How to split a column of string variables (words of various lengths and numbers) into separate columns

Simwinga Simwinga

Join Date: Apr 2022

Posts: 36
#1

How to split a column of string variables (words of various lengths and numbers) into separate columns

09 Dec 2023, 10:33

Hello, I'm attempting to split a column that contains words and numbers of varying lengths into distinct columns. The difficulty lies in the uneven number of columns in some rows, resulting from the varying word lengths. I aim to have all numbers into uniform columns. How can I achieve this task? The first picture is how the column (origin) is, the second picture is how I want the data to be and the third picture is how the data is being spit when I command split origin. PLEASE PARDON ME FOR ATTACHING THESE UGLY PICTURES, COULD NOT DO ANY BETTER.

Attached Files

Last edited by Simwinga Simwinga; 09 Dec 2023, 10:36.
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 29795
#2

09 Dec 2023, 10:51

Are you sure you really want the data to look like the second picture? The third picture, showing what you got from -split- looks to me like the ideal layout of this data. In particular, if you go with the layout of the second picture, you are implying that there is no rationale for the different columns in the original data, because you want the first number after the name to occur in either the third or the fourth variable (with everything else moved over one variable and a new orig9 variable created) depending on how many characters appear in the name (orig2) that precedes the numbers. The columns in your third picture combine numbers that belong to different columns in the original layout--does that really make sense?

Assuming it does, here's how you can do it, starting from what you have in the third picture:

Code:

gen orig9 = . forvalues j = 9(-1)4 { replace orig`j' = orig`=`j'-1' if strlen(orig2) > 8 } replace orig3 = . if strlen(orig2) > 8

Note: While I understand using screenshots here so that the physical appearance of the data on screen, important for your purpose, is displayed, the data cannot be imported to Stata for development and testing of code from a screenshot. So this code is untested and may be incorrect. In particular, this code assumes that the variables orig3-orig8 are numeric variables. (There is no way to know from a screenshot like this.) If, in fact, they are string variables, then all of the appearances of -= .= need to be replaced by -= ""-.
Comment

Simwinga Simwinga

Join Date: Apr 2022
Posts: 36

10 Dec 2023, 20:21

Hello Clyde, Apologies for the confusion. I prefer my data to resemble the third picture, not the second one. The second picture depicts what I get when I split the data. Here is the original dataset. Note that all these are in one column named orig.

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input strL orig
"010085 CHILUBI                   1    8    7   10    2   28"
"010086 MUCHINSHI                 9   10   13   13    0   45"
"010094 MOFU                      1    6    9   21    1   38"
"010095 MATIPA                    5   16    4    6    7   38"
"010097 CHABA                     0    0    6   23    9   38"
"010098 SHITIMALI                 0    8   12    2    0   22"
"010175 CHINUNGWE                 3   15    8    6    1   33"
"010180 CHABUKA SANSHA            2   12   14    7    0   35"
"010182 MUBILI                    0    2    3    1    5   11"
"010183 KABESHA                  12   12    8    5    9   46"
"010201 MAYUKA                    1    7   13   20   17   58"
"010208 MWANAKASABI               9    9    3    1    3   25"
end

Last edited by Simwinga Simwinga; 10 Dec 2023, 20:42.

Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 29795

10 Dec 2023, 21:43

This should do it:

Code:

gen orig1 = substr(orig, 1, 6)
gen orig2 = substr(orig, 7, 26)
gen part3 = substr(orig, 33, .)
split part3, gen(numeric) destring
rename numeric# orig#, renumber(3)
drop orig part3

Comment

Simwinga Simwinga

Join Date: Apr 2022

Posts: 36
#5

11 Dec 2023, 01:54

Thank you very much, this worked perfectly.
Comment
Simwinga Simwinga

Join Date: Apr 2022

Posts: 36
#6

23 Dec 2023, 08:03

Hi Clyde, I'm reaching out again regarding the substring code you shared. It worked perfectly for the initial dataset, but I'm encountering a new challenge. I'm unsure how to automatically determine the starting point and length of the substring for each observation. Is manual counting necessary, or is there a more efficient method? I'd appreciate any guidance or insights you can offer on this matter. Thanks again for your help.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29795
#7

23 Dec 2023, 10:51

Please post back with a new -dataex- example showing some of the new data for which the original code does not produce correct results. Please make the example as representative of the variety of situations you encounter as possible. I'll see if I can figure it out.
Comment

Simwinga Simwinga

Join Date: Apr 2022
Posts: 36

26 Dec 2023, 04:34

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str140 orig
"FAC    EXAM NUM.    NAME                       SEX D.O.B.      ENGL   SS     MATHS  IS     ZL     SP1    SP2    CTS    S/T1 S/T2 SCORE "     
"090005 230000000005 KKKKK LLLLLLL NNNNNNNNNN   F  11/10/2010 143    144    140    142    ---    145    145    144     573  290    863    ONE"
"090005 230000000005 NNNNN MMMMMM ZZZZZ         F  19/02/2011 143    137    140    142    ---    141    145    142     567  286    853    ONE"
"090005 230000000005 KKKKKKK CCCCCCCC           M  17/04/2011 144    143    128    141    124    145    135    142     570  280    850    ONE"
"090005 230000000005 PPPPP TTTTTTT JJJJJJ       M  14/07/2010 138    138    140    139    ---    150    145    134     555  295    850    ONE"
"090005 230000000005 TTTTT BBBBBBB              M  12/05/2011 144    141    140    138    107    141    145    136     563  286    849    ONE"
"090005 230000000005 NNNNNNNN NNNNN             F  01/07/2011 144    125    140    144    114    137    145    138     566  282    848    ONE"
"090005 230000000005 EEEEE BBBBBB               F  22/03/2012 140    143    140    138    119    150    135    137     561  285    846    ONE"
"090005 230000000005 VVVVVVV KKKKKKK MMMMMM MORIFH 07/12/2009 144    141    136    141    116    141    135    144     570  276    846    ONE"
"090005 230000000005 NNNNNNNNN SSSSSSSSSS NNNN  F  12/01/2011 137    143    140    132    ---    145    145    135     555  290    845    ONE"
"090005 230000000005 LLLLLLLL LLLLLLL           F  23/03/2011 137    144    140    138    ---    145    135    142     564  280    844    ONE"
"090005 230000000005 MMMMMM MMMMMM              M  20/05/2011 143    143    140    141    102    137    135    144     571  272    843    ONE"
"090005 230000000005 IIIIII CCCCCCCC  SSSSSS    M  05/06/2011 142    136    140    141    ---    141    135    144     567  276    843    ONE"
"090005 230000000005 MMMMMM CCCCCCC AAAAAA      F  01/09/2011 144    135    140    138    113    137    145    138     560  282    842    ONE"
"090005 230000000005 MMMMMMMM MMMMMMMMM         M  22/01/2011 140    141    140    141     92    145    135    135     562  280    842    ONE"
"090005 230000000005 SSSSSSSSSS KKKKKK GGGGGGGG M  04/05/2011 144    133    140    144    113    145    135    134     562  280    842    ONE"
"090005 230000000005 MMMMM NNNNNNN              F  10/09/2011 143    138    128    129     76    150    145    137     547  295    842    ONE"
"090005 230000000005 MMMMMM MMMMMM              F  09/06/2011 137    138    140    141    102    141    145    137     556  286    842    ONE"
"090005 230000000005 MMMMMM KKKKKKKK            M  30/04/2011 136    144    134    136    ---    150    135    140     556  285    841    ONE"
"090005 230000000005 CCCCCCCCCC CCCCCCCC        M  14/07/2011 139    136    132    139    108    150    135    142     556  285    841    ONE"
"090005 230000000005 MMMMMMM KKKKKK NNNNNNN     F  11/03/2011 140    136    140    139     88    150    128    144     563  278    841    ONE"
"090005 230000000005 NNNNNNNNN MMMMM            M  16/03/1979 140    138    133    141    118    141    145    136     555  286    841    ONE"
"090005 230000000005 NNNNNNNN AAAAAAA LLLLLLL   F  11/07/2011 140    138    138    139     91    150    135    136     555  285    840    ONE"
"090005 230000000005 HHHHHHHHH WWWWWWW NCHIMUNYAM  10/09/2010 138    142    133    141    101    141    145    133     554  286    840    ONE"
"090005 230000000005 CCCCCCC GGGGG MMMMM        F  09/07/2011 140    142    140    142    ---    137    135    144     568  272    840    ONE"
"090005 230000000005 CCCCCCCCCC LLLLLL LLLLLL   M  14/06/2011 144    138    140    138    ---    145    135    138     560  280    840    ONE"
"090005 230000000005 BBBBB KKKKKKK              F  04/02/2011 141    138    132    141    114    145    135    140     560  280    840    ONE"
"090005 230000000005 MMMMMMM SSSSSS             M  08/03/2011 143    133    140    141    101    145    128    142     566  273    839    ONE"
"090005 230000000005 MMMMMMM NNNNNNNNN          F  22/06/2011 138    141    138    136    108    145    135    142     559  280    839    ONE"
end

Comment

Simwinga Simwinga

Join Date: Apr 2022
Posts: 36

26 Dec 2023, 07:59

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str140 orig
"FAC    EXAM NUM.    NAME                       SEX D.O.B.      ENGL   SS     MATHS  IS     ZL     SP1    SP2    CTS    S/T1 S/T2 SCORE "    
"090005 230000000005 KKKKK LLLLLLL NNNNNNNNNN   F  11/10/2010 143    144    140    142    ---    145    145    144     573  290    863    ONE"
"090005 230000000005 NNNNN MMMMMM ZZZZZ         F  19/02/2011 143    137    140    142    ---    141    145    142     567  286    853    ONE"
"090005 230000000005 KKKKKKK CCCCCCCC           M  17/04/2011 144    143    128    141    124    145    135    142     570  280    850    ONE"
"090005 230000000005 PPPPP TTTTTTT JJJJJJ       M  14/07/2010 138    138    140    139    ---    150    145    134     555  295    850    ONE"
"090005 230000000005 TTTTT BBBBBBB              M  12/05/2011 144    141    140    138    107    141    145    136     563  286    849    ONE"
"090005 230000000005 NNNNNNNN NNNNN             F  01/07/2011 144    125    140    144    114    137    145    138     566  282    848    ONE"
"090005 230000000005 EEEEE BBBBBB               F  22/03/2012 140    143    140    138    119    150    135    137     561  285    846    ONE"
"090005 230000000005 VVVVVVV KKKKKKK MMMMMM MORIFH 07/12/2009 144    141    136    141    116    141    135    144     570  276    846    ONE"
"090005 230000000005 NNNNNNNNN SSSSSSSSSS NNNN  F  12/01/2011 137    143    140    132    ---    145    145    135     555  290    845    ONE"
"090005 230000000005 LLLLLLLL LLLLLLL           F  23/03/2011 137    144    140    138    ---    145    135    142     564  280    844    ONE"
"090005 230000000005 MMMMMM MMMMMM              M  20/05/2011 143    143    140    141    102    137    135    144     571  272    843    ONE"
"090005 230000000005 IIIIII CCCCCCCC  SSSSSS    M  05/06/2011 142    136    140    141    ---    141    135    144     567  276    843    ONE"
"090005 230000000005 MMMMMM CCCCCCC AAAAAA      F  01/09/2011 144    135    140    138    113    137    145    138     560  282    842    ONE"
"090005 230000000005 MMMMMMMM MMMMMMMMM         M  22/01/2011 140    141    140    141     92    145    135    135     562  280    842    ONE"
"090005 230000000005 SSSSSSSSSS KKKKKK GGGGGGGG M  04/05/2011 144    133    140    144    113    145    135    134     562  280    842    ONE"
"090005 230000000005 MMMMM NNNNNNN              F  10/09/2011 143    138    128    129     76    150    145    137     547  295    842    ONE"
"090005 230000000005 MMMMMM MMMMMM              F  09/06/2011 137    138    140    141    102    141    145    137     556  286    842    ONE"
"090005 230000000005 MMMMMM KKKKKKKK            M  30/04/2011 136    144    134    136    ---    150    135    140     556  285    841    ONE"
"090005 230000000005 CCCCCCCCCC CCCCCCCC        M  14/07/2011 139    136    132    139    108    150    135    142     556  285    841    ONE"
"090005 230000000005 MMMMMMM KKKKKK NNNNNNN     F  11/03/2011 140    136    140    139     88    150    128    144     563  278    841    ONE"
"090005 230000000005 NNNNNNNNN MMMMM            M  16/03/1979 140    138    133    141    118    141    145    136     555  286    841    ONE"
"090005 230000000005 NNNNNNNN AAAAAAA LLLLLLL   F  11/07/2011 140    138    138    139     91    150    135    136     555  285    840    ONE"
"090005 230000000005 HHHHHHHHH WWWWWWW NCHIMUNYAM  10/09/2010 138    142    133    141    101    141    145    133     554  286    840    ONE"
"090005 230000000005 CCCCCCC GGGGG MMMMM        F  09/07/2011 140    142    140    142    ---    137    135    144     568  272    840    ONE"
"090005 230000000005 CCCCCCCCCC LLLLLL LLLLLL   M  14/06/2011 144    138    140    138    ---    145    135    138     560  280    840    ONE"
"090005 230000000005 BBBBB KKKKKKK              F  04/02/2011 141    138    132    141    114    145    135    140     560  280    840    ONE"
"090005 230000000005 MMMMMMM SSSSSS             M  08/03/2011 143    133    140    141    101    145    128    142     566  273    839    ONE"
"090005 230000000005 MMMMMMM NNNNNNNNN          F  22/06/2011 138    141    138    136    108    145    135    142     559  280    839    ONE"
end

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 29795
#10

26 Dec 2023, 10:29

So this data set presents a couple of difficulties. The first is that the first observation purports contain variable names. But, a) some of these are not legal variable names because they contain punctuation or embedded spaces, and b) there is no variable name provided for the final variable (which, oddly, always takes the value "ONE") But these are details that are easily overcome.

It is easy enough to write code to break apart the orig variable once we get rid of the first observation:

Code:

drop in 1 gen fac = substr(orig, 1, 6) gen exam_num = substr(orig, 8,12) gen name = substr(orig, 21, 27) gen sex = substr(orig, 48, 1) gen dob = daily(substr(orig, 51, 10), "DMY") format dob %tdDD/NN/CCYY gen the_rest = substr(orig, 61, .) split the_rest, gen(vble) destring ignore("-") rename vble* (engl ss maths is zl sp1 sp2 cts s_t1 s_t2 score huh_) drop orig

But the "magic numbers" appearing in the -substr()- expressions are hand counted. I understand your request to be some way of doing this without first counting out the locations of the starts of these variable by hand, but rather automating the process. The major obstacle here is the part of the string that represents name, as these strings vary not only in overall length but in the number of embedded blanks. But there are some reliable anchor points in the structure of orig that make it possible. fac and exam_num are always of the same length in each observation, so the blanks that follow them are easily located. Getting around name is a bit harder, but the first "/" in the date of birth will do. So the following code lifts the burden of counting this out from you:

Code:

drop in 1 // START OUT RELYING ON SEPARATION OF VARIABLES BY BLANKS gen blank_location = strpos(orig, " ") gen fac = substr(orig, 1, blank_location), before(orig) replace orig = subinstr(orig, fac, "", 1) replace blank_location = strpos(orig, " ") gen exam_num = substr(orig, 1, blank_location), before(orig) replace orig = subinstr(orig, exam_num, "", 1) // NOW WE HAVE REACHED THE NAMES. THE RELIABLE ANCHOR NOW IS THE FIRST / IN DOB drop blank_location gen anchor = strpos(orig, "/") gen name = substr(orig, 1, anchor-6), before(orig) replace orig = subinstr(orig, name, "", 1) // AND EVERYTHING ELSE CAN BE HANDLED BY SPLIT split orig, gen(vble) destring ignore("-") rename vble* (sex _dob engl ss maths is zl sp1 sp2 cts s_t1 s_t2 score huh_) drop orig gen dob = daily(_dob, "DMY"), after(_dob) format dob %tdDD/NN/CCYY assert missing(dob) == missing(_dob) drop _dob
Comment
Farza Chow

Join Date: Sep 2018

Posts: 7
#11

24 Jun 2024, 07:47

Hello;

I am trying to separate the country name and year. Example below:

Afghanistan2008
Germany2021

I tried the following code, but the country name leaves out the first letter of the country.

ssc inst moss

moss country, match("([0-9]+)") regex pre(num)
moss country, match("([A-z]+)") regex pre(num)
list country *match*

Thank you for your help
Comment

Andrew Musau

Join Date: Oct 2014
Posts: 9945

#12

24 Jun 2024, 08:26

Stata has plenty of string functions that you can use, no need for an external program here.

Code:

help string_functions

Code:

clear
input str200 text
"Afghanistan2008"
"Germany2021"
end

gen year1= real(substr(text, -4, 4))
gen country1= subinstr(text, string(year), "", .)

gen year2= real(ustrregexra(text, "(.*)(\d{4})$", "$2"))
gen country2= ustrregexra(text, "(.*)(\d{4})$", "$1")

Res.:

Code:

. l

     +-------------------------------------------------------------+
     |            text   year1      country1   year2      country2 |
     |-------------------------------------------------------------|
  1. | Afghanistan2008    2008   Afghanistan    2008   Afghanistan |
  2. |     Germany2021    2021       Germany    2021       Germany |
     +-------------------------------------------------------------+

Last edited by Andrew Musau; 24 Jun 2024, 08:43.

Announcement