Splitting a string varaibles that has a space

Nick Cox

Join Date: Mar 2014

Posts: 35698
#16

21 Feb 2019, 09:17

We don't encourage .dta attachments, as the FAQ Advice explains.

Why is dist str105??? Is there garbage somewhere in there pushing it up to 105 characters?

I would try

Code:

replace dist = trim(itrim(subinstr(dist, char(160), " ", .)))
Comment

Robert Picard

Join Date: Mar 2014
Posts: 1536

#17

21 Feb 2019, 09:41

It looks like Nick's intuition is correct and the data contains Unicode non-breaking space characters as shown in the chartab output in #10. Here's a simple example that shows that split does not work as expected:

Code:

. clear

. set obs 1
number of observations (_N) was 0, now 1

. gen s = "1" + uchar(160) + "2"

. split s
variable created as string: 
s1

. list

     +-----------+
     |   s    s1 |
     |-----------|
  1. | 1 2   1 2 |
     +-----------+

.

It's a good idea to perform some Unicode due diligence when working with string data now that Stata operates in Unicode. A quick check shows the presence of a Unicode character:

Code:

. chartab s

   decimal  hexadecimal   character |     frequency    unique name
------------------------------------+--------------------------------
        49       \u0031       1     |             1    DIGIT ONE
        50       \u0032       2     |             1    DIGIT TWO
       160       \u00a0             |             1    NO-BREAK SPACE
------------------------------------+--------------------------------

                                    freq. count   distinct
ASCII characters              =               2          2
Multibyte UTF-8 characters    =               1          1
Unicode replacement character =               0          0
Total Unicode characters      =               3          3

Since this is a Unicode character, you have to use the uchar() function to refer to it. If you want these non-breaking Unicode characters to be treated as regular ASCII spaces, you can substitute all occurrences using something like:

Code:

. gen ss = subinstr(s, uchar(160), " ", .)

. split ss
variables created as string: 
ss1  ss2

. list

     +-----------------------------+
     |   s    s1    ss   ss1   ss2 |
     |-----------------------------|
  1. | 1 2   1 2   1 2     1     2 |
     +-----------------------------+

.

You can use subinstr() in this case because uchar(160) is a multi-byte Unicode character, in other words a string composed of two bytes.

You could also replace any and all Unicode whitespace characters using regular expressions:

Code:

. gen sfix = ustrregexra(s, "\s", " ", .)

. chartab sfix

   decimal  hexadecimal   character |     frequency    unique name
------------------------------------+---------------------------
        32       \u0020             |             1    SPACE
        49       \u0031       1     |             1    DIGIT ONE
        50       \u0032       2     |             1    DIGIT TWO
------------------------------------+---------------------------

                                    freq. count   distinct
ASCII characters              =               3          3
Multibyte UTF-8 characters    =               0          0
Unicode replacement character =               0          0
Total Unicode characters      =               3          3

Comment

Falco Wolf

Join Date: Oct 2018
Posts: 83

#18

21 Feb 2019, 09:47

Nick and the rest that have helped until now, thank you very much.

The original data comes from here:
http://www.bcn.cat/estadistica/angle...a2002/VL03.htm

Which I copied into an excel and then imported into stata. The dataex's come from there. So if anyone would like to see what is happening exactly to my dataset, and perhaps you find a way that is much more straight forward than what I am putting here, please do share! I would be very happy, since I have to do the same process and create a panel with yearly data of this variable and many more. this specific data is a dataset for Cadastral Value of certain district sections of Barcelona and the dist is the district.

Nick, I did your command and it almost did the trick.
replace dist = trim(itrim(subinstr(dist, char(160), " ", .)))
split dist, and the below appeared. Which now, effectively seperates the characters that needed to be seperated. However, there is � character attached to the first character. Is there a way to avoid that?

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str105 dist str3 dist1 str1(dist2 dist3 dist4 dist5) str3 dist6
"2� � � � � 47"  "2�" "�" "�" "�" "�" "47" 
"2� � � � � 48"  "2�" "�" "�" "�" "�" "48" 
"2� � � � � 49"  "2�" "�" "�" "�" "�" "49" 
"2� � � � � 50"  "2�" "�" "�" "�" "�" "50" 
"2� � � � � 51"  "2�" "�" "�" "�" "�" "51" 
"2� � � � � 52"  "2�" "�" "�" "�" "�" "52" 
"2� � � � � 53"  "2�" "�" "�" "�" "�" "53" 
"2� � � � � 54"  "2�" "�" "�" "�" "�" "54" 
"2� � � � � 55"  "2�" "�" "�" "�" "�" "55" 
"2� � � � � 56"  "2�" "�" "�" "�" "�" "56" 
"2� � � � � 57"  "2�" "�" "�" "�" "�" "57" 
"2� � � � � 58"  "2�" "�" "�" "�" "�" "58" 
"2� � � � � 59"  "2�" "�" "�" "�" "�" "59" 
"2� � � � � 60"  "2�" "�" "�" "�" "�" "60" 
"2� � � � � 61"  "2�" "�" "�" "�" "�" "61" 
"2� � � � � 62"  "2�" "�" "�" "�" "�" "62" 
"2� � � � � 63"  "2�" "�" "�" "�" "�" "63" 
"2� � � � � 64"  "2�" "�" "�" "�" "�" "64" 
"2� � � � � 65"  "2�" "�" "�" "�" "�" "65" 
"2� � � � � 66"  "2�" "�" "�" "�" "�" "66" 
"2� � � � � 67"  "2�" "�" "�" "�" "�" "67" 
"2� � � � � 68"  "2�" "�" "�" "�" "�" "68" 
"2� � � � � 69"  "2�" "�" "�" "�" "�" "69" 
"2� � � � � 70"  "2�" "�" "�" "�" "�" "70" 
"2� � � � � 71"  "2�" "�" "�" "�" "�" "71" 
"2� � � � � 72"  "2�" "�" "�" "�" "�" "72" 
"2� � � � � 73"  "2�" "�" "�" "�" "�" "73" 
"2� � � � � 74"  "2�" "�" "�" "�" "�" "74" 
"2� � � � � 75"  "2�" "�" "�" "�" "�" "75" 
"2� � � � � 76"  "2�" "�" "�" "�" "�" "76" 
"2� � � � � 77"  "2�" "�" "�" "�" "�" "77" 
"2� � � � � 78"  "2�" "�" "�" "�" "�" "78" 
"2� � � � � 79"  "2�" "�" "�" "�" "�" "79" 
"2� � � � � 80"  "2�" "�" "�" "�" "�" "80" 
"2� � � � � 81"  "2�" "�" "�" "�" "�" "81" 
"2� � � � � 82"  "2�" "�" "�" "�" "�" "82" 
"2� � � � � 83"  "2�" "�" "�" "�" "�" "83" 
"2� � � � � 84"  "2�" "�" "�" "�" "�" "84" 
"2� � � � � 85"  "2�" "�" "�" "�" "�" "85" 
"2� � � � � 86"  "2�" "�" "�" "�" "�" "86" 
"2� � � � � 87"  "2�" "�" "�" "�" "�" "87" 
"2� � � � � 88"  "2�" "�" "�" "�" "�" "88" 
"2� � � � � 89"  "2�" "�" "�" "�" "�" "89" 
"2� � � � � 90"  "2�" "�" "�" "�" "�" "90" 
"2� � � � � 91"  "2�" "�" "�" "�" "�" "91" 
"2� � � � � 92"  "2�" "�" "�" "�" "�" "92" 
"2� � � � � 93"  "2�" "�" "�" "�" "�" "93" 
"2� � � � � 94"  "2�" "�" "�" "�" "�" "94" 
"2� � � � � 95"  "2�" "�" "�" "�" "�" "95" 
"2� � � � � 96"  "2�" "�" "�" "�" "�" "96" 
"2� � � � � 97"  "2�" "�" "�" "�" "�" "97" 
"2� � � � � 98"  "2�" "�" "�" "�" "�" "98" 
"2� � � � � 99"  "2�" "�" "�" "�" "�" "99" 
"2� � � � � 100" "2�" "�" "�" "�" "�" "100"
"2� � � � � 101" "2�" "�" "�" "�" "�" "101"
"2� � � � � 102" "2�" "�" "�" "�" "�" "102"
"2� � � � � 103" "2�" "�" "�" "�" "�" "103"
"2� � � � � 104" "2�" "�" "�" "�" "�" "104"
"2� � � � � 105" "2�" "�" "�" "�" "�" "105"
"2� � � � � 106" "2�" "�" "�" "�" "�" "106"
"2� � � � � 107" "2�" "�" "�" "�" "�" "107"
"2� � � � � 108" "2�" "�" "�" "�" "�" "108"
"2� � � � � 109" "2�" "�" "�" "�" "�" "109"
"2� � � � � 110" "2�" "�" "�" "�" "�" "110"
"2� � � � � 111" "2�" "�" "�" "�" "�" "111"
"2� � � � � 112" "2�" "�" "�" "�" "�" "112"
"2� � � � � 113" "2�" "�" "�" "�" "�" "113"
"2� � � � � 114" "2�" "�" "�" "�" "�" "114"
"2� � � � � 115" "2�" "�" "�" "�" "�" "115"
"2� � � � � 116" "2�" "�" "�" "�" "�" "116"
"2� � � � � 117" "2�" "�" "�" "�" "�" "117"
"2� � � � � 118" "2�" "�" "�" "�" "�" "118"
"2� � � � � 119" "2�" "�" "�" "�" "�" "119"
"2� � � � � 120" "2�" "�" "�" "�" "�" "120"
"2� � � � � 121" "2�" "�" "�" "�" "�" "121"
"2� � � � � 122" "2�" "�" "�" "�" "�" "122"
"2� � � � � 123" "2�" "�" "�" "�" "�" "123"
"2� � � � � 124" "2�" "�" "�" "�" "�" "124"
"2� � � � � 125" "2�" "�" "�" "�" "�" "125"
"2� � � � � 126" "2�" "�" "�" "�" "�" "126"
"2� � � � � 127" "2�" "�" "�" "�" "�" "127"
"2� � � � � 128" "2�" "�" "�" "�" "�" "128"
"2� � � � � 129" "2�" "�" "�" "�" "�" "129"
"2� � � � � 130" "2�" "�" "�" "�" "�" "130"
"2� � � � � 131" "2�" "�" "�" "�" "�" "131"
"2� � � � � 132" "2�" "�" "�" "�" "�" "132"
"2� � � � � 133" "2�" "�" "�" "�" "�" "133"
"2� � � � � 134" "2�" "�" "�" "�" "�" "134"
"2� � � � � 135" "2�" "�" "�" "�" "�" "135"
"2� � � � � 136" "2�" "�" "�" "�" "�" "136"
"2� � � � � 137" "2�" "�" "�" "�" "�" "137"
"2� � � � � 138" "2�" "�" "�" "�" "�" "138"
"2� � � � � 139" "2�" "�" "�" "�" "�" "139"
"2� � � � � 140" "2�" "�" "�" "�" "�" "140"
"2� � � � � 141" "2�" "�" "�" "�" "�" "141"
"2� � � � � 142" "2�" "�" "�" "�" "�" "142"
"2� � � � � 143" "2�" "�" "�" "�" "�" "143"
"2� � � � � 144" "2�" "�" "�" "�" "�" "144"
"2� � � � � 145" "2�" "�" "�" "�" "�" "145"
"2� � � � � 146" "2�" "�" "�" "�" "�" "146"
end

Comment

Robert Picard

Join Date: Mar 2014
Posts: 1536

#19

21 Feb 2019, 10:08

I copied the first few lines from the link you provided and pasted them into Excel. Here's the solution offered in #17 applied to your data:

Code:

. import excel using "test.xlsx", clear

. chartab A

   decimal  hexadecimal   character |     frequency    unique name
------------------------------------+--------------------------------
        49       \u0031       1     |            10    DIGIT ONE
        50       \u0032       2     |             1    DIGIT TWO
        51       \u0033       3     |             1    DIGIT THREE
        52       \u0034       4     |             1    DIGIT FOUR
        53       \u0035       5     |             1    DIGIT FIVE
        54       \u0036       6     |             1    DIGIT SIX
        55       \u0037       7     |             1    DIGIT SEVEN
        56       \u0038       8     |             1    DIGIT EIGHT
        57       \u0039       9     |             1    DIGIT NINE
       160       \u00a0             |            63    NO-BREAK SPACE
------------------------------------+--------------------------------

                                    freq. count   distinct
ASCII characters              =              18          9
Multibyte UTF-8 characters    =              63          1
Unicode replacement character =               0          0
Total Unicode characters      =              81         10


. gen dist = subinstr(A, uchar(160), " ", .)

. split dist
variables created as string: 
dist1  dist2

. list

     +-------------------------------------------------------------------------+
     |         A         B         C         D   E        dist   dist1   dist2 |
     |-------------------------------------------------------------------------|
  1. | 1     1     533,9     314,2     219,7     .   1     1         1       1 |
  2. | 1     2     506,4     377,7     128,7     .   1     2         1       2 |
  3. | 1     3     504,8     358,8     146,1     .   1     3         1       3 |
  4. | 1     4     527,3     355,7     167,3     .   1     4         1       4 |
  5. | 1     5     749,8     387,9     361,1     .   1     5         1       5 |
     |-------------------------------------------------------------------------|
  6. | 1     6     382,5     232,8     149,6     .   1     6         1       6 |
  7. | 1     7     456,8     288,5     168,3     .   1     7         1       7 |
  8. | 1     8     393,3     240,5     152,8     .   1     8         1       8 |
  9. | 1     9     385,5     248,1       137,4   .   1     9         1       9 |
     +-------------------------------------------------------------------------+

.

Comment

Robert Picard

Join Date: Mar 2014

Posts: 1536
#20

21 Feb 2019, 10:16

I just used the Sandbox forum and can confirm that dataex will output these Unicode whitespace characters to Stata's Results window. However, when pasted in the forum, these are replaced with regular spaces.
2 likes
Comment
Jorrit Gosens

Join Date: Jan 2015

Posts: 1019
#21

21 Feb 2019, 11:11

If you are going to copy more of those sheets and want to import into Stata, you can save yourself some headache by copy pasting into notepad (txt file).
Then in Stata:

Code:

import delimited C:\Desktop\test.txt, delimiter(whitespace, collapse)

Same as options 'whitespace' as separator, and 'treat sequential delimiters as one' checked.

That results in:

Code:

input byte v1 int v2 str7(v3 v4) str5 v5 byte v6 4 98 "860,7" "500,4" "360,3" . 4 99 "1.018,6" "643,6" "375,0" . 4 100 "835,4" "511,2" "324,2" . 4 101 "1.200,1" "853,5" "346,6" . 5 102 "925,9" "583,4" "342,5" .

With the benefit that 1) your tricky variable is already split into two parts, and 2) any remaining trailing or leading whitespaces in the other variables are eaten up already as well.

then you still need to get rid of the commas as decimal points and dots as thousand separators:

Code:

ds, has(type string) local strvars "`r(varlist)'" foreach var of local strvars { replace `var'=subinstr(`var',".","",.) } destring, replace dpcomma

And all your data is in order:

Code:

clear input byte v1 int v2 double(v3 v4 v5) byte v6 4 98 860.7 500.4 360.3 . 4 99 1018.6 643.6 375 . 4 100 835.4 511.2 324.2 . 4 101 1200.1 853.5 346.6 . 5 102 925.9 583.4 342.5 . end
1 like
Comment
Falco Wolf

Join Date: Oct 2018

Posts: 83
#22

22 Feb 2019, 02:29

Thank you all. It finally works.
I'm very thankful for all the time you all spent here working on this problem.
1 like
Comment
Awais Ahmed

Join Date: Oct 2020

Posts: 8
#23

05 Jun 2021, 20:03

Dear all, I have one question regarding how to split var. I have var that has month and days such as Apr1, Apr11, Mar1, Mar22, Dec1, Dec22 etc..... Basically days are next to month name therefore if anyone know how to split days from month . I need only new variable that shows me only days. Any help would be appreciated. Many thanks ahead.
Comment
Joro Kolev

Join Date: Aug 2018

Posts: 3050
#24

05 Jun 2021, 22:38

Try this

Code:

. dis substr("Apr1",4,l) 1

and if you do not manage to make it work, show a sample of your data using -dataex-.

Originally posted by Awais Ahmed View Post

Dear all, I have one question regarding how to split var. I have var that has month and days such as Apr1, Apr11, Mar1, Mar22, Dec1, Dec22 etc..... Basically days are next to month name therefore if anyone know how to split days from month . I need only new variable that shows me only days. Any help would be appreciated. Many thanks ahead.
Comment

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment