Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    We don't encourage .dta attachments, as the FAQ Advice explains.

    Why is dist str105??? Is there garbage somewhere in there pushing it up to 105 characters?

    I would try

    Code:
    replace dist = trim(itrim(subinstr(dist, char(160), " ", .))) 

    Comment


    • #17
      It looks like Nick's intuition is correct and the data contains Unicode non-breaking space characters as shown in the chartab output in #10. Here's a simple example that shows that split does not work as expected:
      Code:
      . clear
      
      . set obs 1
      number of observations (_N) was 0, now 1
      
      . gen s = "1" + uchar(160) + "2"
      
      . split s
      variable created as string: 
      s1
      
      . list
      
           +-----------+
           |   s    s1 |
           |-----------|
        1. | 1 2   1 2 |
           +-----------+
      
      .
      It's a good idea to perform some Unicode due diligence when working with string data now that Stata operates in Unicode. A quick check shows the presence of a Unicode character:
      Code:
      . chartab s
      
         decimal  hexadecimal   character |     frequency    unique name
      ------------------------------------+--------------------------------
              49       \u0031       1     |             1    DIGIT ONE
              50       \u0032       2     |             1    DIGIT TWO
             160       \u00a0             |             1    NO-BREAK SPACE
      ------------------------------------+--------------------------------
      
                                          freq. count   distinct
      ASCII characters              =               2          2
      Multibyte UTF-8 characters    =               1          1
      Unicode replacement character =               0          0
      Total Unicode characters      =               3          3
      Since this is a Unicode character, you have to use the uchar() function to refer to it. If you want these non-breaking Unicode characters to be treated as regular ASCII spaces, you can substitute all occurrences using something like:
      Code:
      . gen ss = subinstr(s, uchar(160), " ", .)
      
      . split ss
      variables created as string: 
      ss1  ss2
      
      . list
      
           +-----------------------------+
           |   s    s1    ss   ss1   ss2 |
           |-----------------------------|
        1. | 1 2   1 2   1 2     1     2 |
           +-----------------------------+
      
      .
      You can use subinstr() in this case because uchar(160) is a multi-byte Unicode character, in other words a string composed of two bytes.

      You could also replace any and all Unicode whitespace characters using regular expressions:

      Code:
      . gen sfix = ustrregexra(s, "\s", " ", .)
      
      . chartab sfix
      
         decimal  hexadecimal   character |     frequency    unique name
      ------------------------------------+---------------------------
              32       \u0020             |             1    SPACE
              49       \u0031       1     |             1    DIGIT ONE
              50       \u0032       2     |             1    DIGIT TWO
      ------------------------------------+---------------------------
      
                                          freq. count   distinct
      ASCII characters              =               3          3
      Multibyte UTF-8 characters    =               0          0
      Unicode replacement character =               0          0
      Total Unicode characters      =               3          3

      Comment


      • #18
        Nick and the rest that have helped until now, thank you very much.

        The original data comes from here:
        http://www.bcn.cat/estadistica/angle...a2002/VL03.htm

        Which I copied into an excel and then imported into stata. The dataex's come from there. So if anyone would like to see what is happening exactly to my dataset, and perhaps you find a way that is much more straight forward than what I am putting here, please do share! I would be very happy, since I have to do the same process and create a panel with yearly data of this variable and many more. this specific data is a dataset for Cadastral Value of certain district sections of Barcelona and the dist is the district.


        Nick, I did your command and it almost did the trick.
        replace dist = trim(itrim(subinstr(dist, char(160), " ", .)))
        split dist, and the below appeared. Which now, effectively seperates the characters that needed to be seperated. However, there is � character attached to the first character. Is there a way to avoid that?



        Code:
        * Example generated by -dataex-. To install: ssc install dataex
        clear
        input str105 dist str3 dist1 str1(dist2 dist3 dist4 dist5) str3 dist6
        "2� � � � � 47"  "2�" "�" "�" "�" "�" "47" 
        "2� � � � � 48"  "2�" "�" "�" "�" "�" "48" 
        "2� � � � � 49"  "2�" "�" "�" "�" "�" "49" 
        "2� � � � � 50"  "2�" "�" "�" "�" "�" "50" 
        "2� � � � � 51"  "2�" "�" "�" "�" "�" "51" 
        "2� � � � � 52"  "2�" "�" "�" "�" "�" "52" 
        "2� � � � � 53"  "2�" "�" "�" "�" "�" "53" 
        "2� � � � � 54"  "2�" "�" "�" "�" "�" "54" 
        "2� � � � � 55"  "2�" "�" "�" "�" "�" "55" 
        "2� � � � � 56"  "2�" "�" "�" "�" "�" "56" 
        "2� � � � � 57"  "2�" "�" "�" "�" "�" "57" 
        "2� � � � � 58"  "2�" "�" "�" "�" "�" "58" 
        "2� � � � � 59"  "2�" "�" "�" "�" "�" "59" 
        "2� � � � � 60"  "2�" "�" "�" "�" "�" "60" 
        "2� � � � � 61"  "2�" "�" "�" "�" "�" "61" 
        "2� � � � � 62"  "2�" "�" "�" "�" "�" "62" 
        "2� � � � � 63"  "2�" "�" "�" "�" "�" "63" 
        "2� � � � � 64"  "2�" "�" "�" "�" "�" "64" 
        "2� � � � � 65"  "2�" "�" "�" "�" "�" "65" 
        "2� � � � � 66"  "2�" "�" "�" "�" "�" "66" 
        "2� � � � � 67"  "2�" "�" "�" "�" "�" "67" 
        "2� � � � � 68"  "2�" "�" "�" "�" "�" "68" 
        "2� � � � � 69"  "2�" "�" "�" "�" "�" "69" 
        "2� � � � � 70"  "2�" "�" "�" "�" "�" "70" 
        "2� � � � � 71"  "2�" "�" "�" "�" "�" "71" 
        "2� � � � � 72"  "2�" "�" "�" "�" "�" "72" 
        "2� � � � � 73"  "2�" "�" "�" "�" "�" "73" 
        "2� � � � � 74"  "2�" "�" "�" "�" "�" "74" 
        "2� � � � � 75"  "2�" "�" "�" "�" "�" "75" 
        "2� � � � � 76"  "2�" "�" "�" "�" "�" "76" 
        "2� � � � � 77"  "2�" "�" "�" "�" "�" "77" 
        "2� � � � � 78"  "2�" "�" "�" "�" "�" "78" 
        "2� � � � � 79"  "2�" "�" "�" "�" "�" "79" 
        "2� � � � � 80"  "2�" "�" "�" "�" "�" "80" 
        "2� � � � � 81"  "2�" "�" "�" "�" "�" "81" 
        "2� � � � � 82"  "2�" "�" "�" "�" "�" "82" 
        "2� � � � � 83"  "2�" "�" "�" "�" "�" "83" 
        "2� � � � � 84"  "2�" "�" "�" "�" "�" "84" 
        "2� � � � � 85"  "2�" "�" "�" "�" "�" "85" 
        "2� � � � � 86"  "2�" "�" "�" "�" "�" "86" 
        "2� � � � � 87"  "2�" "�" "�" "�" "�" "87" 
        "2� � � � � 88"  "2�" "�" "�" "�" "�" "88" 
        "2� � � � � 89"  "2�" "�" "�" "�" "�" "89" 
        "2� � � � � 90"  "2�" "�" "�" "�" "�" "90" 
        "2� � � � � 91"  "2�" "�" "�" "�" "�" "91" 
        "2� � � � � 92"  "2�" "�" "�" "�" "�" "92" 
        "2� � � � � 93"  "2�" "�" "�" "�" "�" "93" 
        "2� � � � � 94"  "2�" "�" "�" "�" "�" "94" 
        "2� � � � � 95"  "2�" "�" "�" "�" "�" "95" 
        "2� � � � � 96"  "2�" "�" "�" "�" "�" "96" 
        "2� � � � � 97"  "2�" "�" "�" "�" "�" "97" 
        "2� � � � � 98"  "2�" "�" "�" "�" "�" "98" 
        "2� � � � � 99"  "2�" "�" "�" "�" "�" "99" 
        "2� � � � � 100" "2�" "�" "�" "�" "�" "100"
        "2� � � � � 101" "2�" "�" "�" "�" "�" "101"
        "2� � � � � 102" "2�" "�" "�" "�" "�" "102"
        "2� � � � � 103" "2�" "�" "�" "�" "�" "103"
        "2� � � � � 104" "2�" "�" "�" "�" "�" "104"
        "2� � � � � 105" "2�" "�" "�" "�" "�" "105"
        "2� � � � � 106" "2�" "�" "�" "�" "�" "106"
        "2� � � � � 107" "2�" "�" "�" "�" "�" "107"
        "2� � � � � 108" "2�" "�" "�" "�" "�" "108"
        "2� � � � � 109" "2�" "�" "�" "�" "�" "109"
        "2� � � � � 110" "2�" "�" "�" "�" "�" "110"
        "2� � � � � 111" "2�" "�" "�" "�" "�" "111"
        "2� � � � � 112" "2�" "�" "�" "�" "�" "112"
        "2� � � � � 113" "2�" "�" "�" "�" "�" "113"
        "2� � � � � 114" "2�" "�" "�" "�" "�" "114"
        "2� � � � � 115" "2�" "�" "�" "�" "�" "115"
        "2� � � � � 116" "2�" "�" "�" "�" "�" "116"
        "2� � � � � 117" "2�" "�" "�" "�" "�" "117"
        "2� � � � � 118" "2�" "�" "�" "�" "�" "118"
        "2� � � � � 119" "2�" "�" "�" "�" "�" "119"
        "2� � � � � 120" "2�" "�" "�" "�" "�" "120"
        "2� � � � � 121" "2�" "�" "�" "�" "�" "121"
        "2� � � � � 122" "2�" "�" "�" "�" "�" "122"
        "2� � � � � 123" "2�" "�" "�" "�" "�" "123"
        "2� � � � � 124" "2�" "�" "�" "�" "�" "124"
        "2� � � � � 125" "2�" "�" "�" "�" "�" "125"
        "2� � � � � 126" "2�" "�" "�" "�" "�" "126"
        "2� � � � � 127" "2�" "�" "�" "�" "�" "127"
        "2� � � � � 128" "2�" "�" "�" "�" "�" "128"
        "2� � � � � 129" "2�" "�" "�" "�" "�" "129"
        "2� � � � � 130" "2�" "�" "�" "�" "�" "130"
        "2� � � � � 131" "2�" "�" "�" "�" "�" "131"
        "2� � � � � 132" "2�" "�" "�" "�" "�" "132"
        "2� � � � � 133" "2�" "�" "�" "�" "�" "133"
        "2� � � � � 134" "2�" "�" "�" "�" "�" "134"
        "2� � � � � 135" "2�" "�" "�" "�" "�" "135"
        "2� � � � � 136" "2�" "�" "�" "�" "�" "136"
        "2� � � � � 137" "2�" "�" "�" "�" "�" "137"
        "2� � � � � 138" "2�" "�" "�" "�" "�" "138"
        "2� � � � � 139" "2�" "�" "�" "�" "�" "139"
        "2� � � � � 140" "2�" "�" "�" "�" "�" "140"
        "2� � � � � 141" "2�" "�" "�" "�" "�" "141"
        "2� � � � � 142" "2�" "�" "�" "�" "�" "142"
        "2� � � � � 143" "2�" "�" "�" "�" "�" "143"
        "2� � � � � 144" "2�" "�" "�" "�" "�" "144"
        "2� � � � � 145" "2�" "�" "�" "�" "�" "145"
        "2� � � � � 146" "2�" "�" "�" "�" "�" "146"
        end

        Comment


        • #19
          I copied the first few lines from the link you provided and pasted them into Excel. Here's the solution offered in #17 applied to your data:
          Code:
          . import excel using "test.xlsx", clear
          
          . chartab A
          
             decimal  hexadecimal   character |     frequency    unique name
          ------------------------------------+--------------------------------
                  49       \u0031       1     |            10    DIGIT ONE
                  50       \u0032       2     |             1    DIGIT TWO
                  51       \u0033       3     |             1    DIGIT THREE
                  52       \u0034       4     |             1    DIGIT FOUR
                  53       \u0035       5     |             1    DIGIT FIVE
                  54       \u0036       6     |             1    DIGIT SIX
                  55       \u0037       7     |             1    DIGIT SEVEN
                  56       \u0038       8     |             1    DIGIT EIGHT
                  57       \u0039       9     |             1    DIGIT NINE
                 160       \u00a0             |            63    NO-BREAK SPACE
          ------------------------------------+--------------------------------
          
                                              freq. count   distinct
          ASCII characters              =              18          9
          Multibyte UTF-8 characters    =              63          1
          Unicode replacement character =               0          0
          Total Unicode characters      =              81         10
          
          
          . gen dist = subinstr(A, uchar(160), " ", .)
          
          . split dist
          variables created as string: 
          dist1  dist2
          
          . list
          
               +-------------------------------------------------------------------------+
               |         A         B         C         D   E        dist   dist1   dist2 |
               |-------------------------------------------------------------------------|
            1. | 1     1     533,9     314,2     219,7     .   1     1         1       1 |
            2. | 1     2     506,4     377,7     128,7     .   1     2         1       2 |
            3. | 1     3     504,8     358,8     146,1     .   1     3         1       3 |
            4. | 1     4     527,3     355,7     167,3     .   1     4         1       4 |
            5. | 1     5     749,8     387,9     361,1     .   1     5         1       5 |
               |-------------------------------------------------------------------------|
            6. | 1     6     382,5     232,8     149,6     .   1     6         1       6 |
            7. | 1     7     456,8     288,5     168,3     .   1     7         1       7 |
            8. | 1     8     393,3     240,5     152,8     .   1     8         1       8 |
            9. | 1     9     385,5     248,1       137,4   .   1     9         1       9 |
               +-------------------------------------------------------------------------+
          
          .

          Comment


          • #20
            I just used the Sandbox forum and can confirm that dataex will output these Unicode whitespace characters to Stata's Results window. However, when pasted in the forum, these are replaced with regular spaces.

            Comment


            • #21
              If you are going to copy more of those sheets and want to import into Stata, you can save yourself some headache by copy pasting into notepad (txt file).
              Then in Stata:
              Code:
              import delimited C:\Desktop\test.txt, delimiter(whitespace, collapse)
              Same as options 'whitespace' as separator, and 'treat sequential delimiters as one' checked.

              That results in:
              Code:
              input byte v1 int v2 str7(v3 v4) str5 v5 byte v6
              4  98 "860,7"   "500,4" "360,3" .
              4  99 "1.018,6" "643,6" "375,0" .
              4 100 "835,4"   "511,2" "324,2" .
              4 101 "1.200,1" "853,5" "346,6" .
              5 102 "925,9"   "583,4" "342,5" .
              With the benefit that 1) your tricky variable is already split into two parts, and 2) any remaining trailing or leading whitespaces in the other variables are eaten up already as well.

              then you still need to get rid of the commas as decimal points and dots as thousand separators:
              Code:
              ds, has(type string) 
              local strvars "`r(varlist)'"
              foreach var of local strvars {
                  replace `var'=subinstr(`var',".","",.)    
              }
              destring, replace dpcomma
              And all your data is in order:
              Code:
              clear
              input byte v1 int v2 double(v3 v4 v5) byte v6
              4  98  860.7 500.4 360.3 .
              4  99 1018.6 643.6   375 .
              4 100  835.4 511.2 324.2 .
              4 101 1200.1 853.5 346.6 .
              5 102  925.9 583.4 342.5 .
              end

              Comment


              • #22
                Thank you all. It finally works.
                I'm very thankful for all the time you all spent here working on this problem.

                Comment


                • #23
                  Dear all, I have one question regarding how to split var. I have var that has month and days such as Apr1, Apr11, Mar1, Mar22, Dec1, Dec22 etc..... Basically days are next to month name therefore if anyone know how to split days from month . I need only new variable that shows me only days. Any help would be appreciated. Many thanks ahead.

                  Comment


                  • #24
                    Try this
                    Code:
                    . dis substr("Apr1",4,l)
                    1
                    and if you do not manage to make it work, show a sample of your data using -dataex-.

                    Originally posted by Awais Ahmed View Post
                    Dear all, I have one question regarding how to split var. I have var that has month and days such as Apr1, Apr11, Mar1, Mar22, Dec1, Dec22 etc..... Basically days are next to month name therefore if anyone know how to split days from month . I need only new variable that shows me only days. Any help would be appreciated. Many thanks ahead.

                    Comment

                    Working...
                    X