Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • character encoding problem with string variable value - cannot get STATA to recognize a string

    Hi everyone, my programmer collaborator and I have been banging our heads on this one, can you help? It's for our medical survey.

    I have a variable QID88_7_TEXT
    each obs has an id# so I'll use that for clarity
    that observation has this id value: 1747
    We cannot make any changes in the values of the original data, so we can't replace this value, only create a new dataset based on do files
    in the STATA data browser window, the id 1747 value for QID88_7_TEXT is this:


    Code:
     I’ve stopped asking. And they don’t tell me voluntarily. I give so few fox, that I honestly can’t answer this question with confidence. In my mind the most accurate is PTSD, but I’m sure depression and bipolar are on the chart somewhere.
    When I run this code:

    Code:
    replace dx_recode = 24 if QID88_7_TEXT == "I’ve stopped asking. And they don’t tell me voluntarily. I give so few fox, that I honestly can’t answer this question with confidence. In my mind the most accurate is PTSD, but I’m sure depression and bipolar are on the chart somewhere."
    STATA returns this, indicating it is not finding a == match:

    Code:
    (0 real changes made)
    Looking at the string, I try it with a leading space, like this:

    Code:
    replace dx_recode = 24 if QID88_7_TEXT == " I’ve stopped asking. And they don’t tell me voluntarily. I give so few fox, that I honestly can’t answer this question with confidence. In my mind the most accurate is PTSD, but I’m sure depression and bipolar are on the chart somewhere."
    STATA returns this, indicating it is not finding a == match:

    Code:
    (0 real changes made)
    When I do a partial string match, it finds the obs:

    Code:
    list QID88_7_TEXT if strpos(QID88_7_TEXT, "asking") > 0
    
          +-----------------------------------------------------------------------------------------------------------------------+
          | QID88_7_TEXT                                                                                                          |
          |-----------------------------------------------------------------------------------------------------------------------|
    1138. |  I’ve stopped asking. And they don’t tell me voluntarily. I give so few fox, that I honestly can’t answer this ques.. |
          +-----------------------------------------------------------------------------------------------------------------------+
    
    .
    I think the problem is with the special character, in UTF-8 it is this:

    Code:
     
    E2 80 99 Right single quotation mark
    Note that it appears multiple times in the string. Here is the string again, copypasted from OSX directly from the STATA browser into Chrome:

    Code:
    I’ve stopped asking. And they don’t tell me voluntarily. I give so few fox, that I honestly can’t answer this question with confidence. In my mind the most accurate is PTSD, but I’m sure depression and bipolar are on the chart somewhere.
    So the question is, How do I rewrite the following code so that it will == locate the special character, identify the obs, and do the value replace?

    Code:
    replace dx_recode = 24 if QID88_7_TEXT == " I’ve stopped asking. And they don’t tell me voluntarily. I give so few fox, that I honestly can’t answer this question with confidence. In my mind the most accurate is PTSD, but I’m sure depression and bipolar are on the chart somewhere."
    THANK YOU!!!!!

    PS
    I have tried the code with a leading space and without a leading space. In the STATA data browser a leading space does seem to be in the string in question, but for some reason copy-paste drops the leading space. This might be a red herring but thought I'd mention it!



    Last edited by Will Hall; Yesterday, 10:05.

  • #2
    You can use dataex to capture the entire string as is.

    Code:
    dataex QID88_7_TEXT in 1138

    Comment


    • #3
      Thanks but our do file needs to run and show the actual string so that outside researchers can follow the changes as valid from a methodology standpoint. Otherwise we could just use the ID number or even row number. So your suggestion I think doesn't work. We need something that actually matches the string line a + char(34) + code something.

      Really appreciate the prompt reply!

      Comment


      • #4
        Show us the dataex output.

        Comment


        • #5

          . dataex QID88_7_TEXT if id == 1747

          ----------------------- copy starting from the next line -----------------------
          Code:
          * Example generated by -dataex-. For more info, type help dataex
          clear
          input str644 QID88_7_TEXT
          " I’ve stopped asking. And they don’t tell me voluntarily. I give so few fox, that I honestly can’t answer this question with confidence. In my mind the most accurate is PTSD, but I’m sure depression and bipolar are on the chart somewhere."
          end
          ------------------ copy up to and including the previous line ------------------

          Listed 1 out of 4406 observations

          Comment


          • #6

            . dataex QID88_7_TEXT if id == 1747

            ----------------------- copy starting from the next line -----------------------
            Code:
            * Example generated by -dataex-. For more info, type help dataex
            clear
            input str644 QID88_7_TEXT
            " I’ve stopped asking. And they don’t tell me voluntarily. I give so few fox, that I honestly can’t answer this question with confidence. In my mind the most accurate is PTSD, but I’m sure depression and bipolar are on the chart somewhere."
            end
            ------------------ copy up to and including the previous line ------------------

            Listed 1 out of 4406 observations

            Comment


            • #7
              Sorry I have no idea how to delete these errors of multiple identical posts, I click "edit" but no "delete" button to be found

              Comment


              • #8
                I don't see any issues here. You can use chartab from SSC to look at the characters that make up the string. Otherwise see

                Code:
                help strtrim()
                and

                Code:
                help stritrim()
                to eliminate leading, trailing and internal blanks.

                Code:
                * Example generated by -dataex-. For more info, type help dataex
                clear
                input str644 QID88_7_TEXT
                " I’ve stopped asking. And they don’t tell me voluntarily. I give so few fox, that I honestly can’t answer this question with confidence. In my mind the most accurate is PTSD, but I’m sure depression and bipolar are on the chart somewhere."
                end
                
                gen test=.
                replace test= 1 if QID88_7_TEXT==" I've stopped asking. And they don't tell me voluntarily. I give so few fox, that I honestly can't answer this question with confidence. In my mind the most accurate is PTSD, but I'm sure depression and bipolar are on the chart somewhere."
                
                *ssc install chartab 
                chartab QID
                Res.:

                Code:
                . 
                . replace test= 1 if QID88_7_TEXT==" I've stopped asking. And they don't tell me voluntarily. I give so few fox, that I honestly can'
                > t answer this question with confidence. In my mind the most accurate is PTSD, but I'm sure depression and bipolar are on the chart 
                > somewhere."
                (1 real change made)
                
                . 
                . 
                . 
                . *ssc install chartab 
                
                . 
                . chartab QID
                
                   decimal  hexadecimal   character |     frequency    unique name
                ------------------------------------+----------------------------------------
                        32       \u0020             |            42    SPACE
                        39       \u0027       '     |             4    APOSTROPHE
                        44       \u002c       ,     |             2    COMMA
                        46       \u002e       .     |             4    FULL STOP
                        65       \u0041       A     |             1    LATIN CAPITAL LETTER A
                        68       \u0044       D     |             1    LATIN CAPITAL LETTER D
                        73       \u0049       I     |             5    LATIN CAPITAL LETTER I
                        80       \u0050       P     |             1    LATIN CAPITAL LETTER P
                        83       \u0053       S     |             1    LATIN CAPITAL LETTER S
                        84       \u0054       T     |             1    LATIN CAPITAL LETTER T
                        97       \u0061       a     |            11    LATIN SMALL LETTER A
                        98       \u0062       b     |             2    LATIN SMALL LETTER B
                        99       \u0063       c     |             6    LATIN SMALL LETTER C
                       100       \u0064       d     |             7    LATIN SMALL LETTER D
                       101       \u0065       e     |            22    LATIN SMALL LETTER E
                       102       \u0066       f     |             3    LATIN SMALL LETTER F
                       103       \u0067       g     |             2    LATIN SMALL LETTER G
                       104       \u0068       h     |             9    LATIN SMALL LETTER H
                       105       \u0069       i     |            11    LATIN SMALL LETTER I
                       107       \u006b       k     |             1    LATIN SMALL LETTER K
                       108       \u006c       l     |             6    LATIN SMALL LETTER L
                       109       \u006d       m     |             6    LATIN SMALL LETTER M
                       110       \u006e       n     |            15    LATIN SMALL LETTER N
                       111       \u006f       o     |            13    LATIN SMALL LETTER O
                       112       \u0070       p     |             4    LATIN SMALL LETTER P
                       113       \u0071       q     |             1    LATIN SMALL LETTER Q
                       114       \u0072       r     |             9    LATIN SMALL LETTER R
                       115       \u0073       s     |            13    LATIN SMALL LETTER S
                       116       \u0074       t     |            18    LATIN SMALL LETTER T
                       117       \u0075       u     |             5    LATIN SMALL LETTER U
                       118       \u0076       v     |             3    LATIN SMALL LETTER V
                       119       \u0077       w     |             4    LATIN SMALL LETTER W
                       120       \u0078       x     |             1    LATIN SMALL LETTER X
                       121       \u0079       y     |             4    LATIN SMALL LETTER Y
                ------------------------------------+----------------------------------------
                
                                                    freq. count   distinct
                ASCII characters              =             238         34
                Multibyte UTF-8 characters    =               0          0
                Unicode replacement character =               0          0
                Total Unicode characters      =             238         34

                Comment

                Working...
                X