character encoding problem with string variable value - cannot get STATA to recognize a string

Will Hall

Join Date: Dec 2019
Posts: 38

character encoding problem with string variable value - cannot get STATA to recognize a string

Yesterday, 10:02

Hi everyone, my programmer collaborator and I have been banging our heads on this one, can you help? It's for our medical survey.

I have a variable QID88_7_TEXT
each obs has an id# so I'll use that for clarity
that observation has this id value: 1747
We cannot make any changes in the values of the original data, so we can't replace this value, only create a new dataset based on do files
in the STATA data browser window, the id 1747 value for QID88_7_TEXT is this:

Code:

 I’ve stopped asking. And they don’t tell me voluntarily. I give so few fox, that I honestly can’t answer this question with confidence. In my mind the most accurate is PTSD, but I’m sure depression and bipolar are on the chart somewhere.

When I run this code:

Code:

replace dx_recode = 24 if QID88_7_TEXT == "I’ve stopped asking. And they don’t tell me voluntarily. I give so few fox, that I honestly can’t answer this question with confidence. In my mind the most accurate is PTSD, but I’m sure depression and bipolar are on the chart somewhere."

STATA returns this, indicating it is not finding a == match:

Code:

(0 real changes made)

Looking at the string, I try it with a leading space, like this:

Code:

replace dx_recode = 24 if QID88_7_TEXT == " I’ve stopped asking. And they don’t tell me voluntarily. I give so few fox, that I honestly can’t answer this question with confidence. In my mind the most accurate is PTSD, but I’m sure depression and bipolar are on the chart somewhere."

STATA returns this, indicating it is not finding a == match:

Code:

(0 real changes made)

When I do a partial string match, it finds the obs:

Code:

list QID88_7_TEXT if strpos(QID88_7_TEXT, "asking") > 0

      +-----------------------------------------------------------------------------------------------------------------------+
      | QID88_7_TEXT                                                                                                          |
      |-----------------------------------------------------------------------------------------------------------------------|
1138. |  I’ve stopped asking. And they don’t tell me voluntarily. I give so few fox, that I honestly can’t answer this ques.. |
      +-----------------------------------------------------------------------------------------------------------------------+

.

I think the problem is with the special character, in UTF-8 it is this:

Code:

 ’
E2 80 99
Right single quotation mark

Note that it appears multiple times in the string. Here is the string again, copypasted from OSX directly from the STATA browser into Chrome:

Code:

I’ve stopped asking. And they don’t tell me voluntarily. I give so few fox, that I honestly can’t answer this question with confidence. In my mind the most accurate is PTSD, but I’m sure depression and bipolar are on the chart somewhere.

So the question is, How do I rewrite the following code so that it will == locate the special character, identify the obs, and do the value replace?

Code:

replace dx_recode = 24 if QID88_7_TEXT == " I’ve stopped asking. And they don’t tell me voluntarily. I give so few fox, that I honestly can’t answer this question with confidence. In my mind the most accurate is PTSD, but I’m sure depression and bipolar are on the chart somewhere."

THANK YOU!!!!!

PS
I have tried the code with a leading space and without a leading space. In the STATA data browser a leading space does seem to be in the string in question, but for some reason copy-paste drops the leading space. This might be a red herring but thought I'd mention it!

Last edited by Will Hall; Yesterday, 10:05.

Tags: Encoding, noob help, special characters, string

Andrew Musau

Join Date: Oct 2014

Posts: 9969
#2

Yesterday, 11:49

You can use dataex to capture the entire string as is.

Code:

dataex QID88_7_TEXT in 1138
Comment
Will Hall

Join Date: Dec 2019

Posts: 38
#3

Yesterday, 14:40

Thanks but our do file needs to run and show the actual string so that outside researchers can follow the changes as valid from a methodology standpoint. Otherwise we could just use the ID number or even row number. So your suggestion I think doesn't work. We need something that actually matches the string line a + char(34) + code something.

Really appreciate the prompt reply!
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 9969
#4

Yesterday, 15:33

Show us the dataex output.
Comment

Will Hall

Join Date: Dec 2019
Posts: 38

Today, 08:41

. dataex QID88_7_TEXT if id == 1747

----------------------- copy starting from the next line -----------------------

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str644 QID88_7_TEXT
" I’ve stopped asking. And they don’t tell me voluntarily. I give so few fox, that I honestly can’t answer this question with confidence. In my mind the most accurate is PTSD, but I’m sure depression and bipolar are on the chart somewhere."
end

------------------ copy up to and including the previous line ------------------

Listed 1 out of 4406 observations

Comment

Will Hall

Join Date: Dec 2019
Posts: 38

Today, 08:41

. dataex QID88_7_TEXT if id == 1747

----------------------- copy starting from the next line -----------------------

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str644 QID88_7_TEXT
" I’ve stopped asking. And they don’t tell me voluntarily. I give so few fox, that I honestly can’t answer this question with confidence. In my mind the most accurate is PTSD, but I’m sure depression and bipolar are on the chart somewhere."
end

------------------ copy up to and including the previous line ------------------

Listed 1 out of 4406 observations

Comment

Will Hall

Join Date: Dec 2019

Posts: 38
#7

Today, 08:44

Sorry I have no idea how to delete these errors of multiple identical posts, I click "edit" but no "delete" button to be found
Comment

Andrew Musau

Join Date: Oct 2014
Posts: 9969

Today, 09:03

I don't see any issues here. You can use chartab from SSC to look at the characters that make up the string. Otherwise see

Code:

help strtrim()

and

Code:

help stritrim()

to eliminate leading, trailing and internal blanks.

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str644 QID88_7_TEXT
" I’ve stopped asking. And they don’t tell me voluntarily. I give so few fox, that I honestly can’t answer this question with confidence. In my mind the most accurate is PTSD, but I’m sure depression and bipolar are on the chart somewhere."
end

gen test=.
replace test= 1 if QID88_7_TEXT==" I've stopped asking. And they don't tell me voluntarily. I give so few fox, that I honestly can't answer this question with confidence. In my mind the most accurate is PTSD, but I'm sure depression and bipolar are on the chart somewhere."

*ssc install chartab 
chartab QID

Res.:

Code:

. 
. replace test= 1 if QID88_7_TEXT==" I've stopped asking. And they don't tell me voluntarily. I give so few fox, that I honestly can'
> t answer this question with confidence. In my mind the most accurate is PTSD, but I'm sure depression and bipolar are on the chart 
> somewhere."
(1 real change made)

. 
. 
. 
. *ssc install chartab 

. 
. chartab QID

   decimal  hexadecimal   character |     frequency    unique name
------------------------------------+----------------------------------------
        32       \u0020             |            42    SPACE
        39       \u0027       '     |             4    APOSTROPHE
        44       \u002c       ,     |             2    COMMA
        46       \u002e       .     |             4    FULL STOP
        65       \u0041       A     |             1    LATIN CAPITAL LETTER A
        68       \u0044       D     |             1    LATIN CAPITAL LETTER D
        73       \u0049       I     |             5    LATIN CAPITAL LETTER I
        80       \u0050       P     |             1    LATIN CAPITAL LETTER P
        83       \u0053       S     |             1    LATIN CAPITAL LETTER S
        84       \u0054       T     |             1    LATIN CAPITAL LETTER T
        97       \u0061       a     |            11    LATIN SMALL LETTER A
        98       \u0062       b     |             2    LATIN SMALL LETTER B
        99       \u0063       c     |             6    LATIN SMALL LETTER C
       100       \u0064       d     |             7    LATIN SMALL LETTER D
       101       \u0065       e     |            22    LATIN SMALL LETTER E
       102       \u0066       f     |             3    LATIN SMALL LETTER F
       103       \u0067       g     |             2    LATIN SMALL LETTER G
       104       \u0068       h     |             9    LATIN SMALL LETTER H
       105       \u0069       i     |            11    LATIN SMALL LETTER I
       107       \u006b       k     |             1    LATIN SMALL LETTER K
       108       \u006c       l     |             6    LATIN SMALL LETTER L
       109       \u006d       m     |             6    LATIN SMALL LETTER M
       110       \u006e       n     |            15    LATIN SMALL LETTER N
       111       \u006f       o     |            13    LATIN SMALL LETTER O
       112       \u0070       p     |             4    LATIN SMALL LETTER P
       113       \u0071       q     |             1    LATIN SMALL LETTER Q
       114       \u0072       r     |             9    LATIN SMALL LETTER R
       115       \u0073       s     |            13    LATIN SMALL LETTER S
       116       \u0074       t     |            18    LATIN SMALL LETTER T
       117       \u0075       u     |             5    LATIN SMALL LETTER U
       118       \u0076       v     |             3    LATIN SMALL LETTER V
       119       \u0077       w     |             4    LATIN SMALL LETTER W
       120       \u0078       x     |             1    LATIN SMALL LETTER X
       121       \u0079       y     |             4    LATIN SMALL LETTER Y
------------------------------------+----------------------------------------

                                    freq. count   distinct
ASCII characters              =             238         34
Multibyte UTF-8 characters    =               0          0
Unicode replacement character =               0          0
Total Unicode characters      =             238         34

Announcement

character encoding problem with string variable value - cannot get STATA to recognize a string

Comment

Comment

Comment

Comment

Comment

Comment

Comment