Destring returns "contains nonnumeric characters; no replace"

Clyde Schechter

Join Date: Apr 2014

Posts: 29294
#16

15 Mar 2020, 12:45

Without example data that reproduces the problem, I can't give you specific advice. What I can say, at a general level, is that somehow, what you think are "En désaccord" and "En total désaccord" in your data set, really aren't. One possibility is that those responses are padded with leading or trailing blanks, which your eye does not see, but Stata does. If that is the problem

Code:

replace `var' = trim(itrim(`var'))

before the -encode- will resolve the problem.

If it is not a matter of blanks, there may be "non-printing" characters embedded in Educ_16, which, again, your eye does not see, but Stata does. Those are more difficult to deal with, as there is no simple cleanup function like trim() to remove them. For a start you can run -chartab- (by Robert Picard, available from SSC) to identify all the characters contained in Educ_16. You will then have to use -subinstr()- or -usubinstr()- to remove them.
1 like
Comment
Kamala Kaghoma

Join Date: Dec 2019

Posts: 26
#17

16 Mar 2020, 01:40

Many thanks John.
Surprisingly, the options which have disappeared are repeated as you can see it here:

educat:
1 En total désaccord
2 En désaccord
3 Neutre
4 D'accord
5 Tout à fait d’accord
6 D'accord
7 En total désaccord
8 En désaccord

and what is strange is that when I try to recode them, I get the following message :

too few variables specified

. The latter message is given when I use:

Code:

foreach var of varlist Educ_1-Educ_18 { 2. recode (7=1) (8=2) (6=4) 3. }

Really strange!
Comment
Kamala Kaghoma

Join Date: Dec 2019

Posts: 26
#18

16 Mar 2020, 02:08

Dear Clyde,
Many thanks. I've tried to trim without success. I will look for the Robert Picard as suggested. In the meantime I try to mimic the data structure:

clear
input str1 (Educ_1 Educ_2 Educ_3)

1 "En total désaccord" "D'accord" "D'accord"
2 "En désaccord" "En désaccord" "Neutre"
3 "D'accord" "En désaccord" "D'accord"
4 "En désaccord" "En total désaccord" "D'accord"
5 "D'accord" "D'accord" "D'accord"
6 "D'accord" "En total désaccord" "En total désaccord"
7 "En total désaccord" "En désaccord" "En désaccord"
8 "Neutre" "Neutre" "En désaccord"
9 "Neutre" "En total désaccord" "En total désaccord"
10 "Neutre" "En total désaccord" "En total désaccord"
11 "Neutre" "Neutre" "Neutre"
end

I will look and see how it can help.
Many thanks
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 9706
#19

16 Mar 2020, 02:42

I do not see anything wrong here. From your post in #14

tab Educ_16

Ensei. compétents | Freq. Percent Cum.
----------------------+-----------------------------------
Neutre | 33 34.74 34.74
D'accord | 46 48.42 83.16
Tout à fait d’accord | 2 2.11 85.26
En total désaccord | 5 5.26 90.53
En désaccord | 9 9.47 100.00
----------------------+-----------------------------------
Total | 95 100.00
. tab Educ_16, nolab

Ensei. |
compétents | Freq. Percent Cum.
------------+-----------------------------------
3 | 33 34.74 34.74
4 | 46 48.42 83.16
5 | 2 2.11 85.26
6 | 5 5.26 90.53
7 | 9 9.47 100.00
------------+-----------------------------------
Total | 95 100.00

The variable Educ_16 does not contain the values 1 and 2. In addition, here is what label list shows

educat:
1 En total désaccord
2 En désaccord
3 Neutre
4 D'accord
5 Tout à fait d’accord
6 D'accord
7 En total désaccord
8 En désaccord

There is no rule that the same value label cannot be used for more than one value. Therefore, the question is how do you want the label to look like considering that the values that you are labeling are within the range 1-8?
Comment
Kamala Kaghoma

Join Date: Dec 2019

Posts: 26
#20

16 Mar 2020, 04:41

Dear Andrew, many thanks. Sorry that the elements I've posted could not be as clearer as required. In fact, in the tablet that served in the collection of the data, I just had 5 options which are also the 5 firsts in the list above. This is what showed me that there was something wrong. I've tried to check and did not find anywhere a cell with "6", "7" or "8" but they are appearing when I encode the variable. I think that what Clyde (post #16) has suggest is likely to be the true problem. I've read and tried "chartab" by Robert Picard but I could not advance much so far.
Comment
Kamala Kaghoma

Join Date: Dec 2019

Posts: 26
#21

16 Mar 2020, 06:44

The good thing is that when I code separately all the variables (as opposed to what I reported post #17), everything works perfectly. However, this allows me to see another strange thing: I have in total 1680 observations but only 95 are being used by STATA for the different operations I want to run.

Code:

tab enquetecible

Enquêté cîble | Freq. Percent Cum.
------------------------------+-----------------------------------
Ménage | 1,259 74.90 74.90
Unité de production | 327 19.45 94.35
Ménage et Unité de production | 95 5.65 100.00
------------------------------+-----------------------------------
Total | 1,681 100.00

Only the observations corresponding to the last option are being used for the different operations. The data base is imported from an Excel file using the following command:

Code:

import excel "BD_all1_versions_25.01.2020.xlsx", sheet("perception_qlty") firstrow clear

Additional commands are :

Code:

rename DATEDELENQUÊTE date_svy rename Enquêtécible enquete rename NOMDELENQUÊTÉ name_enqt gen enquetecible=0 replace enquetecible=1 if enquete=="1. Ménage" replace enquetecible=2 if enquete=="2. Unité de production" replace enquetecible=3 if enquete=="3. Ménage et Unité de production" label define enqueteciblecode 1 "Ménage" 2 "Unité de production" 3 "Ménage et Unité de production" label value enquetecible enqueteciblecode label var enquetecible "Enquêté cîble" drop if enquetecible==0 /*une observation (=0) dont j'ignore l'origine*/

I have no clue on what can be the origin of this misbehavior of the data. I attach the dataset in Excel for any required precision.
Many thanks in advance.

Attached Files

BD_all1_versions_25.01.2020.xlsx (1.31 MB, 1 view)
Comment
Kamala Kaghoma

Join Date: Dec 2019

Posts: 26
#22

16 Mar 2020, 06:49

The good thing is that when I RECODE separately all the variables (as opposed to what I reported post #17), everything works perfectly. However, this allows me to see another strange thing: I have in total 1680 observations but only 95 are being used by STATA for the different operations I want to run.

Many thanks in advance.
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 9706
#23

16 Mar 2020, 08:23

I would rather fix the problem than to use recode later on. Here is one way which makes the labels consistent beforehand.

Code:

foreach var of varlist Educ_1-Educ_18 { replace `var'= "total" if ustrregexm(lower(`var'), "total") replace `var'= "En désaccord" if ustrregexm(lower(`var'), "en") replace `var'= "Neutre" if ustrregexm(lower(`var'), "neut") replace `var'= "D'accord" if ustrregexm(lower(`var'), "^d") replace `var'= "Tout à fait d’accord" if ustrregexm(lower(`var'), "tout") replace `var'= "En total désaccord" if ustrregexm(lower(`var'), "total") } label define educat 1"En total désaccord" 2"En désaccord" 3"Neutre" 4"D'accord" 5"Tout à fait d’accord" foreach var of varlist Educ_1-Educ_18 { encode `var', gen(`var'_) label(educat) drop `var' rename `var'_ `var' }

I have in total 1680 observations but only 95 are being used by STATA for the different operations I want to run.

This is usually due to missing values in other variables that you use. Stata uses listwise deletion of missing values. Therefore, if an observation of a particular variable is missing, Stata deletes the whole observation. This means that you have in total 95 complete cases where no variable is missing for the particular task that you were undertaking. If you search for "multiple imputation", you will see a way to deal with missing values. Finally, to see the sample after a regression

Code:

regress .... gen sample = e(sample) browse if sample
Comment
Nick Cox

Join Date: Mar 2014

Posts: 34555
#24

16 Mar 2020, 08:44

In #23 there are multiple adjacent spaces in some strings. Following #16 I would clean up all the string variables with trim(itrim()) and also check for non-standard characters and from the other end only use labels with single spaces.

Code:

tab1 Educ_1-Educ_18

would be a further check.
Comment

Kamala Kaghoma

Join Date: Dec 2019
Posts: 26

#25

16 Mar 2020, 08:58

Dear Andrew,
Many thanks. This code

Code:

. foreach var of varlist Educ_1-Educ_18 { 
 replace `var'=  "total" if ustrregexm(lower(`var'), "total") replace `var'=  "En  désaccord" if ustrregexm(lower(`var'), "en") replace `var'=  "Neutre" if ustrregexm(lower(`var'), "neut") replace `var'=  "D'accord" if ustrregexm(lower(`var'), "^d") replace `var'=  "Tout à fait d’accord" if ustrregexm(lower(`var'), "tout")               replace `var'= "En  total désaccord"  if ustrregexm(lower(`var'), "total") } label define educat  1"En  total désaccord"    2"En  désaccord"  3"Neutre" 4"D'accord" 5"Tout à fait d’accord" foreach var of varlist Educ_1-Educ_18 {     encode `var', gen(`var'_)  label(educat)     drop `var'     rename `var'_ `var'         }

works perfectly. However, concerning the point on observations which are disappearing, I attached the whole dataset for an indication of the true problem I am facing. The problem is appearing before I reach the level of regression and for option "1" (Ménage) there is no missing observation for most of the observations but they disappear. I even try to run some of the operations by conditioning them to (options any of the tree options "1" and "2" and even there I could not find the operations done for more than the 95 observations. Many thanks,

Comment

Kamala Kaghoma

Join Date: Dec 2019

Posts: 26
#26

16 Mar 2020, 09:10

Many thanks, Nick. I will check that also. The solution suggested from the code provided by Andrew in post #23 provides a solution to the problem I had. I will though try the solution you have also suggested.

On another issue, Andrew has replied as follows

This is usually due to missing values in other variables that you use. Stata uses listwise deletion of missing values. Therefore, if an observation of a particular variable is missing, Stata deletes the whole observation. This means that you have in total 95 complete cases where no variable is missing for the particular task that you were undertaking. If you search for "multiple imputation", you will see a way to deal with missing values. Finally, to see the sample after a regression

to my request.

I've tried to check whether missing values may be the problem and don't really think they are. The problem is appearing before I reach the level of regression and for option "1" (Ménage) there is no missing observation for most of the observations but they disappear. I even try to run some of the operations by conditioning them to (options any of the tree options "1" and "2" and even there I could not find the operations done for more than the 95 observations. I attached the whole dataset for an indication of the true problem I am facing.

Many thanks.

Last edited by Kamala Kaghoma; 16 Mar 2020, 09:16.
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 9706
#27

16 Mar 2020, 09:55

To Nick's point, your value label has spaces which I did not notice as I was copying and pasting, (e.g., "En total désaccord" has 2 spaces between "En" and "total"). This may explain the problem with the initial encode, and it's much better to address this than replacing the labels as I do.

Only the observations corresponding to the last option are being used for the different operations.

Can you provide code that leads you to conclude that some observations are ignored?

Last edited by Andrew Musau; 16 Mar 2020, 10:03.
Comment
Kamala Kaghoma

Join Date: Dec 2019

Posts: 26
#28

16 Mar 2020, 10:57

Dear Andrew, many thanks for your reaction.

To Nick's point, your value label has spaces which I did not notice as I was copying and pasting, (e.g., "En total désaccord" has 2 spaces between "En" and "total"). This may explain the problem with the initial encode, and it's much better to address this than replacing the labels as I do.

This is already considered. I've trimmed all the variable before I run the suggested commands.

As for the second question, as I mentioned in post#21, tabulating

Code:

tab enquetecible

shows the three options I have in the dataset. 1 corresponds to "Ménage" (houshold), 2 to "Unité de production" (Production unit) and 3 Ménage et Unité de production (Household and Unit of production), a total of 1681observations. Option 3 corresponds to the situation where the whole questionnaire has been administered to an individual who represents both her household as well as a production unit while in the former to options it is either a representative of a household and thus submitted to one component of the questionnaire or a representative of a unit of production, thus submitted to the productor's component of the questionnaire. However, even when I tabulate, for instance the same variable as is post #21, I get the same results as the one which is there in the post, instead of having it for (95+1259), the number of observations to which the analysis is restricted after I've just keeped part of the set of observations

Code:

preserve keep if enquetecible==1 | enquetecible==3 ******************************

Thanks
Comment

Andrew Musau

Join Date: Oct 2014
Posts: 9706

#29

16 Mar 2020, 12:50

I am not getting the same result running your code. Below, I rewrite it to save some lines

Code:

import excel "BD_all1_versions_25.01.2020.xlsx", sheet("perception_qlty") firstrow clear
rename (DATEDELENQUÊTE Enquêtécible NOMDELENQUÊTÉ) (date_svy enquete name_enqt)
tab enquete
gen enquetecible= real(substr(enquete, 1, 1))
tab enquetecible
preserve
keep if inlist(enquetecible,1, 3)
tab enquetecible

Res.:

Code:

. tab enquete

                    Enquêté cible  |      Freq.     Percent        Cum.
-----------------------------------+-----------------------------------
                         1. Ménage |      1,259       74.90       74.90
            2. Unité de production |        327       19.45       94.35
  3. Ménage et Unité de production |         95        5.65      100.00
-----------------------------------+-----------------------------------
                             Total |      1,681      100.00

.
. gen enquetecible= real(substr(enquete, 1, 1))
(1 missing value generated)

.
. tab enquetecible

enquetecibl |
          e |      Freq.     Percent        Cum.
------------+-----------------------------------
          1 |      1,259       74.90       74.90
          2 |        327       19.45       94.35
          3 |         95        5.65      100.00
------------+-----------------------------------
      Total |      1,681      100.00

.
. preserve

.
. keep if inlist(enquetecible,1, 3)
(328 observations deleted)

.
. tab enquetecible

enquetecibl |
          e |      Freq.     Percent        Cum.
------------+-----------------------------------
          1 |      1,259       92.98       92.98
          3 |         95        7.02      100.00
------------+-----------------------------------
      Total |      1,354      100.00

I get the same result running your code in #21. Note that I am using Stata 16, but I don't see a reason why the version should matter here.

Comment

Kamala Kaghoma

Join Date: Dec 2019
Posts: 26

#30

16 Mar 2020, 13:33

Dear Andrew,
This is quite intriguing. I agree that the version of STATA should not have a lot to do with that. I am using STATA 14 in fact. I've just attached part of my dofile. Maybe you can help me detect a query somewhere in that I am not getting. Below is the output of

Code:

tab Educ_1

and

Code:

tab enquetecible

after running

Code:

preserve

and before I

Code:

restore

. I can't really understand what is wrong.

Code:

. **********************************************/ 
end of do-file

tab Educ_1

 Ecoles: organisée & |
         bien gérées |      Freq.     Percent        Cum.
---------------------+-----------------------------------
 En  total désaccord |          9        9.47        9.47
       En  désaccord |         35       36.84       46.32
              Neutre |         19       20.00       66.32
            D'accord |         28       29.47       95.79
Tout à fait d’accord |          4        4.21      100.00
---------------------+-----------------------------------
               Total |         95      100.00

. tab enquetecible

                Enquêté cîble |      Freq.     Percent        Cum.
------------------------------+-----------------------------------
                       Ménage |      1,258       74.88       74.88
          Unité de production |        327       19.46       94.35
Ménage et Unité de production |         95        5.65      100.00
------------------------------+-----------------------------------
                        Total |      1,680      100.00

Many thanks in advance for your help.

Attached Files

part_of_my_dofile.do (11.8 KB, 1 view)

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment