Encode many observations

Hugo Rocha

Join Date: Feb 2021

Posts: 288
#1

Encode many observations

16 Sep 2021, 10:07

Dear all,

I have just concluded cleaning up my dataset with around 848,950 observations. Nevertheless, some of the variables are string variables which I need to encode (convert to numeric). When I follow the typical procedure of encoding the variables, I get the following error message:

error . . . . . . . . . . . . . . . . . . . . . . . . Return code 134
too many values
1) You attempted to encode a string variable that takes on
more than 65,536 unique values. 2) You attempted to tabulate
a variable or pair of variables that take on too many values.
If you specified two variables, try interchanging them.
3) You issued a graph command using the by option. The
by-variable takes on too many different values to construct
a readable chart.

Is there any other way I could encode these string variables?

Thank you so much!
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 29956
#2

16 Sep 2021, 10:24

I don't think there is anything you can do that will fully emulate the behavior of -encode- for this. The limit of 65,536 distinct values applies to the number of labels that can be placed in a value label.

You can create a numeric variable that distinguishes all the value of the string variable.

Code:

egen long numeric_var = group(string_var)

will do that. But that numeric variable cannot receive a value label because it will have too many distinct values. You can then, in principle, use this numeric variable for analytic purposes, while using the original string variable in data displays. On the other hand, as a practical matter, I have trouble thinking of any way that you can actually make use of a discrete variable that takes on this many values. What is this variable, and what do you hope to do with it?
2 likes
Comment
Hugo Rocha

Join Date: Feb 2021

Posts: 288
#3

16 Sep 2021, 14:28

Hi Clyde,

Thank you for your response. This variable represents in one single column values for value added, number of enterprises, formation of gross fixed capital and number of employees per sector. The way you know which variable each cell refers is by looking to next columns whose numbers and codes indicate the variable and the sector the specific cell refers. Given your answer, I was trying to think about ways of sorting the dataset to make a panel and then destring variable per variable. My final goal is calculate value added per worker and investment

Below you can see the dataex for numeric values

* Example generated by -dataex-. To install: ssc install dataex
clear
input byte ctable int(country year) str2 isic str3 isiccomb str16 value byte(utable sourcecode) str1 unit
1 4 1981 "15" "15" "62" 1 3 "#"
1 4 1981 "16" "16" "0" 1 3 "#"
1 4 1981 "17" "17" "56" 1 3 "#"
1 4 1981 "18" "18A" "7" 1 3 "#"
1 4 1981 "19" "18A" "..." 1 3 "#"
1 4 1981 "20" "20C" "22" 1 3 "#"
1 4 1981 "21" "21" "0" 1 3 "#"
1 4 1981 "22" "22" "18" 1 3 "#"
1 4 1981 "23" "23" "0" 1 3 "#"
1 4 1981 "24" "24" "5" 1 3 "#"
1 4 1981 "25" "25" "32" 1 3 "#"
1 4 1981 "26" "26" "9" 1 3 "#"
1 4 1981 "27" "27" "0" 1 3 "#"
1 4 1981 "28" "28" "0" 1 3 "#"
1 4 1981 "29" "29C" "0" 1 3 "#"
1 4 1981 "30" "29C" "..." 1 3 "#"
1 4 1981 "31" "31A" "0" 1 3 "#"
1 4 1981 "32" "31A" "..." 1 3 "#"
1 4 1981 "33" "33" "0" 1 3 "#"
1 4 1981 "34" "34A" "0" 1 3 "#"
1 4 1981 "35" "34A" "..." 1 3 "#"
1 4 1981 "36" "20C" "..." 1 3 "#"
1 4 1981 "37" "37" "..." 1 0 "#"
1 4 1981 "D" "D" "211" 1 3 "#"
1 4 1982 "15" "15" "69" 1 3 "#"
1 4 1982 "16" "16" "0" 1 3 "#"
1 4 1982 "17" "17" "63" 1 3 "#"
1 4 1982 "18" "18A" "7" 1 3 "#"
1 4 1982 "19" "18A" "..." 1 3 "#"
1 4 1982 "20" "20C" "20" 1 3 "#"
1 4 1982 "21" "21" "0" 1 3 "#"
1 4 1982 "22" "22" "18" 1 3 "#"
1 4 1982 "23" "23" "0" 1 3 "#"
1 4 1982 "24" "24" "5" 1 3 "#"
1 4 1982 "25" "25" "27" 1 3 "#"
1 4 1982 "26" "26" "10" 1 3 "#"
1 4 1982 "27" "27" "0" 1 3 "#"
1 4 1982 "28" "28" "0" 1 3 "#"
1 4 1982 "29" "29C" "0" 1 3 "#"
1 4 1982 "30" "29C" "..." 1 3 "#"
1 4 1982 "31" "31A" "0" 1 3 "#"
1 4 1982 "32" "31A" "..." 1 3 "#"
1 4 1982 "33" "33" "0" 1 3 "#"
1 4 1982 "34" "34A" "0" 1 3 "#"
1 4 1982 "35" "34A" "..." 1 3 "#"
1 4 1982 "36" "20C" "..." 1 3 "#"
1 4 1982 "37" "37" "..." 1 0 "#"
1 4 1982 "D" "D" "219" 1 3 "#"
1 4 1983 "15" "15" "22" 1 3 "#"
1 4 1983 "16" "16" "0" 1 3 "#"
1 4 1983 "17" "17" "73" 1 3 "#"
1 4 1983 "18" "18A" "14" 1 3 "#"
1 4 1983 "19" "18A" "..." 1 3 "#"
1 4 1983 "20" "20C" "34" 1 3 "#"
1 4 1983 "21" "21" "7" 1 3 "#"
1 4 1983 "22" "22" "20" 1 3 "#"
1 4 1983 "23" "23" "0" 1 3 "#"
1 4 1983 "24" "24" "9" 1 3 "#"
1 4 1983 "25" "25" "38" 1 3 "#"
1 4 1983 "26" "26" "2" 1 3 "#"
1 4 1983 "27" "27" "0" 1 3 "#"
1 4 1983 "28" "28" "0" 1 3 "#"
1 4 1983 "29" "29C" "0" 1 3 "#"
1 4 1983 "30" "29C" "..." 1 3 "#"
1 4 1983 "31" "31A" "0" 1 3 "#"
1 4 1983 "32" "31A" "..." 1 3 "#"
1 4 1983 "33" "33" "0" 1 3 "#"
1 4 1983 "34" "34A" "0" 1 3 "#"
1 4 1983 "35" "34A" "..." 1 3 "#"
1 4 1983 "36" "20C" "..." 1 3 "#"
1 4 1983 "37" "37" "..." 1 0 "#"
1 4 1983 "D" "D" "219" 1 3 "#"
1 4 1984 "15" "15" "22" 1 3 "#"
1 4 1984 "16" "16" "0" 1 3 "#"
1 4 1984 "17" "17" "75" 1 3 "#"
1 4 1984 "18" "18A" "15" 1 3 "#"
1 4 1984 "19" "18A" "..." 1 3 "#"
1 4 1984 "20" "20C" "59" 1 3 "#"
1 4 1984 "21" "21" "7" 1 3 "#"
1 4 1984 "22" "22" "20" 1 3 "#"
1 4 1984 "23" "23" "0" 1 3 "#"
1 4 1984 "24" "24" "9" 1 3 "#"
1 4 1984 "25" "25" "38" 1 3 "#"
1 4 1984 "26" "26" "2" 1 3 "#"
1 4 1984 "27" "27" "0" 1 3 "#"
1 4 1984 "28" "28" "0" 1 3 "#"
1 4 1984 "29" "29C" "0" 1 3 "#"
1 4 1984 "30" "29C" "..." 1 3 "#"
1 4 1984 "31" "31A" "0" 1 3 "#"
1 4 1984 "32" "31A" "..." 1 3 "#"
1 4 1984 "33" "33" "0" 1 3 "#"
1 4 1984 "34" "34A" "0" 1 3 "#"
1 4 1984 "35" "34A" "..." 1 3 "#"
1 4 1984 "36" "20C" "..." 1 3 "#"
1 4 1984 "37" "37" "..." 1 0 "#"
1 4 1984 "D" "D" "247" 1 3 "#"
1 4 1985 "15" "15" "26" 1 3 "#"
1 4 1985 "16" "16" "0" 1 3 "#"
1 4 1985 "17" "17" "78" 1 3 "#"
1 4 1985 "18" "18A" "19" 1 3 "#"
end
[/CODE]

Thank you so much,

Hugo
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29956
#4

16 Sep 2021, 15:01

I think I get the general idea, but not the specifics. Which variable "represents in one single column values for value added, number of enterprises, formation of gross fixed capital and number of employees per sector," and which adjacent variables' "numbers and codes indicate the variable and the sector the specific cell refers." Moreover, how is that information coded. That is, which values correspond to which sector and to which variable--how would you identify which observations of the "in one single column" variable correspond to added value, which to investment, and which to number of workers?
Comment
Hugo Rocha

Join Date: Feb 2021

Posts: 288
#5

16 Sep 2021, 15:48

Yes, I think my description was very poor. The variable that represents in single column the values for value added, number of enterprises etc.. is, for some reason, not showing up on dataex I showed you (this is the string variable). In dataex, on the first column (from left to right) we have table codes (a number that tells us which variable we are talking about. For instance, the number 1 indicates number of enterprises). Then, after, we have a number that codes the country (for instance the number "4" is Afghanistan). Then, we have the year (most countries are from early 1970's until 2019). Then, the 4th column from left to right represents the sector code (for instance, 15 represents the sector for "Food and Beverages"). The 5th and 6th column represent also sectoral codes (under a different standard of industrial classification). 3 and 0 just describe the source for the imputation (either manufacturing survey or OECD database). The last column represent the currency in which value is measured ('#" represents current dollars).
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29956
#6

16 Sep 2021, 17:07

OK. If I understand you correctly, the variable ctable is the one that tells you what variable that string variable that didn't show up in -dataex- represents the value of. (Although, as I look at your -dataex- example in light of your explanation in #5, the variable called value really looks like it might be the one--so for demonstration purposes I'm going to pretend that it is. You can adapt the code I show by replacing value with the name of the actual variable.) It also appears that apart from that, the variables country year isic and sourcecode uniquely identify observations in the data. If I have this right, the following should be a good start:

Code:

replace value = "" if value == "..." destring value, replace reshape wide value unit, i(country year isic sourcecode) j(ctable)

This will break up the "represents in one single column values for value added, number of enterprises, formation of gross fixed capital and number of employees per sector" string variable into separate variables. You might then want to rename those variables to something of better mnemonic value than value1 value2, etc. Then you can go ahead and work with those variables in whatever way you like--they are no longer piled on top of each other.

Note: The code above works on your example data, though it is not a very good test of the code since in the example shown there is only a single value of ctable represented. But, hopefully, in your real data, it will give you a more usable data layout, from which the calculations you want to do will be pretty straightforward.
1 like
Comment
Hugo Rocha

Join Date: Feb 2021

Posts: 288
#7

18 Sep 2021, 14:53

Thank you, Professor Schechter! Your idea worked!
Comment

Announcement

Encode many observations

Comment

Comment

Comment

Comment

Comment

Comment