Data cleaning - checking correct encoding of variables

clara gisoldo

Join Date: Nov 2021

Posts: 10
#1

Data cleaning - checking correct encoding of variables

29 Dec 2021, 05:21

Hello,

I'm very new to Stata and am trying to complete some data cleaning. I have a dataset with 5 variables and around 200 million observations. The variables are all numeric, and I would like to check that three of them have been encoded correctly, as they were originally categorical (string) variables. For example, I would like to know if the numerical code captures distinct countries for the country variable (there may be typos in the original categories, for instance).

The original string variables are not available, but Stata shows the country names in browse (the categorical variable), but treats the variable as numeric in the data editor. Is there any way to check what the equivalencies between the two are?

Thank you in advance for any help you might be able to give me!

Best wishes,
Clara
Tags: None
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17707
#2

29 Dec 2021, 05:30

Clara:
welcome to this forum.
I'm not totally clear with your post.
If the country variable were -destring-ed and -label-ed you should see a number for each country and country name reported in blue.
As an aside, please read and act on the FAQ to post more effectively. Thanks.

Kind regards,
Carlo
(Stata 19.0)
Comment
Jared Greathouse

Join Date: Sep 2021

Posts: 2170
#3

29 Dec 2021, 05:53

Hey Clara. Not too clear what you're asking, but...... how many countries are we talking about here? There may be some ways to proceed.... but first, you'd need to use dataex so we can inspect some sample data.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35696
#4

29 Dec 2021, 06:05

DIfferent strings will be encoded to different numbers. It's as simple as that. Distinct countries should be encoded distinctly but even there could be problems (e.g. two different Congo Republics might be confused);

The bigger deal, as you realise, will be different versions of the same country name,

That includes differences in leading, trailing and embedded spaces, other punctuation (periods, apostrophes and accents, say), differences in use of upper and lower case, longer and shorter versions of the same name, accidental spelling errors, and so on. encode will take all such differences literally; there is no intelligence inside to appreciate that some differences are unimportant to the researcher and mean the same place.

The original string variables can easily be regained by using decode.

Cleaning large messy datasets is tough in any software and not something I would ask anyone "very new" to Stata to do. It's also tough call knowing whether to go back to the original string variables -- and clean those up before a fresh encode -- or to work with what you have.

At worst you have 200-plus countries and variants of each, but at least the size of the problem should be evident by looking at the results of tabulate with your country variable.
Comment

clara gisoldo

Join Date: Nov 2021
Posts: 10

29 Dec 2021, 06:29

Dear Carlo,

Thank you so much for your reply. I unfortunately can't share an example of the dataset as the FAQ suggests, because it is confidential. Here is an example below, I hope it helps you visualise what I mean, this is what I see when I browse the dataset:

countries

date 1

date 2

non-profit

type

USA

20766

true

Unincorporated Nonprofit Association (D)

France

20467

false

Corporation - Professional - Domestic

Italy

20660

false

Limited Partnership (D)

USA

20453

true

Unincorporated Nonprofit Association (D)

USA

14770

20291

false

For-profit Corporation

China

11983

21886

false

Professional Corporation

Russia

false

Limited Liability Company (D)

China

12235

17217

false

Corporation - Business - Foreign

China

true

Unincorporated Nonprofit Association (D)

China

false

Corporation - Professional - Domestic

14770

22410

false

Limited Partnership (D)

Germany

4336

9831

false

Corporation - Professional - Domestic

France

8068

10251

true

Corporation - Nonprofit - Domestic

France

-8299

9817

true

Corporation - Nonprofit - Domestic

The country variable is unlabeled and numeric (float), has a range of 1-140 (there are 140 countries), 1 unit, and 199348601observations (0 missing). Each country has a corresponding numeric code, and I would like to check that this code indeed captures distinct country categories. I would also like to check the numeric encoding of the binary non-profit variable and the type variable (also both originally string and now numerical variables). Thank you for your help!

Best wishes,
Clara

Comment

clara gisoldo

Join Date: Nov 2021

Posts: 10
#6

29 Dec 2021, 06:47

Dear Nick,

Thank you so much for your reply. This is indeed the unknown unknown I am facing: finding out which typos etc. were taken literally in the encoding to mean categorical differences by Stata.

As you suggest, I used the tabulate command on the country variable, which as you know yields a list of the values with the corresponding frequencies, percentage and cumulative percentages. From this, how do I check that the values that are shown in this table have been adequately encoded, as we have discussed?

Thank you,
Clara
Comment

Andrew Musau

Join Date: Oct 2014
Posts: 10190

29 Dec 2021, 06:51

I often deal with cross-country data, and the good news is that there has been some previous work to deal with this specific problem. The following uses kountry from SSC and illustrates an approach.

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input long country
 1
 2
 3
 5
 6
 7
 8
 9
12
15
16
11
14
13
10
 4
end
label values country Country
label def Country 1 "China", modify
label def Country 2 "France", modify
label def Country 3 "Germany", modify
label def Country 4 "Holland", modify
label def Country 5 "Hungary", modify
label def Country 6 "Italy", modify
label def Country 7 "Japan", modify
label def Country 8 "Netherlands", modify
label def Country 9 "Russia", modify
label def Country 10 "Russian Federation", modify
label def Country 11 "S. Korea", modify
label def Country 12 "South Korea", modify
label def Country 13 "UK", modify
label def Country 14 "USA", modify
label def Country 15 "United Kingdom", modify
label def Country 16 "United States", modify

decode country, gen(countryname)
*ssc install kountry
kountry countryname, from(other)
sort NAMES_STD
l, sepby(NAMES_STD)

Res.:

Code:

. l, sepby(NAMES_STD)

     +----------------------------------------------------------+
     |            country          countryname        NAMES_STD |
     |----------------------------------------------------------|
  1. |              China                China            China |
     |----------------------------------------------------------|
  2. |             France               France           France |
     |----------------------------------------------------------|
  3. |            Germany              Germany          Germany |
     |----------------------------------------------------------|
  4. |            Hungary              Hungary          Hungary |
     |----------------------------------------------------------|
  5. |              Italy                Italy            Italy |
     |----------------------------------------------------------|
  6. |              Japan                Japan            Japan |
     |----------------------------------------------------------|
  7. |            Holland              Holland      Netherlands |
  8. |        Netherlands          Netherlands      Netherlands |
     |----------------------------------------------------------|
  9. | Russian Federation   Russian Federation           Russia |
 10. |             Russia               Russia           Russia |
     |----------------------------------------------------------|
 11. |        South Korea          South Korea      South Korea |
 12. |           S. Korea             S. Korea      South Korea |
     |----------------------------------------------------------|
 13. |                 UK                   UK   United Kingdom |
 14. |     United Kingdom       United Kingdom   United Kingdom |
     |----------------------------------------------------------|
 15. |                USA                  USA    United States |
 16. |      United States        United States    United States |
     +----------------------------------------------------------+

Comment

clara gisoldo

Join Date: Nov 2021

Posts: 10
#8

29 Dec 2021, 07:13

Dear Nick,

Further to your other suggestion, I have also decoded the original variable: tostring country, generate(string_country) .

I then try to get a table to compare the encoded and decoded variables: tab country string_country, but receive the following error message: r(134) too many values.

I know that on R I would try to use bigquery, is there an equivalent I could use?

Thank you for your help.

Best,
Clara
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#9

29 Dec 2021, 07:31

Let me address the very specific question of understanding what country name is associated with each country number.

Your country variable has associated with it a value label, and that value label is what causes a given numeric value to be displayed as a particular country name.

Here is an example of displaying the value labels.

Code:

. sysuse auto, clear (1978 automobile data) . describe foreign Variable Storage Display Value name type format label Variable label ------------------------------------------------------------------------------------------------ foreign byte %8.0g origin Car origin . label list origin origin: 0 Domestic 1 Foreign .

Note that the variable foreign has the value label named origin, and when we list that value label we see that the value 0 is displayed as "Domestc" and 1 as "Foreign".

Also, a second approach. You tell us you ran

Code:

tabulate country

so now run

Code:

tabulate country, nolabel

to see exactly the same table but with the numeric values instead of the country names.

Also, a third approach.

Code:

generate N = 1 generate countrynum = country collapse (sum) N, by(country countrynum) list, clean
1 like
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35696
#10

29 Dec 2021, 07:49

In addition to several other helpful answers I suggest

Code:

groups country string_country

where groups is as described in https://www.statalist.org/forums/for...updated-on-ssc but is best downloaded using

Code:

net sj 18-1 st0496_1

Code:

tab country string_country

is not even in principle a good idea as most of its (thousands of?) cells would be empty.

EDIT: This is similar in spirit but less destructive than @WIlliam Lisowki's collapse suggestion.

Last edited by Nick Cox; 29 Dec 2021, 08:02.
1 like
Comment
clara gisoldo

Join Date: Nov 2021

Posts: 10
#11

29 Dec 2021, 08:46

Dear William,

Thank you kindly for all of your advice, it has been very helpful.

Using the three approaches you suggest, I am able to yield a list of the country names and their equivalent codes.

However, I am not sure how this allows me to check that each country observation has been assigned the correct code during the encoding?

Best wishes,
Clara
Comment
clara gisoldo

Join Date: Nov 2021

Posts: 10
#12

29 Dec 2021, 08:52

Dear Nick,

I feel like these suggestions might allow me to check that each country observation has been assigned the correct code during the encoding.

Although I am able to instal the groups files using the code you kindly provide, when I try to use the corresponding command, I get the following error message: command groups is unrecognised r(199). Why might this be?

Without this, the "tab country string_country" command returns the "too many values r(134)" error I mentioned previously.

Thank you for taking the time to help me with this.

Best wishes,
Clara
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#13

29 Dec 2021, 09:06

Suppose you started with a string variable named country_name and you encoded the country_name into the numeric country variable using

Code:

encode country_name, generate(country)

Then every numeric country corresponds to the same country name: country number 42 will not mean "Italy" in one observation and "Spain" in another. There is no issue of not getting the correct code assigned.

The exception to this is that your 200,000,000 observations must all have been encoded by a single command. If they were encoded in different datasets that were later combined into a single dataset, then you might well have problems.

Also, the same country may have multiple names, so "USA" will be assigned a different country number than "U.S.A.". So if you see "USA" and "U.S.A." in the output of "label list" or of "tabulate country" you know you have a problem.

The basic problem in advising you is that you have told us next to nothing about how your 200,000,000 observation dataset was created, or about how the numeric codes were created from the country names, Your question really isn't clear without more detail, or at a minimum it is too difficult to guess at a good answer from what you have shared. Please help us help you. The Statalist FAQ provides advice on effectively posing your questions, posting data, and sharing Stata output.
3 likes
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35696
#14

29 Dec 2021, 10:10

#11

Code:

net install st0496_1

is essential after

Code:

net sj 18-1 st0496_1

Typing the command just above yields that message.
Comment
Jared Greathouse

Join Date: Sep 2021

Posts: 2170
#15

29 Dec 2021, 10:30

Everything else everyone has said is fine. I guess the only thing i have to add is that the reason I asked about how many nations we have is because it *might* help us identify plausible solutions, which is why I asked for a data example.

Once we see if there are any spelling errors, as is possibly the case, we should correct them best we can. Once that's done, from a very practical perspective, I would advise the use of the assert command.

So let's say we have 100 nations. If "United States of America" has the value label of 1, what I would do is a unit test that goes something like

Code:

as country_encoded==1 if country_string=="United States of America"

I would likely loop this or find some equivalent way to check this.

What's more, and it's sort of unrelated, but however did we get to this point to begin with? I ask because in my experience, you're only dealing with north of 100 million observations when you're dealing with HUGE datasets like weather data or other gridded, fine datasets.

I mention this because surely there must be a better way to make a panel of 200 countries for whatever number of time periods you have. Put a little differently, there's likely a solution to this problem, but it's a problem I wouldn't want to think of dealing with unless I had to. How was your dataset constructed? What question are we investigating, how was the raw data compiled? Surely there's a better, more efficient way to address this question at the root, no, given that we're dealing with nation-states? clara gisoldo
1 like
Comment

Announcement