Data cleaning - checking correct encoding of variables

Clyde Schechter

Join Date: Apr 2014

Posts: 29799
#16

29 Dec 2021, 11:37

Building on Jared Greathouse's advice, here's a fairly simple way to verify that there is a one-to-one correspondence in both directions between country_encoded and country_string:

Code:

by country_encoded (country_string), sort: assert country_string[1] == country_string[_N] by country_string (country_encoded), sort: assert country_encoded[1] == country_encoded[_N]

This will be somewhat time consuming in a data set of this size because sorting is involved. But it's important to be sure that the data is right--we should not be in a rush to get the wrong results.
Comment
clara gisoldo

Join Date: Nov 2021

Posts: 10
#17

30 Dec 2021, 04:05

#13
Dear William,

Thank you for this explanations, things are much clearer now. I know the 200 000 000 observations were encoded by a single command, as this was done to make the original dataset (over 66GB) smaller, along with removing some variables and compressing. This was not done by me but I believe the country names were just encoded with numbers 1-140 by using a command such as "encode country_name, generate(country)".

The good news is that the list I get from the output of "label list" or of "tabulate country" is clean; all of the country names are distinct, there are no typos, and each is uniquely identified by a different number.

To answer your question about the dataset and Jared Greathouse's point in #15: this huge dataset is built from different records of companies, showing when and in which countries they are/were registered. As I mentioned, there are 140 of these countries. So, the unit of observation is companies, but what we are interested in is actually how many companies are registered in a given country over time (date 1 and date 2 are important variables in this sense).

#14
Dear Nick,

Thank you for showing me the extra code to actually install the file.

This allows me to use the "groups country string_country" you suggested above. Stata has been running this command for over 45 minutes, but this is probably just because of the size of the dataset. I am sure this will be useful to check the names and corresponding codes efficiently.

#15
Dear Jared,

I believe I answer most of your questions above. But to your general point, I agree that there must be a better way to make a panel of 140 countries to show how many companies are registered in each over time, instead of the way things are organised as shown in my second post in #5. Any suggestions are most welcome!

Thank you all for all of your pertinent advice, I really appreciate you taking the time to answer my questions.

Best wishes,
Clara
Comment

Announcement

Comment

Comment