Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    Building on Jared Greathouse's advice, here's a fairly simple way to verify that there is a one-to-one correspondence in both directions between country_encoded and country_string:

    Code:
    by country_encoded (country_string), sort: assert country_string[1] == country_string[_N]
    by country_string (country_encoded), sort: assert country_encoded[1] == country_encoded[_N]
    This will be somewhat time consuming in a data set of this size because sorting is involved. But it's important to be sure that the data is right--we should not be in a rush to get the wrong results.

    Comment


    • #17
      #13
      Dear William,

      Thank you for this explanations, things are much clearer now. I know the 200 000 000 observations were encoded by a single command, as this was done to make the original dataset (over 66GB) smaller, along with removing some variables and compressing. This was not done by me but I believe the country names were just encoded with numbers 1-140 by using a command such as "encode country_name, generate(country)".

      The good news is that the list I get from the output of "label list" or of "tabulate country" is clean; all of the country names are distinct, there are no typos, and each is uniquely identified by a different number.

      To answer your question about the dataset and Jared Greathouse's point in #15: this huge dataset is built from different records of companies, showing when and in which countries they are/were registered. As I mentioned, there are 140 of these countries. So, the unit of observation is companies, but what we are interested in is actually how many companies are registered in a given country over time (date 1 and date 2 are important variables in this sense).

      #14
      Dear Nick,

      Thank you for showing me the extra code to actually install the file.

      This allows me to use the "groups country string_country" you suggested above. Stata has been running this command for over 45 minutes, but this is probably just because of the size of the dataset. I am sure this will be useful to check the names and corresponding codes efficiently.

      #15
      Dear Jared,

      I believe I answer most of your questions above. But to your general point, I agree that there must be a better way to make a panel of 140 countries to show how many companies are registered in each over time, instead of the way things are organised as shown in my second post in #5. Any suggestions are most welcome!



      Thank you all for all of your pertinent advice, I really appreciate you taking the time to answer my questions.

      Best wishes,
      Clara

      Comment

      Working...
      X