Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Data cleaning - checking correct encoding of variables

    Hello,

    I'm very new to Stata and am trying to complete some data cleaning. I have a dataset with 5 variables and around 200 million observations. The variables are all numeric, and I would like to check that three of them have been encoded correctly, as they were originally categorical (string) variables. For example, I would like to know if the numerical code captures distinct countries for the country variable (there may be typos in the original categories, for instance).

    The original string variables are not available, but Stata shows the country names in browse (the categorical variable), but treats the variable as numeric in the data editor. Is there any way to check what the equivalencies between the two are?

    Thank you in advance for any help you might be able to give me!

    Best wishes,
    Clara

  • #2
    Clara:
    welcome to this forum.
    I'm not totally clear with your post.
    If the country variable were -destring-ed and -label-ed you should see a number for each country and country name reported in blue.
    As an aside, please read and act on the FAQ to post more effectively. Thanks.
    Kind regards,
    Carlo
    (StataNow 18.5)

    Comment


    • #3
      Hey Clara. Not too clear what you're asking, but...... how many countries are we talking about here? There may be some ways to proceed.... but first, you'd need to use dataex so we can inspect some sample data.

      Comment


      • #4
        DIfferent strings will be encoded to different numbers. It's as simple as that. Distinct countries should be encoded distinctly but even there could be problems (e.g. two different Congo Republics might be confused);

        The bigger deal, as you realise, will be different versions of the same country name,

        That includes differences in leading, trailing and embedded spaces, other punctuation (periods, apostrophes and accents, say), differences in use of upper and lower case, longer and shorter versions of the same name, accidental spelling errors, and so on. encode will take all such differences literally; there is no intelligence inside to appreciate that some differences are unimportant to the researcher and mean the same place.

        The original string variables can easily be regained by using decode.

        Cleaning large messy datasets is tough in any software and not something I would ask anyone "very new" to Stata to do. It's also tough call knowing whether to go back to the original string variables -- and clean those up before a fresh encode -- or to work with what you have.

        At worst you have 200-plus countries and variants of each, but at least the size of the problem should be evident by looking at the results of tabulate with your country variable.

        Comment


        • #5
          Dear Carlo,

          Thank you so much for your reply. I unfortunately can't share an example of the dataset as the FAQ suggests, because it is confidential. Here is an example below, I hope it helps you visualise what I mean, this is what I see when I browse the dataset:
          countries date 1 date 2 non-profit type
          USA 20766 . true Unincorporated Nonprofit Association (D)
          France 20467 . false Corporation - Professional - Domestic
          Italy 20660 . false Limited Partnership (D)
          USA 20453 . true Unincorporated Nonprofit Association (D)
          USA 14770 20291 false For-profit Corporation
          China 11983 21886 false Professional Corporation
          Russia . . false Limited Liability Company (D)
          China 12235 17217 false
          Corporation - Business - Foreign
          China . . true Unincorporated Nonprofit Association (D)
          China . . false Corporation - Professional - Domestic
          UK 14770 22410 false Limited Partnership (D)
          Germany 4336 9831 false Corporation - Professional - Domestic
          France 8068 10251 true Corporation - Nonprofit - Domestic
          France -8299 9817 true Corporation - Nonprofit - Domestic
          The country variable is unlabeled and numeric (float), has a range of 1-140 (there are 140 countries), 1 unit, and 199348601observations (0 missing). Each country has a corresponding numeric code, and I would like to check that this code indeed captures distinct country categories. I would also like to check the numeric encoding of the binary non-profit variable and the type variable (also both originally string and now numerical variables). Thank you for your help!

          Best wishes,
          Clara


          Comment


          • #6
            Dear Nick,

            Thank you so much for your reply. This is indeed the unknown unknown I am facing: finding out which typos etc. were taken literally in the encoding to mean categorical differences by Stata.

            As you suggest, I used the tabulate command on the country variable, which as you know yields a list of the values with the corresponding frequencies, percentage and cumulative percentages. From this, how do I check that the values that are shown in this table have been adequately encoded, as we have discussed?

            Thank you,
            Clara

            Comment


            • #7
              I often deal with cross-country data, and the good news is that there has been some previous work to deal with this specific problem. The following uses kountry from SSC and illustrates an approach.

              Code:
              * Example generated by -dataex-. For more info, type help dataex
              clear
              input long country
               1
               2
               3
               5
               6
               7
               8
               9
              12
              15
              16
              11
              14
              13
              10
               4
              end
              label values country Country
              label def Country 1 "China", modify
              label def Country 2 "France", modify
              label def Country 3 "Germany", modify
              label def Country 4 "Holland", modify
              label def Country 5 "Hungary", modify
              label def Country 6 "Italy", modify
              label def Country 7 "Japan", modify
              label def Country 8 "Netherlands", modify
              label def Country 9 "Russia", modify
              label def Country 10 "Russian Federation", modify
              label def Country 11 "S. Korea", modify
              label def Country 12 "South Korea", modify
              label def Country 13 "UK", modify
              label def Country 14 "USA", modify
              label def Country 15 "United Kingdom", modify
              label def Country 16 "United States", modify
              
              decode country, gen(countryname)
              *ssc install kountry
              kountry countryname, from(other)
              sort NAMES_STD
              l, sepby(NAMES_STD)
              Res.:

              Code:
              . l, sepby(NAMES_STD)
              
                   +----------------------------------------------------------+
                   |            country          countryname        NAMES_STD |
                   |----------------------------------------------------------|
                1. |              China                China            China |
                   |----------------------------------------------------------|
                2. |             France               France           France |
                   |----------------------------------------------------------|
                3. |            Germany              Germany          Germany |
                   |----------------------------------------------------------|
                4. |            Hungary              Hungary          Hungary |
                   |----------------------------------------------------------|
                5. |              Italy                Italy            Italy |
                   |----------------------------------------------------------|
                6. |              Japan                Japan            Japan |
                   |----------------------------------------------------------|
                7. |            Holland              Holland      Netherlands |
                8. |        Netherlands          Netherlands      Netherlands |
                   |----------------------------------------------------------|
                9. | Russian Federation   Russian Federation           Russia |
               10. |             Russia               Russia           Russia |
                   |----------------------------------------------------------|
               11. |        South Korea          South Korea      South Korea |
               12. |           S. Korea             S. Korea      South Korea |
                   |----------------------------------------------------------|
               13. |                 UK                   UK   United Kingdom |
               14. |     United Kingdom       United Kingdom   United Kingdom |
                   |----------------------------------------------------------|
               15. |                USA                  USA    United States |
               16. |      United States        United States    United States |
                   +----------------------------------------------------------+

              Comment


              • #8
                Dear Nick,

                Further to your other suggestion, I have also decoded the original variable: tostring country, generate(string_country) .

                I then try to get a table to compare the encoded and decoded variables: tab country string_country, but receive the following error message: r(134) too many values.

                I know that on R I would try to use bigquery, is there an equivalent I could use?

                Thank you for your help.

                Best,
                Clara

                Comment


                • #9
                  Let me address the very specific question of understanding what country name is associated with each country number.

                  Your country variable has associated with it a value label, and that value label is what causes a given numeric value to be displayed as a particular country name.

                  Here is an example of displaying the value labels.
                  Code:
                  . sysuse auto, clear
                  (1978 automobile data)
                  
                  . describe foreign
                  
                  Variable      Storage   Display    Value
                      name         type    format    label      Variable label
                  ------------------------------------------------------------------------------------------------
                  foreign         byte    %8.0g      origin     Car origin
                  
                  . label list origin
                  origin:
                             0 Domestic
                             1 Foreign
                  
                  .
                  Note that the variable foreign has the value label named origin, and when we list that value label we see that the value 0 is displayed as "Domestc" and 1 as "Foreign".

                  Also, a second approach. You tell us you ran
                  Code:
                  tabulate country
                  so now run
                  Code:
                  tabulate country, nolabel
                  to see exactly the same table but with the numeric values instead of the country names.

                  Also, a third approach.
                  Code:
                  generate N = 1
                  generate countrynum = country
                  collapse (sum) N, by(country countrynum)
                  list, clean

                  Comment


                  • #10
                    In addition to several other helpful answers I suggest

                    Code:
                    groups country string_country
                    where groups is as described in https://www.statalist.org/forums/for...updated-on-ssc but is best downloaded using

                    Code:
                    net sj 18-1 st0496_1
                    Code:
                    tab country string_country
                    is not even in principle a good idea as most of its (thousands of?) cells would be empty.

                    EDIT: This is similar in spirit but less destructive than @WIlliam Lisowki's collapse suggestion.
                    Last edited by Nick Cox; 29 Dec 2021, 09:02.

                    Comment


                    • #11
                      Dear William,

                      Thank you kindly for all of your advice, it has been very helpful.

                      Using the three approaches you suggest, I am able to yield a list of the country names and their equivalent codes.

                      However, I am not sure how this allows me to check that each country observation has been assigned the correct code during the encoding?

                      Best wishes,
                      Clara

                      Comment


                      • #12
                        Dear Nick,

                        I feel like these suggestions might allow me to check that each country observation has been assigned the correct code during the encoding.

                        Although I am able to instal the groups files using the code you kindly provide, when I try to use the corresponding command, I get the following error message: command groups is unrecognised r(199). Why might this be?

                        Without this, the "tab country string_country" command returns the "too many values r(134)" error I mentioned previously.

                        Thank you for taking the time to help me with this.

                        Best wishes,
                        Clara

                        Comment


                        • #13
                          Suppose you started with a string variable named country_name and you encoded the country_name into the numeric country variable using
                          Code:
                          encode country_name, generate(country)
                          Then every numeric country corresponds to the same country name: country number 42 will not mean "Italy" in one observation and "Spain" in another. There is no issue of not getting the correct code assigned.

                          The exception to this is that your 200,000,000 observations must all have been encoded by a single command. If they were encoded in different datasets that were later combined into a single dataset, then you might well have problems.

                          Also, the same country may have multiple names, so "USA" will be assigned a different country number than "U.S.A.". So if you see "USA" and "U.S.A." in the output of "label list" or of "tabulate country" you know you have a problem.

                          The basic problem in advising you is that you have told us next to nothing about how your 200,000,000 observation dataset was created, or about how the numeric codes were created from the country names, Your question really isn't clear without more detail, or at a minimum it is too difficult to guess at a good answer from what you have shared. Please help us help you. The Statalist FAQ provides advice on effectively posing your questions, posting data, and sharing Stata output.


                          Comment


                          • #14
                            #11

                            Code:
                            net install st0496_1
                            is essential after


                            Code:
                            net sj 18-1 st0496_1
                            Typing the command just above yields that message.

                            Comment


                            • #15
                              Everything else everyone has said is fine. I guess the only thing i have to add is that the reason I asked about how many nations we have is because it *might* help us identify plausible solutions, which is why I asked for a data example.

                              Once we see if there are any spelling errors, as is possibly the case, we should correct them best we can. Once that's done, from a very practical perspective, I would advise the use of the assert command.

                              So let's say we have 100 nations. If "United States of America" has the value label of 1, what I would do is a unit test that goes something like

                              Code:
                              as country_encoded==1 if country_string=="United States of America"
                              I would likely loop this or find some equivalent way to check this.

                              What's more, and it's sort of unrelated, but however did we get to this point to begin with? I ask because in my experience, you're only dealing with north of 100 million observations when you're dealing with HUGE datasets like weather data or other gridded, fine datasets.

                              I mention this because surely there must be a better way to make a panel of 200 countries for whatever number of time periods you have. Put a little differently, there's likely a solution to this problem, but it's a problem I wouldn't want to think of dealing with unless I had to. How was your dataset constructed? What question are we investigating, how was the raw data compiled? Surely there's a better, more efficient way to address this question at the root, no, given that we're dealing with nation-states? clara gisoldo

                              Comment

                              Working...
                              X