Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Unable to convert string ID variable to numeric using encode/destring

    Hi statalist,

    I'm trying to do a fuzzy matching using matchit. It turns out, the ID variables cannot be string. I have tried encoding the ID and it gives error that too many values.

    Code:
    encode id,gen(num_id)
    too many values
    I have tried destring which also gives error
    Code:
    destring id,gen(num_id)
    id: contains nonnumeric characters; no generate
    Here is the sample data for your use:

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str19 id str23 location
    "905016" "GUNTUR"
    "201416" "ANANTAPUR"
    "33751" "VIJAYAWADA"
    "641110" "Visakhapatnam"
    "783058" "BHIMAVARAM"
    "851102" "RAJAHMUNDRY"
    "498126" "HYDERABAD"
    "968955" "SRIKAKULAM"
    "591147" "Vizianagaram"
    "501199" "NELLORE"
    "747538" "KARIMNAGAR"
    "210431" "HYDERABAD"
    "878642" "RAJAHMUNDRY"
    "882513" "NELLORE"
    "279911" "GUNTUR"
    "86983" "Guntur"
    "146273" "VIJAYAWADA"
    "804336" "WARANGAL"
    "643155" "Ongole"
    "803403" "TIRUPATHI"
    "1136108" "VIJAYAWADA"
    "136532" "Hyderabad"
    "1232574" "HYDERABAD"
    "617628" "HYDERABAD"
    "371021" "Vijayawada"
    "888357" "HYDERABAD"
    "358401" "HYDERABAD"
    "245297" "NELLORE"
    "226732" "Hyderabad"
    "280930" "RAJAHMUNDRY"
    "704674" "VIJAYAWADA"
    "200807" "HYDERABAD"
    "540789" "TIRUPATHI"
    "413309" "Hyderabad"
    "67695" "KARIMNAGAR"
    "784241" "WARANGAL"
    "549517" "HYDERABAD"
    "1276446" "KURNOOL"
    "280262" "Guntur"
    "365927" "VISAKHAPATNAM"
    "675412" "HYDERABAD"
    "446856" "KARIMNAGAR"
    "69962" "HYDERABAD"
    "450854" "HYDERABAD"
    "730537" "Hyderabad"
    "567942" "HYDERABAD"
    "492382" "Visakhapatnam"
    "932050" "VIJAYAWADA"
    "126069" "KURNOOL"
    "1183824" "Guntur"
    "267487" "GUNTUR"
    "182371" "TIRUPATHI"
    "366956" "HYDERABAD"
    "488738" "GUNTUR"
    "459707" "HYDERABAD"
    "867766" "HYDERABAD"
    "825337" "VIJAYAWADA"
    "118670" "Hyderabad"
    "193624" "VIJAYAWADA"
    "118467" "Eluru"
    "661288" "Visakhapatnam"
    "974051" "Karimnagar"
    "25926" "Rajahmundry"
    "1134304" "HYDERABAD"
    "866420" "Kurnool"
    "294048" "KURNOOL"
    "736503" "HYDERABAD"
    "884434" "VISAKHAPATNAM"
    "433826" "HYDERABAD"
    "829731" "VIJAYAWADA"
    "779076" "HYDERABAD"
    "12630" "Vijayawada"
    "822021" "Hyderabad"
    "925622" "Hyderabad"
    "481261" "SANGAREDDY"
    "499392" "HYDERABAD"
    "531659" "WARANGAL"
    "178306" "HYDERABAD"
    "774302" "HYDERABAD"
    "201884" "Vijayawada"
    "557415" "HYDERABAD"
    "171842" "VIZIANAGARAM"
    "711075" "Anantapur"
    "48403" "HYDERABAD"
    "858462" "HYDERABAD"
    "398172" "VIJAYAWADA"
    "876915" "HYDERABAD"
    "219755" "Kakinada"
    "397756" "HYDERABAD"
    "544889" "GUNTUR"
    "1011093" "WARANGAL"
    "318906" "Hyderabad"
    "252161" "KARIMNAGAR"
    "940840" "Vijayawada"
    "797493" "RAJAHMUNDRY"
    "927880" "HYDERABAD"
    "273490" "Guntur"
    "630532" "VIJAYAWADA"
    "399764" "HYDERABAD"
    "516341" "NELLORE"
    end
    The non numeric characters dont appear in the sample data unfortunately. Could anyone suggest how to resolve this problem? I have 1,135,764 obs in my data.

    Thanks,
    Last edited by Titir Bhattacharya; 25 Apr 2022, 08:12.

  • #2
    Titir:
    works for me:
    Code:
    . destring id,gen(num_id)
    id: all characters numeric; num_id generated as long
    
    . list in 1/10
    
         +---------------------------------+
         |     id   num_id        location |
         |---------------------------------|
      1. | 905016   905016          GUNTUR |
      2. | 201416   201416       ANANTAPUR |
      3. |  33751    33751      VIJAYAWADA |
      4. | 641110   641110   Visakhapatnam |
      5. | 783058   783058      BHIMAVARAM |
         |---------------------------------|
      6. | 851102   851102     RAJAHMUNDRY |
      7. | 498126   498126       HYDERABAD |
      8. | 968955   968955      SRIKAKULAM |
      9. | 591147   591147    Vizianagaram |
     10. | 501199   501199         NELLORE |
         +---------------------------------+
    
    .
    Before -desting-ing, make sure that your -string- does not contain leading and/or trailing blanks:
    Code:
    gen wanted=strtrim(id)
    Kind regards,
    Carlo
    (StataNow 18.5)

    Comment


    • #3
      The issue here for -encode- is that it only permits 65,536 values (-help limits-). The failure of -destring()- in Titir's actual data but not the example here might also be due to the presence of invisible nonprinting characters in some observations of the actual file, which might be detected with
      Code:
      browse id if real(id) == .
      Another possibility is the presence of a completely blank observation, which can commonly (and almost invisibly) appear at the end of a data file that comes from a spreadsheet.

      Comment


      • #4
        Code:
        gen num_id=real(id)

        Comment


        • #5
          Be careful! -gen num_id = real(id)- may have run without any error messages. But it may also have seriously messed up your data.

          Earlier, you tried to convert id to numeric using -destring-, but you got an error message "id: contains nonnumeric characters; no generate." Unless you have subsequently done something to eliminate the values of id that contain non-numeric characters, they are still there. What -gen num_id = real(id)- does with those is set num_id = missing value (.). It does not give you any warning or messages about that. But this could be a very serious problem. For example if you have two different entities with ids 51634A and 51634B, they will have both been assigned num_id = missing value. In other words, two different entities have the same value of num_id. Everything from there will be wrong.

          The advise Mike Lacy gave you in #3 is correct. You should do that. Don't just make the error messages go away--solve the underlying problem! It seems likely that whatever the observations with non-numeric characters in id are, they are probably errors. Fix them if possible, eliminate them if not.

          If after inspecting the id's that have non-numeric characters you determine that, in fact, they are not errors--you just had a wrong expectation about what a valid id can be--then you can create a good numeric identifier with:
          Code:
          egen long num_id = group(id)
          But, again, don't do that until you have identified why those non-numeric values of id are there and dealt with them appropriately. Don't sweep bad data under the rug.

          Comment


          • #6
            Originally posted by Carlo Lazzaro View Post
            Titir:
            works for me:
            Code:
            . destring id,gen(num_id)
            id: all characters numeric; num_id generated as long
            
            . list in 1/10
            
            +---------------------------------+
            | id num_id location |
            |---------------------------------|
            1. | 905016 905016 GUNTUR |
            2. | 201416 201416 ANANTAPUR |
            3. | 33751 33751 VIJAYAWADA |
            4. | 641110 641110 Visakhapatnam |
            5. | 783058 783058 BHIMAVARAM |
            |---------------------------------|
            6. | 851102 851102 RAJAHMUNDRY |
            7. | 498126 498126 HYDERABAD |
            8. | 968955 968955 SRIKAKULAM |
            9. | 591147 591147 Vizianagaram |
            10. | 501199 501199 NELLORE |
            +---------------------------------+
            
            .
            Before -desting-ing, make sure that your -string- does not contain leading and/or trailing blanks:
            Code:
            gen wanted=strtrim(id)
            Thanks Carlo. However the problem is in my actual data, there are non numeric components to some ids which is causing problem.

            Comment


            • #7
              Originally posted by Mike Lacy View Post
              The issue here for -encode- is that it only permits 65,536 values (-help limits-). The failure of -destring()- in Titir's actual data but not the example here might also be due to the presence of invisible nonprinting characters in some observations of the actual file, which might be detected with
              Code:
              browse id if real(id) == .
              Another possibility is the presence of a completely blank observation, which can commonly (and almost invisibly) appear at the end of a data file that comes from a spreadsheet.
              Thanks Mike. Indeed destring isnt working in my actual data as there are non numeric components in some ids. Could you suggest a workaround?

              Comment


              • #8
                Originally posted by Clyde Schechter View Post
                Be careful! -gen num_id = real(id)- may have run without any error messages. But it may also have seriously messed up your data.

                Earlier, you tried to convert id to numeric using -destring-, but you got an error message "id: contains nonnumeric characters; no generate." Unless you have subsequently done something to eliminate the values of id that contain non-numeric characters, they are still there. What -gen num_id = real(id)- does with those is set num_id = missing value (.). It does not give you any warning or messages about that. But this could be a very serious problem. For example if you have two different entities with ids 51634A and 51634B, they will have both been assigned num_id = missing value. In other words, two different entities have the same value of num_id. Everything from there will be wrong.

                The advise Mike Lacy gave you in #3 is correct. You should do that. Don't just make the error messages go away--solve the underlying problem! It seems likely that whatever the observations with non-numeric characters in id are, they are probably errors. Fix them if possible, eliminate them if not.

                If after inspecting the id's that have non-numeric characters you determine that, in fact, they are not errors--you just had a wrong expectation about what a valid id can be--then you can create a good numeric identifier with:
                Code:
                egen long num_id = group(id)
                But, again, don't do that until you have identified why those non-numeric values of id are there and dealt with them appropriately. Don't sweep bad data under the rug.
                Thanks Clyde. Actually the non numeric components are not due to errors. At some point in my repeated cross section data, the ID convention changed and the admin started to include non numeric characters. It was an administrative decision. It wasnt causing any problem earlier due to nature of my analysis but now it has led me to big trouble. Could you suggest any workaround?

                Comment


                • #9
                  Did you try the following suggested workaround given in Clyde's post?
                  If after inspecting the id's that have non-numeric characters you determine that, in fact, they are not errors--you just had a wrong expectation about what a valid id can be--then you can create a good numeric identifier with:
                  Code:
                  egen long num_id = group(id)

                  Comment


                  • #10
                    Given that the non-numeric identifiers are, indeed, correct, the code in #5 is your workaround; it will create a parallel numeric identifier. You will not be able to put value labels on it because, as -encode- told you, there are too many distinct values to fit in a value label. So you will probably need to carry along both id and num_id in your data, because Stata will want you to use num_id in -xtset- and other commands, but sometimes you may need to refer to id so that you know what entities your observations refer to in a recognizable way.

                    Added: Crossed with #9.

                    Comment


                    • #11
                      Dear Cylde,

                      Thank you for #5. I will be careful about this. But, sometimes I import my data from Excel documents, and STATA automatically gets my continuous variables as a string. Then, I use -gen newvar=real(oldvar)- to handle it easily. These variables are GDP or other World Bank or OECD data.

                      Best

                      Comment


                      • #12
                        But, sometimes I import my data from Excel documents, and STATA automatically gets my continuous variables as a string. Then, I use -gen newvar=real(oldvar)- to handle it easily.
                        Well, I wouldn't do that if I were you. When I'm in that situation, I use -destring-. That way, if something that I think should be a numeric variable erroneously contains something that isn't a number, -destring- will refuse to convert it. That way, I find out right away that there's a problem with the data and I can go about fixing it (or I can find out after looking into it that my expectation that the variable is entirely numeric was mistaken and revise my data management and analysis plans accordingly).

                        By contrast, -real()- doesn't care. It just returns a missing value and moves on without giving you any warning that something is amiss. Then, if you're lucky, you'll find out about this when something bizarre shows up in analysis that you then have to laboriously trace back to this problem. If you're unlucky, you won't find out about it until after somebody else gets burned by relying on your incorrect results.

                        Comment


                        • #13
                          Originally posted by William Lisowski View Post
                          Did you try the following suggested workaround given in Clyde's post?
                          Thanks William. Indeed it worked perfectly

                          Comment


                          • #14
                            Originally posted by Clyde Schechter View Post
                            Given that the non-numeric identifiers are, indeed, correct, the code in #5 is your workaround; it will create a parallel numeric identifier. You will not be able to put value labels on it because, as -encode- told you, there are too many distinct values to fit in a value label. So you will probably need to carry along both id and num_id in your data, because Stata will want you to use num_id in -xtset- and other commands, but sometimes you may need to refer to id so that you know what entities your observations refer to in a recognizable way.

                            Added: Crossed with #9.
                            Thanks Clyde. It worked perfectly

                            Comment

                            Working...
                            X