Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Error message: too many values when using encode command

    Hi,

    I would like to convert a string variable (that I got to import from a .csv file and that I got in 'red' so i cannot use the variable) into numeric.
    I tried using the commands destring and encode but I get an error message saying:
    too many values
    r(134);


    I have 246 312 observations. I got to read in a few posts that since its too big, I should get the BIGTAB package. Which I did but it doesn't change anything, I still get the message error.
    Do you have any suggestions? What should I do to convert my variable to numeric?

    Thank you so much for your help.







  • #2
    Well, -destring- and -encode- do very different things, and basically if either is applicable to your data, the other probably isn't. So it isn't entirely clear what you're trying to do.

    I'm guessing you really wanted to do what -encode- does and that the -destring- part is based on a misunderstanding. -encode- would have taken your string variable and assigned distinct numeric values to each distinct value of your string variable, and then attached a value label to that variable so that in many kinds of output, the string equivalent would appear instead of the assigned numeric values. But -encode- only allows up to, if I recall correctly, about 65,000 numeric values. And that, if I also recall correctly, (I'm away from a computer with Stata at the moment so I can't check), is due to a limitation on the number of values a single value label can contain.

    So, you are not going to be able to have all of what you wanted. But for purposes of being able to use this variable in analyses, for example in an -xtset- command, you can get the numeric values easily enough, just not the value labels:

    Code:
    egen long numeric_variable = group(string_variable)
    You can use the resulting numeric_variable as a proxy for string_variable in any context where string variables are not allowed, but when you see it in the data browser or in Stata output listings, it will always be numeric and will not show the corresponding string information.

    Comment


    • #3
      Should it be numeric? We just can't tell on this evidence. You can't show your whole dataset but you could use dataex (SSC) to show us an example.

      Comment


      • #4
        Yes my variable is supposed to be numeric, it represents the size of a company as calculated by the log of the market value of equity. But when I converted my .csv file to state i got string variable.
        Here is the example using dataex as suggested:
        [
        CODE]
        * Example generated by -dataex-. To install: ssc install dataex
        clear
        input str35 size
        ""
        "4.72893529332202"
        ""
        "-Inf"
        "4.2325400609735"
        ""
        "2.9838621009835"
        "4.41816568128975"
        "5.22643729539419"
        "4.21978156473198"
        ""
        "1.51424772543035"
        ""
        "3.63676345297492"
        "6.19806094631167"
        "5.39689492990903"
        ""
        "5.94307099177886"
        "3.29157780908395"
        "7.46757940235708"
        ""
        ""
        "5.25909565339426"
        ""
        ""
        "3.69185003755877"
        ""
        ""
        ""
        ""
        "5.10461125088744"
        ""
        "7.34136459703477"
        "4.75895079734846"
        "8.52113929529693"
        ""
        "2.44841554120559"
        ""
        "5.74829126458645"
        "4.35646841317719"
        "3.49832409239288"
        ""

        Comment


        • #5
          The good news is that this is relatively easy. You need destring (and emphatically not encode or egen, group()). A plain

          Code:
          destring size, replace 
          would not have worked (you told us you tried destring but gave no details!) if only because values like "-Inf" would have prevented that.

          Ideally you should check what destring objects to by looking at the problematic values:

          Code:
          tab size if missing(real(size))
          to check that nothing informative would be lost with a brute force solution:

          Code:
          destring size, force replace 
          Do study the help and manual entry for destring.

          Comment


          • #6
            Today I stumbled on the same problem, and I solved it using Clyde's suggestion.
            That was timely to say the least!

            thanks a lot

            Comment


            • #7
              Dino: @ClydeSchechter won't mind me emphasising that egen, group() is entirely the wrong way to approach a problem like that revealed in #4 and I am completely confident that had he seen #4 first he would have recommended what I did.

              With the sample data in #4 this is the mapping that such an approach produces. The best that can be said for it is that it preserves rank order, but nothing else. Missings don't even get treated meaningfully. Indeed, values like "12.3" will break rank order too.

              Code:
                +------------------------------------+
                |               size   group   Freq. |
                |------------------------------------|
                |                ""        1      18 |
                |            "-Inf"        2       1 |
                | "1.51424772543035"       3       1 |
                | "2.44841554120559"       4       1 |
                | "2.9838621009835"        5       1 |
                |------------------------------------|
                | "3.29157780908395"       6       1 |
                | "3.49832409239288"       7       1 |
                | "3.63676345297492"       8       1 |
                | "3.69185003755877"       9       1 |
                | "4.21978156473198"      10       1 |
                |------------------------------------|
                | "4.2325400609735"       11       1 |
                | "4.35646841317719"      12       1 |
                | "4.41816568128975"      13       1 |
                | "4.72893529332202"      14       1 |
                | "4.75895079734846"      15       1 |
                |------------------------------------|
                | "5.10461125088744"      16       1 |
                | "5.22643729539419"      17       1 |
                | "5.25909565339426"      18       1 |
                | "5.39689492990903"      19       1 |
                | "5.74829126458645"      20       1 |
                |------------------------------------|
                | "5.94307099177886"      21       1 |
                | "6.19806094631167"      22       1 |
                | "7.34136459703477"      23       1 |
                | "7.46757940235708"      24       1 |
                | "8.52113929529693"      25       1 |
                +------------------------------------+

              Comment


              • #8
                Nick, I didn't describe my problem, which is similar but not the same than Amira's.

                I needed to collapse records which correspond to pathologies within hospital discharges which in turn are within patients. I have more than 450k records, where discharges are identified by an alphanumeric code (codiceevento), so when trying encode I got the error of too many values.
                I read the destring entry in the Stata manual, and thought it wasn't my solution because using the force option I would have turned into missing data all the string values containing nonnumeric characters. So I created the variable ric

                egen long ric=group(codiceevento)

                and obtained:

                Code:
                id    codiceevento    ric
                P13768    E100370    412
                P13768    E100370    412
                P13768    E100370    412
                P13768    E100370    412
                P13768    E96609    129654
                P13768    E96609    129654
                P13768    E96609    129654
                P13768    E97233    130344
                P13768    E98020    131211
                P13768    E98020    131211
                P13768    E98020    131211
                P13768    E98021    131212
                P13771    E95159    128055
                P13771    E95160    128057
                P13771    E95774    128732
                P13771    E95774    128732
                P13771    E95980    128959
                P13771    E95981    128960
                P13771    E95981    128960
                P13771    E95982    128961
                P13771    E96224    129228
                P13771    E96500    129534
                which, accordingly to codebook, seems to have functioned well:

                Code:
                . codebook, com
                
                Variable         Obs Unique      Mean     Min     Max  Label
                -----------------------------------------------------------------------------------------------------------
                id            453798  17749         .       .       .  Cod.PIRP
                codiceevento  453798 133384         .       .       .  Codice evento
                ric           453798 133384  68892.23       1  133384  group(codiceevento)

                Then I run collapse successfully:
                Code:
                collapse (max) AKI data_dim data_ric datar_AKI datad_AKI, by(id ric)

                Comment


                • #9
                  Dino: Sounds good. You did say in #6 that you had "the same problem". Thanks for the clarification and correction.

                  Comment


                  • #10
                    I'm sorry Nick for being inaccurate. Actually, I should have said that I got the same error message.
                    Thank you for your help

                    Comment


                    • #11
                      Nick Cox, THANK YOU so much!!!
                      It worked!
                      As suggested by you, I used:
                      Code:
                      destring size, force replace
                      and it worked great

                      Comment


                      • #12
                        Good, but anyone reading this should note my advice in #5 that you check that force is not causing loss of information.

                        Comment

                        Working...
                        X