Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Xtset and string variables

    Hi Everyone,

    I am trying to set my data as a panel by using xtset command. I have uasid as identifier and year variables. When I use uasid, since it is a string variable my command doesn't work.
    I did encode uasid to id to make sure I destring uasid. when I did that, I had the following problem on the attachement. After encoding, uasid and id do not match each other for each observations. How can I solve this problem?
    Attached Files

  • #2
    The encode command is designed for assigning numerical codes to non-numeric strings like "France", "Germany", "United States". The output of help encode instructs us

    Do not use encode if varname contains numbers that merely happen to be stored as strings; instead, use generate newvar = real(varname) or destring; see real() or [D] destring.
    You should use something like
    Code:
    destring uasid, generate(id)

    Comment


    • #3
      William Lisowski's recommendation to use -destring- is correct.

      Unfortunately, the advice on the quote from -help encode-, to use generate newvar = real(varname), would lead to incorrect results in this case. That is because -generate-, by default, creates variables as a float storage type. But a float storage type is not large enough to hold 9 decimal digits of precision, and the results would be wrong (including mapping some distinct uasid values to the same id value. If one were to use -generate ... real()- for this, you would have to explicitly make the new variable a long or double to get correct results.

      Comment


      • #4
        Originally posted by Yaseminn Akcann View Post
        I did encode uasid to id to make sure I destring uasid. . . . After encoding, uasid and id do not match each other for each observations.
        How did you manage to do that?

        I agree with William that destring is your best option in your case, and with Clyde that there's a bug in the documentation that StataCorp needs to fix.

        But I can't see how you got those mismatches with encode.

        .ÿ
        .ÿversionÿ17.0

        .ÿ
        .ÿclearÿ*

        .ÿ
        .ÿinputÿstr9ÿuasid

        ÿÿÿÿÿÿÿÿÿuasid
        ÿÿ1.ÿ140100007
        ÿÿ2.ÿ140100007
        ÿÿ3.ÿ140100007
        ÿÿ4.ÿ140100010
        ÿÿ5.ÿ140100041
        ÿÿ6.ÿ140100011
        ÿÿ7.ÿ140100011
        ÿÿ8.ÿ140100047
        ÿÿ9.ÿ140100035
        ÿ10.ÿ140100035
        ÿ11.ÿ140100072
        ÿ12.ÿ140100041
        ÿ13.ÿ140100038
        ÿ14.ÿ140100079
        ÿ15.ÿ140100047
        ÿ16.ÿ140100041
        ÿ17.ÿ140100081
        ÿ18.ÿ140100048
        ÿ19.ÿ140100047
        ÿ20.ÿ140100090
        ÿ21.ÿ140100072
        ÿ22.ÿ140100048
        ÿ23.ÿ140100108
        ÿ24.ÿ140100079
        ÿ25.ÿend

        .ÿ
        .ÿencodeÿuasid,ÿgenerate(id)ÿlabel(UASIDs)

        .ÿ
        .ÿpreserve

        .ÿsortÿuasid

        .ÿlistÿuasidÿid,ÿnoobsÿsepby(uasid)

        ÿÿ+-----------------------+
        ÿÿ|ÿÿÿÿÿuasidÿÿÿÿÿÿÿÿÿÿidÿ|
        ÿÿ|-----------------------|
        ÿÿ|ÿ140100007ÿÿÿ140100007ÿ|
        ÿÿ|ÿ140100007ÿÿÿ140100007ÿ|
        ÿÿ|ÿ140100007ÿÿÿ140100007ÿ|
        ÿÿ|-----------------------|
        ÿÿ|ÿ140100010ÿÿÿ140100010ÿ|
        ÿÿ|-----------------------|
        ÿÿ|ÿ140100011ÿÿÿ140100011ÿ|
        ÿÿ|ÿ140100011ÿÿÿ140100011ÿ|
        ÿÿ|-----------------------|
        ÿÿ|ÿ140100035ÿÿÿ140100035ÿ|
        ÿÿ|ÿ140100035ÿÿÿ140100035ÿ|
        ÿÿ|-----------------------|
        ÿÿ|ÿ140100038ÿÿÿ140100038ÿ|
        ÿÿ|-----------------------|
        ÿÿ|ÿ140100041ÿÿÿ140100041ÿ|
        ÿÿ|ÿ140100041ÿÿÿ140100041ÿ|
        ÿÿ|ÿ140100041ÿÿÿ140100041ÿ|
        ÿÿ|-----------------------|
        ÿÿ|ÿ140100047ÿÿÿ140100047ÿ|
        ÿÿ|ÿ140100047ÿÿÿ140100047ÿ|
        ÿÿ|ÿ140100047ÿÿÿ140100047ÿ|
        ÿÿ|-----------------------|
        ÿÿ|ÿ140100048ÿÿÿ140100048ÿ|
        ÿÿ|ÿ140100048ÿÿÿ140100048ÿ|
        ÿÿ|-----------------------|
        ÿÿ|ÿ140100072ÿÿÿ140100072ÿ|
        ÿÿ|ÿ140100072ÿÿÿ140100072ÿ|
        ÿÿ|-----------------------|
        ÿÿ|ÿ140100079ÿÿÿ140100079ÿ|
        ÿÿ|ÿ140100079ÿÿÿ140100079ÿ|
        ÿÿ|-----------------------|
        ÿÿ|ÿ140100081ÿÿÿ140100081ÿ|
        ÿÿ|-----------------------|
        ÿÿ|ÿ140100090ÿÿÿ140100090ÿ|
        ÿÿ|-----------------------|
        ÿÿ|ÿ140100108ÿÿÿ140100108ÿ|
        ÿÿ+-----------------------+

        .ÿ
        .ÿrestore

        .ÿdecodeÿid,ÿgenerate(backid)

        .ÿassertÿuasidÿ==ÿbackid

        .ÿ
        .ÿexit

        endÿofÿdo-file


        .


        For my own edification, could you show what you typed that got you there?

        Comment


        • #5
          Thank you all for the answers.

          When use destring command as William recommended, I cannot set my data as panel data by using xtreg. When I try that after destring, Stata tells me that "repeated time values within panel" error. When I check for duplicates and drop them after that message, I get half of my abs deleted.


          Joseph, I used the exam same code of encode uasid, generate(id) to destring the values but it ended up creating these results. I don't understand why both way do not work.

          Comment


          • #6
            Stata tells me that "repeated time values within panel" error. When I check for duplicates and drop them after that message, I get half of my abs deleted.
            Well, is that a problem or isn't it? If your data set contained duplicates of every observation (so half the data disappeared when you dropped duplicates), you have not actually lost any information. The real problem is why those duplicates were there in the first place. That usually reflects an error in the data management that created the data set. Seldom does anybody intentionally create a data set with even any duplicate observations, let alone large numbers of them. Removing those duplicates probably leaves you with the data set you initially intended to create.

            Unless you did something like -duplicates drop id timevar, force-, in which case you may have had observations that were not pure duplicates but merely agreed on id and the time variable, while disagreeing on other variables. In that case, you need to carefully review the original data set and get a clear understanding of what is going on. There are two possibilities:

            1. The observations that agree on id and timevar but otherwise disagree on some variables are all supposed to be there. They are all correct data. In this case, it is inappropriate to -xtset id timevar- because you do not have panel data. It may be that id in combination with some other variable(s) will uniquely identify observations in conjunction with the time variable--in which case creating a variable that combines those variables will serve instead of id in -xtset-. (Look at -egen, group- to create such a variable.) If that is not the case, you can still -xtset id- without mentioning a time variable. You will still be able to use the -xt- commands for analyses. All you will lose is the ability to use time series operators (leads, lags, etc.) or do analyses with autoregressive structure.

            2. The observations that agree on id and timevar but otherwise disagree on some variables are not all supposed to be there. So you have a bad data set and you need to eliminate the surplus observations that are incorrect. Or perhaps the "correct" observations are combinations of the surplus ones. But even if you can simply handle that, the fact that you ended up with a bad data set suggests that there was something wrong with the data management that created it. So you should go back and review that from beginning to end and find out where things went wrong. In the course of doing that, there is a reasonable chance you will also uncover other errors. Best to get it all fixed now before it bites you later.

            Comment


            • #7
              Originally posted by Clyde Schechter View Post

              Well, is that a problem or isn't it?
              It was a problem since I was sure that there were no dups in my data overall and I had this problem after encoding. After the comments and noticing encode did not work properly, I looked back every data set that I appended or merged and figured out the problems. It seems everything is fine now. Thanks for all the help!

              Comment

              Working...
              X