Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Add missing observations to the panel data

    Hello,

    I would like to add the missing observations for the missing years to the panel data below (my data set is very large, and this would help me easily manage and calculate the growth rates).

    The question is similar to this post (http://www.stata.com/statalist/archi.../msg00002.html); but this is a panel, and I cannot use –tsset– and –tsfill–


    The missing years are: 2004, 2005, 2009
    ccode cname pdtcode partcode year export Import
    NGA Nigeria 98 Africa 2000 16865.06 163984.4
    NGA Nigeria 98 Africa 2001 38273.86 462252.9
    NGA Nigeria 98 Africa 2002 417154.6 313921.3
    NGA Nigeria 98 Africa 2003 60113.79 642668.6
    NGA Nigeria 98 Africa 2006 399306.9 684482.1
    NGA Nigeria 98 Africa 2007 229000 1453937
    NGA Nigeria 98 Africa 2008 935363.9 1646058
    NGA Nigeria 98 Africa 2010 1988731 1629383
    NGA Nigeria 98 WLD 2000 65469.92 5474002
    NGA Nigeria 98 WLD 2001 60095.69 7344401
    NGA Nigeria 98 WLD 2002 1327962 8281683
    NGA Nigeria 98 WLD 2003 598970 14333524
    NGA Nigeria 98 WLD 2006 4060341 21208807
    NGA Nigeria 98 WLD 2007 1775116 29903932
    NGA Nigeria 98 WLD 2008 5111720 27254810
    NGA Nigeria 98 WLD 2010 18959458 42759921

    The expected result looks like this:
    ccode cname pdtcode partcode year export import
    NGA Nigeria 98 Africa 2000 16865.06 163984.4
    NGA Nigeria 98 Africa 2001 38273.86 462252.9
    NGA Nigeria 98 Africa 2002 417154.6 313921.3
    NGA Nigeria 98 Africa 2003 60113.79 642668.6
    NGA Nigeria 98 Africa 2004
    NGA Nigeria 98 Africa 2005
    NGA Nigeria 98 Africa 2006 399306.9 684482.1
    NGA Nigeria 98 Africa 2007 229000 1453937
    NGA Nigeria 98 Africa 2008 935363.9 1646058
    NGA Nigeria 98 Africa 2009
    NGA Nigeria 98 Africa 2010 1988731 1629383
    NGA Nigeria 98 WLD 2000 65469.92 5474002
    NGA Nigeria 98 WLD 2001 60095.69 7344401
    NGA Nigeria 98 WLD 2002 1327962 8281683
    NGA Nigeria 98 WLD 2003 598970 14333524
    NGA Nigeria 98 WLD 2004
    NGA Nigeria 98 WLD 2005
    NGA Nigeria 98 WLD 2006 4060341 21208807
    NGA Nigeria 98 WLD 2007 1775116 29903932
    NGA Nigeria 98 WLD 2008 5111720 27254810
    NGA Nigeria 98 WLD 2009
    NGA Nigeria 98 WLD 2010 18959458 42759921
    Many thanks.
    Hieu

  • #2
    Hi nguyenchihieu82vn,
    You can start with the task in a different way.
    use data,
    keep ccode
    duplicates drop ccode
    expand 11
    bysort ccode:gen year=1999+_n
    sort ccode year
    merge ccode year using data,
    This should work for the data you show.
    Best

    Comment


    • #3
      Thanks a lot FernandosRios,
      It works well that way.
      Best,

      Comment


      • #4
        I don't know why you say

        this is a panel, and I cannot use tsset and tsfill
        Consider this example:

        Code:
         
        . webuse grunfeld, clear
        
        . l if company == 7 & year == 1940
        
             +--------------------------------------------------+
             | company   year   invest   mvalue   kstock   time |
             |--------------------------------------------------|
        126. |       7   1940    33.71    145.1    222.9      6 |
             +--------------------------------------------------+
        
        . drop if company == 7 & year == 1940
        (1 observation deleted)
        
        . tsfill
        
        . l if company == 7 & year == 1940
        
             +--------------------------------------------------+
             | company   year   invest   mvalue   kstock   time |
             |--------------------------------------------------|
        126. |       7   1940        .        .        .      . |
             +--------------------------------------------------+
        This is a panel; it was previously tsset; and tsfill does what it is supposed to. That's not much use, but it works.

        Comment


        • #5
          @Nick

          My problem is that I cannot tsset the data.

          . tsset year
          repeated time values in sample
          r(451);

          Comment


          • #6
            Quite. As it is a panel, you must specify a panel identifier as well as a time variable. This is documented in the help for tsset.
            Last edited by Nick Cox; 30 May 2014, 11:53.

            Comment


            • #7
              Thanks a lot Nick. I would read more on that.
              Anyway, I find the suggestion of FernandosRios above really helpful for this task as I can quickly check the missing years and transform the data to appropriate form.

              Comment


              • #8
                Originally posted by FernandoRios View Post
                Hi nguyenchihieu82vn,
                You can start with the task in a different way.
                use data,
                keep ccode
                duplicates drop ccode
                expand 11
                bysort ccode:gen year=1999+_n
                sort ccode year
                merge ccode year using data,
                This should work for the data you show.
                Best
                I actually have this problem myself with a dataset that I am using. I have some years that are missing similar to Nguyen (I need to add the missing observations for the missing years). I don't understand Fernando's explanation/breakdown for Nguyen's question.

                Comment


                • #9
                  The code you show, written by Fernando Rios, starts by reading your data set and then keeping only the ccode variable, and only one observation for each ccode. It then expands that to 11 observations for each ccode, and goes on to create a new variable with consecutive values 2000, 2001,...,2010 for each ccode. That resulting "frame" of ccodes and years is then -merge-d with the original data. Where a given ccode-year combination already existed in the original data, that data will be brought back. Where there is no corresponding combination of ccode and year in the original data, there will now be an observation containing the ccode and year, but the other variables will have missing values, as there is no information about them.

                  Comment


                  • #10
                    Originally posted by Clyde Schechter View Post
                    The code you show, written by Fernando Rios, starts by reading your data set and then keeping only the ccode variable, and only one observation for each ccode. It then expands that to 11 observations for each ccode, and goes on to create a new variable with consecutive values 2000, 2001,...,2010 for each ccode. That resulting "frame" of ccodes and years is then -merge-d with the original data. Where a given ccode-year combination already existed in the original data, that data will be brought back. Where there is no corresponding combination of ccode and year in the original data, there will now be an observation containing the ccode and year, but the other variables will have missing values, as there is no information about them.
                    Thank you Clyde. I tried to do something similar from Fernando's code and logic, but with a dyad-year unit of analysis (two countries). I got an r4 error "no; data in memory would be lost."


                    I would like to add the missing years and observations for the dataset for the years 2000-2013, very similar to Nyugen. Where importer1 and importer2 are the countries.

                    Code:
                    use A.dta
                    keep importer1 importer2
                    duplicates drop importer1 importer2
                    expand 14
                    bysort importer1 importer2:gen year=1999+_n
                    sort importer1 importer2 year
                    merge importer1 importer2 year use A.dta
                    From what I read from Nick's example, it seems that just erased observations that were already there?

                    Comment


                    • #11
                      So I see two problems here. In the first line, if there is already data in memory that has not been saved since its last modification, Stata will give you precisely the error message you describe. Stata never permits you to clobber unsaved data in memory without either saving it first or indicating that you are aware you are throwing it away and are OK with that. So the first line needs to be -use A.dta, clear-.

                      There is another error in the last line. But I think it would give you a syntax error rather than the message you got. It should read:

                      Code:
                      merge import1 importer2 year using A.dta
                      A couple of things can also be simplified. -duplicates drop importer1 importer2- can be simplified to just -duplicates drop- because at that point in the code, importer1 and importer2 are the only variables in the data in memory. And the -sort importer1 importer2 year- command is not needed (unless you are using a pretty old version of Stata). Modern Stata automatically sorts the data, if necessary, when it does the -merge-. To be clear, neither of these things is an error; you can leave them the way they are. I'm just pointing out a way to make the code a tad cleaner.

                      Comment


                      • #12
                        Originally posted by Clyde Schechter View Post
                        So I see two problems here. In the first line, if there is already data in memory that has not been saved since its last modification, Stata will give you precisely the error message you describe. Stata never permits you to clobber unsaved data in memory without either saving it first or indicating that you are aware you are throwing it away and are OK with that. So the first line needs to be -use A.dta, clear-.

                        There is another error in the last line. But I think it would give you a syntax error rather than the message you got. It should read:

                        Code:
                        merge import1 importer2 year using A.dta
                        A couple of things can also be simplified. -duplicates drop importer1 importer2- can be simplified to just -duplicates drop- because at that point in the code, importer1 and importer2 are the only variables in the data in memory. And the -sort importer1 importer2 year- command is not needed (unless you are using a pretty old version of Stata). Modern Stata automatically sorts the data, if necessary, when it does the -merge-. To be clear, neither of these things is an error; you can leave them the way they are. I'm just pointing out a way to make the code a tad cleaner.
                        I'm all about clean data.

                        Something is wrong during the merge it seems. I even cut out a very small snippet of the original dataset, and it's still not merging properly. I tried your clean code:

                        Code:
                        use A.dta, clear
                        
                        keep countrya countryb 
                        
                        duplicates drop
                        
                        expand 14
                        
                        bysort countrya countryb:gen year=1999+_n
                        
                        merge countrya countryb year using A.dta
                        here is the "snippet" data:
                        Code:
                        * Example generated by -dataex-. To install: ssc install dataex
                        clear
                        input int(year ccode1 ccode2) str11 countrya str10 countryb byte(applied accepted other rejected)
                        2008 339 240 "Afghanistan" "Turkey"     1 0 0 0
                        2000 339 241 "Afghanistan" "Tajikistan" 0 0 0 1
                        2001 339 241 "Afghanistan" "Tajikistan" 6 0 0 6
                        2002 339 241 "Afghanistan" "Tajikistan" 5 0 0 5
                        2008 339 241 "Afghanistan" "Tajikistan" 1 0 0 0
                        2013 339 241 "Afghanistan" "Tajikistan" 1 0 0 0
                        end
                        I get the master data not sorted error again. Only the countries and year variables stay
                        Code:
                        * Example generated by -dataex-. To install: ssc install dataex
                        clear
                        input str11 countrya str10 countryb float year
                        "Afghanistan" "Tajikistan" 2000
                        "Afghanistan" "Tajikistan" 2001
                        "Afghanistan" "Tajikistan" 2002
                        "Afghanistan" "Tajikistan" 2003
                        "Afghanistan" "Tajikistan" 2004
                        "Afghanistan" "Tajikistan" 2005
                        "Afghanistan" "Tajikistan" 2006
                        "Afghanistan" "Tajikistan" 2007
                        "Afghanistan" "Tajikistan" 2008
                        "Afghanistan" "Tajikistan" 2009
                        "Afghanistan" "Tajikistan" 2010
                        "Afghanistan" "Tajikistan" 2011
                        "Afghanistan" "Tajikistan" 2012
                        "Afghanistan" "Tajikistan" 2013
                        etc

                        So, it's providing the years needed, but it's not merging. I'm not sure how it's not sorted correctly.

                        Comment


                        • #13
                          Sorry I didn't pick this up earlier. There's another error in the -merge- statement. In modern Stata, we specify -merge-'s as 1:1, 1:m, or m:1 (-m:m- is legal too but is dangerous as it produces mostly gibberish). When that specification is omitted, Stata's parser thinks you are using old -merge- syntax. Back then, you also had to sort both data sets on the merge key before you could run the merge. So that's why you're getting that error. It should be:

                          Code:
                          merge m:1 countrya countryb year using A.dta

                          Comment


                          • #14
                            Originally posted by Clyde Schechter View Post
                            Sorry I didn't pick this up earlier. There's another error in the -merge- statement. In modern Stata, we specify -merge-'s as 1:1, 1:m, or m:1 (-m:m- is legal too but is dangerous as it produces mostly gibberish). When that specification is omitted, Stata's parser thinks you are using old -merge- syntax. Back then, you also had to sort both data sets on the merge key before you could run the merge. So that's why you're getting that error. It should be:

                            Code:
                            merge m:1 countrya countryb year using A.dta
                            Clyde, thank you! I'm actually just learning about merging data in general, so I'm still trying to wrap my head around them all.

                            In the case of Nyugen and myself, we are using a many to one merge because data we're adding to (using A.dta) is unique, and the data we created is not unique (because it's longitudinal)?

                            Comment


                            • #15
                              In the case of Nyugen and myself, we are using a many to one merge because data we're adding to (using A.dta) is unique, and the data we created is not unique (because it's longitudinal)?
                              Precisely so.

                              Comment

                              Working...
                              X