Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Adding Extra Years to Dataset on Stata

    Good Evening,

    I have a large dataset that ends in 2006, however, I would like to add extra years to the data (up to 2012). I tried to look through the forum and online, but I haven't found much. Besides the origin, I actually do not want to add any observations (at the moment) to the data. In other words, I want to add extra rows by years and origin.If I try to do this manually on excel (or another format) it will take a lot of time seeing that there are about 500,000+ rows. So, I'm trying to see if I can find an easier way.

    Code:
    input int year str34 origin long(totalpersonspendingstartyear appliedduringyear rejected)
       . ""                           .   .   .
    2003 "China"                      0   4   0
    2004 "China"                      4   0   0
    2005 "China"                      4   2   0
    2006 "China"                      2   0   0
    In other words, I want it to look like the following:
    2003 "China" 0 4 0
    2004 "China" 4 0 0
    2005 "China" 4 2 0
    2006 "China" 2 0 0
    2007 "China"
    2008 "China"
    2009 "China"
    2010 "China"
    2011 "China"
    2012 "China"

    I tried to play around with carryforward but get "repeated time values in sample" error when I used tsset with the year.

    Regards.

  • #2
    Hi Eli,

    Welcome to Statalist. Thanks for sharing an example using dataex!

    I think the below snippet does what you want. I've taken the liberty of adding in some other countries to your dataset and have just coded their data for the totalpersons..., appliedduring..., and rejected variables as 1 in all cases (just for the sake of example). You can essentially ignore the input part of the below example though, and focus on the code.

    Code:
    clear
    input int year str34 origin long(totalpersonspendingstartyear appliedduringyear rejected)
    2003 "China"                      0   4   0
    2004 "China"                      4   0   0
    2005 "China"                      4   2   0
    2006 "China"                      2   0   0
    2003 "Japan"                      1   1   1    
    2004 "Japan"                      1   1   1
    2005 "Japan"                      1   1   1
    2006 "Japan"                      1   1   1
    2003 "Korea"                      1   1   1
    2004 "Korea"                      1   1   1
    2005 "Korea"                      1   1   1
    2006 "Korea"                      1   1   1
    2003 "Taiwan"                     1   1   1
    2004 "Taiwan"                     1   1   1
    2005 "Taiwan"                     1   1   1
    2006 "Taiwan"                     1   1   1
    2003 "Laos"                       1   1   1
    2004 "Laos"                       1   1   1
    2005 "Laos"                       1   1   1
    2006 "Laos"                       1   1   1
    end
    
    //your code starts from here
    
    expand 3, gen(expandob) //adding two more versions of every observation currently in your dataset
    sort origin expandob year //sorting by origin, then whether it's an observation we've just created, then by year
    
    ds year origin expandob, not 
    local toempty `r(varlist)' //storing all the variables that you want to be empty in a local
    foreach var of local toempty{ 
        replace `var' = . if expandob == 1 //making these variables empty
    }
    
    foreach var of varlist year origin{
        bysort origin: replace year = year[_n-1]+1 if expandob == 1 //replacing the years so they're sequential from the most recent year for each country in your original data 
    }
    
    drop expandob //dropping the expandob variable
    Good luck!

    Comment


    • #3
      Hi Chris,

      The code worked! But for some reason, the years expanded until 2029 (I believe)? I'm still trying to see how that happened, but in the mean time, I just used drop if year >2012 to clean it up.

      Also, thank you for the notes on the code. They were very helpful for understanding.
      Last edited by Eli Aguado; 19 Jan 2017, 21:37. Reason: Added a thank you

      Comment


      • #4
        Eli, I copied Chris' code into my do-editor and ran it, and it did not run the years out to 2029. I don't know what you did, but whatever you ran was not exactly Chris' code.

        That said, I'll just point out that there is no need for -bysort origin: replace year = year[_n-1]+1 if expandob == 1- to be in a loop. The loop doesn't even reference the var iterator and just repeats the same calculation 3 times, when once is all you need.

        Comment


        • #5
          Clyde is completely right. My mind was focussed on both origin and year, but obviously origin is taken care of with the expand command. Thanks Clyde!

          Comment


          • #6
            Good day to all, I am sorry if I should have asked this question separately. Please point out if that's the norm (I am new here). I am wondering if Eli's question can be extended to adding retrospective years in the dataset? That is to add 1990 to 2002 years before 2003's data? Thank you very much.

            Comment


            • #7
              This is sufficiently closely related to the topic of the thread that I think it is reasonable to ask the question here.

              Code:
              gen expander = cond(year == 2003, 2003-1990 + 1, 1)
              expand expander
              by origin (year), sort: replace year = 1990 + _n - 1 if year == 2003
              sort origin year
              
              foreach v of varlist totalpersonspendingstartyear appliedduringyear rejected {
                  replace `v' = . if year < 2003
              }
              Note that the foreach loop at the end is there to mark the newly created observations as having no actual data on the substantive variables. If the intent, however, is to carry the 2002 variable values backwards all the way to 1990 you would omit that. (But carrying backwards anything like this from 2002 to 1990 strikes me as a pretty bad idea, so I'm not recommending that.)

              Comment


              • #8
                Hi Clyde, Thank you very much for your reply. It solved my problem very nicely indeed! Thanks for the foreach part too ( I was not looking to populate the variables with the data from 2003).

                Comment


                • #9
                  Hi all,

                  I have a question closely linked to this. I am also trying to add years as extra rows to my dataset (also with no extra observations).

                  However, whereas Eli wanted to add a fixed number of years, the number of years which need to be added to my dataset vary by country. In the example below I would need to add the years 1996 and 1997 for Japan and 1998 for Korea. My actual dataset varies from 1970-2017. I have a large dataset and am looking to find a quicker way to perform this using Stata. Thanks in advance.

                  E.g.

                  Japan
                  1992
                  1993
                  1994
                  1995
                  1998


                  Korea
                  1993
                  1994
                  1995
                  1996
                  1997

                  Comment


                  • #10
                    First, you need to post an example from your Stata data set, not a hand-made tableau. Use the -dataex- command to do that. If you are running version 15.1 or a fully updated version 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

                    Next, you need to clarify your question. I might infer from the Japan example that you just want to fill in gaps in the existing years. But for Korea you are actually adding on a year beyond the range of the existing data. But you say that your full data set runs to 2017. So why do you only need to go to 1998 in Korea? And for any other country, how would somebody know which years you want to add ? At the moment it seems to require telepathy to answer your question.

                    Comment


                    • #11
                      Hi Clyde,

                      Apologies for the vagueness. It made a lot more sense in my head but I appreciate full clarity is needed in order to find a solution.

                      I mentioned in my first post that my actual dataset covers 1970-2017. I include a smaller subsample here for purposes of simplification but it perfectly replicates my problem.

                      My dependent variable 'exportedproducts' is trade data from UN COMTRADE. COMTRADE only presents years for which there is recorded data and misses the year out completely if there is no data, whereas the source for my control variables (i.e GDP) includes all years albeit with a '.' to denote the observation is missing. So that the rows line things up nicely for when I copy and paste my control variables, I want to add the missing years to the dependent variable first.

                      In the example below the period for each country should cover 2000-2010. For developed countries the the full range is almost always available. However things get a little messy for developing countries: 1) years are missing within the range (see Nigeria where 2006-2008 are missing); 2) the first/ last year is missing (see Angola where 2010 is missing); and 3) a combination of both (also see Angola where 2004, 2005, 2007 and 2010 are missing).

                      When attempting this in Excel, I made a new column for '#Years' and employed a countIF formula to count the number of years for each country and highlighted in Green when #Years = 10. Where there were missing years, I manually went in and added rows where necessary.

                      Is there a shortcut to this using Stata?

                      Thanks in advance.

                      Code:
                      * Example generated by -dataex-. To install: ssc install dataex
                      clear
                      
                      year  country exportedproducts
                      2000 "Nigeria" 434
                      2001 "Nigeria" 555
                      2002 "Nigeria" 343
                      2003 "Nigeria" 677
                      2004 "Nigeria" 565
                      2005 "Nigeria" 443
                      2009 "Nigeria" 555
                      2010 "Nigeria" 666
                      2000 "Angola" 764
                      2001 "Angola" 655
                      2002 "Angola" 566
                      2003 "Angola" 547
                      2006 "Angola" 527
                      2008 "Angola" 678
                      2009 "Angola" 658
                      end
                      Last edited by Ray Uddin; 04 Mar 2019, 02:20.

                      Comment


                      • #12
                        Well, you just shouldn't be doing this at all, now that I see the full picture. Data sets should not be created by copy and paste: it is far too easy to mess up things like alignment (as you have noted) or to inadvertently omit some rows or columns. And it leaves no audit trail of what has been done. It's simply not an acceptable way to do things for serious purposes. I would never accept analyses based on data sets created that way from people I supervise. You should be importing your data into Stata using the -import excel- (for .xls* files) or -import delimited- (for .csv files) commands (or using a program like StatTransfer) as separate data sets, and then clean them. Finally, when you -merge- them together the issue of the non-existence of observations for years with missing data in one of the data sets will be handled automatically by Stata. You won't even have to think about it.

                        If you are not familiar with the -import excel- and -import delmited- commands, you should learn them now. The PDF manual that comes with your Stata installation provides clear instructions and illustrations. Type -help import excel- in the command window. Then click on the blue link near the top of the Viewer window that opens up to go to the corresponding manual section.

                        Comment


                        • #13
                          Thanks Clyde.

                          For anyone else who may encounter this issue, I solved it as follows:

                          1. Opened the main .dta file and saved a copy under a different name. The data file in memory is then the copy file.

                          The using the following code:

                          Code:
                          keep country
                          duplicates drop country, force
                          expand 48 [since I want every country to display 48 rows for 48 years]
                          bysort country: gen year=1969+_n [since my dataset runs from 1970 to 2017]
                          merge 1:1 country year using [main dta. file]
                          save [save location]
                          This ensures you can merge the dataset with missing years with dta. files which contain full years as you are now able to attach a unique id.

                          Best,
                          Ray
                          Last edited by Ray Uddin; 04 Mar 2019, 14:54.

                          Comment


                          • #14
                            I would like to ask to extend the answer provided, if possible. What if instead of having just the year I have year and quarter variables and I need to reference to these to extend the dataset? Thank you.

                            Comment


                            • #15
                              Use the -yq()- function to to calculate a proper Stata internal format quarterly date variable, and then use that in the computations for extending the data set. (For ease of understanding outputs, it is usually advisable to apply a %tq display format to this created variable.) You may or may not want to retain the original year and quarter variables--they are sometimes useful, and sometimes not, depending on what else you will be doing.

                              Comment

                              Working...
                              X