Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Panel Data: Creating new observations

    Dear Statafriends,

    I got a tricky issue. The thing is, I have to create new observations, not variables.
    Obviously, creating new observations in Stata is neither comfortable nor does it make sense in most occassions, but this time it is necessary.

    I already found this helpful thread, but my issue is more complex: http://www.stata.com/statalist/archi.../msg00875.html


    My Data
    • I have panel data, identified by company_id and month_id.
    • uniq_company: unique value, one for each company_id
    • no_company: total number of distinct company_ids
    You can find a small subset below.
    My Dataset
    Code:
    clear
    input double company_id float(month_id uniq_company no_company)
    10001 337 10001 11306
    10001 338     . 11306
    10001 339     . 11306
    10001 340     . 11306
    10001 341     . 11306
    10001 342     . 11306
    10001 343     . 11306
    10001 344     . 11306
    10001 345     . 11306
    10001 346     . 11306
    10001 347     . 11306
    10001 348     . 11306
    10001 349     . 11306
    10002 414 10002 11306
    10002 415     . 11306
    10002 416     . 11306
    10002 417     . 11306
    10002 418     . 11306
    10002 419     . 11306
    10002 420     . 11306
    10002 421     . 11306
    10002 422     . 11306
    10002 423     . 11306
    10002 424     . 11306
    10002 425     . 11306
    10002 426     . 11306
    10002 427     . 11306
    10002 428     . 11306
    10002 429     . 11306
    10003 426 10003 11306
    10003 427     . 11306
    10003 428     . 11306
    10003 429     . 11306
    10003 430     . 11306
    10003 431     . 11306
    end
    format %tm month_id
    What do I have to do?
    Create 2 new observations, before the first month_id of each company_id.
    It does not matter what happens with uniq_company no_company, they are only support-variables,
    but I have to create two new (previous) month_id s for each company.

    Code:
    clear
    input double company_id float(month_id uniq_company no_company)
    10001 335     . 11306
    10001 336     . 11306
    10001 337 10001 11306
    10001 338     . 11306
    10001 339     . 11306
    . .     . .
    10002 412     . 11306
    10002 413     . 11306
    10002 414 10002 11306
    10002 415     . 11306
    10002 416     . 11306
    end
    format %tm month_id

    Why do I need this?
    The intuition is, that I use this list as a criterion which is merged with other datasets and defines whether to discard a specific company_id in a specific month.
    Unfortunately, the current, non-adjusted data would lag everything by 2 months, which implies that my list would tell the other dataset to exclude some months that would be necessary, creating a bias in my subsequently used data.




    My Approach up to now
    First, created the necessary number of empty cells for at least 1 month [set obs `=_N+no_company'] (how can I double that number?).
    Then I tried to fill the empty cells with company_id and month_id less 1 month. I tried every way I could think of, this seemed to be the most promising, but it just does not work.


    Code:
    sort company_id month_id
    
    generate uniq_company = .
    by company_id: replace uniq_company = company_id if _n==1
    
    egen no_company = count(uniq_company)
    
    set obs `=_N+no_company'
    
    by company_id month_id: replace company_id   = uniq_company[_n==1] if missing(company_id)
    by company_id month_id: replace month_id   = month_id[_n==1] - 1  if missing(month_id)

    Any suggestions would be highly welcome!

    Thank you,
    Carlos
    Last edited by Carlos Teigimiz; 26 May 2016, 18:21.

  • #2
    I think it's easier to create them separately and then append them to the original data:

    Code:
    clear
    input double company_id float(month_id uniq_company no_company)
    10001 337 10001 11306
    10001 338     . 11306
    10001 339     . 11306
    10001 340     . 11306
    10001 341     . 11306
    10001 342     . 11306
    10001 343     . 11306
    10001 344     . 11306
    10001 345     . 11306
    10001 346     . 11306
    10001 347     . 11306
    10001 348     . 11306
    10001 349     . 11306
    10002 414 10002 11306
    10002 415     . 11306
    10002 416     . 11306
    10002 417     . 11306
    10002 418     . 11306
    10002 419     . 11306
    10002 420     . 11306
    10002 421     . 11306
    10002 422     . 11306
    10002 423     . 11306
    10002 424     . 11306
    10002 425     . 11306
    10002 426     . 11306
    10002 427     . 11306
    10002 428     . 11306
    10002 429     . 11306
    10003 426 10003 11306
    10003 427     . 11306
    10003 428     . 11306
    10003 429     . 11306
    10003 430     . 11306
    10003 431     . 11306
    end
    format %tm month_id
    
    // SET ASIDE THE ORIGINAL DATA
    tempfile original
    save `original'
    
    // KEEP ONLY FIRST MONTH FOR EACH COMPANY ID
    by company_id (month_id), sort: keep if _n == 1
    // GENERATE TWO ADDITIONAL OBSERVATIONS FOR EACH
    expand 3
    // POPULATE month_id WITH THE TWO PRECEDING MONTHS
    by company_id (month_id), sort: replace month_id = month_id[_n-1]-1 if _n > 1
    
    // PUT IT ALL TOGETHER AND SORT IT INTO ORDER
    append using `original'
    sort company_id month_id
    And thank you for using -dataex- and code blocks; it makes working with your information so much easier!

    Comment


    • #3
      Wow, awesome code.
      Thanks so much Clyde!


      One edit: if anybody else also needs this. Do not forget - duplicates drop - to get rid of double initial values
      Last edited by Carlos Teigimiz; 26 May 2016, 19:46.

      Comment


      • #4
        Here's another way to do it in place:

        Code:
        clear
        input double company_id float(month_id uniq_company no_company)
        10001 337 10001 11306
        10001 338     . 11306
        10001 339     . 11306
        10002 414 10002 11306
        10002 415     . 11306
        10002 416     . 11306
        end
        format %tm month_id
        
        bysort company_id (month_id) : gen first = _n == 1
        expand 3 if first
        bysort company_id (month_id) : replace month_id = month_id - 2 if _n == 1
        by company_id : replace month_id = month_id - 1 if _n == 2
         
        list, sepby(company_id)
        
             +---------------------------------------------------+
             | compan~d   month_id   uniq_c~y   no_com~y   first |
             |---------------------------------------------------|
          1. |    10001    1987m12      10001      11306       1 |
          2. |    10001     1988m1      10001      11306       1 |
          3. |    10001     1988m2      10001      11306       1 |
          4. |    10001     1988m3          .      11306       0 |
          5. |    10001     1988m4          .      11306       0 |
             |---------------------------------------------------|
          6. |    10002     1994m5      10002      11306       1 |
          7. |    10002     1994m6      10002      11306       1 |
          8. |    10002     1994m7      10002      11306       1 |
          9. |    10002     1994m8          .      11306       0 |
         10. |    10002     1994m9          .      11306       0 |
             +---------------------------------------------------+
        Last edited by Nick Cox; 27 May 2016, 02:20.

        Comment


        • #5
          [QUOTE=Nick Cox;n1342828]Here's another way to do it in place:

          [CODE]
          clear
          input double company_id float(month_id uniq_company no_company)
          10001 337 10001 11306
          10001 338 . 11306
          10001 339 . 11306
          10002 414 10002 11306
          10002 415 . 11306
          10002 416 . 11306
          end
          format %tm month_id

          bysort company_id (month_id) : gen first = _n == 1
          expand 3 if first
          bysort company_id (month_id) : replace month_id = month_id - 2 if _n == 1
          by company_id : replace month_id = month_id - 1 if _n == 2

          list, sepby(company_id)

          +---------------------------------------------------+
          | compan~d month_id uniq_c~y no_com~y first |
          |---------------------------------------------------|
          1. | 10001 1987m12 10001 11306 1 |
          2. | 10001 1988m1 10001 11306 1 |
          3. | 10001 1988m2 10001 11306 1 |
          4. | 10001 1988m3 . 11306 0 |
          5. | 10001 1988m4 . 11306 0 |
          |---------------------------------------------------|
          6. | 10002 1994m5 10002 11306 1 |
          7. | 10002 1994m6 10002 11306 1 |
          8. | 10002 1994m7 10002 11306 1 |
          9. | 10002 1994m8 . 11306 0 |
          10. | 10002 1994m9 . 11306 0 |
          +---------------------------------------------------+

          ccode_1 ccode_2 year_1
          2 20 2002
          2 20 2003
          2 20 2004
          2 20 2005
          2 20 2006
          2 20 2007

          Hi Nick...me again. my question is how to expand three months afterwards? actually, in may case, My data ends at 2007. i want to expand the observations to 2010. the first line of the command should be this: bysort ccode_1 ccode_2 (year_1): gen last = _n == _N. However, i cannot figure out the second line of command. thanks in advance!

          Comment


          • #6
            Sorry, but to me #5 is a real mess. The data you give at the start don't seem to have any bearing on the final question. Three months? three years? Too much of a a puzzle. If someone else can follow what you want, well and good. Otherwise I suggest you clean up, post again and get

            Code:
            
            
            delimiters right. Use the # key in the editor to get mark-up automatically.

            Comment


            • #7
              Originally posted by Nick Cox View Post
              Sorry, but to me #5 is a real mess. The data you give at the start don't seem to have any bearing on the final question. Three months? three years? Too much of a a puzzle. If someone else can follow what you want, well and good. Otherwise I suggest you clean up, post again and get

              Code:
              
              
              delimiters right. Use the # key in the editor to get mark-up automatically.
              Hi Nick. i figured out this later on, just too tired to post it.
              Here is my code on how to expand observations from 2007 to 2008. my data ends in 2007.
              #
              ccode1 ccode2 year
              2 20 2002
              2 20 2003
              2 20 2004
              2 20 2005
              2 20 2006
              2 20 2007
              "ccode" stand for country code and 2 is for US and 20 is for Canand.
              below is my code
              ******
              bysort ccode1 ccode2 (year): gen last = _n == _N
              bysort ccode1 ccode2 (year): replace year = year +3 if _n == _N
              bysort ccode1 ccode2 (year): replace year = year +3 if _n == _N-1
              bysort ccode1 ccode2 (year): replace year = year +3 if _n == _N-2
              drop if year < 2008
              drop last
              save "/Users/yaojiong/Desktop/dyadic mid1.dta", replace
              clear
              ****
              basically, this helps me to generate observations from 2008-10.
              then, just append with the original data.
              do you think it is right? Thanks in advance.

              Comment


              • #8
                Thanks for the extra detail but I am still in a fog about what your problem is and why your code is a solution.

                Data examples are best provided with dataex: see https://www.statalist.org/forums/help#stata

                Here is some technique for adding years 2008, 2009, 2010 to a dataset with year 2002 to 2007.

                Code:
                clear 
                input ccode1    ccode2    year
                2    20    2002
                2    20    2003
                2    20    2004
                2    20    2005
                2    20    2006
                2    20    2007
                end 
                
                bysort ccode1 ccode2 (year) : gen toexpand = _n == _N
                expand 4 if toexpand 
                bysort ccode1 ccode2 (year) : replace year = year[1] + _n  - 1 
                drop toexpand 
                list, sepby(ccode*) 
                
                     +------------------------+
                     | ccode1   ccode2   year |
                     |------------------------|
                  1. |      2       20   2002 |
                  2. |      2       20   2003 |
                  3. |      2       20   2004 |
                  4. |      2       20   2005 |
                  5. |      2       20   2006 |
                  6. |      2       20   2007 |
                  7. |      2       20   2008 |
                  8. |      2       20   2009 |
                  9. |      2       20   2010 |
                     +------------------------+
                1. That said, there is no obvious point in bulking out a dataset with identifiers and a time variable unless you have other variables to add with non-missing values.

                2. Further, in your own code you drop data for before 2008, so why do that? If you just want a dataset with 3 years for each panel, there will be code to do that directly. But you don't want that, so see my code instead. Yet comment #1 still applies.

                Comment


                • #9
                  Originally posted by Nick Cox View Post
                  Thanks for the extra detail but I am still in a fog about what your problem is and why your code is a solution.

                  Data examples are best provided with dataex: see https://www.statalist.org/forums/help#stata

                  Here is some technique for adding years 2008, 2009, 2010 to a dataset with year 2002 to 2007.

                  Code:
                  clear
                  input ccode1 ccode2 year
                  2 20 2002
                  2 20 2003
                  2 20 2004
                  2 20 2005
                  2 20 2006
                  2 20 2007
                  end
                  
                  bysort ccode1 ccode2 (year) : gen toexpand = _n == _N
                  expand 4 if toexpand
                  bysort ccode1 ccode2 (year) : replace year = year[1] + _n - 1
                  drop toexpand
                  list, sepby(ccode*)
                  
                  +------------------------+
                  | ccode1 ccode2 year |
                  |------------------------|
                  1. | 2 20 2002 |
                  2. | 2 20 2003 |
                  3. | 2 20 2004 |
                  4. | 2 20 2005 |
                  5. | 2 20 2006 |
                  6. | 2 20 2007 |
                  7. | 2 20 2008 |
                  8. | 2 20 2009 |
                  9. | 2 20 2010 |
                  +------------------------+
                  1. That said, there is no obvious point in bulking out a dataset with identifiers and a time variable unless you have other variables to add with non-missing values.

                  2. Further, in your own code you drop data for before 2008, so why do that? If you just want a dataset with 3 years for each panel, there will be code to do that directly. But you don't want that, so see my code instead. Yet comment #1 still applies.
                  Thanks Nick. You code works perfectly fine. yes, i want to add other variables to the dataset. one more question, could you explain the logic of the third line of the command, especially " replace year = year[1] + _n - 1"? Thanks again for your help!

                  Comment


                  • #10
                    See any explanation of the by: prefix command such as my paper in Stata Journal 2(1) from 2002.

                    With that prefix year[1] is the first value in each group and _n runs 1 upwards within each group. So the code gives a sequence of integers as result.

                    Comment


                    • #11
                      Originally posted by Nick Cox View Post
                      Here's another way to do it in place:

                      Code:
                      clear
                      input double company_id float(month_id uniq_company no_company)
                      10001 337 10001 11306
                      10001 338 . 11306
                      10001 339 . 11306
                      10002 414 10002 11306
                      10002 415 . 11306
                      10002 416 . 11306
                      end
                      format %tm month_id
                      
                      bysort company_id (month_id) : gen first = _n == 1
                      expand 3 if first
                      bysort company_id (month_id) : replace month_id = month_id - 2 if _n == 1
                      by company_id : replace month_id = month_id - 1 if _n == 2
                      
                      list, sepby(company_id)
                      
                      +---------------------------------------------------+
                      | compan~d month_id uniq_c~y no_com~y first |
                      |---------------------------------------------------|
                      1. | 10001 1987m12 10001 11306 1 |
                      2. | 10001 1988m1 10001 11306 1 |
                      3. | 10001 1988m2 10001 11306 1 |
                      4. | 10001 1988m3 . 11306 0 |
                      5. | 10001 1988m4 . 11306 0 |
                      |---------------------------------------------------|
                      6. | 10002 1994m5 10002 11306 1 |
                      7. | 10002 1994m6 10002 11306 1 |
                      8. | 10002 1994m7 10002 11306 1 |
                      9. | 10002 1994m8 . 11306 0 |
                      10. | 10002 1994m9 . 11306 0 |
                      +---------------------------------------------------+
                      Really helpful, thanks. My question would be if it is somehow possible to grasp the last observation of a group, not the first. Lets say the observations of different groups (maybe countries) start at different years. With _n == 1 and
                      year_id + 1 (for example) we would not always end at the end of the groups observations. But if we could say _n = "last observation" one could easily add a further year (maybe to predict one year ahead). Is this somehow possible ?
                      Last edited by sladmin; 11 Apr 2022, 08:23. Reason: anonymize original poster

                      Comment


                      • #12
                        This variation presumably starts with

                        Code:
                         
                         bysort company_id (month_id) : gen last = _n == _N
                        See

                        Code:
                        help _variables
                        for basic information and nearby sections of [U] for elaboration.

                        Comment


                        • #13
                          Originally posted by Nick Cox View Post
                          This variation presumably starts with

                          Code:
                          bysort company_id (month_id) : gen last = _n == _N
                          See

                          Code:
                          help _variables
                          for basic information and nearby sections of [U] for elaboration.


                          Code:
                          egen id = group(Country) 
                          xtset id Year, yearly
                          
                          sort id Year
                          
                          bysort id (Year) : gen first = _n == _N if Year > 2018
                          expand 2 if first == 1
                          bysort id (Year) : replace Year = Year +1 if _n == _N
                          
                          replace LMR = . if Year > 2019
                          
                          gen lag_LMR = l.LMR
                          Thank you that was really helpful. So this is my line of code but if i want to execute the last command "gen lag_LMR = l.LMR" Stata says "not sorted". Everything looks sorted and also sorting again isnt helping. Could you tell me why ? Thanks in advance.

                          Comment


                          • #14
                            I think Stata is working on the principle that you have explicitly sorted your data since the last xtset. In principle you could (also) have changed the data so that xtset no longer holds I guess that if you xtset again, the calculation will then go through.

                            Comment


                            • #15
                              Originally posted by Nick Cox View Post
                              I think Stata is working on the principle that you have explicitly sorted your data since the last xtset. In principle you could (also) have changed the data so that xtset no longer holds I guess that if you xtset again, the calculation will then go through.
                              It worked. Thank you very much.

                              Comment

                              Working...
                              X