Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Generating a new row for specific year

    Hi all, I have the following panel data set with -ID YEAR CITY STATE- variables.

    For the purpose of the study, I need do the following:

    1. Identify -ID- that does not have -YEAR- 2004.
    2. Create an artificial row for this -ID- with -ID YEAR CITY STATE- that copies from whichever is the most recent entry.
    3. If there is -ID- with 2004 year entry, skip.

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input long id double year str195 city str8 state
    1004 2000 "Wood Dale" "IL"
    1004 2001 "Wood Dale" "IL"
    1004 2002 "Wood Dale" "IL"
    1004 2008 "Wood Dale" "IL"
    1004 2009 "Wood Dale" "IL"
    1004 2010 "Wood Dale" "IL"
    1005 2007 "Beaconsfield" "MA"
    1005 2008 "Beaconsfield" "MA"
    1005 2009 "Beaconsfield" "MA"
    1005 2010 "Beaconsfield" "MA"
    end
    I tried the following construction but I am not sure how to create the row I want:

    Code:
    bysort id (year): gen obs = _n
    local varid id
    * I think I need a counter variable like ...local i = 1
    
    foreach y of local varid{
        * if not have 2004
        if max(year) <2004 | min(year) >2004 {
           * what command should I have here to insert a row with year = 2004 and to keep values in -city- -state-?
        }
    }
    Any advice/help would be greatly appreciated!

  • #2
    Code:
    isid id year, sort
    by id (year), sort: egen has_2004 = max(year == 2004)
    by id (year): gen expander = cond(!has_2004 & _n == 1, 2, 1)
    expand expander
    by id (expander year), sort: replace year = 2004 if _n == _N & !has_2004
    
    isid id year, sort
    ds id year has_2004, not
    foreach v of varlist `r(varlist)' {
        by id (year): replace `v' = `v'[_n-1] if year == 2004 & !has_2004
    }
    drop expander
    NOTE: Code assumes, and verifies, that id and year uniquely identify observations in the data.
    Last edited by Clyde Schechter; 22 Feb 2024, 13:02.

    Comment


    • #3
      Code:
      clear
      input long id double year str20 city str8 state
      1004 2000 "Wood Dale" "IL"
      1004 2001 "Wood Dale" "IL"
      1004 2002 "Wood Dale" "IL"
      1004 2008 "Wood Dale" "IL"
      1004 2009 "Wood Dale" "IL"
      1004 2010 "Wood Dale" "IL"
      1005 2007 "Beaconsfield" "MA"
      1005 2008 "Beaconsfield" "MA"
      1005 2009 "Beaconsfield" "MA"
      1005 2010 "Beaconsfield" "MA"
      1006 2003 "Sometown" "VA"
      1006 2004 "Sometown" "VA"
      1006 2005 "Sometown" "VA"
      1006 2016 "Sometown" "VA"
      end
      * Create some values
      set seed 1744237
      gen x = runiform()
      
      preserve
      * 2004 indicator
      gen has_2004 = (year == 2004)
      * Select states with no 2004
      collapse (max) year has_2004, by(id city state)
      drop if has_2004 == 1
      replace year = 2004
      save temp01, replace
      restore
      
      * Append empty 2004 rows
      append using temp01
      * Copy values
      bysort state city id (year): replace x = x[_n-1] if year == 2004 & has_2004 == 0
      Outcome:

      Code:
           +----------------------------------------------------------+
           |   id   year           city   state          x   has_2004 |
           |----------------------------------------------------------|
        1. | 1004   2000      Wood Dale      IL   .0248728          . |
        2. | 1004   2001      Wood Dale      IL   .5002839          . |
        3. | 1004   2002      Wood Dale      IL   .7049134          . |
        4. | 1004   2004      Wood Dale      IL   .7049134          0 |
        5. | 1004   2008      Wood Dale      IL   .0677949          . |
        6. | 1004   2009      Wood Dale      IL   .9536643          . |
        7. | 1004   2010      Wood Dale      IL   .1169867          . |
           |----------------------------------------------------------|
        8. | 1005   2004   Beaconsfield      MA          .          0 |
        9. | 1005   2007   Beaconsfield      MA   .0696377          . |
       10. | 1005   2008   Beaconsfield      MA    .972551          . |
       11. | 1005   2009   Beaconsfield      MA    .945122          . |
       12. | 1005   2010   Beaconsfield      MA   .1402545          . |
           |----------------------------------------------------------|
       13. | 1006   2003       Sometown      VA   .4325173          . |
       14. | 1006   2004       Sometown      VA   .3057961          . |
       15. | 1006   2005       Sometown      VA   .9123544          . |
       16. | 1006   2016       Sometown      VA   .4709794          . |
           +----------------------------------------------------------+
      Crossed with #2. The solution is #2 is less convoluted.

      Comment


      • #4
        Thanks to both Clyde and Ken for help!

        A quick question to Ken -- is there a way to restrict this application of "inserting a row" to those IDs that are between years of 2000 and 2010?

        In other words, if there is an ID that do not have values in between (inclusive) 2000 and 2010, I don't want the operation to take place.

        Comment


        • #5
          Originally posted by Stephen Ch View Post
          Thanks to both Clyde and Ken for help!

          A quick question to Ken -- is there a way to restrict this application of "inserting a row" to those IDs that are between years of 2000 and 2010?

          In other words, if there is an ID that do not have values in between (inclusive) 2000 and 2010, I don't want the operation to take place.
          Just want to clarify one thing first. When you said "that copies from whichever is the most recent entry", the most recent entry also includes future data? (e.g. copy 2005's data into 2004 if 2004 is missing?)

          Comment


          • #6
            Thanks for asking the clarification question.

            I want to restrict this operation for the IDs that have entries for years between 2000 and 2010.

            For example, if ID 1004 has entries 1998, 2012, then I don't want to create a 2004 entry with id, city, and state.

            If ID 1005 has 1998, 2003, and 2012, then I do want to create a 2004 entry with id, city, and state from the 2012 city state data (e.g., it is possible city and state entries are different from 1998 or 2003).

            Originally posted by Ken Chui View Post

            Just want to clarify one thing first. When you said "that copies from whichever is the most recent entry", the most recent entry also includes future data? (e.g. copy 2005's data into 2004 if 2004 is missing?)

            Comment


            • #7
              Ken, also when you use -collapse- I am losing a lot of variables from the original data set; is there a way to preserve them?

              Originally posted by Ken Chui View Post

              Just want to clarify one thing first. When you said "that copies from whichever is the most recent entry", the most recent entry also includes future data? (e.g. copy 2005's data into 2004 if 2004 is missing?)

              Comment


              • #8
                Originally posted by Stephen Ch View Post
                Thanks for asking the clarification question.

                I want to restrict this operation for the IDs that have entries for years between 2000 and 2010.

                For example, if ID 1004 has entries 1998, 2012, then I don't want to create a 2004 entry with id, city, and state.

                If ID 1005 has 1998, 2003, and 2012, then I do want to create a 2004 entry with id, city, and state from the 2012 city state data (e.g., it is possible city and state entries are different from 1998 or 2003).

                But if an ID has 1998, 2005, and 2012, then are you planning to:
                1. Copy the 1998 data as 2004, OR
                2. Copy the 2005 data as 2004, OR
                3. Not impute for 2004 at all?
                Thanks!

                Comment


                • #9
                  Originally posted by Stephen Ch View Post
                  Ken, also when you use -collapse- I am losing a lot of variables from the original data set; is there a way to preserve them?
                  That's why it's wrapped between a pair of commands called "preserve" and "restore". If you run the whole set command on a do-file between these two commands, you'll see that your original data would not be affected.

                  Comment


                  • #10
                    Hi Ken, this is very helpful, thank you. Thanks for demystifying the observation drop concern with -collapse-!

                    To answer your question:

                    if an ID has 1998, 2005, and 2012, then since 2005 is between 2000 and 2010, I do want to create 2004 entry based on 2012 information. However, I only want to keep ID, city, and state entries in that observation granted there are other pieces of information (e.g., zip code).

                    Thanks!

                    Originally posted by Ken Chui View Post

                    But if an ID has 1998, 2005, and 2012, then are you planning to:
                    1. Copy the 1998 data as 2004, OR
                    2. Copy the 2005 data as 2004, OR
                    3. Not impute for 2004 at all?
                    Thanks!

                    Comment


                    • #11
                      Ken, there is also a possibility that the ID 1004 can be in two different locations at different or same times (just wanted to clarify)

                      Originally posted by Ken Chui View Post

                      But if an ID has 1998, 2005, and 2012, then are you planning to:
                      1. Copy the 1998 data as 2004, OR
                      2. Copy the 2005 data as 2004, OR
                      3. Not impute for 2004 at all?
                      Thanks!

                      Comment


                      • #12
                        Hi Clyde,

                        Thanks for this clean suggested code.

                        Is there a way to restrict this operation for the id-year observations that are between 2000 and 2010?

                        Consider two examples:

                        id 1002 that appears in 1999 2012
                        id 1003 that appears in 1998 2005 2006 2019

                        I would not like to apply your code to id 1002, because I see no observation between (inclusive) 2000 and 2010.

                        However, for id 1003, I have 2005 and 2006 observations. Hence, I would like to repliace the city and state entry from max(year) (i.e., most recent) of id 1003, which is 2019 entries, and make the artificial 2004 observation.



                        Originally posted by Clyde Schechter View Post
                        Code:
                        isid id year, sort
                        by id (year), sort: egen has_2004 = max(year == 2004)
                        by id (year): gen expander = cond(!has_2004 & _n == 1, 2, 1)
                        expand expander
                        by id (expander year), sort: replace year = 2004 if _n == _N & !has_2004
                        
                        isid id year, sort
                        ds id year has_2004, not
                        foreach v of varlist `r(varlist)' {
                        by id (year): replace `v' = `v'[_n-1] if year == 2004 & !has_2004
                        }
                        drop expander
                        NOTE: Code assumes, and verifies, that id and year uniquely identify observations in the data.

                        Comment


                        • #13
                          I believe the following will do what you want. It is not really tested because your -dataex- example data has no id's that would be excluded by this new condition. But I think I have it right.

                          Code:
                          by id (year), sort: egen has_2004 = max(year == 2004)
                          by id (year), sort: egen has_in_range_years = max(inrange(year, 2000, 2010))
                          by id (year): gen expander = cond(has_in_range_years &!has_2004 & _n == 1, 2, 1)
                          expand expander
                          by id (expander year), sort: replace year = 2004 if _n == _N ///
                              & has_in_range_years & !has_2004
                          
                          isid id year, sort
                          ds id year has_2004, not
                          foreach v of varlist `r(varlist)' {
                              by id (year): replace `v' = `v'[_n-1] if year == 2004 & !has_2004
                          }
                          drop expander
                          Also, note that this code replaces the variables in the newly created 2004 record with the data from the preceding year, as you originally specified in #1. I see that you have changed that to another criterion which, to be honest I did not follow. Hopefully you can fix up that part of it yourself. If you need help doing that, please explain the new source of the replacement data in clear and consistent terms, and provide some examples.

                          Comment


                          • #14
                            Hi Clyde,

                            Thanks once again! Your code works but it does not transfer the "city" and "state" information to the newly created entry for 2004.
                            Is there a way to do this?



                            Originally posted by Clyde Schechter View Post
                            I believe the following will do what you want. It is not really tested because your -dataex- example data has no id's that would be excluded by this new condition. But I think I have it right.

                            Code:
                            by id (year), sort: egen has_2004 = max(year == 2004)
                            by id (year), sort: egen has_in_range_years = max(inrange(year, 2000, 2010))
                            by id (year): gen expander = cond(has_in_range_years &!has_2004 & _n == 1, 2, 1)
                            expand expander
                            by id (expander year), sort: replace year = 2004 if _n == _N ///
                            & has_in_range_years & !has_2004
                            
                            isid id year, sort
                            ds id year has_2004, not
                            foreach v of varlist `r(varlist)' {
                            by id (year): replace `v' = `v'[_n-1] if year == 2004 & !has_2004
                            }
                            drop expander
                            Also, note that this code replaces the variables in the newly created 2004 record with the data from the preceding year, as you originally specified in #1. I see that you have changed that to another criterion which, to be honest I did not follow. Hopefully you can fix up that part of it yourself. If you need help doing that, please explain the new source of the replacement data in clear and consistent terms, and provide some examples.

                            Comment


                            • #15
                              Ken, I am trying both your way and Clyde's suggestion.

                              Is there a way to restrict your code for years between 2000 and 2010.

                              To respond to your earlier question:

                              Consider two examples:

                              id 1002 that appears in 1999 2012
                              id 1003 that appears in 1998 2005 2006 2019

                              I would not like to apply your code to id 1002, because I see no observation between (inclusive) 2000 and 2010.

                              However, for id 1003, I have 2005 and 2006 observations. Hence, I would like to repliace the city and state entry from max(year) (i.e., most recent) of id 1003, which is 2019 entries, and make the artificial 2004 observation.

                              Thanks!

                              Originally posted by Ken Chui View Post
                              Code:
                              clear
                              input long id double year str20 city str8 state
                              1004 2000 "Wood Dale" "IL"
                              1004 2001 "Wood Dale" "IL"
                              1004 2002 "Wood Dale" "IL"
                              1004 2008 "Wood Dale" "IL"
                              1004 2009 "Wood Dale" "IL"
                              1004 2010 "Wood Dale" "IL"
                              1005 2007 "Beaconsfield" "MA"
                              1005 2008 "Beaconsfield" "MA"
                              1005 2009 "Beaconsfield" "MA"
                              1005 2010 "Beaconsfield" "MA"
                              1006 2003 "Sometown" "VA"
                              1006 2004 "Sometown" "VA"
                              1006 2005 "Sometown" "VA"
                              1006 2016 "Sometown" "VA"
                              end
                              * Create some values
                              set seed 1744237
                              gen x = runiform()
                              
                              preserve
                              * 2004 indicator
                              gen has_2004 = (year == 2004)
                              * Select states with no 2004
                              collapse (max) year has_2004, by(id city state)
                              drop if has_2004 == 1
                              replace year = 2004
                              save temp01, replace
                              restore
                              
                              * Append empty 2004 rows
                              append using temp01
                              * Copy values
                              bysort state city id (year): replace x = x[_n-1] if year == 2004 & has_2004 == 0
                              Outcome:

                              Code:
                              +----------------------------------------------------------+
                              | id year city state x has_2004 |
                              |----------------------------------------------------------|
                              1. | 1004 2000 Wood Dale IL .0248728 . |
                              2. | 1004 2001 Wood Dale IL .5002839 . |
                              3. | 1004 2002 Wood Dale IL .7049134 . |
                              4. | 1004 2004 Wood Dale IL .7049134 0 |
                              5. | 1004 2008 Wood Dale IL .0677949 . |
                              6. | 1004 2009 Wood Dale IL .9536643 . |
                              7. | 1004 2010 Wood Dale IL .1169867 . |
                              |----------------------------------------------------------|
                              8. | 1005 2004 Beaconsfield MA . 0 |
                              9. | 1005 2007 Beaconsfield MA .0696377 . |
                              10. | 1005 2008 Beaconsfield MA .972551 . |
                              11. | 1005 2009 Beaconsfield MA .945122 . |
                              12. | 1005 2010 Beaconsfield MA .1402545 . |
                              |----------------------------------------------------------|
                              13. | 1006 2003 Sometown VA .4325173 . |
                              14. | 1006 2004 Sometown VA .3057961 . |
                              15. | 1006 2005 Sometown VA .9123544 . |
                              16. | 1006 2016 Sometown VA .4709794 . |
                              +----------------------------------------------------------+
                              Crossed with #2. The solution is #2 is less convoluted.

                              Comment

                              Working...
                              X