Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Generating a dummy variable for if a variable changes over time periods

    I am trying to generate a variable in panel and I am having some issues. I wondered if you could tell me what the command is or a where I can find information about creating the variable.

    I want to create a dummy variable taking 0 if the industry an individual worked in did not change over the periods; taking value 1 if an individual worked in more than 1 industry over the time period.

    Is there any advice you could give as to how to do this?

    Industry is coded from 1-12 and the data has gaps.

    Thank you in advance.

  • #2
    When you say the data has gaps, it is unclear whether you mean that the panel itself has gaps, or whether you mean that even for observations included in the panel, sometimes the industry variable is missing. I'll assume the latter. That being the case, there is the question of what you want to do when there are missing values, as you really won't know if the industry changed in the years for which you lack information. I'm going to assume that you want to indicate whether the industry variable is constant or changes for those years it is available and you will ignore what might have happened in the years it is missing. I assume you have a variable identifying individuals, call it id.

    Code:
    gen byte industry_missing = missing(industry)
    by id industry_missing (industry), sort: gen byte changed = (industry[1] != industry[_N]) if industry_missing == 0
    by id (industry_missing industry): replace changed = changed[1]

    Comment


    • #3
      An example:
      Code:
      clear
      set more off
      
      input ///
      id period indust
      1 1 8
      1 2 8
      1 4 10
      1 5 10
      1 6 8
      2 1 4
      2 2 4
      2 3 4
      2 4 9
      3 1 5
      3 2 5
      3 4 5
      end
      
      list, sepby(id)
      
      *-----
      
      bysort id (indust) : gen indicat = indust[1] != indust[_N]
      
      list, sepby(id)
      The strategy is simple. Sort thet values of -indust- per -id. If the first and last observations are the same, the person has not changed industries.

      Missing values for -indust- requires more code.

      See -help subscripting-, if necessary.
      Last edited by Roberto Ferrer; 24 Apr 2015, 13:27.
      You should:

      1. Read the FAQ carefully.

      2. "Say exactly what you typed and exactly what Stata typed (or did) in response. N.B. exactly!"

      3. Describe your dataset. Use list to list data when you are doing so. Use input to type in your own dataset fragment that others can experiment with.

      4. Use the advanced editing options to appropriately format quotes, data, code and Stata output. The advanced options can be toggled on/off using the A button in the top right corner of the text editor.

      Comment


      • #4
        Thank you for your prompt response Clyde. Sorry this is my first time using this website and I have not be as precise as I should have been. The panel itself has gaps. There is no missing data for the industry variable. You are correct I want to see if the variable changes over time for respondents and thus generate a variable coded 0 for those respondents who across the time period did not change industry and hence there industry code would remain the same; coded 1 for those who have different industry codes (1-12) across the time periods i.e they have changed industry in the time periods at least once. Thank you again.

        Comment


        • #5
          OK. Well, the code I posted in #2 still works in the absence of missing values for industry, but, given that there are no missings it could be simplified:

          Code:
          by id (industry), sort: gen changed = (industry[1] != industry[_N])

          Comment


          • #6
            THANK YOU BOTH

            Comment


            • #7
              Note also FAQ http://www.stata.com/support/faqs/da...ions-in-group/

              Comment


              • #8
                This thread has been very helpful as I have a similar problem.
                However, I want to specify the industry. Using the example above, I need a dummy that takes value 1 if the individual changed from industry3 to industry10. The order is also important, the person must have worked in industry3 in year1 and in industry10 in year2. My dataset is limited to two years. Any other combination should result in the dummy taking value 0.

                My approach has been:
                Code:
                by ID (industry), sort: gen var = (industry[1] ==3 & industry[2]==10)
                by ID (industry): replace var = var[1]
                But the dummy still only indicates whether there has been a change in industry, regardless of the type of industry.
                I am new to Stata, so any hint and ideas are much appreciated! Thanks!
                Last edited by Chris Meier; 17 Jun 2016, 05:33. Reason: typo

                Comment


                • #9
                  The second statement is redundant, as the first statement supplies the same result for all observations for each identifier.

                  The problem is that your sort order should be different:

                  Code:
                   
                  bysort ID (year) : gen var = industry[1] == 3 & industry[2] == 10
                  Note that as you sorted on industry within ID, the dummy also picks up changes from 10 to 3 over years 1 to 2.

                  Comment


                  • #10
                    Thanks, Nick!
                    I understand what went wrong in terms of sorting and corrected it. I have also double checked the industry labels.

                    The weird thing is that my results are not consistent. For individuals who did not change industry (ID1 and ID2 who both remained in industry1, for instance), I get var=0 for ID1 and var=1 for ID2.
                    In addition, I do get var=1 for IDs that changed industry but for a random change and not the industry3 to industry10 I am looking for.

                    Any idea what the problem might be? I know this is hard to tell without seeing the data but I have no clue how to approach this.
                    Thank you!


                    Comment


                    • #11
                      Indeed, why are you not showing examples where you think this happens? I can't reproduce or understand that behaviour. Can you reproduce this?

                      Code:
                      clear 
                      input ID year industry 
                      1  1  3 
                      1  2  10 
                      2  1  10 
                      2  2  3 
                      3  1  42 
                      3  2  42 
                      end 
                      bysort ID (year) : gen var = industry[1] == 3 & industry[2] == 10
                      list, sepby(ID) 
                      
                           +----------------------------+
                           | ID   year   industry   var |
                           |----------------------------|
                        1. |  1      1          3     1 |
                        2. |  1      2         10     1 |
                           |----------------------------|
                        3. |  2      1         10     0 |
                        4. |  2      2          3     0 |
                           |----------------------------|
                        5. |  3      1         42     0 |
                        6. |  3      2         42     0 |
                           +----------------------------+

                      Comment


                      • #12
                        Thanks
                        No, when I try to run the bysort-command I get an error message saying 'factor variables and time-series operators not allowed'

                        Comment


                        • #13
                          Is this a guessing game?

                          You typed something incorrectly. Tell us exactly what you typed and we should be able to explain why it was wrong.

                          (Note that it's not clear why #11 solved #10. The point of asking questions in public is that threads inform others of things that can go wrong.)

                          Comment


                          • #14
                            Thanks, Nick
                            I hope I can describe things in a more comprehensible way now:

                            1) Using your code to replicate the basic example in #11, I accidentally typed

                            Code:
                            clear
                            input ID year industry
                            1 1 3
                            1 2 10
                            2 1 10
                            2 2 3
                            3 1 42
                            3 2 42
                            end
                            bysort ID(year): gen var=industry[1]==3 & industry[2]==10
                            list, sepby(ID)
                            instead of the correct code:
                            Code:
                            clear
                            input ID year industry
                            1 1 3
                            1 2 10
                            2 1 10
                            2 2 3
                            3 1 42
                            3 2 42
                            end
                            bysort ID (year): gen var=industry[1]==3 & industry[2]==10
                            list, sepby(ID)
                            the difference being the missing [space] between ID and year in
                            Code:
                            bysort ID(year): gen var=industry[1]==3 & industry[2]==10
                            . This produced the error message 'factor variables and time-series operators not allowed'. So I could replicate #11 now.

                            2) Resuming the initial issue, this is what I would like to achieve: var=1 indicates that the individual changed from industry3 in year1 to industry10 in year2, otherwise var=0
                            In addition, I distinguish between male (sex==1) and female (sex==2).

                            Code:
                            clear
                            input ID year industry sex
                            1 1 3 1
                            1 2 10 1
                            2 1 10 1
                            2 2 3 1
                            3 1 42 1
                            3 2 42 1
                            4 1 42 1
                            4 2 42 1
                            5 1 42 2
                            5 2 42 2
                            6 1 3 1
                            6 2 3 1
                            
                            end
                            bysort ID (year): gen var=industry[1]==3 & industry[2]==10 if sex==1
                            list, sepby(ID)
                             
                            +----------------------------------+
                            ID   year   industry   sex   var 
                            ----------------------------------
                            1.   1      1          3     1     1 
                            2.   1      2         10     1     1 
                            ----------------------------------
                            3.   2      1         10     1     0 
                            4.   2      2          3     1     0 
                            ----------------------------------
                            5.   3      1         42     1     0 
                            6.   3      2         42     1     0 
                            ----------------------------------
                            7.   4      1         42     1     0 
                            8.   4      2         42     1     0 
                            ----------------------------------
                            9.   5      1         42     2     . 
                            10.   5      2         42     2     . 
                            ----------------------------------
                            11.   6      1          3     1     1 
                            12.   6      2         10     1    1 
                            ----------------------------------
                            However, in my original dataset I get the following result:

                            +----------------------------------+
                            | ID year industry sex var |
                            |----------------------------------|
                            1. | 1 1 3 1 1 |
                            2. | 1 2 10 1 1 |
                            |----------------------------------|
                            3. | 2 1 10 1 0 |
                            4. | 2 2 3 1 0 |
                            |----------------------------------|
                            5. | 3 1 42 1 0 |
                            6. | 3 2 42 1 0 |
                            |----------------------------------|
                            7. | 4 1 42 1 1 |
                            8. | 4 2 42 1 1 |
                            |----------------------------------|
                            9. | 5 1 42 2 . |
                            10. | 5 2 42 2 . |
                            +----------------------------------+
                            11. | 6 1 3 1 0 |
                            12. | 6 2 10 1 0 |
                            |----------------------------------|


                            ID1 and ID6 have the same characteristics and meet the criteria for var=1 but they produce different results for var
                            ID3 and ID4 also have the same characteristics but do not meet the criteria for var=1, so var=0 would be correct. Still, they produce different results for var
                            The good news is there are no issues with the separation by sex.

                            Any ideas would be much appreciated! Thank you!

                            Last edited by Chris Meier; 17 Jun 2016, 11:47.

                            Comment


                            • #15
                              You've given more details, for which thanks, but the same principle applies.

                              The results given reproducibly (code followed by results) make perfect sense. The missing result for ID 5 is because sex is 2 and such observations were excluded from the calculation by your own code.

                              Otherwise, unless you give the exact code you used to produce the second set of results, we can't comment.

                              Comment

                              Working...
                              X