Generating a dummy variable for if a variable changes over time periods

Jordan Smith

Join Date: Apr 2015

Posts: 7
#1

Generating a dummy variable for if a variable changes over time periods

24 Apr 2015, 13:12

I am trying to generate a variable in panel and I am having some issues. I wondered if you could tell me what the command is or a where I can find information about creating the variable.

I want to create a dummy variable taking 0 if the industry an individual worked in did not change over the periods; taking value 1 if an individual worked in more than 1 industry over the time period.

Is there any advice you could give as to how to do this?

Industry is coded from 1-12 and the data has gaps.

Thank you in advance.
Tags: dummy, Generate, panel, panel data
Clyde Schechter

Join Date: Apr 2014

Posts: 29796
#2

24 Apr 2015, 13:20

When you say the data has gaps, it is unclear whether you mean that the panel itself has gaps, or whether you mean that even for observations included in the panel, sometimes the industry variable is missing. I'll assume the latter. That being the case, there is the question of what you want to do when there are missing values, as you really won't know if the industry changed in the years for which you lack information. I'm going to assume that you want to indicate whether the industry variable is constant or changes for those years it is available and you will ignore what might have happened in the years it is missing. I assume you have a variable identifying individuals, call it id.

Code:

gen byte industry_missing = missing(industry) by id industry_missing (industry), sort: gen byte changed = (industry[1] != industry[_N]) if industry_missing == 0 by id (industry_missing industry): replace changed = changed[1]
Comment
Roberto Ferrer

Join Date: Apr 2014

Posts: 449
#3

24 Apr 2015, 13:25

An example:

Code:

clear set more off input /// id period indust 1 1 8 1 2 8 1 4 10 1 5 10 1 6 8 2 1 4 2 2 4 2 3 4 2 4 9 3 1 5 3 2 5 3 4 5 end list, sepby(id) *----- bysort id (indust) : gen indicat = indust[1] != indust[_N] list, sepby(id)

The strategy is simple. Sort thet values of -indust- per -id. If the first and last observations are the same, the person has not changed industries.

Missing values for -indust- requires more code.

See -help subscripting-, if necessary.

Last edited by Roberto Ferrer; 24 Apr 2015, 13:27.

You should:

1. Read the FAQ carefully.

2. "Say exactly what you typed and exactly what Stata typed (or did) in response. N.B. exactly!"

3. Describe your dataset. Use list to list data when you are doing so. Use input to type in your own dataset fragment that others can experiment with.

4. Use the advanced editing options to appropriately format quotes, data, code and Stata output. The advanced options can be toggled on/off using the A button in the top right corner of the text editor.
1 like
Comment
Jordan Smith

Join Date: Apr 2015

Posts: 7
#4

24 Apr 2015, 13:36

Thank you for your prompt response Clyde. Sorry this is my first time using this website and I have not be as precise as I should have been. The panel itself has gaps. There is no missing data for the industry variable. You are correct I want to see if the variable changes over time for respondents and thus generate a variable coded 0 for those respondents who across the time period did not change industry and hence there industry code would remain the same; coded 1 for those who have different industry codes (1-12) across the time periods i.e they have changed industry in the time periods at least once. Thank you again.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29796
#5

24 Apr 2015, 14:08

OK. Well, the code I posted in #2 still works in the absence of missing values for industry, but, given that there are no missings it could be simplified:

Code:

by id (industry), sort: gen changed = (industry[1] != industry[_N])
1 like
Comment
Jordan Smith

Join Date: Apr 2015

Posts: 7
#6

24 Apr 2015, 15:27

THANK YOU BOTH
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35211
#7

24 Apr 2015, 18:57

Note also FAQ http://www.stata.com/support/faqs/da...ions-in-group/
Comment
Chris Meier

Join Date: Jun 2016

Posts: 9
#8

17 Jun 2016, 05:19

This thread has been very helpful as I have a similar problem.
However, I want to specify the industry. Using the example above, I need a dummy that takes value 1 if the individual changed from industry3 to industry10. The order is also important, the person must have worked in industry3 in year1 and in industry10 in year2. My dataset is limited to two years. Any other combination should result in the dummy taking value 0.

My approach has been:

Code:

by ID (industry), sort: gen var = (industry[1] ==3 & industry[2]==10) by ID (industry): replace var = var[1]

But the dummy still only indicates whether there has been a change in industry, regardless of the type of industry.
I am new to Stata, so any hint and ideas are much appreciated! Thanks!

Last edited by Chris Meier; 17 Jun 2016, 05:33. Reason: typo
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35211
#9

17 Jun 2016, 06:44

The second statement is redundant, as the first statement supplies the same result for all observations for each identifier.

The problem is that your sort order should be different:

Code:

bysort ID (year) : gen var = industry[1] == 3 & industry[2] == 10

Note that as you sorted on industry within ID, the dummy also picks up changes from 10 to 3 over years 1 to 2.
Comment
Chris Meier

Join Date: Jun 2016

Posts: 9
#10

17 Jun 2016, 08:00

Thanks, Nick!
I understand what went wrong in terms of sorting and corrected it. I have also double checked the industry labels.

The weird thing is that my results are not consistent. For individuals who did not change industry (ID1 and ID2 who both remained in industry1, for instance), I get var=0 for ID1 and var=1 for ID2.
In addition, I do get var=1 for IDs that changed industry but for a random change and not the industry3 to industry10 I am looking for.

Any idea what the problem might be? I know this is hard to tell without seeing the data but I have no clue how to approach this.
Thank you!
Comment

Nick Cox

Join Date: Mar 2014
Posts: 35211

#11

17 Jun 2016, 08:29

Indeed, why are you not showing examples where you think this happens? I can't reproduce or understand that behaviour. Can you reproduce this?

Code:

clear 
input ID year industry 
1  1  3 
1  2  10 
2  1  10 
2  2  3 
3  1  42 
3  2  42 
end 
bysort ID (year) : gen var = industry[1] == 3 & industry[2] == 10
list, sepby(ID) 

     +----------------------------+
     | ID   year   industry   var |
     |----------------------------|
  1. |  1      1          3     1 |
  2. |  1      2         10     1 |
     |----------------------------|
  3. |  2      1         10     0 |
  4. |  2      2          3     0 |
     |----------------------------|
  5. |  3      1         42     0 |
  6. |  3      2         42     0 |
     +----------------------------+

Comment

Chris Meier

Join Date: Jun 2016

Posts: 9
#12

17 Jun 2016, 09:26

Thanks
No, when I try to run the bysort-command I get an error message saying 'factor variables and time-series operators not allowed'
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35211
#13

17 Jun 2016, 09:33

Is this a guessing game?

You typed something incorrectly. Tell us exactly what you typed and we should be able to explain why it was wrong.

(Note that it's not clear why #11 solved #10. The point of asking questions in public is that threads inform others of things that can go wrong.)
Comment
Chris Meier

Join Date: Jun 2016

Posts: 9
#14

17 Jun 2016, 11:23

Thanks, Nick
I hope I can describe things in a more comprehensible way now:

1) Using your code to replicate the basic example in #11, I accidentally typed

Code:

clear input ID year industry 1 1 3 1 2 10 2 1 10 2 2 3 3 1 42 3 2 42 end bysort ID(year): gen var=industry[1]==3 & industry[2]==10 list, sepby(ID)

instead of the correct code:

Code:

clear input ID year industry 1 1 3 1 2 10 2 1 10 2 2 3 3 1 42 3 2 42 end bysort ID (year): gen var=industry[1]==3 & industry[2]==10 list, sepby(ID)

the difference being the missing [space] between ID and year in

Code:

bysort ID(year): gen var=industry[1]==3 & industry[2]==10

. This produced the error message 'factor variables and time-series operators not allowed'. So I could replicate #11 now.

2) Resuming the initial issue, this is what I would like to achieve: var=1 indicates that the individual changed from industry3 in year1 to industry10 in year2, otherwise var=0
In addition, I distinguish between male (sex==1) and female (sex==2).

Code:

clear input ID year industry sex 1 1 3 1 1 2 10 1 2 1 10 1 2 2 3 1 3 1 42 1 3 2 42 1 4 1 42 1 4 2 42 1 5 1 42 2 5 2 42 2 6 1 3 1 6 2 3 1 end bysort ID (year): gen var=industry[1]==3 & industry[2]==10 if sex==1 list, sepby(ID) +----------------------------------+ ID year industry sex var ---------------------------------- 1. 1 1 3 1 1 2. 1 2 10 1 1 ---------------------------------- 3. 2 1 10 1 0 4. 2 2 3 1 0 ---------------------------------- 5. 3 1 42 1 0 6. 3 2 42 1 0 ---------------------------------- 7. 4 1 42 1 0 8. 4 2 42 1 0 ---------------------------------- 9. 5 1 42 2 . 10. 5 2 42 2 . ---------------------------------- 11. 6 1 3 1 1 12. 6 2 10 1 1 ----------------------------------

However, in my original dataset I get the following result:

+----------------------------------+
| ID year industry sex var |
|----------------------------------|
1. | 1 1 3 1 1 |
2. | 1 2 10 1 1 |
|----------------------------------|
3. | 2 1 10 1 0 |
4. | 2 2 3 1 0 |
|----------------------------------|
5. | 3 1 42 1 0 |
6. | 3 2 42 1 0 |
|----------------------------------|
7. | 4 1 42 1 1 |
8. | 4 2 42 1 1 |
|----------------------------------|
9. | 5 1 42 2 . |
10. | 5 2 42 2 . |
+----------------------------------+
11. | 6 1 3 1 0 |
12. | 6 2 10 1 0 |
|----------------------------------|

ID1 and ID6 have the same characteristics and meet the criteria for var=1 but they produce different results for var
ID3 and ID4 also have the same characteristics but do not meet the criteria for var=1, so var=0 would be correct. Still, they produce different results for var
The good news is there are no issues with the separation by sex.

Any ideas would be much appreciated! Thank you!

Last edited by Chris Meier; 17 Jun 2016, 11:47.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35211
#15

17 Jun 2016, 12:53

You've given more details, for which thanks, but the same principle applies.

The results given reproducibly (code followed by results) make perfect sense. The missing result for ID 5 is because sex is 2 and such observations were excluded from the calculation by your own code.

Otherwise, unless you give the exact code you used to produce the second set of results, we can't comment.
Comment

Announcement

Generating a dummy variable for if a variable changes over time periods

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment