Creating dummy by year

Tharshini Thangavelu

Join Date: Oct 2015

Posts: 78
#1

Creating dummy by year

17 Mar 2017, 09:37

Hi all,

I have a time serie dataset, yearly between 1993-2010 for 7 districts. Here is an extract for one of the seven districts:

year dist sumyrwrain meanyrwrain stdyrwrain sumlr meanlr shock
1993 Kadapa 86975.36 517.7104 833.2421 1449708 80539.31 0
1994 Kadapa 103504.3 616.0969 932.221 1449708 80539.31 0
1995 Kadapa 73511.12 437.5662 746.5975 1449708 80539.31 1
1996 Kadapa 104767.3 623.6147 1000.061 1449708 80539.31 0
1997 Kadapa 77549.91 461.6066 803.1562 1449708 80539.31 1
1998 Kadapa 82867.43 493.2585 869.8391 1449708 80539.31 0
1999 Kadapa 65157.68 387.8433 669.4527 1449708 80539.31 1
2000 Kadapa 78883.34 469.5437 790.1378 1449708 80539.31 1
2001 Kadapa 81103.82 482.7609 864.0387 1449708 80539.31 0
2002 Kadapa 55331.86 329.3563 597.1664 1449708 80539.31 1
2003 Kadapa 41383.3 246.3291 474.8039 1449708 80539.31 1
2004 Kadapa 75030.71 446.6114 816.4559 1449708 80539.31 1
2005 Kadapa 110613.6 658.4145 1119.453 1449708 80539.31 0
2006 Kadapa 69959.75 416.4271 680.7109 1449708 80539.31 1
2007 Kadapa 87482.21 520.7274 850.8123 1449708 80539.31 0
2008 Kadapa 75579.64 449.8788 809.5716 1449708 80539.31 1
2009 Kadapa 75579.64 449.8788 809.5716 1449708 80539.31 1
2010 Kadapa 104426.7 621.5875 951.5563 1449708 80539.31 0

I have two questions:

1. The above dataset is created by collapse command to extract total sum, mean and std for each district, yearly.
However, as you can see, the meanyrwrain and stdyrwrain is not correct. Why so?

2. I would like to create a dummy for each year between 2002-2006 to indicate whether ex: in year 2002, the district Kadapa experienced a shock or not.
What is the most efficient command ?

Thanks !!
Tags: None
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17673
#2

17 Mar 2017, 09:46

Tharshini:
-I can't say why the variabes you mentioned does not seem correct; besides, you do not include what you typed to -collapse- your original dataset;
- if you plan to create year dummies, take a look at -fvvarlist-:

Code:

i.year

will do it for you.

Kind regards,
Carlo
(Stata 19.0)
Comment
Tharshini Thangavelu

Join Date: Oct 2015

Posts: 78
#3

17 Mar 2017, 16:22

The code that I was using is the following:

collapse(sum) sumyrwrain=wrain (mean) meanyrwrain=wrain (sd)stdyrwrain=wrain, by(year dist)

I was looking for something else than i.year, where I actually see each dummy as independent variable for each year.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29956
#4

17 Mar 2017, 17:51

Code:

. clear* . sysuse auto (1978 Automobile Data) . regress price mpg i.rep78 Source | SS df MS Number of obs = 69 -------------+---------------------------------- F(5, 63) = 4.39 Model | 149020603 5 29804120.7 Prob > F = 0.0017 Residual | 427776355 63 6790100.88 R-squared = 0.2584 -------------+---------------------------------- Adj R-squared = 0.1995 Total | 576796959 68 8482308.22 Root MSE = 2605.8 ------------------------------------------------------------------------------ price | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- mpg | -280.2615 61.57666 -4.55 0.000 -403.3126 -157.2103 | rep78 | 2 | 877.6347 2063.285 0.43 0.672 -3245.51 5000.78 3 | 1425.657 1905.438 0.75 0.457 -2382.057 5233.371 4 | 1693.841 1942.669 0.87 0.387 -2188.274 5575.956 5 | 3131.982 2041.049 1.53 0.130 -946.7282 7210.693 | _cons | 10449.99 2251.041 4.64 0.000 5951.646 14948.34 ------------------------------------------------------------------------------

As you can see, regression output with factor variable notation gives you the detailed results for each level of the variable. There is no need to generate separate "dummy" variables for use as independent variables. It's just a waste of time and space. No only that, if you generate your own indicator variables for a regression, you won't be able to use the -margins- command later, as -margins- only understands factor-variable notation.

There are a small number of estimation commands in Stata that do not allow factor-variable notation, but most of those are archaic and have been superseded by newer commands that provide the same functionality and do support factor-variable notation. There are also a few other places where factor-variable notation is not supported--but these are unusual. For example, if you need to model random slopes of the levels of a categorical variable, you need your own indicator variables for that.

If there is some other use you have for these indicators, you can create them as follows:

Code:

tab year, gen(y)

This will generate variables y1, y2, ... that indicate the separate years. Each will come labeled to show which year it indicates.

Your -collapse- command in #3 looks correct. Why do you think that meanyrwrain and stdyrwrain in #1 are incorrect? I don't see anything wrong with them.
2 likes
Comment
Tharshini Thangavelu

Join Date: Oct 2015

Posts: 78
#5

19 Mar 2017, 13:58

The command i.year is a neat and clear command, however, I was aiming for a code which differentiate the years since I would like to control for yearly drought shock that happened during years 02,03,04,05 and using i.year gives the whole time line that includes year that is 1993-2010. Perhaps one can do a shortcut from i.year command to only control for certain years in the whole timeline.

In my regression I have dummy variable for each year that I wish to control for but I believe this gives me bug in the output, since all year dummy variables are omitted. Can this be due to the way I introduce the dummies into my regression or because my dependent variable is recoded in year 2006 and independent variables are recoded in year 2002, but I wish to control whether drought shock (binary variable) that occurred between 2002-2006 had an impact on dependent variable.

global xlist y10 y11 y12 y13 y14 i.sext ageyrt i.hhsizet i.ill i.chcast i.chlangt i.carerelt i.caredut i.regiont i.site wealth
reg aspiration $xlist, r

where y10......y14 is created by tab year, gen(y).

Your -collapse- command in #3 looks correct. Why do you think that meanyrwrain and stdyrwrain in #1 are incorrect? I don't see anything wrong with them.

Well, if you study the attached table in message #1, sumyrwrain is basically sum over months for year 1993, hence the mean should be just divided by 12: sumyrwrain/12 = 7247 which is different from 517.7104. Or I am completely off here ??
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29956
#6

19 Mar 2017, 15:44

There is nothing inherently wrong with the code you are showing in #5. (The use of a global macro is inadvisable, but not causing this problem.) Similarly, the -collapse- command in #3 is not obviously wrong in any way. So there must be something about your data--it is not what you think it is. In particular, I notice that in every observation you show in #1 the ratio of sumyrwrain/meanyrwrain instead of being the expected 12 is, in each case, to within very small rounding error 168. The number 168 is the number of hours in a week. I wonder if instead of working with monthly data you had hourly data and collapsed it over weeks.

I do not think we can make any progress without a good example of the data you are working with. It should include all of the variables mentioned in your commands and it should include observations from all years and several districts. (The data as it was before the -collapse- command is what you need to show.)

Please do this using the -dataex- command. Run -ssc install dataex- to install it. Run -help dataex- and read the simple instructions for using it.
Comment

Tharshini Thangavelu

Join Date: Oct 2015
Posts: 78

20 Mar 2017, 03:24

The dataset in message #1 is monthly rainfall data for years 1993-2010, for each grids. In total I have 71 grids for 7 districts. The rainfall data is grid data, which means that each districts have different number of grids; a larger districts will evidently have more grids than a smaller districts. I used the collapse command so that I go from monthly data for each grid to yearly data for each districts which is what you see in message #1 for one of the districts. This implies that I aggregate monthly data into yearly and grids into districts.

As a 2end step, I am merging the collapsed dataset (yearly) with my main dataset which is a panel dataset where I follow individuals overtime with two rounds; year 2002 and 2006.

As a 3rd step, analyzing the merged data, I want to plug in dummies for each years between 2002-2005 in the regression but of course, as I pointed out in the latest message, the year dummy are omitted. I was wondering that lagging the year dummies to year 2006 because I don't have any information in the main dataset for year 2003-2005 ?

Here is an extraction of the rainfall data: the first 12 months for grid==1 (in total this district have 8 grids), district==4

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input float(grid_id year mnth) long dist double(longitude latitude Area_sqm Area_sqkm Share rain) float wrain
1 1993  1 4 76.75 13.75 26705216.7123 26.7052167122 .138897807754     0         0
1 1993  2 4 76.75 13.75 26705216.7123 26.7052167122 .138897807754     0         0
1 1993  3 4 76.75 13.75 26705216.7123 26.7052167122 .138897807754   9.8 1.3611985
1 1993  4 4 76.75 13.75 26705216.7123 26.7052167122 .138897807754    18 2.5001605
1 1993  5 4 76.75 13.75 26705216.7123 26.7052167122 .138897807754  11.9  1.652884
1 1993  6 4 76.75 13.75 26705216.7123 26.7052167122 .138897807754  23.9 3.3196576
1 1993  7 4 76.75 13.75 26705216.7123 26.7052167122 .138897807754  40.6  5.639251
1 1993  8 4 76.75 13.75 26705216.7123 26.7052167122 .138897807754  21.6 3.0001926
1 1993  9 4 76.75 13.75 26705216.7123 26.7052167122 .138897807754 149.4 20.751333
1 1993 10 4 76.75 13.75 26705216.7123 26.7052167122 .138897807754 127.6  17.72336
1 1993 11 4 76.75 13.75 26705216.7123 26.7052167122 .138897807754  10.4  1.444537
1 1993 12 4 76.75 13.75 26705216.7123 26.7052167122 .138897807754  59.7  8.292199
end
label values dist di
label def di 4 "Anantapur", modify

In total the rainfall data have 71 grids x 18 years x 12 months = 15 336 observations, when I collapse into yearly data by district I have 7 districts x 18 years =126 observations as you can see from the attached table here below. This is how it should be, however, the result, as it was mentioned in the previous tread seem to be suspicious.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input float year long dist double sumyrwrain float(meanyrwrain stdyrwrain)
1993 1  91821.81578540802  1275.303 1922.2588
1993 2  91468.26395041123  1088.908  1884.729
1993 3  86975.36002779007 517.71045  833.2421
1993 4  69254.00849664211   360.698 524.51385
1993 5  90234.15876418352  683.5921 1163.0334
1993 6  75424.47998719476  419.0249  794.2458
1993 7 63258.077629089355  2635.753  4081.613
end
label values dist di
label def di 1 "West Godavari", modify
label def di 2 "Srikakulam", modify
label def di 3 "Kadapa", modify
label def di 4 "Anantapur", modify
label def di 5 "Karimnagar", modify
label def di 6 "Mababubnagar", modify
label def di 7 "Hyderabad", modify

Last edited by Tharshini Thangavelu; 20 Mar 2017, 04:01.

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 29956
#8

20 Mar 2017, 09:06

OK, now I see the problem with your -collapse- command.

Code:

collapse(sum) sumyrwrain=wrain (mean) meanyrwrain=wrain (sd)stdyrwrain=wrain, by(year dist)

does not work as you expect because each district contains multiple grids, which have separate observations in your data. So the number of observations per year-district is not 12, it is 12 X the number of grids in the district.

So what you need to do is collapse in two steps. First you need to collapse the multiple grids within district. Not knowing exactly how these grids work and what they mean, I don't know if you should be averaging them, or adding them, or doing something else. But whatever it is, you need to aggregate up the grid-level data to the district level within each month. Then you can -collapse- that data -by(year dist)- to get appropriate yearly averages and totals.

With regard to the issue of the year indicators ("dummy" variables), you have correctly identified your problem. You have no observations for 2003-2005 in your "main" dataset. So the only observations that enter the regression analysis as those for 2002 and 2006. So the indicators for 2003-2005 must drop out because they are always zero. That leaves you with only a single indicator for 2002. Lagging the indicators will not help you: it will only change which sold indicator survives. The problem is due to the lack of information from 2003 through 2005, and the only way to resolve that is to get information for those years. If you have no information from years 2003 through 2005, there is no regression model or statistical procedure that will enable you to estimate effects in those years!
Comment
Tharshini Thangavelu

Join Date: Oct 2015

Posts: 78
#9

20 Mar 2017, 10:21

If you have no information from years 2003 through 2005, there is no regression model or statistical procedure that will enable you to estimate effects in those years!

I agree, yet, there must be possible to see whether shock that occurred in t-1 have an impact in t, although there is lack of information.

To explain; the rainfall data merged with the panel data where I follow children across rounds. The rainfall is used to construct instrument variables (IV) to check whether weather shocks before round 1 (i.e before year 2002) and between the two rounds (2002-2006) have any impact on children.

This implies, that I need to construct the IV's in such a way that information of whether there has been any shock or not before round 1 and between round 1 and 2 is transferred into children in round 2. This step I have hard time to visualize how I should construct the dataset before analyzing.
Comment
laura hman

Join Date: Jun 2021

Posts: 9
#10

21 Jun 2021, 04:47

Originally posted by Clyde Schechter View Post

[code]

Code:

tab year, gen(y)

This will generate variables y1, y2, ... that indicate the separate years. Each will come labeled to show which year it indicates.

Dear Cody, I am wondering whether it is possible to use the code "tab year, gen (y)" in a way that the following year indicator would actually be named after the year?
In the example data of Tharshini Thangavelu, the generated year indicator would have to be called year2002 year2003, year2004, year 2005 and year2006. Do you know how to alter the code so that the emerging indicator variables are not called y1, y2 etc?
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35432
#11

21 Jun 2021, 05:22

#10 Consider the following example

Code:

webuse grunfeld tab year tab year. gen(year) rename year# year#, renumber(1935)

which renames year1 to year20 as year1935 to year1954. In your case, 1935 will presumably be 2002.

That said, factor variable notation supersedes most of these devices.
2 likes
Comment

Announcement