dummy variable conditional not working for all panel IDs

Adam Klaas

Join Date: Jan 2022

Posts: 56
#1

dummy variable conditional not working for all panel IDs

21 Jan 2022, 15:07

Dear Stata users,

I have a panel data set consisting of municipalities in my country from 2000-2020, I have three variables which tells what percentage of land is currently used for. So Built_area is one variable, Agricultural_area is the second variable and third variable is NaturalForest_area. So what I want to do is I create 3 variables indicating low_dev,high_dev and med_dev and then each municipality will be in one of these categories. The problem I am facing is that some of my municipalites are not captured in one of these categories. I am not sure but I might think the range is not set properly, hopefully someone could address me the issue and what the solution to this problem is.

These are my commands:
gen med_dev=1 if (Built_area<=32.71& Built_area>10.13 & Agri_area>33.03 & Agri_area<56.32 & NaturalForest_area<15.18) & !missing(Built_area,Agri_area,NaturalForest_area)

gen high_dev=1 if (Built_area>32.71 & Agri_area<=33.03 & NaturalForest_area<11.11) & !missing(Built_area, Agri_area, NaturalForest_area)

gen low_dev=1 if (Built_area<10.13 & Agri_area>=56.32 & NaturalForest_area>=15.18) & !missing(Built_area,Agri_area,NaturalForest_area)

I thank everyone for taking the time to read and hopefully helps me with the solution.

Br,

Adam
Tags: None
Mike Lacy

Join Date: Apr 2014

Posts: 2416
#2

21 Jan 2022, 15:43

You are using the "and" (&) operator where I believe you want the "or" (|) operator. For example

Code:

.... (Built_area<=32.71& Built_area>10.13...

defines a condition that can never be true, as a value cannot be <= 32.71 and also > 10.13. You presumably mean "or". Also, unless you are very familiar with the order in which logical operators are evaluated in Stata, I'd suggest you use parentheses to be sure that your code does what you *mean,*, rather than rely on what your code might mean in its natural language equivalent. I'd be almost sure that Stata will evaluate what you have written in a way different than what you think it does.

You'd also get more easily written and read code if you used Stata's -inrange()- function, e.g.

Code:

.... if !inrange(Built_area, 32.71,10.13) // see -help inrange()-

Finally, you are defining variables that will be 1 and missing, not 1 and 0.
See https://journals.sagepub.com/doi/10....36867X19830921 for a helpful explanation.
1 like
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35696
#3

21 Jan 2022, 20:23

Mike Lacy makes excellent points. There is a typo in his code as he meant to write

Code:

!inrange(Built_area, 10.13, 32.71)

I have a further reaction. How are these indicators (you say dummies) going to help analysis? You have measured predictors, which may or may not be helpful, but degrading them to indicators isn't going to add information or produce a clearer model without some really good rationale for the limits.
1 like
Comment
Bruce Weaver

Join Date: May 2014

Posts: 1132
#4

22 Jan 2022, 07:11

Maybe I need another cup of (strong) coffee, but I do not see how this code...

Code:

.... (Built_area<=32.71& Built_area>10.13...

defines a condition that can never occur. The condition is met if Built_area = 25, for example, is it not? 25 <= 32.71 and 25 > 10.13.

As far as I can tell, all 3 of Adam's -generate- commands create indicators for possible combinations of the variables. But if he wants indicators, he should (IMO) just set the variable = to the expression on the right rather than setting the variable = 1 if the expression is true. That will give him indicators with 0 in place of missing. E.g.,

Code:

clear * input Built_area Agri_area NaturalForest_area 25 35 15 33 33 10 10 57 16 end gen med_dev=1 if (Built_area<=32.71& Built_area>10.13 & Agri_area>33.03 & Agri_area<56.32 & NaturalForest_area<15.18) & !missing(Built_area,Agri_area,NaturalForest_area) gen high_dev=1 if (Built_area>32.71 & Agri_area<=33.03 & NaturalForest_area<11.11) & !missing(Built_area, Agri_area, NaturalForest_area) gen low_dev=1 if (Built_area<10.13 & Agri_area>=56.32 & NaturalForest_area>=15.18) & !missing(Built_area,Agri_area,NaturalForest_area) list replace med_dev= (Built_area<=32.71& Built_area>10.13 & Agri_area>33.03 & Agri_area<56.32 & NaturalForest_area<15.18) & !missing(Built_area,Agri_area,NaturalForest_area) replace high_dev= (Built_area>32.71 & Agri_area<=33.03 & NaturalForest_area<11.11) & !missing(Built_area, Agri_area, NaturalForest_area) replace low_dev= (Built_area<10.13 & Agri_area>=56.32 & NaturalForest_area>=15.18) & !missing(Built_area,Agri_area,NaturalForest_area) list

Regarding the use of range() could go wrong here, given that in some cases, there is a simple < sign rather than <=. I.e., if I understand how range() works,

Code:

range(Built_area,10.13,32.71) = Built_area >= 10.13 & Built_area <= 32.71

...and the original code shows the condition as Built_area > 10.13.

Having said all that, I think Nick's "further reaction" in #3 is spot on and needs to be addressed.

Now I'll go get that coffee and wait for someone to tell me where I went off the rails.

--
Bruce Weaver
Email: [email protected]
Version: Stata/MP 19.5 (Windows)
1 like
Comment
Adam Klaas

Join Date: Jan 2022

Posts: 56
#5

22 Jan 2022, 12:59

Originally posted by Nick Cox View Post

Mike Lacy makes excellent points. There is a typo in his code as he meant to write

Code:

!inrange(Built_area, 10.13, 32.71)

I have a further reaction. How are these indicators (you say dummies) going to help analysis? You have measured predictors, which may or may not be helpful, but degrading them to indicators isn't going to add information or produce a clearer model without some really good rationale for the limits.

Hello Sir, I use these indicators to make sub samples and use panel regression by having 3 models one which is high developed municipalities, second is medium develop municipalities and third is low developed municipalities. I am doing my thesis about the impact ofsupply constraints on the real estate market. I use a somewhat similar model to the ones that Hilber & Vermeulen(2016) have used for their paper.
Comment
Adam Klaas

Join Date: Jan 2022

Posts: 56
#6

22 Jan 2022, 13:02

Originally posted by Mike Lacy View Post

You are using the "and" (&) operator where I believe you want the "or" (|) operator. For example

Code:

.... (Built_area<=32.71& Built_area>10.13...

defines a condition that can never be true, as a value cannot be <= 32.71 and also > 10.13. You presumably mean "or". Also, unless you are very familiar with the order in which logical operators are evaluated in Stata, I'd suggest you use parentheses to be sure that your code does what you *mean,*, rather than rely on what your code might mean in its natural language equivalent. I'd be almost sure that Stata will evaluate what you have written in a way different than what you think it does.

You'd also get more easily written and read code if you used Stata's -inrange()- function, e.g.

Code:

.... if !inrange(Built_area, 32.71,10.13) // see -help inrange()-

Finally, you are defining variables that will be 1 and missing, not 1 and 0.
See https://journals.sagepub.com/doi/10....36867X19830921 for a helpful explanation.

Thank you for your reply, the inrange function still didnt work out for me even with the ifrange function.
gen med_dev=1 if (!inrange (Built_area,32.71,10.13) & (!inrange (Agri_area,33.03,56.32))& NaturalForest_area<15.18) & !missing(Built_area,Agri_area,NaturalForest_area)
This is what I have been using but, some municipalities dont get a 1 for one of the three dummies
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35696
#7

22 Jan 2022, 13:23

Thanks for your comment in #5 which ignores my typo correction to Mike’s code. I am not familiar with the paper you allude to — please note our FAQ request to avoid minimal references — but from the sound of it I would have the same reaction to their analysis.
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2416
#8

22 Jan 2022, 15:40

Bruce and Nick were (of course) right about my mistakes; sorry for any further confusion I introduced.
1 like
Comment

Nick Cox

Join Date: Mar 2014
Posts: 35696

23 Jan 2022, 05:48

You cross-posted this at https://www.reddit.com/r/stata/comme...s_in_the_same/ The people at Reddit should surely want to know about this thread here.

Please read the FAQ Advice, as every new message prompt requests that you do, specifically https://www.statalist.org/forums/help#crossposting

Not telling people about cross-posting has enormous potential to waste people's time and erode their good will. You shouldn't (want to) do that.

Thinking that a thread in one place is not giving you the answers you want is one thing, but it's still common courtesy to give a cross-reference if you try elsewhere.

Backing up here, I see several distinct issues. This echoes excellent points made by Mike Lacy and Bruce Weaver without, I hope, perpetuating typos.

1. Your code generates indicators (you say dummies) that are 1 or missing. Such indicators are useless for analysis as Stata will omit observations that have missing values from most statistical calculations.

2. Your approach creates indicator variables that have little or no obvious rationale. Needing to emulate a published analysis because a teacher tells you to would be one thing. Applying indicator variables when there is no rationale and you already have measurements is a waste of information. Even if your cut-offs have some statistical rule behind them such as being tertiles or based on mean +/- k SD doesn't impart any substantive meaning. I am a geographer and have worked with land use data but that is not an elite. It's just a consequence of general knowledge to appreciate that whether the built up area is above or below 10.13% is not a threshold that has any scientific or practical meaning. Same with your other cut-offs. I have made this point twice already and won't make it again.

I make the following guesses about your data.

3. I assume that no area can be in two or more land use categories at the same time, so the total land use is 100%. I don't assume that there aren't other land use categories

4. Lacking a data example I made a synthetic dataset. The code below may help (a) to show technique (b) to throw light on whatever you are not understanding about your results.

Specifically note that groups is from the Stata Journal and must be installed before it can be used. It can be helpful when checking for cross-combinations of three or more variables.

4a. It may be worth creating indicators that are 1 or 0 for low, medium and high. In doing that I note that you've not been consistent about inequalities.

4b. As a cross-check I create categorical variables for low, medium and high on each named category.

4c. Finally I create indicators that I think are more or less what you are looking for. The point is that the indicators you think you need should best be defined in terms of something simpler.

5. I note that land use data can be awkward because of skewness and outliers and because zeros can be natural (thereby ruling out logarithmic transformations). But if a predictor appears awkward or there are indications of nonlinearity a square root or square transformation can sometimes help.

Code:

clear 
set obs 21 
gen Built = 5 * (_n - 1)
clonevar Agri = Built 
clonevar Forest = Built 
fillin Agri Built Forest 
drop if (Agri + Built + Forest) > 100 
drop _fillin 

* L Low M medium H high 
* using a convention that upper limits are included: the original post mixes conventions 
gen Built_L = Built <= 10.13 if Built < . 
gen Built_M = Built > 10.13 & Built <= 32.71 if Built < . 
gen Built_H = Built > 32.71 if Built < . 
gen Agri_L = Agri <= 33.03 if Agri < . 
gen Agri_M = Agri > 33.03 & Agri <= 56.32 if Agri < . 
gen Agri_H = Agri > 56.32 if Agri < . 
gen Forest_L = Forest <= 11.11 if Forest < . 
gen Forest_M = Forest > 11.11 & Forest <= 15.18 if Forest < . 
gen Forest_H = Forest > 15.18 if Forest < . 
gen Built_cat = cond(Built_L, 1, cond(Built_M, 2, 3)) if Built < . 
gen Agri_cat = cond(Agri_L, 1, cond(Agri_M, 2, 3)) if Agri < . 
gen Forest_cat = cond(Forest_L, 1, cond(Forest_M, 2, 3)) if Forest < . 

groups *_cat  

gen low_dev = Built_L & Agri_H & Forest_H 
gen med_dev = Built_M & Agri_M & (Forest_L | Forest_M)
gen high_dev = Built_H & Agri_L & Forest_L 

groups *_dev

The results here are for the synthetic dataset and clearly will differ from those for your real data.

Code:

. groups *_cat  

  +--------------------------------------------------+
  | Built_~t   Agri_cat   Forest~t   Freq.   Percent |
  |--------------------------------------------------|
  |        1          1          1      63      3.56 |
  |        1          1          2      21      1.19 |
  |        1          1          3     273     15.42 |
  |        1          2          1      45      2.54 |
  |        1          2          2      15      0.85 |
  |--------------------------------------------------|
  |        1          2          3     105      5.93 |
  |        1          3          1      63      3.56 |
  |        1          3          2      15      0.85 |
  |        1          3          3      31      1.75 |
  |        2          1          1      84      4.74 |
  |--------------------------------------------------|
  |        2          1          2      28      1.58 |
  |        2          1          3     266     15.02 |
  |        2          2          1      60      3.39 |
  |        2          2          2      20      1.13 |
  |        2          2          3      70      3.95 |
  |--------------------------------------------------|
  |        2          3          1      42      2.37 |
  |        2          3          2       6      0.34 |
  |        2          3          3       4      0.23 |
  |        3          1          1     210     11.86 |
  |        3          1          2      56      3.16 |
  |--------------------------------------------------|
  |        3          1          3     210     11.86 |
  |        3          2          1      60      3.39 |
  |        3          2          2      10      0.56 |
  |        3          2          3      10      0.56 |
  |        3          3          1       4      0.23 |
  +--------------------------------------------------+

.
.
. groups *_dev

  +------------------------------------------------+
  | low_dev   med_dev   high_dev   Freq.   Percent |
  |------------------------------------------------|
  |       0         0          0    1450     81.87 |
  |       0         0          1     210     11.86 |
  |       0         1          0      80      4.52 |
  |       1         0          0      31      1.75 |
  +------------------------------------------------+

Last edited by Nick Cox; 23 Jan 2022, 05:58.

Comment

Nick Cox

Join Date: Mar 2014

Posts: 35696
#10

23 Jan 2022, 07:13

Better technique if missing values are present.

Code:

gen OK = !missing(low_dev, med_dev, high_dev) gen low_dev = Built_L & Agri_H & Forest_H If OK

with similar code for the other indicators.
Comment
Adam Klaas

Join Date: Jan 2022

Posts: 56
#11

23 Jan 2022, 13:18

Originally posted by Nick Cox View Post

You cross-posted this at https://www.reddit.com/r/stata/comme...s_in_the_same/ The people at Reddit should surely want to know about this thread here.

Please read the FAQ Advice, as every new message prompt requests that you do, specifically https://www.statalist.org/forums/help#crossposting

Not telling people about cross-posting has enormous potential to waste people's time and erode their good will. You shouldn't (want to) do that.

Thinking that a thread in one place is not giving you the answers you want is one thing, but it's still common courtesy to give a cross-reference if you try elsewhere.

Backing up here, I see several distinct issues. This echoes excellent points made by Mike Lacy and Bruce Weaver without, I hope, perpetuating typos.

1. Your code generates indicators (you say dummies) that are 1 or missing. Such indicators are useless for analysis as Stata will omit observations that have missing values from most statistical calculations.

2. Your approach creates indicator variables that have little or no obvious rationale. Needing to emulate a published analysis because a teacher tells you to would be one thing. Applying indicator variables when there is no rationale and you already have measurements is a waste of information. Even if your cut-offs have some statistical rule behind them such as being tertiles or based on mean +/- k SD doesn't impart any substantive meaning. I am a geographer and have worked with land use data but that is not an elite. It's just a consequence of general knowledge to appreciate that whether the built up area is above or below 10.13% is not a threshold that has any scientific or practical meaning. Same with your other cut-offs. I have made this point twice already and won't make it again.

I make the following guesses about your data.

3. I assume that no area can be in two or more land use categories at the same time, so the total land use is 100%. I don't assume that there aren't other land use categories

4. Lacking a data example I made a synthetic dataset. The code below may help (a) to show technique (b) to throw light on whatever you are not understanding about your results.

Specifically note that groups is from the Stata Journal and must be installed before it can be used. It can be helpful when checking for cross-combinations of three or more variables.

4a. It may be worth creating indicators that are 1 or 0 for low, medium and high. In doing that I note that you've not been consistent about inequalities.

4b. As a cross-check I create categorical variables for low, medium and high on each named category.

4c. Finally I create indicators that I think are more or less what you are looking for. The point is that the indicators you think you need should best be defined in terms of something simpler.

5. I note that land use data can be awkward because of skewness and outliers and because zeros can be natural (thereby ruling out logarithmic transformations). But if a predictor appears awkward or there are indications of nonlinearity a square root or square transformation can sometimes help.

Code:

clear set obs 21 gen Built = 5 * (_n - 1) clonevar Agri = Built clonevar Forest = Built fillin Agri Built Forest drop if (Agri + Built + Forest) > 100 drop _fillin * L Low M medium H high * using a convention that upper limits are included: the original post mixes conventions gen Built_L = Built <= 10.13 if Built < . gen Built_M = Built > 10.13 & Built <= 32.71 if Built < . gen Built_H = Built > 32.71 if Built < . gen Agri_L = Agri <= 33.03 if Agri < . gen Agri_M = Agri > 33.03 & Agri <= 56.32 if Agri < . gen Agri_H = Agri > 56.32 if Agri < . gen Forest_L = Forest <= 11.11 if Forest < . gen Forest_M = Forest > 11.11 & Forest <= 15.18 if Forest < . gen Forest_H = Forest > 15.18 if Forest < . gen Built_cat = cond(Built_L, 1, cond(Built_M, 2, 3)) if Built < . gen Agri_cat = cond(Agri_L, 1, cond(Agri_M, 2, 3)) if Agri < . gen Forest_cat = cond(Forest_L, 1, cond(Forest_M, 2, 3)) if Forest < . groups *_cat gen low_dev = Built_L & Agri_H & Forest_H gen med_dev = Built_M & Agri_M & (Forest_L | Forest_M) gen high_dev = Built_H & Agri_L & Forest_L groups *_dev

The results here are for the synthetic dataset and clearly will differ from those for your real data.

Code:

. groups *_cat +--------------------------------------------------+ | Built_~t Agri_cat Forest~t Freq. Percent | |--------------------------------------------------| | 1 1 1 63 3.56 | | 1 1 2 21 1.19 | | 1 1 3 273 15.42 | | 1 2 1 45 2.54 | | 1 2 2 15 0.85 | |--------------------------------------------------| | 1 2 3 105 5.93 | | 1 3 1 63 3.56 | | 1 3 2 15 0.85 | | 1 3 3 31 1.75 | | 2 1 1 84 4.74 | |--------------------------------------------------| | 2 1 2 28 1.58 | | 2 1 3 266 15.02 | | 2 2 1 60 3.39 | | 2 2 2 20 1.13 | | 2 2 3 70 3.95 | |--------------------------------------------------| | 2 3 1 42 2.37 | | 2 3 2 6 0.34 | | 2 3 3 4 0.23 | | 3 1 1 210 11.86 | | 3 1 2 56 3.16 | |--------------------------------------------------| | 3 1 3 210 11.86 | | 3 2 1 60 3.39 | | 3 2 2 10 0.56 | | 3 2 3 10 0.56 | | 3 3 1 4 0.23 | +--------------------------------------------------+ . . . groups *_dev +------------------------------------------------+ | low_dev med_dev high_dev Freq. Percent | |------------------------------------------------| | 0 0 0 1450 81.87 | | 0 0 1 210 11.86 | | 0 1 0 80 4.52 | | 1 0 0 31 1.75 | +------------------------------------------------+

Thank you very much Dr. Nick Cox I appreciate the effort, I am sorry that I crossposted it on Reddit I wish I had informed them about this thread. Ill try the code you produced and see if I can figure out if it works for my data with some adjustments. I forgot to say that only the year 2015 for each municipality has values for Built_area,Agri_area and NaturalForest_area which I make an assumption that land use is fairly constant over time. Therefore if a municipality has 1 for one of the categories(low,med,high) then all those years for that municipality in the category must be 1 as well
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35696
#12

23 Jan 2022, 14:46

Needing or wanting to spread values from 2015 to other years is what it is, and doesn't undermine anything else discussed so far.
Comment

Announcement