Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • dummy variable conditional not working for all panel IDs

    Dear Stata users,

    I have a panel data set consisting of municipalities in my country from 2000-2020, I have three variables which tells what percentage of land is currently used for. So Built_area is one variable, Agricultural_area is the second variable and third variable is NaturalForest_area. So what I want to do is I create 3 variables indicating low_dev,high_dev and med_dev and then each municipality will be in one of these categories. The problem I am facing is that some of my municipalites are not captured in one of these categories. I am not sure but I might think the range is not set properly, hopefully someone could address me the issue and what the solution to this problem is.


    These are my commands:
    gen med_dev=1 if (Built_area<=32.71& Built_area>10.13 & Agri_area>33.03 & Agri_area<56.32 & NaturalForest_area<15.18) & !missing(Built_area,Agri_area,NaturalForest_area)

    gen high_dev=1 if (Built_area>32.71 & Agri_area<=33.03 & NaturalForest_area<11.11) & !missing(Built_area, Agri_area, NaturalForest_area)

    gen low_dev=1 if (Built_area<10.13 & Agri_area>=56.32 & NaturalForest_area>=15.18) & !missing(Built_area,Agri_area,NaturalForest_area)


    I thank everyone for taking the time to read and hopefully helps me with the solution.

    Br,

    Adam

  • #2
    You are using the "and" (&) operator where I believe you want the "or" (|) operator. For example
    Code:
    .... (Built_area<=32.71& Built_area>10.13...
    defines a condition that can never be true, as a value cannot be <= 32.71 and also > 10.13. You presumably mean "or". Also, unless you are very familiar with the order in which logical operators are evaluated in Stata, I'd suggest you use parentheses to be sure that your code does what you *mean,*, rather than rely on what your code might mean in its natural language equivalent. I'd be almost sure that Stata will evaluate what you have written in a way different than what you think it does.

    You'd also get more easily written and read code if you used Stata's -inrange()- function, e.g.
    Code:
    .... if !inrange(Built_area, 32.71,10.13) // see -help inrange()-
    Finally, you are defining variables that will be 1 and missing, not 1 and 0.
    See https://journals.sagepub.com/doi/10....36867X19830921 for a helpful explanation.

    Comment


    • #3
      Mike Lacy makes excellent points. There is a typo in his code as he meant to write

      Code:
       !inrange(Built_area, 10.13, 32.71)
      I have a further reaction. How are these indicators (you say dummies) going to help analysis? You have measured predictors, which may or may not be helpful, but degrading them to indicators isn't going to add information or produce a clearer model without some really good rationale for the limits.

      Comment


      • #4
        Maybe I need another cup of (strong) coffee, but I do not see how this code...

        Code:
        .... (Built_area<=32.71& Built_area>10.13...
        defines a condition that can never occur. The condition is met if Built_area = 25, for example, is it not? 25 <= 32.71 and 25 > 10.13.

        As far as I can tell, all 3 of Adam's -generate- commands create indicators for possible combinations of the variables. But if he wants indicators, he should (IMO) just set the variable = to the expression on the right rather than setting the variable = 1 if the expression is true. That will give him indicators with 0 in place of missing. E.g.,

        Code:
        clear *
        input Built_area Agri_area NaturalForest_area
        25 35 15
        33 33 10
        10 57 16
        end
        
        gen med_dev=1 if (Built_area<=32.71& Built_area>10.13 & Agri_area>33.03 & Agri_area<56.32 & NaturalForest_area<15.18) & !missing(Built_area,Agri_area,NaturalForest_area)
        gen high_dev=1 if (Built_area>32.71 & Agri_area<=33.03 & NaturalForest_area<11.11) & !missing(Built_area, Agri_area, NaturalForest_area)
        gen low_dev=1 if (Built_area<10.13 & Agri_area>=56.32 & NaturalForest_area>=15.18) & !missing(Built_area,Agri_area,NaturalForest_area)
        
        list
        
        replace med_dev= (Built_area<=32.71& Built_area>10.13 & Agri_area>33.03 & Agri_area<56.32 & NaturalForest_area<15.18) & !missing(Built_area,Agri_area,NaturalForest_area)
        replace high_dev= (Built_area>32.71 & Agri_area<=33.03 & NaturalForest_area<11.11) & !missing(Built_area, Agri_area, NaturalForest_area)
        replace low_dev= (Built_area<10.13 & Agri_area>=56.32 & NaturalForest_area>=15.18) & !missing(Built_area,Agri_area,NaturalForest_area)
        
        list
        Regarding the use of range() could go wrong here, given that in some cases, there is a simple < sign rather than <=. I.e., if I understand how range() works,

        Code:
        range(Built_area,10.13,32.71) = Built_area >= 10.13 & Built_area <= 32.71
        ...and the original code shows the condition as Built_area > 10.13.

        Having said all that, I think Nick's "further reaction" in #3 is spot on and needs to be addressed.

        Now I'll go get that coffee and wait for someone to tell me where I went off the rails.
        --
        Bruce Weaver
        Email: [email protected]
        Version: Stata/MP 18.5 (Windows)

        Comment


        • #5
          Originally posted by Nick Cox View Post
          Mike Lacy makes excellent points. There is a typo in his code as he meant to write

          Code:
          !inrange(Built_area, 10.13, 32.71)
          I have a further reaction. How are these indicators (you say dummies) going to help analysis? You have measured predictors, which may or may not be helpful, but degrading them to indicators isn't going to add information or produce a clearer model without some really good rationale for the limits.
          Hello Sir, I use these indicators to make sub samples and use panel regression by having 3 models one which is high developed municipalities, second is medium develop municipalities and third is low developed municipalities. I am doing my thesis about the impact ofsupply constraints on the real estate market. I use a somewhat similar model to the ones that Hilber & Vermeulen(2016) have used for their paper.

          Comment


          • #6
            Originally posted by Mike Lacy View Post
            You are using the "and" (&) operator where I believe you want the "or" (|) operator. For example
            Code:
            .... (Built_area<=32.71& Built_area>10.13...
            defines a condition that can never be true, as a value cannot be <= 32.71 and also > 10.13. You presumably mean "or". Also, unless you are very familiar with the order in which logical operators are evaluated in Stata, I'd suggest you use parentheses to be sure that your code does what you *mean,*, rather than rely on what your code might mean in its natural language equivalent. I'd be almost sure that Stata will evaluate what you have written in a way different than what you think it does.

            You'd also get more easily written and read code if you used Stata's -inrange()- function, e.g.
            Code:
            .... if !inrange(Built_area, 32.71,10.13) // see -help inrange()-
            Finally, you are defining variables that will be 1 and missing, not 1 and 0.
            See https://journals.sagepub.com/doi/10....36867X19830921 for a helpful explanation.
            Thank you for your reply, the inrange function still didnt work out for me even with the ifrange function.
            gen med_dev=1 if (!inrange (Built_area,32.71,10.13) & (!inrange (Agri_area,33.03,56.32))& NaturalForest_area<15.18) & !missing(Built_area,Agri_area,NaturalForest_area)
            This is what I have been using but, some municipalities dont get a 1 for one of the three dummies

            Comment


            • #7
              Thanks for your comment in #5 which ignores my typo correction to Mike’s code. I am not familiar with the paper you allude to — please note our FAQ request to avoid minimal references — but from the sound of it I would have the same reaction to their analysis.

              Comment


              • #8
                Bruce and Nick were (of course) right about my mistakes; sorry for any further confusion I introduced.

                Comment


                • #9
                  You cross-posted this at https://www.reddit.com/r/stata/comme...s_in_the_same/ The people at Reddit should surely want to know about this thread here.

                  Please read the FAQ Advice, as every new message prompt requests that you do, specifically https://www.statalist.org/forums/help#crossposting

                  Not telling people about cross-posting has enormous potential to waste people's time and erode their good will. You shouldn't (want to) do that.

                  Thinking that a thread in one place is not giving you the answers you want is one thing, but it's still common courtesy to give a cross-reference if you try elsewhere.

                  Backing up here, I see several distinct issues. This echoes excellent points made by Mike Lacy and Bruce Weaver without, I hope, perpetuating typos.


                  1. Your code generates indicators (you say dummies) that are 1 or missing. Such indicators are useless for analysis as Stata will omit observations that have missing values from most statistical calculations.

                  2. Your approach creates indicator variables that have little or no obvious rationale. Needing to emulate a published analysis because a teacher tells you to would be one thing. Applying indicator variables when there is no rationale and you already have measurements is a waste of information. Even if your cut-offs have some statistical rule behind them such as being tertiles or based on mean +/- k SD doesn't impart any substantive meaning. I am a geographer and have worked with land use data but that is not an elite. It's just a consequence of general knowledge to appreciate that whether the built up area is above or below 10.13% is not a threshold that has any scientific or practical meaning. Same with your other cut-offs. I have made this point twice already and won't make it again.

                  I make the following guesses about your data.

                  3. I assume that no area can be in two or more land use categories at the same time, so the total land use is 100%. I don't assume that there aren't other land use categories

                  4. Lacking a data example I made a synthetic dataset. The code below may help (a) to show technique (b) to throw light on whatever you are not understanding about your results.

                  Specifically note that groups is from the Stata Journal and must be installed before it can be used. It can be helpful when checking for cross-combinations of three or more variables.

                  4a. It may be worth creating indicators that are 1 or 0 for low, medium and high. In doing that I note that you've not been consistent about inequalities.

                  4b. As a cross-check I create categorical variables for low, medium and high on each named category.

                  4c. Finally I create indicators that I think are more or less what you are looking for. The point is that the indicators you think you need should best be defined in terms of something simpler.

                  5. I note that land use data can be awkward because of skewness and outliers and because zeros can be natural (thereby ruling out logarithmic transformations). But if a predictor appears awkward or there are indications of nonlinearity a square root or square transformation can sometimes help.



                  Code:
                  clear 
                  set obs 21 
                  gen Built = 5 * (_n - 1)
                  clonevar Agri = Built 
                  clonevar Forest = Built 
                  fillin Agri Built Forest 
                  drop if (Agri + Built + Forest) > 100 
                  drop _fillin 
                  
                  * L Low M medium H high 
                  * using a convention that upper limits are included: the original post mixes conventions 
                  gen Built_L = Built <= 10.13 if Built < . 
                  gen Built_M = Built > 10.13 & Built <= 32.71 if Built < . 
                  gen Built_H = Built > 32.71 if Built < . 
                  gen Agri_L = Agri <= 33.03 if Agri < . 
                  gen Agri_M = Agri > 33.03 & Agri <= 56.32 if Agri < . 
                  gen Agri_H = Agri > 56.32 if Agri < . 
                  gen Forest_L = Forest <= 11.11 if Forest < . 
                  gen Forest_M = Forest > 11.11 & Forest <= 15.18 if Forest < . 
                  gen Forest_H = Forest > 15.18 if Forest < . 
                  gen Built_cat = cond(Built_L, 1, cond(Built_M, 2, 3)) if Built < . 
                  gen Agri_cat = cond(Agri_L, 1, cond(Agri_M, 2, 3)) if Agri < . 
                  gen Forest_cat = cond(Forest_L, 1, cond(Forest_M, 2, 3)) if Forest < . 
                  
                  groups *_cat  
                  
                  gen low_dev = Built_L & Agri_H & Forest_H 
                  gen med_dev = Built_M & Agri_M & (Forest_L | Forest_M)
                  gen high_dev = Built_H & Agri_L & Forest_L 
                  
                  groups *_dev

                  The results here are for the synthetic dataset and clearly will differ from those for your real data.


                  Code:
                  . groups *_cat  
                  
                    +--------------------------------------------------+
                    | Built_~t   Agri_cat   Forest~t   Freq.   Percent |
                    |--------------------------------------------------|
                    |        1          1          1      63      3.56 |
                    |        1          1          2      21      1.19 |
                    |        1          1          3     273     15.42 |
                    |        1          2          1      45      2.54 |
                    |        1          2          2      15      0.85 |
                    |--------------------------------------------------|
                    |        1          2          3     105      5.93 |
                    |        1          3          1      63      3.56 |
                    |        1          3          2      15      0.85 |
                    |        1          3          3      31      1.75 |
                    |        2          1          1      84      4.74 |
                    |--------------------------------------------------|
                    |        2          1          2      28      1.58 |
                    |        2          1          3     266     15.02 |
                    |        2          2          1      60      3.39 |
                    |        2          2          2      20      1.13 |
                    |        2          2          3      70      3.95 |
                    |--------------------------------------------------|
                    |        2          3          1      42      2.37 |
                    |        2          3          2       6      0.34 |
                    |        2          3          3       4      0.23 |
                    |        3          1          1     210     11.86 |
                    |        3          1          2      56      3.16 |
                    |--------------------------------------------------|
                    |        3          1          3     210     11.86 |
                    |        3          2          1      60      3.39 |
                    |        3          2          2      10      0.56 |
                    |        3          2          3      10      0.56 |
                    |        3          3          1       4      0.23 |
                    +--------------------------------------------------+
                  
                  .
                  .
                  . groups *_dev
                  
                    +------------------------------------------------+
                    | low_dev   med_dev   high_dev   Freq.   Percent |
                    |------------------------------------------------|
                    |       0         0          0    1450     81.87 |
                    |       0         0          1     210     11.86 |
                    |       0         1          0      80      4.52 |
                    |       1         0          0      31      1.75 |
                    +------------------------------------------------+
                  Last edited by Nick Cox; 23 Jan 2022, 06:58.

                  Comment


                  • #10
                    Better technique if missing values are present.

                    Code:
                    gen OK = !missing(low_dev, med_dev, high_dev) 
                    
                    gen low_dev = Built_L & Agri_H & Forest_H If OK
                    with similar code for the other indicators.

                    Comment


                    • #11
                      Originally posted by Nick Cox View Post
                      You cross-posted this at https://www.reddit.com/r/stata/comme...s_in_the_same/ The people at Reddit should surely want to know about this thread here.

                      Please read the FAQ Advice, as every new message prompt requests that you do, specifically https://www.statalist.org/forums/help#crossposting

                      Not telling people about cross-posting has enormous potential to waste people's time and erode their good will. You shouldn't (want to) do that.

                      Thinking that a thread in one place is not giving you the answers you want is one thing, but it's still common courtesy to give a cross-reference if you try elsewhere.

                      Backing up here, I see several distinct issues. This echoes excellent points made by Mike Lacy and Bruce Weaver without, I hope, perpetuating typos.


                      1. Your code generates indicators (you say dummies) that are 1 or missing. Such indicators are useless for analysis as Stata will omit observations that have missing values from most statistical calculations.

                      2. Your approach creates indicator variables that have little or no obvious rationale. Needing to emulate a published analysis because a teacher tells you to would be one thing. Applying indicator variables when there is no rationale and you already have measurements is a waste of information. Even if your cut-offs have some statistical rule behind them such as being tertiles or based on mean +/- k SD doesn't impart any substantive meaning. I am a geographer and have worked with land use data but that is not an elite. It's just a consequence of general knowledge to appreciate that whether the built up area is above or below 10.13% is not a threshold that has any scientific or practical meaning. Same with your other cut-offs. I have made this point twice already and won't make it again.

                      I make the following guesses about your data.

                      3. I assume that no area can be in two or more land use categories at the same time, so the total land use is 100%. I don't assume that there aren't other land use categories

                      4. Lacking a data example I made a synthetic dataset. The code below may help (a) to show technique (b) to throw light on whatever you are not understanding about your results.

                      Specifically note that groups is from the Stata Journal and must be installed before it can be used. It can be helpful when checking for cross-combinations of three or more variables.

                      4a. It may be worth creating indicators that are 1 or 0 for low, medium and high. In doing that I note that you've not been consistent about inequalities.

                      4b. As a cross-check I create categorical variables for low, medium and high on each named category.

                      4c. Finally I create indicators that I think are more or less what you are looking for. The point is that the indicators you think you need should best be defined in terms of something simpler.

                      5. I note that land use data can be awkward because of skewness and outliers and because zeros can be natural (thereby ruling out logarithmic transformations). But if a predictor appears awkward or there are indications of nonlinearity a square root or square transformation can sometimes help.



                      Code:
                      clear
                      set obs 21
                      gen Built = 5 * (_n - 1)
                      clonevar Agri = Built
                      clonevar Forest = Built
                      fillin Agri Built Forest
                      drop if (Agri + Built + Forest) > 100
                      drop _fillin
                      
                      * L Low M medium H high
                      * using a convention that upper limits are included: the original post mixes conventions
                      gen Built_L = Built <= 10.13 if Built < .
                      gen Built_M = Built > 10.13 & Built <= 32.71 if Built < .
                      gen Built_H = Built > 32.71 if Built < .
                      gen Agri_L = Agri <= 33.03 if Agri < .
                      gen Agri_M = Agri > 33.03 & Agri <= 56.32 if Agri < .
                      gen Agri_H = Agri > 56.32 if Agri < .
                      gen Forest_L = Forest <= 11.11 if Forest < .
                      gen Forest_M = Forest > 11.11 & Forest <= 15.18 if Forest < .
                      gen Forest_H = Forest > 15.18 if Forest < .
                      gen Built_cat = cond(Built_L, 1, cond(Built_M, 2, 3)) if Built < .
                      gen Agri_cat = cond(Agri_L, 1, cond(Agri_M, 2, 3)) if Agri < .
                      gen Forest_cat = cond(Forest_L, 1, cond(Forest_M, 2, 3)) if Forest < .
                      
                      groups *_cat
                      
                      gen low_dev = Built_L & Agri_H & Forest_H
                      gen med_dev = Built_M & Agri_M & (Forest_L | Forest_M)
                      gen high_dev = Built_H & Agri_L & Forest_L
                      
                      groups *_dev

                      The results here are for the synthetic dataset and clearly will differ from those for your real data.


                      Code:
                      . groups *_cat
                      
                      +--------------------------------------------------+
                      | Built_~t Agri_cat Forest~t Freq. Percent |
                      |--------------------------------------------------|
                      | 1 1 1 63 3.56 |
                      | 1 1 2 21 1.19 |
                      | 1 1 3 273 15.42 |
                      | 1 2 1 45 2.54 |
                      | 1 2 2 15 0.85 |
                      |--------------------------------------------------|
                      | 1 2 3 105 5.93 |
                      | 1 3 1 63 3.56 |
                      | 1 3 2 15 0.85 |
                      | 1 3 3 31 1.75 |
                      | 2 1 1 84 4.74 |
                      |--------------------------------------------------|
                      | 2 1 2 28 1.58 |
                      | 2 1 3 266 15.02 |
                      | 2 2 1 60 3.39 |
                      | 2 2 2 20 1.13 |
                      | 2 2 3 70 3.95 |
                      |--------------------------------------------------|
                      | 2 3 1 42 2.37 |
                      | 2 3 2 6 0.34 |
                      | 2 3 3 4 0.23 |
                      | 3 1 1 210 11.86 |
                      | 3 1 2 56 3.16 |
                      |--------------------------------------------------|
                      | 3 1 3 210 11.86 |
                      | 3 2 1 60 3.39 |
                      | 3 2 2 10 0.56 |
                      | 3 2 3 10 0.56 |
                      | 3 3 1 4 0.23 |
                      +--------------------------------------------------+
                      
                      .
                      .
                      . groups *_dev
                      
                      +------------------------------------------------+
                      | low_dev med_dev high_dev Freq. Percent |
                      |------------------------------------------------|
                      | 0 0 0 1450 81.87 |
                      | 0 0 1 210 11.86 |
                      | 0 1 0 80 4.52 |
                      | 1 0 0 31 1.75 |
                      +------------------------------------------------+
                      Thank you very much Dr. Nick Cox I appreciate the effort, I am sorry that I crossposted it on Reddit I wish I had informed them about this thread. Ill try the code you produced and see if I can figure out if it works for my data with some adjustments. I forgot to say that only the year 2015 for each municipality has values for Built_area,Agri_area and NaturalForest_area which I make an assumption that land use is fairly constant over time. Therefore if a municipality has 1 for one of the categories(low,med,high) then all those years for that municipality in the category must be 1 as well

                      Comment


                      • #12
                        Needing or wanting to spread values from 2015 to other years is what it is, and doesn't undermine anything else discussed so far.

                        Comment

                        Working...
                        X