Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Forced negative correlation between x variables?

    Hi all,

    I'm writing a paper that looks at the effect of fires on recreational visits to National parks and National forests. I have a panel dataset consisting of ~500 IDs (geographical units) and 10 years, and my dependent variable is a count variable so I'm using an FE Poisson model.

    I have two questions regarding forced negative correlation between independent variables of interests, and would be grateful for your advice.

    1) A fire can burn a geographical unit in different parts by either very low, low, medium, or high severity. For example, a 'unit' may be burned 3% by a fire of low severity, 5% by medium and so on. One of my regressions involves looking at the effect of different proportions of a burn of given severity in the unit. The problem I'm having is that each unit only has one fire, and naturally if 80% of a fire is medium severity, then very low, low and high severity fires can only be 20% of the fire i.e., there's some sort of forced mechanical correlation. I'm concerned about this because this maybe means that when I look at the effect of the whole unit burning with moderate severity relative to none of it burning with moderate severity, the coefficient it's giving me is actually the coefficient for the effect of the unit not having a severe, low or very low grade fire in that area since they're inherently negatively correlated. I'm assuming this is a common problem, but I'm not able to find stuff on it online, possibly because I'm not using the right words. I was wondering if you had suggestions on what people tend to do in these situations/ what I should look up?

    The code I'm using for this regression right now is:

    xtpoisson pud c.propburn_all_rows#c.propsev1_all_rows c.propburn_all_rows#c.propsev2_all_rows c.propburn_all_rows#c.propsev3_all_rows c.propburn_all_rows#c.propsev4_all_rows pop60 pop120 prec tmean tmax i.year,fe vce(robust)

    where propburn_all_rows is a continuous variable ranging from 0 to 1 with information on what proportion of the geographical unit is being burned by a fire, propsev1_all_rows is the proportion of the fire burned by severity 1 (very low severity), propsev2_all_rows is proportion of the fire burned by severity 2 (low severity) and so on.

    2) One of my regressions looks at the effect of fires of different ages on visits to national parks and forests. I essentially created a categorical variable called single_column_year_groups that is coded as 1 for observations that are 1-3 years after a fire, 2 for 4-6 years after a fire in the unit etc. I'm then running the following regression:

    xtpoisson pud i.single_column_year_groups pop60 pop120 prec tmean tmax i.year, fe vce(robust)

    I'm concerned about forced negative correlation between the different year groups (i.e., if the observation is in years 1-3 after fire, it's obviously not in any of the other year groups). I'm trying to understand if this is even a problem and/or if Stata is adjusting for it, because that is the case for basically any categorical variable - if you're in one category, you're not in another? I'd love any insight on how Stata adjusts for this.

    Thank you!

  • #2
    Maybe I'm missing something, but I don't see any problem here at all. When Stata estimates regressions, it uses the covariance matrix of all the predictors, so any correlations, positive or negative, among the predictors, are accounted for in the calculations.

    The only potential problem I foresee here is that if all of the observations in your data set have had some fires, then the propsev* variables will always sum to 1 and there will be a colinearity among them. Even that, however, is only a cosmetic problem: Stata will break the colinearity by omitting one of them for you automatically (and will give you a note telling you what it has done). And if some of your observations are about properties that haven't had any fire at all, you won't even encounter this little bump in the road, as for those observations the sum of the propsev* variables will be either 0 or 1.

    I note that you have used the # interaction notation in your regression commands. This means that your analysis will not include the proburn_all_rows variable itself, nor any of the propsev*_all_rows variables by themselves. This is usually a mis-specified model: when you have an interaction, its constituents should usually also be included. The Stata notation to do that is the double hashtag ##.

    As an aside, you can shorten and simplify your code as follows:

    Code:
    xtpoisson pud c.propburn_all_rows##c.propsev*_all_rows pop60 pop120 prec tmean tmax i.year,fe vce(robust)

    Comment


    • #3
      Dear Clyde Schechter,

      Thank you very much for your response! That's very helpful. To respond the easier part first, thank you for the simplified code -- I didn't know I could just do propsev*! In terms of why I'm using the # instead of ## --- the dependent variable I want is what proportion of the geographical unit is burned by severity 1, severity 2 etc. The variable propburn_all_rows has information on what proportion of the polygon is burned by a fire, and prop_sev1_all_rows has information on what proportion of the fire was burned by severity 1 fire, so multiplying that gives me what proportion of the fire as a whole was burned by a severity 1 fire, and same for the other severities --- and the Stata # is just an easy way for me to get that multiplied value.

      Hm, that's useful to know. All the units in my dataset have a fire, but there are pre-fire observations too where propsev* would sum to 0, so overall propsev* mostly sums to 1 and sometimes for 0 for each unit. Would you then predict that Stata doesn't need to omit anything? It isn't right now, but I'm just wondering if I'm doing something wrong.

      I am still mildly concerned because the results I'm getting are strange. For instance, it's telling me that the coefficient for the x variable 'proportion of polygon burned such that there was increased vegetation after (severity 5)' was a massive drop in visits - an even bigger drop than fire of high severities. This to me doesn't make sense and makes me suspicious that the current code I'm using isn't using the correct comparison group. Ideally I want to look at the effect of a severity 5 fire burning the polygon as compared to no fire burning the polygon, not as compared to a fire of some other severity distribution burning the polygon. Do you think my current code is accomplishing that?

      Thank you so much!

      Comment


      • #4
        As a side note, the correlation between the proportions of no two severities are all that high but they are still affecting at a mechanical level what the others can be, especially if the proportion of one severity is particularly high -- so the correlation between two severity classes feels somewhat unusual in that when the proportion burned by a given severity class is low, the correlation between it and another severity class is just as the correlation between any two x variables would be and Stata would handle that normally, but if either of their proportions is extremely high, that automatically affects what the proportion of the other can be. I'm not sure how this affects everything.

        Comment


        • #5
          Thanks for explaining why you used # instead of ##--and for your purposes, # is, indeed, correct.

          Given that you have some pre-fire observations, there will be no colinearity issue, and nothing should be omitted.

          With the code you are using, the coefficient of propburn_all_rows#propsevN_all_rows, for N = 1, 2, 3, or 4 will be your estimate of the expected difference between the logarithm of the outcome variable (pud) when a fire of severity N burns the property and when there is no fire burning the property. (The logarithm comes in because you are doing Poisson regression.)

          I am still mildly concerned because the results I'm getting are strange. For instance, it's telling me that the coefficient for the x variable 'proportion of polygon burned such that there was increased vegetation after (severity 5)' was a massive drop in visits - an even bigger drop than fire of high severities.
          This sentence confuses me. In the original post, you referred to 4 levels of severity, numbered 1 through 4. So why are we suddenly talking about severity 5? And if there is a severity 5 as well, would it not represent even greater severity than 1 through 4, so that its effect being larger would be quite sensible?

          Be that as it may, if the results do not accord with your expectations, the problem does not arise from the code you are using, or at least not from this aspect of the code. The possibilities to consider are:

          1. The model is simply incorrect in the first place.
          2. The data are not correct.
          3. Your expectations are not correct.

          Comment


          • #6
            Dear Clyde Schechter,

            Thank you so much for helping me clearly interpret my results!

            There actually is a 5th severity class as well which denotes increased greenness - so it doesn't represent a fire of even greater severity than severity 4. I didn't mention it before because on average it burns 0.2% of geographical units -- almost negligible as compared to the other severity classes. Consequently, I wasn't even including it in my regressions, but when I ran the command as you suggested (propsev*) it was included and Stata computed a coefficient for it which was shockingly negative and large, though nowhere close to statistically significant.

            It's reassuring to hear that you think this code is doing what I want it to -- I will look into what else could be causing this strange result.

            Thank you very much!

            Comment


            • #7
              it was included and Stata computed a coefficient for it which was shockingly negative and large, though nowhere close to statistically significant.
              ...which is exactly what you would expect for a variable that only affects 0.2% of the data! With so few instances, you simply don't have the information needed to estimate that effect with any useful degree of precision. So the coefficient can be wildly large with either sign, but with enormously wide confidence intervals.

              Comment

              Working...
              X