Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Regression Error: "omitted because of collinearity"

    I am trying to run a regression with the following independent variables:
    Code:
    **Treatment: only one event attended; Control: no event attended
    gen after_event=.
    replace after_event=1 if before_attendance_count==1
    replace after_event=0 if before_attendance_count==0
    
    **the event attended can be divided into type "a", "b" and "c"
    
    gen after_a=.
    replace after_a=0 if before_attendance_count==1
    replace after_a=1 if before_attendance_count==1 & event_type=="a"
    
    gen after_b=.
    replace after_b=0 if before_attendance_count==1
    replace after_b=1 if before_attendance_count==1 & event_type=="b"
    
    gen after_c=.
    replace after_c=0 if before_attendance_count==1
    replace after_c=1 if before_attendance_count==1 & event_type=="c"
    Further we cluster by country and event (one event can be attended by different countries and the perception of the type of one event can differ by country):
    Code:
    country_event_fixedeffects
    and conduct following regressions:

    Code:
    areg dependent_variable after_event, absorb(country_event_fixedeffects)
    There is no problem with the regression above.

    However, as soon as I add a type or multiple types of the event:
    Code:
    areg dependent_variable after_event after_a, absorb(country_event_fixedeffects)
    
    *or
    areg dependent_variable after_a after_b after_c, absorb(country_event_fixedeffects)
    the after_* variables get omitted, e.g.:

    Code:
    *for the first regression above:
    
    note:after_event omitted because of collinearity
    .

    Since the independent variables are not perfectly correlated and assuming the type of event is random, why does it show me a collinearity problem?
    Last edited by Penelope Smart; 13 Jan 2022, 15:02.

  • #2
    You have constructed the after_event_a, b, c variables in a way that guarantees this behavior by including the restriction that before_attendance_count == 1 in order for them to have a non-missing value. Using just the -areg dependent_variable after_event after_a, absorb(country_match_fe)- model to illustrate the problem we see that before_attendance_count == 1 is the same thing as after_event == 1. So the variable after_a, as constructed, is always missing whenever after_event == 0. Now, in any regression model, any observation that has a missing value for any variable in the regression command is automatically omitted from the calculations. So the only observations included in this regression will be those with after_event == 1. So after_event is, in fact, a constant in the regression estimation sample, and is therefore omitted from the regression (because a variable that doesn't vary is always colinear with the fixed effects.) The same considerations apply to after_b and after_c

    So there are two possibilities. It may be that after_a is only meaningful or relevant in the context of after_even == 1. If that's the case, then what you have done and what Stata has given you is already what you need. You might want to make the output look neater and more thoughtful by rerunning the regressions leaving out the after_event variable yourself, rather than having Stata discard it for you. But that's just an aesthetic matter.

    The other possibility is that after_a is meaningful and relevant as a contrast both to those for which after_event == 1 but the event is not type a, as well as contrasting with those for which after_event == 0. In this case, you have to change your construction of after_a by eliminating the restriction to attendance_count = 1. Again, the same considerations apply to after_b and after_c.

    Comment


    • #3
      Hello Clyde,

      thank you so much for the thorough explanation. I am trying to find coefficients for the later.

      I have changed the construction of the event type variables into:

      Code:
      gen after_a=0
      replace after_a=1 if before_attendance_count==1 & event_type=="a"
      replace after_a=. if before_attendance_count>1  
      
      gen after_b=0
      replace after_b=1 if before_attendance_count==1 & event_type=="b"
      replace after_b=. if before_attendance_count>1
      
      gen after_c=0
      replace after_c=1 if before_attendance_count==1 & event_type=="c"
      replace after_c=. if before_attendance_count>1
      Subjects who have attended more than one event get a missing value (they are not relevant for our analysis).

      I have rerun the regression and don't get any omitted variable issues. I think logically this should also be correct?

      Comment


      • #4
        Also, a short question regarding the interpretation of the regression results:

        Code:
        **regression 1**  
        areg dependent_variable after_event, absorb(country_event_fixedeffects)  
        
        **regression 2**
        areg dependent_variable after_event after_a, absorb(country_event_fixedeffects)  
        
        **regression 3**
        areg dependent_variable after_a after_b after_c, absorb(country_event_fixedeffects)
        I get a negative but non-significant coefficient for after_event running regression 1.

        For regression 2, I get a positive coefficient for after_event (non-significant) and negative coefficient for after_a (significant at 10% level), hence, I would interpret that attending an event of type a seems to have a negative effect on the dependent variable.

        However, the coefficients of regression 3 are all non-significant (event of type a still has a negative coefficient, the other coefficients are positive).

        What conclusion could one make to make sense of this result? For regression 3 I was expecting the coefficient of after_a to be negative as well. Therefore this result seems quite odd to me.
        Last edited by Penelope Smart; 13 Jan 2022, 17:12.

        Comment


        • #5
          Re #3: Yes that seems sensible.

          Re #4: The variable after_a in regression 2 does not represent the same ting as the variable after_a in regression 3, so there is no reason to expect the coefficients to be the same, or even to resemble each other in any particular way. Given the definitions of after_a, after_b, and after_c, these are mutually exclusive and exhaustive indicators. You can think of them as "dummy" variables for event_type, with the difference being that all event types other than a, b, and c are not distinguished and just lumped together as an "other" category that serves as the reference category for this dummy variable representation of event_type.

          So what does after_a represent in regression 2? Because there is no mention of after_b or after_c, the coefficient of after_a represents the expected difference, all else equal, in dependent_variable between an observation that experienced an event of type "a" and any observation that didn't. The latter means either no event at all, or an event of type b, or an event of type c, or an event of some other type.

          By contrast, in regression 3, after_a's coefficient represents only the difference between an observation that experienced an event of type "a" and one that experienced either no event at all or an event of type other than a, b, or c.

          Given that these are different things being compared, there is no particular reason for them to agree in any way.

          Comment


          • #6
            Thank you for your explanation. If there are no other type of events except a, b, and c, does it even make sense to run the following regressions together:

            Code:
             **regression 2**
            areg dependent_variable after_event after_a, absorb(country_event_fixedeffects)  
            
             **regression 4**
            are dependent_variable after_a, absorb (country_event_fixedeffects)
            The coefficient of after_a in regression 1 represents the difference between an observation that experienced an event of type "a" and any observation that experienced none or an event type of type a or b.

            For the coefficient in regression 4, the interpretation of the coefficient of after_a should be the same?

            Comment


            • #7
              Not exactly right. Let's do the algebra. I'll ignore the fixed effects for simplicity.

              The model in regression 2 is:

              Code:
              dependent_variable = b0 + b1*after_event + b2*after_a + error
              Now, because of the way after_a is defined, it is always the case that after_event = 1 when after_a = 1, but after_event could be either 0 or 1 when after_a = 0. So, we have separate cases:
              Code:
              E(dependent_variable) = b0 + b1 + b2 when after_a = 1
              E(dependent_variable) = b0 + b1 when after_a = 0 & after_event = 1
              E(dependent_variable) = b0 when after_a = 0 & after_event = 1
              [E(...) denotes expected value of ...)
              So, there is no clearly defined effect of after_a in this model. Sometimes it is b2, and sometimes it is b2 + b1.

              In regression 4, things are cleaner:
              Code:
              dependent_variable = b0 + b2*after_a + error
              
              E(dependent_variable) = b0 + b2 when after_a = 1
              E(dependent_variable) = b0 when after_a = 1 (regardless of whether after_event = 0 or 1)
              So in this case, the effect of after_a is clearly just b2, and it represents the difference in expected values of the dependent variable between those observations that experienced after_a, and those that did not (meaning either they experienced no event, or an event of a different type.)

              If what you want is to contrast the expected values of the dependent variable between those observations that experienced after_a and those that experienced an event of type b or c, then that would be:
              Code:
              areg dependent_variable i.after_a if after_event, absorb(countryevent_fixed_effects)

              Comment

              Working...
              X