Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • DiD Regression panel data unobserved fixed effects

    Dear Statalist-Users

    i am a stata-rookie and I currently want to replicate data and outcomes of a regression model.
    My data set contains data of 119 cities as a panel with 11 time stamps from 1919 to 2002. For every period (1919-1925; 1925-1933 and so on) my dataset contains the population growth.
    I now want to do a simple Difference-In-Difference regression, where I set 20 cities into my treatment group and the rest 99 cities into my control group.
    I wrote down the formal regression model for you:

    "popgrowth = beta*border + y(Border * division) + i.year + e (error term), where popgrowth is the annualized rate of population growth over the periods 1919–1925, 1925–1933, 1933–1939, 1950–1960, 1960–1970, 1970-1980, and 1980–1988 in West German city c at time t; border is a dummy equal to one when a city is a member of the treatment group of cities close to the East-West border and zero otherwise; division is a dummy equal to one when Germany is divided and zero otherwise; i.year are a full set of time dummies; and e is the error term."
    (If you need more details, the paper I want to replicate is from Redding and Storm with the title: "The cost of remoteness".)

    pop_growth is the population growth of the 199 cities and is my dependent variable.
    division is a dummy which is zero, except for the years 1950–1988 when Germany was divided, in which case it takes the value one.
    border is a dummy which is zero unless a city lies within 75 kilometers of the East-West German border, in which case it takes the value one (treatment group).
    i.year implements the time dummy sets.

    Also important to say is, that I exclude the 1939–1950 difference. (They did it in the paper as well).


    Well I now implemented the model and the variables into stata but I cant even replicate the first main outcome for (border * division) , I even get a positive outcome (It should be negative).

    My code:

    //xtset city year

    gen border = 0
    replace border = 1 if dist_gg_border <= 75

    *tab city border

    gen division = 0
    replace division = 1 if year >= 1950 & year <= 1988

    //gen border_division = border * division

    //Regression

    reg pop_growth border##division i.year if (!(year >= 1939 & year < 1950) & !(year > 1988 & year <= 2002)), robust cluster(city)

    Also I often saw people who did a xtreg for panel data and fe on the end of the ecuation for fixed effects. When I do that, I get the same result. Also I have some questions as well. In my regression I can interpretate the terms for the set of time-dummys. For excample 1933, 1939 and so on. Is it normal, that I cant see 1919 and 1925? My only thinking is, that I cant see "1919" because obviously pop_growth has missings and "1925" is used as the reference category. But I find it still strange. And my second question, the time-dummy 1988 is "omitted". When I do the xtreg ,fe "border" is also omitted in my regression. What does that mean?

    I get the correct number of observations but the interpretation of the outcomes are just not matching with the paper. Do you see any rookie-mistake I did? For sure, I can post more information down here! Thankyou already for your help!

  • #2
    Okay I fixed it now, thank you still for reading, I got the outcomes now

    Comment


    • #3
      -xtreg, fe- and -reg- are very different analyses, and only under special circumstances do they produce similar results. In your data, variation in population growth is coming from (at least) two different sources. One is differences among the cities: some of them will be overall experiencing higher levels of population growth than others. The other is that over time, within any given city, the population growth rate will change. -xtreg, fe- can show you only the latter, within-city changes in population growth rate. By contrast, what -reg- looks at is a weighted average of the among-cities and within-cities changes.

      Which of these is the correct analysis depends on which aspect of variation in population growth rates is salient in your research question. You don't say what your research question is, but given that you are trying to replicate a previous study, and -xtreg, fe- seems to do that while -reg- does not it seems that, know it or not, you are trying to estimate the within-cities over time variation in population growth rate. This also goes along with the fact that you are doing a DID model: in such models we are usually studying the effect of some "treatment" on the within-cluster (within-city in your case) variation over time of some outcome variable.

      So the reason you are not getting the results you were expecting with your -reg- command is that -reg- is simply the wrong analysis for your research question.

      As for the variables that do not show up in the analysis, let me deal first with the variable border in the -xtreg, fe- model. The variable border is a time-invariant attribute of your cities: every observation for a given city has the same value of border. A city is either a border city or it isn't. That never changes. In a fixed-effects model such as -xtreg, fe-, time-invariant variables are always omitted because their effects are not estimable in a fixed-effects model. In math terms the issue is that time-invariant variables are colinear with the panel (city, i your case) fixed-effects themselves. For reasons of linear algebra, you cannot include a full set of colinear variables in any regression model. In the fixed-effects model this means that either border or one of the city fixed-effects has to go. -xtreg, fe- will always preserve the panel fixed effects, so it jettison's border. Crucially, this is of no importance whatsoever: it is not a problem. The of being a border city on population growth after to the reunification of Germany is given by the coefficient of 1.border#1.division, which is not omitted.

      Turning to the time variables that got dropped, I fully expect that two of them should be omitted, but not exactly for the reasons you give. In particular, there is nothing in your data description that suggests that the year 1919 observations are plagued by missing values (to any greater extent than any of the 10 years in the data). And if it is true that the 1919 observations all have missing values, then I would expect that you would see yet a third year omitted from the analysis as well. After all observations with missing observations on any variable mentioned in the regression command are excluded from the estimation sample, you have to lose two of your years. One of them will be the reference category, and assuming there are some 1919 observations still in the estimation sample, or a specification by you of some other value, Stata will use 1919 for that. You must also lose another year because your variable division defines a subset of the time variables: all those up to and including 1988 have division = 0, and those after have division = 1. This variable is therefore colinear with the indicators arising out of i.year. The specific colinearity relationship is division = sum(indicators for all years > 1988). As mentioned earlier, when there is a group of colinear variables in any regression, one of them has to go. In this case, Stata chose 1925 for the purpose. If for aesthetic reasons you do not like that choice, there are ways of getting it to choose another, or to choose to drop division and leave the rest of the years alone. But there is no statistical reason to care: the important results are not affected. Such choices would result in different results for the year indicators and the division variable, but they don't effect 1.border#1.division, which is the key variable whose results you need.

      As a beginner, you are not familiar with fixed effects models. I suggest you get your hands on an introductory econometrics textbook to learn more about them.

      Added: Crossed with #2.

      Comment


      • #4
        Actually, I spoke to quickly in #3.
        You must also lose another year because your variable division defines a subset of the time variables: all those up to and including 1988 have division = 0, and those after have division = 1. This variable is therefore colinear with the indicators arising out of i.year. The specific colinearity relationship is division = sum(indicators for all years > 1988). As mentioned earlier, when there is a group of colinear variables in any regression, one of them has to go. In this case, Stata chose 1925 for the purpose.
        is incorrect. Removing 1925 from the year indicators would not modify the colinearity relationship in this case: one of the years > 1988 would have to be removed. So I cannot explain what O.P. observed. I cannot explain why Stata did remove 1925, and I cannot explain why Stata did not remove some other year > 1988. My hunch is that the data do not actually conform to the description given. But without recourse to example data and the complete output from -xtreg, fe-, I can't say anything more.

        If O.P. chooses to pursue this farther, I hope the response will use the -dataex- command to show the example data. If you are running version 18, 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

        Comment

        Working...
        X