Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Testing coefficients or predictions with > 500,000 unit fixed effects

    PROBLEM:

    I am examining how mobility fell during a COVID lockdown in two types of communities, slums and non-slums. I have data on > 500,000 cell phones on a daily basis for several months. During that period, the government imposed a lockdown. For simplicity, assume that the data start on 1 March 2020 and the lockdown is 1 April - 30 April 2020, and I am only looking in those two months. I want to study if the effect of the lockdown differed across the two communities, but want to measure that in 2 ways.

    1/ Was mobility lower in non-slums than slums during the lockdown
    2/ Did mobility decline more during lockdown in non-slums measured as a percentage of pre-lockdown mobility

    In addition, I want to identify changes in mobility within devices (ie, the phones). Importantly, each device is associated with a time-invariant community type (slum or non-slum).

    And to make it even more complicated, the device id variable is a string.

    I am using Stata 16.

    ATTEMPTED SOLUTION:

    Ordinarily, I would do the following:

    reghdfe mobility lockdown if slum==0, a(device_id)
    eststo nonslum
    reghdfe mobility lockdown if slum==1, a(device_id)
    eststo slum
    suest nonslum slum, vce(cl device)
    test _[slum_lockdown] - _[nonslum_lockdown] = 0
    test (_[slum_lockdown]/(_[slum_cons])) - (_[nonslum_lockdown]/(_[nonslum_cons])) = 0


    But suest doesn't work with reghdfe.
    So then I tried reg with factors.

    egen cell = group(device_id)
    reg mobility lockdown i.cell if slum==0
    eststo nonslum
    reg mobility lockdown i.cell if slum==1
    eststo slum
    suest nonslum slum, vce(cl device)


    But I have > 500,000 devices. I can't set maxvar high enough.
    So, I try using predict instead.

    gen lockdown_slum = lockdown * slum
    reghdfe mobility lockdown lockdown_slum, a(device_id)
    predict nonslum_level if slum==0, xb

    predict nonslum_level_se if slum==0, stdp
    predict slum_level if slum==1, xb
    predict slum_level_se if slum==1, stdp

    The question is how to test the difference in predictions to test hypothesis 1. Moreover, how to test hypothesis 2?

    Any advice would be much appreciated.

  • #2
    If you have not already done so, I would start with estimating a pooled regression using -regress- with factor variable notation like this:
    Code:
    regress mobility i.lockdown##i.slum
    If the coefficient of your interaction term is statistically significant, that is evidence that the lockdown affects the mobility of phones classified as associated with the slums differently than it affects the mobility of other phones. Then the magnitude of the estimated coefficient will be interesting. Note that this approach obviates the need for -suest-.

    Also what does the distribution of your dependent variable look like? Is it right-skewed? If very few phones have zero mobility, you might use the logarithm of mobility as your dependent variable rather than raw mobility. That will enable you to interpret the coefficient of the interaction term as a percentage difference in the effect of the lockdown on mobility instead of an absolute difference. The percentage difference might, in this case, be more meaningful.

    If these results look interesting, then you can attempt to control for other characteristics of the phones. For example, within the two groups of phones, can you use the first week of pre-lockdown data to distinguish those belonging to inhabitants who live in single family dwellings from those belonging to people living in multi-family dwellings? Or perhaps you can use the destinations of daily phone travel before the lockdown to classify all phones by how far their owners normally travel for work. Do you know which cell service provider is associated with each phone? (I'm thinking that cell provider or dwelling type might be rough proxies for household income.)

    After experimenting with these other variables, and handling any memory problems that arise when you estimate this model with -regress-, you can consider whether you really need fixed effects. Using fixed effects is likely to obscure the relationship between mobility and other interesting variables like your slum dummy and the variables I have suggested. In particular, since the phone-specific fixed effects are grouped within your "slum" and "non-slum" categories, they will probably prevent your estimating an impact of that variable.

    If you do decide you need fixed effects, before turning to -reghdfe-, you might want to experiment with Stata's -areg- command like this:

    Code:
    areg  mobility i.lockdown##i.slum , absorb(device_id)
    Like -reghdfe-, -areg- is very parsimonius with Stata's memory. Both programs accept a string variable in the -absorb()- option. -reghdfe- is essential if you have multiple fixed-effects, such as -device_id-, -employer_id-, -household_id-, etc., especially if they are not nested. But for your basic problem, -areg- should work just as well. Sergio Correia, the author of -reghdfe-, uses -areg- to benchmark his program.
    Last edited by Mead Over; 09 Feb 2021, 22:58.

    Comment


    • #3
      Thank you for this response, but it doesn't address my core question: how can one test if one type of device (located in slums) has different change in mobility than another group of devices when one is identifying change within device. Please assume I have the right econometric specification and that I have done robustness checks on the data. This includes identifying change w/o device fixed effects. I am interested in solving the Stata-specific problem.

      Areg has the problem that it does not work with suest.

      I also have the problem that my device_id is a string and with 500k device id's, I cannot use the usual approaches (outside areg and reghdfe) touse factor notation.

      Comment


      • #4
        Mead Over is spot on, it appears that you just do not follow what he is doing. To understand the use of interactions in comparing coefficients, see, for example, https://www.theanalysisfactor.com/co...-coefficients/.


        I also have the problem that my device_id is a string and with 500k device id's
        See

        Code:
        help encode

        Comment


        • #5
          I don't have a solution to this problem, but I am compelled to point out that these responses are exactly why I am frequently hesitant to go to Statalist with questions: so many posters assume the user doesn't know what they're doing, and/or give irrelevant answers. They're not asking for your advice on other ways to analyze the data. They want to know how, in Stata, one can analyze it THIS way. Telling this person to run a different regression is not helpful (nor are links to college freshman-level webpages about interpreting coefficients).

          Comment


          • #6
            Hi Anup,
            there is actually a work around for your problem but requires a few extra steps:
            See the example below
            Code:
            sysuse auto ,clear
            xtile qmpg=mpg, n(5)
            
            reg price weight i.qmpg if foreign==0
            est sto m1
            reg price weight i.qmpg if foreign==1
            est sto m2
            suest m1 m2
            
            hdfe price weight if foreign==0, abs(qmpg) gen(f0)
            hdfe price weight if foreign==1, abs(qmpg) gen(f1)
            
            reg f0price f0weight if foreign==0
            est sto m3
            reg f1price f1weight if foreign==1
            est sto m4
            
            suest m1 m2 m3 m4
            So basically, you need to "demean" all your variables by the FE (using hdfe for example), and by your "treatment" group. Then estimate the models using the demeaned variables.
            Above I provide the code using a dummy approach and using the demeaned variables. The code suest should show you that the results are the same (once suest is used).

            This requires the user-written command hdfe (ssc install hdfe)
            Hope it helps.
            Fernando

            Comment


            • #7
              I tried encode, but it gives a "too many values" error. I then tried

              egen cell = group(device_id)

              At first it did not work because a maxvars limit. I set maxvars at 120k, the max for my machine and my state version. In addition, I cut down the data set a bit (not ideal). Then egen group worked. So then I could run

              reg mobility lockdown i.cell if slum==0

              But given there are nearly 500k fixed effects, I get a "op. sys. refuses to provide memory" error.

              At this point, a natural question is what machine and stata do I have. I have a Macbook Pro 13" M1 with 16gb RAM and 1 TB drive. I am running Stata/MP 16.1 for Mac (Apple Silicon) Revision 21 Jan 2021. I'd be happy to jump on the University server if that will eliminate these problems.

              Finally, another option is bootstrap I suppose. (Still working it out.) But the problem with bootstrap is that even reghdfe takes some time to run. Doing a 1000 reps will take *long* time. Not saying I couldn't but it's a long time.

              And to answer (anticipated) econometric questions:

              - I have already done non-FE solutions. Identification within device using FE is a robustness check. I have limited data on cell phone owner's features (other than home location), so FE picks up unobservable time-invariant differences across devices. For these two reasons, I really am interested in the FE model.

              - Ideally I want to avoid running a random effects model because I want to allow the fixed effects to be correlated with other variables not mentioned to simplify the problem.

              - While I estimate within device here, this is part of a battery of robustness tests. In others I test changes within uber H9 hexes (and there are > 100k of those.)

              - I have also looked at different approaches to the data. E.g., look at a quantile reg. Perhaps that has to be done at a neighborhood level. But those are actually harder to do with lots of fixed effects.

              I would extremely be happy to learn that am making a mistake or missing an obvious solution. But I suspect it will be on the stata side, not the econometric side.

              Comment


              • #8
                Replying to #5: Let us clarify a few things here: In #3, the OP points out his main issue:

                Areg has the problem that it does not work with suest.
                In #2, the first code shows how to overcome this. The link in #4 follows from the appearance that the OP did not understand the first piece of advice.


                so many posters assume the user doesn't know what they're doing, and/or give irrelevant answers.
                That the advice is irrelevant is your opinion, which you provide no evidence to justify. Part of the reason for participating in a public forum such as Statalist is not only to directly answer the questions asked, but also to give opinions on what one thinks is a better approach. That you do not like that people express their opinions is neither here nor there.
                Last edited by Andrew Musau; 10 Feb 2021, 10:59.

                Comment


                • #9
                  #7 You can use reghdfe using interactions. See an example in #5 of the following link and post back if you are unable to adapt the code.

                  https://www.statalist.org/forums/for...erent-outcomes



                  Comment


                  • #10
                    Fernando (#4), this is an interesting approach. Thank you. But I wonder if it runs into the problem that

                    reg mobility lockdown i.cell if slum==0

                    runs into a "op. sys. refuses to provide memory" error because of the large number of FE. This leads me to ask: why do I need m1 and m2 in the suest command?

                    Comment


                    • #11
                      Hi Anup,
                      Sorry for the confusion. What I was showing in that example is that running the regression
                      Code:
                      reg price weight i.qmpg if foreign==0
                      and
                      Code:
                      reg f0price f0weight if foreign==0
                      were equivalent

                      You do not need to run this "reg mobility lockdown i.cell if slum==0"
                      Just follow the "demeaning" process.

                      Code:
                      hdfe mobility lockdown if slum==0, abs(cell) gen(f0)
                      hdfe mobility lockdown if slum==1, abs(cell) gen(f1)
                      
                      reg f0mobility f1lockdown if slum==0
                      est sto m1
                      reg f1mobility f1lockdown if slum==1
                      est sto m2
                      
                      suest m1 m2
                      HTH
                      Last edited by FernandoRios; 10 Feb 2021, 11:13.

                      Comment


                      • #12
                        Andrew (#7), I am probably missing something, but here is my concern. Each device or cell phone in my sample is associated with a slum or non-slum neighborhood throughout the time period of the sample. Because of that, device_id spans the slum indicator, meaning I can't do interactions.

                        Without using state commands, here is the problem. I think what you are suggesting is a stacked regression implemented via interactions with the slum indicator.

                        mobility = a0 + a1 lockdown + a2 slum + a3 lockdown * slum + d_i + e

                        where d_i is a vector of device fixed effects. This cannot be estimated because d_i and slum and are perfectly colinear. Perhaps the reason the stacked regression worked in your linked example is that the LHS variable was what differered in your example. So your group indicator was for which LHS variable there was. Your absorbed variable did not span a regressor. (Though I could be wrong about this so you should correct me.)


                        Comment


                        • #13
                          Your model in #1 looks like Fernando's example in #4. Here is how you specify the model with interactions with an added variable "displacement". Am I missing something?

                          Code:
                          sysuse auto ,clear
                          xtile qmpg=mpg, n(5)
                          
                          reghdfe price weight disp if !foreign, absorb(qmpg)
                          reghdfe price weight disp if foreign, absorb(qmpg)
                          gen obs=_n
                          gen for1= 1.foreign
                          gen for0=0.foreign
                          reghdfe price i.foreign#(c.weight c.disp), a(i.qmpg#c.for0 i.qmpg#c.for1) cluster(obs)
                          test 0b.foreign#c.weight= 1.foreign#c.weight
                          Res.:

                          Code:
                          . reghdfe price weight disp if !foreign, absorb(qmpg)
                          (MWFE estimator converged in 1 iterations)
                          
                          HDFE Linear regression                            Number of obs   =         52
                          Absorbing 1 HDFE group                            F(   2,     45) =      16.56
                                                                            Prob > F        =     0.0000
                                                                            R-squared       =     0.6374
                                                                            Adj R-squared   =     0.5891
                                                                            Within R-sq.    =     0.4240
                                                                            Root MSE        =  1985.2890
                          
                          ------------------------------------------------------------------------------
                                 price |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
                          -------------+----------------------------------------------------------------
                                weight |    2.85163   1.005773     2.84   0.007     .8258983    4.877362
                          displacement |   10.95874   6.659186     1.65   0.107    -2.453549    24.37103
                                 _cons |  -5947.947   2535.947    -2.35   0.023    -11055.61   -840.2874
                          ------------------------------------------------------------------------------
                          
                          Absorbed degrees of freedom:
                          -----------------------------------------------------+
                           Absorbed FE | Categories  - Redundant  = Num. Coefs |
                          -------------+---------------------------------------|
                                  qmpg |         5           0           5     |
                          -----------------------------------------------------+
                          
                          . 
                          . reghdfe price weight disp if foreign, absorb(qmpg)
                          (MWFE estimator converged in 1 iterations)
                          
                          HDFE Linear regression                            Number of obs   =         22
                          Absorbing 1 HDFE group                            F(   2,     15) =       7.27
                                                                            Prob > F        =     0.0062
                                                                            R-squared       =     0.8871
                                                                            Adj R-squared   =     0.8419
                                                                            Within R-sq.    =     0.4922
                                                                            Root MSE        =  1042.3941
                          
                          ------------------------------------------------------------------------------
                                 price |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
                          -------------+----------------------------------------------------------------
                                weight |   4.121708   2.163195     1.91   0.076    -.4890323    8.732449
                          displacement |   1.181057    38.3645     0.03   0.976    -80.59094    82.95306
                                 _cons |  -3292.185   2548.395    -1.29   0.216    -8723.962    2139.591
                          ------------------------------------------------------------------------------
                          
                          Absorbed degrees of freedom:
                          -----------------------------------------------------+
                           Absorbed FE | Categories  - Redundant  = Num. Coefs |
                          -------------+---------------------------------------|
                                  qmpg |         5           0           5     |
                          -----------------------------------------------------+
                          
                          . 
                          . gen obs=_n
                          
                          . 
                          . gen for1= 1.foreign
                          
                          . 
                          . gen for0=0.foreign
                          
                          . 
                          . reghdfe price i.foreign#(c.weight c.disp), a(i.qmpg#c.for0 i.qmpg#c.for1) cluster(ob
                          > s)
                          (warning: no intercepts terms in absorb(); regression lacks constant term)
                          (MWFE estimator converged in 2 iterations)
                          
                          HDFE Linear regression                            Number of obs   =         74
                          Absorbing 2 HDFE groups                           F(   4,     60) =       7.64
                          Statistics robust to heteroskedasticity           Prob > F        =     0.0000
                                                                            R-squared       =     0.9438
                                                                            Adj R-squared   =     0.9307
                                                                            Within R-sq.    =     0.4305
                          Number of clusters (obs)     =         74         Root MSE        =  1796.5733
                          
                                                                 (Std. Err. adjusted for 74 clusters in obs)
                          ----------------------------------------------------------------------------------
                                           |               Robust
                                     price |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
                          -----------------+----------------------------------------------------------------
                          foreign#c.weight |
                                 Domestic  |    2.85163   1.275207     2.24   0.029     .3008353    5.402425
                                  Foreign  |   4.121708   2.524503     1.63   0.108    -.9280494    9.171466
                                           |
                                   foreign#|
                            c.displacement |
                                 Domestic  |   10.95874   7.517157     1.46   0.150    -4.077813    25.99529
                                  Foreign  |   1.181057   40.89143     0.03   0.977    -80.61399     82.9761
                          ----------------------------------------------------------------------------------
                          
                          Absorbed degrees of freedom:
                          -----------------------------------------------------+
                           Absorbed FE | Categories  - Redundant  = Num. Coefs |
                          -------------+---------------------------------------|
                           qmpg#c.for0 |         5           0           5    ?|
                           qmpg#c.for1 |         5           0           5    ?|
                          -----------------------------------------------------+
                          ? = number of redundant parameters may be higher
                          
                          . 
                          . test 0b.foreign#c.weight= 1.foreign#c.weight
                          
                           ( 1)  0b.foreign#c.weight - 1.foreign#c.weight = 0
                          
                                 F(  1,    60) =    0.20
                                      Prob > F =    0.6550
                          
                          .

                          Comment


                          • #14
                            Anup may feel that @FernandoRios's example #11 and @AndrewMusau 's example in #13 are not pertinent because the variable analagous to Anup's fixed-effect, -qmpg-, does not perfectly predict the dummy variable -foreign- the way Anup's cell phone -device_id- perfectly predicts his dummy variable -slum-. Here is slightly tweaked code showing that @AndrewMusau 's approach works even when the fixed effect is perfectly correlated with the dummy variable.

                            The analogy with Anup's example is:
                            price -> mobility
                            foreign -> slum
                            weight -> lockdown
                            device_id -> mfg (In the code below, the variable -mfg- is constructed to perfectly predict -foreign-)

                            Code:
                            sysuse auto ,clear
                            gen str_mfg = cond(strpos(make," "), substr(make,1,strpos(make," ")-1),make)
                            
                            encode str_mfg, gen(mfg)
                            
                            reghdfe price weight disp if !foreign, absorb(mfg)
                            reghdfe price weight disp if foreign, absorb(mfg)
                            
                            gen obs=_n
                            gen for1= 1.foreign
                            gen for0=0.foreign
                            
                            reghdfe price i.foreign#(c.weight c.disp), a(i.mfg#c.for0 i.mfg#c.for1) cluster(obs)
                            
                            test 0b.foreign#c.weight= 1.foreign#c.weight
                            This approach avoids the -suest- command, by imposing the constraint that the disturbance term in the model has the same variance in both the -slum- and -!slum- groups of cell phone, a not unreasonable assumption. Furthermore, this single pooled (or stacked) approach has the advnatage that the hypothesis test is nested. I think the -suest- command, on the other hand, appeals to the theory of non-nested hypothesis testing, which has always seemed to me to rest on stronger and less plausible asusmptions.

                            But if this single equation approach still runs into memory management problems, I don't know what to advise. Perhaps -reghdfe-'s author @SergioCorreia could help with that.

                            Comment


                            • #15
                              @AnupMalani : In your post #7, you say:
                              - While I estimate within device here, this is part of a battery of robustness tests. In others I test changes within uber H9 hexes (and there are > 100k of those.)
                              This makes me wonder if it would be possible to merge your data with variables from other geospatial databases that are specific to the hex unit. Number of Uber trips or requests? Number of people in the associated census tract? If each hex is located on approximately the same number of square-meters, could the number of cell-phones with residences in the same hex be used as a proxy for population density? I'm just thinking that you might be able to get a variable or two that could represent poverty in a more interesting way than with a dummy variable. Then instead of assuming that there are only two types of people, those in the slum and those outside it, you could test that hypothesis.

                              By the way, as a health economist who has worked on infectious diseases, I find your research objective and eventual results quite interesting. As I'm sure you are aware, the question of whether the lockdown effectively restricts the mobility of people who live in the poorest part of the city reflects both the effectiveness of the lockdown policy to control Covid transmission and also the economic necessity that prevents the poorest people from "sheltering in place".

                              I'm not sure which discipline you call home, but if you are not an epidemiologist you might not be aware of the literature on "core groups". This July, 2020 op-ed by Jeffrey Klausner is a good introduction to the idea with application to COVID. And here's a paper I published applying the concept to HIV prevention. (If the Pay Wall is a problem, I can send a PDF.) In the situation you are modelling, I would be more inclined to attach the term "core group" to the most highly mobile people, regardless of where they live, rather than to the people who live in the slum. On the other hand, some measure of population density might be a better indicator of who is more vulnerable to infection, especially from people in the core group. It's interesting to ask to what degree the lockdown prevents transmission (by reducing the mobility of the most mobile) or protects the most vulnerable (by reducing mobility of those living in the densest or poorest neighborhoods).


                              Last edited by Mead Over; 11 Feb 2021, 12:53.

                              Comment

                              Working...
                              X