Testing coefficients or predictions with > 500,000 unit fixed effects

Anup Malani

Join Date: Feb 2021

Posts: 5
#1

Testing coefficients or predictions with > 500,000 unit fixed effects

09 Feb 2021, 21:02

PROBLEM:

I am examining how mobility fell during a COVID lockdown in two types of communities, slums and non-slums. I have data on > 500,000 cell phones on a daily basis for several months. During that period, the government imposed a lockdown. For simplicity, assume that the data start on 1 March 2020 and the lockdown is 1 April - 30 April 2020, and I am only looking in those two months. I want to study if the effect of the lockdown differed across the two communities, but want to measure that in 2 ways.

1/ Was mobility lower in non-slums than slums during the lockdown
2/ Did mobility decline more during lockdown in non-slums measured as a percentage of pre-lockdown mobility

In addition, I want to identify changes in mobility within devices (ie, the phones). Importantly, each device is associated with a time-invariant community type (slum or non-slum).

And to make it even more complicated, the device id variable is a string.

I am using Stata 16.

ATTEMPTED SOLUTION:

Ordinarily, I would do the following:

reghdfe mobility lockdown if slum==0, a(device_id)
eststo nonslum
reghdfe mobility lockdown if slum==1, a(device_id)
eststo slum
suest nonslum slum, vce(cl device)
test _[slum_lockdown] - _[nonslum_lockdown] = 0
test (_[slum_lockdown]/(_[slum_cons])) - (_[nonslum_lockdown]/(_[nonslum_cons])) = 0

But suest doesn't work with reghdfe.
So then I tried reg with factors.

egen cell = group(device_id)
reg mobility lockdown i.cell if slum==0
eststo nonslum
reg mobility lockdown i.cell if slum==1
eststo slum
suest nonslum slum, vce(cl device)

But I have > 500,000 devices. I can't set maxvar high enough.
So, I try using predict instead.

gen lockdown_slum = lockdown * slum
reghdfe mobility lockdown lockdown_slum, a(device_id)
predict nonslum_level if slum==0, xb
predict nonslum_level_se if slum==0, stdp
predict slum_level if slum==1, xb
predict slum_level_se if slum==1, stdp

The question is how to test the difference in predictions to test hypothesis 1. Moreover, how to test hypothesis 2?

Any advice would be much appreciated.
Tags: None
Mead Over

Join Date: Sep 2014

Posts: 110
#2

09 Feb 2021, 22:50

If you have not already done so, I would start with estimating a pooled regression using -regress- with factor variable notation like this:

Code:

regress mobility i.lockdown##i.slum

If the coefficient of your interaction term is statistically significant, that is evidence that the lockdown affects the mobility of phones classified as associated with the slums differently than it affects the mobility of other phones. Then the magnitude of the estimated coefficient will be interesting. Note that this approach obviates the need for -suest-.

Also what does the distribution of your dependent variable look like? Is it right-skewed? If very few phones have zero mobility, you might use the logarithm of mobility as your dependent variable rather than raw mobility. That will enable you to interpret the coefficient of the interaction term as a percentage difference in the effect of the lockdown on mobility instead of an absolute difference. The percentage difference might, in this case, be more meaningful.

If these results look interesting, then you can attempt to control for other characteristics of the phones. For example, within the two groups of phones, can you use the first week of pre-lockdown data to distinguish those belonging to inhabitants who live in single family dwellings from those belonging to people living in multi-family dwellings? Or perhaps you can use the destinations of daily phone travel before the lockdown to classify all phones by how far their owners normally travel for work. Do you know which cell service provider is associated with each phone? (I'm thinking that cell provider or dwelling type might be rough proxies for household income.)

After experimenting with these other variables, and handling any memory problems that arise when you estimate this model with -regress-, you can consider whether you really need fixed effects. Using fixed effects is likely to obscure the relationship between mobility and other interesting variables like your slum dummy and the variables I have suggested. In particular, since the phone-specific fixed effects are grouped within your "slum" and "non-slum" categories, they will probably prevent your estimating an impact of that variable.

If you do decide you need fixed effects, before turning to -reghdfe-, you might want to experiment with Stata's -areg- command like this:

Code:

areg mobility i.lockdown##i.slum , absorb(device_id)

Like -reghdfe-, -areg- is very parsimonius with Stata's memory. Both programs accept a string variable in the -absorb()- option. -reghdfe- is essential if you have multiple fixed-effects, such as -device_id-, -employer_id-, -household_id-, etc., especially if they are not nested. But for your basic problem, -areg- should work just as well. Sergio Correia, the author of -reghdfe-, uses -areg- to benchmark his program.

Last edited by Mead Over; 09 Feb 2021, 22:58.
Comment
Anup Malani

Join Date: Feb 2021

Posts: 5
#3

10 Feb 2021, 07:51

Thank you for this response, but it doesn't address my core question: how can one test if one type of device (located in slums) has different change in mobility than another group of devices when one is identifying change within device. Please assume I have the right econometric specification and that I have done robustness checks on the data. This includes identifying change w/o device fixed effects. I am interested in solving the Stata-specific problem.

Areg has the problem that it does not work with suest.

I also have the problem that my device_id is a string and with 500k device id's, I cannot use the usual approaches (outside areg and reghdfe) touse factor notation.
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 10058
#4

10 Feb 2021, 08:07

Mead Over is spot on, it appears that you just do not follow what he is doing. To understand the use of interactions in comparing coefficients, see, for example, https://www.theanalysisfactor.com/co...-coefficients/.

I also have the problem that my device_id is a string and with 500k device id's

See

Code:

help encode
Comment
Stacy Rosenbaum

Join Date: Jun 2017

Posts: 10
#5

10 Feb 2021, 10:22

I don't have a solution to this problem, but I am compelled to point out that these responses are exactly why I am frequently hesitant to go to Statalist with questions: so many posters assume the user doesn't know what they're doing, and/or give irrelevant answers. They're not asking for your advice on other ways to analyze the data. They want to know how, in Stata, one can analyze it THIS way. Telling this person to run a different regression is not helpful (nor are links to college freshman-level webpages about interpreting coefficients).
Comment
FernandoRios

Join Date: Apr 2014

Posts: 2429
#6

10 Feb 2021, 10:44

Hi Anup,
there is actually a work around for your problem but requires a few extra steps:
See the example below

Code:

sysuse auto ,clear xtile qmpg=mpg, n(5) reg price weight i.qmpg if foreign==0 est sto m1 reg price weight i.qmpg if foreign==1 est sto m2 suest m1 m2 hdfe price weight if foreign==0, abs(qmpg) gen(f0) hdfe price weight if foreign==1, abs(qmpg) gen(f1) reg f0price f0weight if foreign==0 est sto m3 reg f1price f1weight if foreign==1 est sto m4 suest m1 m2 m3 m4

So basically, you need to "demean" all your variables by the FE (using hdfe for example), and by your "treatment" group. Then estimate the models using the demeaned variables.
Above I provide the code using a dummy approach and using the demeaned variables. The code suest should show you that the results are the same (once suest is used).

This requires the user-written command hdfe (ssc install hdfe)
Hope it helps.
Fernando
1 like
Comment
Anup Malani

Join Date: Feb 2021

Posts: 5
#7

10 Feb 2021, 10:49

I tried encode, but it gives a "too many values" error. I then tried

egen cell = group(device_id)

At first it did not work because a maxvars limit. I set maxvars at 120k, the max for my machine and my state version. In addition, I cut down the data set a bit (not ideal). Then egen group worked. So then I could run

reg mobility lockdown i.cell if slum==0

But given there are nearly 500k fixed effects, I get a "op. sys. refuses to provide memory" error.

At this point, a natural question is what machine and stata do I have. I have a Macbook Pro 13" M1 with 16gb RAM and 1 TB drive. I am running Stata/MP 16.1 for Mac (Apple Silicon) Revision 21 Jan 2021. I'd be happy to jump on the University server if that will eliminate these problems.

Finally, another option is bootstrap I suppose. (Still working it out.) But the problem with bootstrap is that even reghdfe takes some time to run. Doing a 1000 reps will take *long* time. Not saying I couldn't but it's a long time.

And to answer (anticipated) econometric questions:

- I have already done non-FE solutions. Identification within device using FE is a robustness check. I have limited data on cell phone owner's features (other than home location), so FE picks up unobservable time-invariant differences across devices. For these two reasons, I really am interested in the FE model.

- Ideally I want to avoid running a random effects model because I want to allow the fixed effects to be correlated with other variables not mentioned to simplify the problem.

- While I estimate within device here, this is part of a battery of robustness tests. In others I test changes within uber H9 hexes (and there are > 100k of those.)

- I have also looked at different approaches to the data. E.g., look at a quantile reg. Perhaps that has to be done at a neighborhood level. But those are actually harder to do with lots of fixed effects.

I would extremely be happy to learn that am making a mistake or missing an obvious solution. But I suspect it will be on the stata side, not the econometric side.
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 10058
#8

10 Feb 2021, 10:51

Replying to #5: Let us clarify a few things here: In #3, the OP points out his main issue:

Areg has the problem that it does not work with suest.

In #2, the first code shows how to overcome this. The link in #4 follows from the appearance that the OP did not understand the first piece of advice.

so many posters assume the user doesn't know what they're doing, and/or give irrelevant answers.

That the advice is irrelevant is your opinion, which you provide no evidence to justify. Part of the reason for participating in a public forum such as Statalist is not only to directly answer the questions asked, but also to give opinions on what one thinks is a better approach. That you do not like that people express their opinions is neither here nor there.

Last edited by Andrew Musau; 10 Feb 2021, 10:59.
1 like
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 10058
#9

10 Feb 2021, 10:55

#7 You can use reghdfe using interactions. See an example in #5 of the following link and post back if you are unable to adapt the code.

https://www.statalist.org/forums/for...erent-outcomes
Comment
Anup Malani

Join Date: Feb 2021

Posts: 5
#10

10 Feb 2021, 11:03

Fernando (#4), this is an interesting approach. Thank you. But I wonder if it runs into the problem that

reg mobility lockdown i.cell if slum==0

runs into a "op. sys. refuses to provide memory" error because of the large number of FE. This leads me to ask: why do I need m1 and m2 in the suest command?
Comment
FernandoRios

Join Date: Apr 2014

Posts: 2429
#11

10 Feb 2021, 11:10

Hi Anup,
Sorry for the confusion. What I was showing in that example is that running the regression

Code:

reg price weight i.qmpg if foreign==0

and

Code:

reg f0price f0weight if foreign==0

were equivalent

You do not need to run this "reg mobility lockdown i.cell if slum==0"
Just follow the "demeaning" process.

Code:

hdfe mobility lockdown if slum==0, abs(cell) gen(f0) hdfe mobility lockdown if slum==1, abs(cell) gen(f1) reg f0mobility f1lockdown if slum==0 est sto m1 reg f1mobility f1lockdown if slum==1 est sto m2 suest m1 m2

HTH

Last edited by FernandoRios; 10 Feb 2021, 11:13.
1 like
Comment
Anup Malani

Join Date: Feb 2021

Posts: 5
#12

10 Feb 2021, 11:49

Andrew (#7), I am probably missing something, but here is my concern. Each device or cell phone in my sample is associated with a slum or non-slum neighborhood throughout the time period of the sample. Because of that, device_id spans the slum indicator, meaning I can't do interactions.

Without using state commands, here is the problem. I think what you are suggesting is a stacked regression implemented via interactions with the slum indicator.

mobility = a0 + a1 lockdown + a2 slum + a3 lockdown * slum + d_i + e

where d_i is a vector of device fixed effects. This cannot be estimated because d_i and slum and are perfectly colinear. Perhaps the reason the stacked regression worked in your linked example is that the LHS variable was what differered in your example. So your group indicator was for which LHS variable there was. Your absorbed variable did not span a regressor. (Though I could be wrong about this so you should correct me.)
Comment

Andrew Musau

Join Date: Oct 2014
Posts: 10058

#13

10 Feb 2021, 12:26

Your model in #1 looks like Fernando's example in #4. Here is how you specify the model with interactions with an added variable "displacement". Am I missing something?

Code:

sysuse auto ,clear
xtile qmpg=mpg, n(5)

reghdfe price weight disp if !foreign, absorb(qmpg)
reghdfe price weight disp if foreign, absorb(qmpg)
gen obs=_n
gen for1= 1.foreign
gen for0=0.foreign
reghdfe price i.foreign#(c.weight c.disp), a(i.qmpg#c.for0 i.qmpg#c.for1) cluster(obs)
test 0b.foreign#c.weight= 1.foreign#c.weight

Res.:

Code:

. reghdfe price weight disp if !foreign, absorb(qmpg)
(MWFE estimator converged in 1 iterations)

HDFE Linear regression                            Number of obs   =         52
Absorbing 1 HDFE group                            F(   2,     45) =      16.56
                                                  Prob > F        =     0.0000
                                                  R-squared       =     0.6374
                                                  Adj R-squared   =     0.5891
                                                  Within R-sq.    =     0.4240
                                                  Root MSE        =  1985.2890

------------------------------------------------------------------------------
       price |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      weight |    2.85163   1.005773     2.84   0.007     .8258983    4.877362
displacement |   10.95874   6.659186     1.65   0.107    -2.453549    24.37103
       _cons |  -5947.947   2535.947    -2.35   0.023    -11055.61   -840.2874
------------------------------------------------------------------------------

Absorbed degrees of freedom:
-----------------------------------------------------+
 Absorbed FE | Categories  - Redundant  = Num. Coefs |
-------------+---------------------------------------|
        qmpg |         5           0           5     |
-----------------------------------------------------+

. 
. reghdfe price weight disp if foreign, absorb(qmpg)
(MWFE estimator converged in 1 iterations)

HDFE Linear regression                            Number of obs   =         22
Absorbing 1 HDFE group                            F(   2,     15) =       7.27
                                                  Prob > F        =     0.0062
                                                  R-squared       =     0.8871
                                                  Adj R-squared   =     0.8419
                                                  Within R-sq.    =     0.4922
                                                  Root MSE        =  1042.3941

------------------------------------------------------------------------------
       price |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      weight |   4.121708   2.163195     1.91   0.076    -.4890323    8.732449
displacement |   1.181057    38.3645     0.03   0.976    -80.59094    82.95306
       _cons |  -3292.185   2548.395    -1.29   0.216    -8723.962    2139.591
------------------------------------------------------------------------------

Absorbed degrees of freedom:
-----------------------------------------------------+
 Absorbed FE | Categories  - Redundant  = Num. Coefs |
-------------+---------------------------------------|
        qmpg |         5           0           5     |
-----------------------------------------------------+

. 
. gen obs=_n

. 
. gen for1= 1.foreign

. 
. gen for0=0.foreign

. 
. reghdfe price i.foreign#(c.weight c.disp), a(i.qmpg#c.for0 i.qmpg#c.for1) cluster(ob
> s)
(warning: no intercepts terms in absorb(); regression lacks constant term)
(MWFE estimator converged in 2 iterations)

HDFE Linear regression                            Number of obs   =         74
Absorbing 2 HDFE groups                           F(   4,     60) =       7.64
Statistics robust to heteroskedasticity           Prob > F        =     0.0000
                                                  R-squared       =     0.9438
                                                  Adj R-squared   =     0.9307
                                                  Within R-sq.    =     0.4305
Number of clusters (obs)     =         74         Root MSE        =  1796.5733

                                       (Std. Err. adjusted for 74 clusters in obs)
----------------------------------------------------------------------------------
                 |               Robust
           price |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-----------------+----------------------------------------------------------------
foreign#c.weight |
       Domestic  |    2.85163   1.275207     2.24   0.029     .3008353    5.402425
        Foreign  |   4.121708   2.524503     1.63   0.108    -.9280494    9.171466
                 |
         foreign#|
  c.displacement |
       Domestic  |   10.95874   7.517157     1.46   0.150    -4.077813    25.99529
        Foreign  |   1.181057   40.89143     0.03   0.977    -80.61399     82.9761
----------------------------------------------------------------------------------

Absorbed degrees of freedom:
-----------------------------------------------------+
 Absorbed FE | Categories  - Redundant  = Num. Coefs |
-------------+---------------------------------------|
 qmpg#c.for0 |         5           0           5    ?|
 qmpg#c.for1 |         5           0           5    ?|
-----------------------------------------------------+
? = number of redundant parameters may be higher

. 
. test 0b.foreign#c.weight= 1.foreign#c.weight

 ( 1)  0b.foreign#c.weight - 1.foreign#c.weight = 0

       F(  1,    60) =    0.20
            Prob > F =    0.6550

.

Comment

Mead Over

Join Date: Sep 2014

Posts: 110
#14

10 Feb 2021, 15:03

Anup may feel that @FernandoRios's example #11 and @AndrewMusau 's example in #13 are not pertinent because the variable analagous to Anup's fixed-effect, -qmpg-, does not perfectly predict the dummy variable -foreign- the way Anup's cell phone -device_id- perfectly predicts his dummy variable -slum-. Here is slightly tweaked code showing that @AndrewMusau 's approach works even when the fixed effect is perfectly correlated with the dummy variable.

The analogy with Anup's example is:
price -> mobility
foreign -> slum
weight -> lockdown
device_id -> mfg (In the code below, the variable -mfg- is constructed to perfectly predict -foreign-)

Code:

sysuse auto ,clear gen str_mfg = cond(strpos(make," "), substr(make,1,strpos(make," ")-1),make) encode str_mfg, gen(mfg) reghdfe price weight disp if !foreign, absorb(mfg) reghdfe price weight disp if foreign, absorb(mfg) gen obs=_n gen for1= 1.foreign gen for0=0.foreign reghdfe price i.foreign#(c.weight c.disp), a(i.mfg#c.for0 i.mfg#c.for1) cluster(obs) test 0b.foreign#c.weight= 1.foreign#c.weight

This approach avoids the -suest- command, by imposing the constraint that the disturbance term in the model has the same variance in both the -slum- and -!slum- groups of cell phone, a not unreasonable assumption. Furthermore, this single pooled (or stacked) approach has the advnatage that the hypothesis test is nested. I think the -suest- command, on the other hand, appeals to the theory of non-nested hypothesis testing, which has always seemed to me to rest on stronger and less plausible asusmptions.

But if this single equation approach still runs into memory management problems, I don't know what to advise. Perhaps -reghdfe-'s author @SergioCorreia could help with that.
1 like
Comment
Mead Over

Join Date: Sep 2014

Posts: 110
#15

11 Feb 2021, 12:50

@AnupMalani : In your post #7, you say:

- While I estimate within device here, this is part of a battery of robustness tests. In others I test changes within uber H9 hexes (and there are > 100k of those.)

This makes me wonder if it would be possible to merge your data with variables from other geospatial databases that are specific to the hex unit. Number of Uber trips or requests? Number of people in the associated census tract? If each hex is located on approximately the same number of square-meters, could the number of cell-phones with residences in the same hex be used as a proxy for population density? I'm just thinking that you might be able to get a variable or two that could represent poverty in a more interesting way than with a dummy variable. Then instead of assuming that there are only two types of people, those in the slum and those outside it, you could test that hypothesis.

By the way, as a health economist who has worked on infectious diseases, I find your research objective and eventual results quite interesting. As I'm sure you are aware, the question of whether the lockdown effectively restricts the mobility of people who live in the poorest part of the city reflects both the effectiveness of the lockdown policy to control Covid transmission and also the economic necessity that prevents the poorest people from "sheltering in place".

I'm not sure which discipline you call home, but if you are not an epidemiologist you might not be aware of the literature on "core groups". This July, 2020 op-ed by Jeffrey Klausner is a good introduction to the idea with application to COVID. And here's a paper I published applying the concept to HIV prevention. (If the Pay Wall is a problem, I can send a PDF.) In the situation you are modelling, I would be more inclined to attach the term "core group" to the most highly mobile people, regardless of where they live, rather than to the people who live in the slum. On the other hand, some measure of population density might be a better indicator of who is more vulnerable to infection, especially from people in the core group. It's interesting to ask to what degree the lockdown prevents transmission (by reducing the mobility of the most mobile) or protects the most vulnerable (by reducing mobility of those living in the densest or poorest neighborhoods).

Last edited by Mead Over; 11 Feb 2021, 12:53.
Comment

Announcement

Testing coefficients or predictions with > 500,000 unit fixed effects

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment