multilevel model with large data, any quick solutions?

Cooper Felix

Join Date: Sep 2015

Posts: 84
#1

multilevel model with large data, any quick solutions?

11 Feb 2021, 23:03

Dear Stata users,

I am interested in estimating a multilevel model that includes variables at different levels (e.g., individual store-, parent firm-, industry-, and zip-code levels). The sample is pretty large with more than 2-million observations at the store level across different firms. Is there any command that allows me to estimate a fixed-effects multilevel model in a fast manner? Thank you.
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 29956
#2

12 Feb 2021, 11:24

My advice is to take up Russian novels. In other threads on this Forum I have occasionally referred to analyses I do that take weeks to run. This is precisely that type of situation.

Perhaps more helpful, if they apply to your situation:

1. It is generally not useful to include random effects at levels where the number of distinct values is small in the first place; and each level in the model greatly adds to estimation complexity and run time. If you only have a small number of industries, for example, then it makes more sense to just include i.industry as a covariate in the fixed effects part of the model and not have a || industry: level in the model. Doing that will also make the model run much faster.

2. If all your predictor variables are discrete and it is a logistic model, aggregating your data up (with -collapse-, for example) to combine all observations with the same values of all predictors and outcome into a single observation, and using -melogit-'s -binomial()- option to show the number of original observations it represents can lead to dramatic speed-ups.

3. In general, -mixed- is faster than any of the non-linear multilevel models. If you have a dichotomous outcome that is an artificial dichotomization of an underlying continuous variable, not only will the modeling be better if you use the underlying continuous variable as outcome instead, it will also be much faster. Even if the outcome is truly inherently discrete, if a linear probability model would be reasonable, it will be faster than logistic. If you have a count outcome, consider whether -mixed- can serve your needs just as well as -mepoisson- or -menbreg-. If using -mixed- won't seriously mis-specify the model, you will find it is faster.
3 likes
Comment
Cooper Felix

Join Date: Sep 2015

Posts: 84
#3

06 Jun 2021, 00:20

Originally posted by Clyde Schechter View Post

My advice is to take up Russian novels. In other threads on this Forum I have occasionally referred to analyses I do that take weeks to run. This is precisely that type of situation.

Perhaps more helpful, if they apply to your situation:

1. It is generally not useful to include random effects at levels where the number of distinct values is small in the first place; and each level in the model greatly adds to estimation complexity and run time. If you only have a small number of industries, for example, then it makes more sense to just include i.industry as a covariate in the fixed effects part of the model and not have a || industry: level in the model. Doing that will also make the model run much faster.

2. If all your predictor variables are discrete and it is a logistic model, aggregating your data up (with -collapse-, for example) to combine all observations with the same values of all predictors and outcome into a single observation, and using -melogit-'s -binomial()- option to show the number of original observations it represents can lead to dramatic speed-ups.

3. In general, -mixed- is faster than any of the non-linear multilevel models. If you have a dichotomous outcome that is an artificial dichotomization of an underlying continuous variable, not only will the modeling be better if you use the underlying continuous variable as outcome instead, it will also be much faster. Even if the outcome is truly inherently discrete, if a linear probability model would be reasonable, it will be faster than logistic. If you have a count outcome, consider whether -mixed- can serve your needs just as well as -mepoisson- or -menbreg-. If using -mixed- won't seriously mis-specify the model, you will find it is faster.

Dear Clyde,

Your response is appreciated and sorry that it took me so long to reply as I was distracted by non-academic issues in the past few months. So, the problem I am facing is a bit tricky. Obviously, I only have access to an aggregated outcome (say, the total production level of a firm during each quarter) while I have my focal explanatory variable at a more granular level (say, number of employees reported to each store each week). In this case, I am still thinking about how to study the relationship between human capital input (i.e., number of employees) and firm production in a more appropriate way. There are two options I think:
Aggregate weekly data into quarterly data so that the outcome and explanatory variables are at the same level.

Append the end-of-quarter outcome to each weekly observation and study at the weekly level.

I am eager to know what your recommendation would be. Thanks.
Comment
Rich Goldstein

Join Date: Mar 2014

Posts: 4439
#4

06 Jun 2021, 04:42

I do note that version 17 has improved the speed of at least some multi-level models; do you have version 17?
Comment
Cooper Felix

Join Date: Sep 2015

Posts: 84
#5

06 Jun 2021, 09:25

Originally posted by Rich Goldstein View Post

I do note that version 17 has improved the speed of at least some multi-level models; do you have version 17?

Thanks, I'm still using the 16. Do you mind telling me which command(s) you are referring to?
Comment
Rich Goldstein

Join Date: Mar 2014

Posts: 4439
#6

06 Jun 2021, 09:41

the only ones I have used so far are -mixed- and -melogit-; it may well be true of others also
Comment
Cooper Felix

Join Date: Sep 2015

Posts: 84
#7

06 Jun 2021, 10:46

Any suggestions to the following will be appreciated:

Obviously, I only have access to an aggregated outcome (say, the total production level of a firm during each quarter) while I have my focal explanatory variable at a more granular level (say, number of employees reported to each store each week). In this case, I am still thinking about how to study the relationship between human capital input (i.e., number of employees) and firm production in a more appropriate way. There are two options I think:
Aggregate weekly data into quarterly data so that the outcome and explanatory variables are at the same level.

Append the end-of-quarter outcome to each weekly observation and study at the weekly level.

I am eager to know what your recommendation would be. Thanks.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29956
#8

09 Jun 2021, 15:30

Either approach is acceptable. What is most important is to understand that they are different and that the results require different interpretations.

In your original approach of aggregating up the weekly explanatory variable to quarters, you are smoothing out a great deal of noise, and thereby endowing the resulting variable: mean (or sum, or whatever aggregated statistic you used) weekly employees with a higher reliability than the original weekly variable has. In particular, compared to the "ideal" analysis in which weekly production (which is not available in real life) is regressed on weekly employees, the effect of aggregated quarterly workforce on quarterly output will be larger, probably a good deal larger since there are 13 weeks in a quarter.

By contrast, following the reviewer's proposal, you will be creating a synthetic outcome variable that has a low variability that is illusory. Moreover, it is intuitive that in most situations, a weekly workforce measure cannot be as strongly predictive of a quarterly outcome as the quarterly workforce measure would be--if only because the weekly workforce measure's effects get diluted by the effects of the other weeks in the quarter. So this analysis of a weekly predictor for a quarterly pseudo-outcome will produce a substantial under estimate of the effect seen in a model of quarterly predictor and genuine quarterly outcome. Less easy to see intuitively, but not hard to derive mathematically, is that using the monthly outcome as if it were weekly, thereby underestimating its variation, also results in underestimating the regression coefficients compared to what would be observed in a weekly vs weekly analysis (were weekly outcome data available).

So, in short, neither approach is wrong, and neither approach is ideal. Each needs to be carefully interpreted in light of the different meanings of the variables in the analyses. And whichever approach is used, its limitations should be explicitly discussed.
Comment
Cooper Felix

Join Date: Sep 2015

Posts: 84
#9

10 Jun 2021, 11:46

Originally posted by Clyde Schechter View Post

Either approach is acceptable. What is most important is to understand that they are different and that the results require different interpretations.

In your original approach of aggregating up the weekly explanatory variable to quarters, you are smoothing out a great deal of noise, and thereby endowing the resulting variable: mean (or sum, or whatever aggregated statistic you used) weekly employees with a higher reliability than the original weekly variable has. In particular, compared to the "ideal" analysis in which weekly production (which is not available in real life) is regressed on weekly employees, the effect of aggregated quarterly workforce on quarterly output will be larger, probably a good deal larger since there are 13 weeks in a quarter.

By contrast, following the reviewer's proposal, you will be creating a synthetic outcome variable that has a low variability that is illusory. Moreover, it is intuitive that in most situations, a weekly workforce measure cannot be as strongly predictive of a quarterly outcome as the quarterly workforce measure would be--if only because the weekly workforce measure's effects get diluted by the effects of the other weeks in the quarter. So this analysis of a weekly predictor for a quarterly pseudo-outcome will produce a substantial under estimate of the effect seen in a model of quarterly predictor and genuine quarterly outcome. Less easy to see intuitively, but not hard to derive mathematically, is that using the monthly outcome as if it were weekly, thereby underestimating its variation, also results in underestimating the regression coefficients compared to what would be observed in a weekly vs weekly analysis (were weekly outcome data available).

So, in short, neither approach is wrong, and neither approach is ideal. Each needs to be carefully interpreted in light of the different meanings of the variables in the analyses. And whichever approach is used, its limitations should be explicitly discussed.

This is extremely helpful, thank you so much Clyde!
Comment
Cooper Felix

Join Date: Sep 2015

Posts: 84
#10

18 Oct 2021, 16:32

Originally posted by Clyde Schechter View Post

Either approach is acceptable. What is most important is to understand that they are different and that the results require different interpretations.

In particular, compared to the "ideal" analysis in which weekly production (which is not available in real life) is regressed on weekly employees, the effect of aggregated quarterly workforce on quarterly output will be larger, probably a good deal larger since there are 13 weeks in a quarter.

Hi Clyde,
I wonder if there is any way to prove that "the effect of aggregated quarterly workforce on quarterly output will be larger" compared to the effect from a weekly vs weekly analysis? I was reflecting on your answers and not sure how to prove this. Thanks!
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29956
#11

18 Oct 2021, 18:24

Hmm, thanks for challenging me on that. It isn't provable because it isn't true. I was thinking that the quarterly aggregates would be less noisy than the weekly data and therefore provide stronger associations. But that is too glib. What is true, but not of much practical value is that while, on average the correlation between the weekly values of x and y will be the same as the correlation between the quarterly values of x and y, the quarterly correlations will exhibit higher variance: you will see more extremely high or extremely low quarterly correlations than weekly correlations.

Also true, and perhaps of some value, particularly in the context of this thread, is how correlating weekly values of x and y compares to correlating quarterly aggregated X (applied to each week's observation) and the weekly values of y, the correlations of the weekly variables will be appreciably stronger (in expectation) than the correlation of the aggregated quarterly X with the weekly y--because, in effect, the quarterly aggregate, being constant within quarter, has less opportunity to co-vary with y than its weekly version does. Here's an example, where the "true" correlation between weekly x and y is 0.2:

Code:

clear* set seed 123 matrix C = (1, 0.2 \ 0.2, 1) frame create results float(r_ww r_qw r_qq) forvalues i = 1/1000 { quietly { drawnorm x y, corr(C) cstorage(full) n(520) gen long group = floor((_n-1)/13) + 1 corr x y local topost (`r(rho)') by group, sort: gen X = sum(x) by group: replace X = . if _n < _N corr X y local topost `topost' (`r(rho)') collapse (sum) X = x Y = y, by(group) corr X Y local topost `topost' (`r(rho)') frame post results `topost' clear } } frame change results summ r_*

You can play with the parameters of that and you will see that the pattern is quite consistent.

My apologies for the original mis-statement and for the time you have wasted trying to concoct a proof of it.
1 like
Comment
Felix Kaysers

Join Date: Oct 2022

Posts: 63
#12

16 Mar 2023, 04:44

Dear Clyde,

I wanted to express my gratitude for your informative thread, as it has provided me with valuable insights. However, I do have a quick question regarding the second potential solution you proposed. I have attempted to use the "collapse" command by referring to the help file, but I am struggling to figure out how to combine all observations with the same values for all predictors. When I omit the collapse statistic, it defaults to the mean, and I am uncertain whether the "sums" option is appropriate in conjunction with "melogit, binomial". In my search for a solution, I stumbled upon the "contract" command, which appears to combine all observations with the same values for all predictors. Would it be possible to achieve the same result using "collapse"?

Thank you in advance for your assistance.

__________________________________________________ __________

Cheers, Felix
Stata Version: MP 18.0
OS: Windows 11
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29956
#13

16 Mar 2023, 09:39

Your question can't be answered without more detailed information about what you want to do with the combined data. For different purposes, different collapse statistics would be appropriate. Just knowing that you plan to use -melogit, binomial- doesn't determine the answer.

Be aware that -contract- simply gives a count of the number of observations having those specific values for all predictors. That -contract- result would give you an appropriate variable to include in the -binomial()- option of -melogit-, but might be wrong for other variables needed for your -melogit-.

-contract- can be emulated with -collapse- as follows:

Code:

gen guaranteed_not_missing = 1 collapse (count) freq = guaranteed_not_missing, by(list_of_predictors)

And if your data set already contains some variable that never has missing values, you don't even need to create that guaranteed_not_missing variable: just use that never-missing variable in the code instead.

That said, if what you want is a count of the number of observations for each combination of predictor values, -contract- will be faster if your data set is large. (-collapse- has a lot of overhead that interprets the command to figure out exactly what statistics you want, whereas -contract-, offering no choice in the matter, need not waste its time on that.)

But to return to your original question, if you describe in greater detail what you are trying to do, I will try to advise.

Added: Without retracting what I said about their being many possibilities, let me guess what you are trying to do and show you how to do it. The most common situation, though not the only one, where a question like yours might arise is this. You have a data set with a large number of observations, and you need to do a multi-level logistic regression with a dichotomous outcome variable. You also notice that the number of different combinations of the explanatory variables in the data is substantially smaller than the number of observations, so you are interested in speeding up the calculations by reducing the size of the data set, grouping observations according to the combinations of the explanatory variables and performing the -melogit- with that. If that is what you are trying to do, then the approach is:

Code:

collapse (sum) outcome (count) denominator = outcome, by(explanatory_variables random_effect_variables) melogit outcome explanatory_variables || random_effects_equations, binomial(denominator)

Last edited by Clyde Schechter; 16 Mar 2023, 09:45.
Comment

Announcement

multilevel model with large data, any quick solutions?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment