Stratification for multiple linear regression

Maryam Kebbe

Join Date: Apr 2020

Posts: 14
#1

Stratification for multiple linear regression

17 Apr 2020, 13:20

Hello,

I am running a multiple linear regression analysis to examine associations between blood glucose (continuous) and saturated fats (meeting recommendations: yes/no [binary]). As shown in the code below, we are adjusting for a number of variables (condensed for the purposes here).

We would like to stratify by age (continuous) and sex (binary). I tried using egen mystrata = group(age sex), however get an error that the option mystrata is not allowed when I run the command.

Code:

regress blood_glucose i.SF_energy age i.sex i.ethnicity i.education_group i.smoke i.bmi_gp, mystrata cap drop myResiduals predict myResiduals, r sktest myResiduals

I was hoping to receiving guidance on:
1. the code to correctly stratify for age and sex, and
2. whether I would need to remove "age i.sex" from the above code when doing so

Thank you!
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#2

17 Apr 2020, 15:27

I can show you how to do what you say you want to do here, but I think it is ill-advised to do it.

Your age variable is continuous. Within your data set, there will only be a very small number of observations corresponding to any specific value of age, and typically only half of those will be of either sex. In short, you will be creating probably hundreds of strata each with at most a handful of observations. Your residuals will be almost pure noise as a result.

Stratifying on continuous variables almost always ends badly.

With regard to your second question, you could either remove age and i.sex from the stratified analysis or leave them in. If you leave them in, because in any stratum these will not actually vary, they will be colinear with the constant term, and Stata will remove them for you.
Comment
Maryam Kebbe

Join Date: Apr 2020

Posts: 14
#3

17 Apr 2020, 17:26

Dear Clyde - thank you for your response. I should have mentioned, we are working with quite a large dataset, with several thousand participants for each specific value of age!
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#4

17 Apr 2020, 18:16

Then in that case:

Code:

capture program drop one_stratum program define one_stratum regress blood_glucose i.SF_energy i.ethnicity i.education_group i.smoke i.bmi_gp predict myResidual, resid sktest myResidual gen p_skew = r(P_skew) gen p_curt = r(P_kurt) exit end runby one_stratum, by(age sex) status

-runby- is written by Robert Picard and me, and is available from SSC. Since it sounds like you have a large data set, I have specified the -status- option, so that -runby- will give you a periodic update of how many strata have been processed and an estimate of remaining time to completion.

You don't show in #1 what results from -sktest- you want to save: if you have lots of strata, just having them printed out on the screen as you go isn't going to be useful. To illustrate the approach, here I have saved the outputs that Stata calls Pr(Skewness) and Pr(Kurtosis) when it displays results. Other outputs are available--see what -sktest- does after -return list-.

One more unsolicited critique of what you are doing: since your data set is large, with thousands of observations in each stratum, the normality of residuals is irrelevant: the central limit theorem will drive the standard normal-theory statistics to be correct even if the residuals are not even close to normal.
Comment
Leonardo Guizzetti

Join Date: Jul 2016

Posts: 2402
#5

17 Apr 2020, 18:58

Out of curiosity, why do you feel the need to stratify your model on age groups at all with a dataset this large?
Comment
Maryam Kebbe

Join Date: Apr 2020

Posts: 14
#6

17 Apr 2020, 19:02

Thanks for this, Clyde. Sounds great all around. One more thing, I might be missing something obvious here, but when I run the provided code, I only get a brief table as the output summarizing number of by-groups, by-groups with errors (0), and by-groups with no data (0), and no info on the regression output itself. Am I to do something differently, in addition to running the provided code?

(As an aside, the variables p_skew and p_curt all had . missing values as well).

Leonardo, you bring up a good point - will likely create a binary category for this instead or stratify by sex alone.

Last edited by Maryam Kebbe; 17 Apr 2020, 19:24.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#7

17 Apr 2020, 20:09

Oh, I didn't do the code to save regression results: I saw your focus on the residuals and just generated those.

Code:

capture program drop one_stratum program define one_stratum regress blood_glucose i.SF_energy i.ethnicity i.education_group i.smoke i.bmi_gp levelsof SF_energy if e(sample), local(levels) foreach l of local levels { gen b`l'_sf_energy = _b[`l'.SF_energy] gen se`l'_sf_energy = _se[`l'.SF_energy] } predict myResidual, resid sktest myResidual gen p_skew = r(P_skew) gen p_curt = r(P_kurt) exit end runby one_stratum, by(age sex) status

Now you will get the coefficients of the SF_energy indicators and their standard errors saved in the data set. You can expand that to do the same thing for any other variables in your model whose coefficients you are truly interested in.

I have no idea why you are seeing all missing values for p_siew and p_curt. When I tested the code in a fake data set, it produced those results. If you want to pursue it, post back using the -dataex- command to show example data that reproduces your problem.
Comment

Maryam Kebbe

Join Date: Apr 2020
Posts: 14

18 Apr 2020, 08:16

*For the purposes of this discussion, I am switching from stratifying by age and sex to stratifying by bmi.*

The updated code works great when I apply it to my first variable (SF_energy) stratified by bmi_gp. When I go ahead and try to apply it to the rest of my independent variables (eg, fiber [code below]), I get the error message attached (same if I use "one_stratum" or "two_stratum", not sure if that makes a difference).

Code:

capture program drop two_stratum
program define two_stratum
    regress blood_glucose i.fiber age i.sex i.ethnicity i.education_group i.smoke i.bmi_gp
  levelsof fiber if e(sample), local(levels)
  foreach l of local levels {
      gen b`l'_fiber = _b[`l'.fiber]
      gen se`l'_fiber = _se[`l'.fiber]
  }
    predict myResidual, resid
    sktest myResidual
    gen p_skew = r(P_skew)
    gen p_curt = r(P_kurt)
    exit
end

runby two_stratum, by(bmi_gp) status

Below is a data set example.

Code:

input double blood_glucose float fiber int(age_at_recruit sex) float(ethnicity education_group smoke bmi_gp)
4.852 0 47 1 1 4 1 3
5.755 1 65 1 1 4 1 1
4.699 1 63 1 1 4 1 2
5.326 1 59 1 1 4 3 2
5.298 1 60 1 1 3 1 2
8.139 1 60 1 1 4 1 1
    . 1 53 1 1 4 1 3
5.158 1 62 1 1 4 2 1
5.322 1 63 1 1 4 1 3
5.125 1 64 1 1 4 1 1

On a different note, is there a way to add other typical stata regression outputs to the -runby- code (eg, R-squared, p-value, CIs)?

Attached Files

Last edited by Maryam Kebbe; 18 Apr 2020, 08:18.

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#9

18 Apr 2020, 11:57

Well, whatever the problem is, it doesn't arise with your example data. The code runs and reports no errors, and produces the intended output.

Here's what I suggest. Something is going wrong inside program two_stratum (it doesn't matter what name you give it as long as you use the name consistently). So add the -verbose- option to the -runby- command. That way, you will see the output of program two_stratum, and, in particular, the error messages it generates. With that information you should be able to figure out why two_stratum doesn't work with your data and modify it accordingly.
Comment
Maryam Kebbe

Join Date: Apr 2020

Posts: 14
#10

21 Apr 2020, 09:44

I figured out the error (silly on my part - just needed to use different variable names since was running the code more than once). One last thing I'd appreciate your help with - could you please advise on how to add other typical Stata regression outputs to the -runby- code (eg, R-squared, p-value, CIs)?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#11

21 Apr 2020, 11:57

Code:

gen r_squared = e(r2)

will get you the R². For the p-values and CIs there are two ways you can go. You can calculate them from the coefficients and standard errors (and this doesn't have to be in the program that -runby- calls: you can add this code after -runby-:

Code:

ds se*_* local std_errs `r(varlist)' foreach se of local std_errs { foreach stat in b z p ci_lb ci_ub { local `stat': subinstr local se "se" "`stat'" } gen `z' = `b'/`se' gen `p' = 2*normal(-abs(`z')) gen `ci_lb' = `b' - invnormal(0.975)*`se' gen `ci_ub' = `b' + invnormal(0.975)*`se' }

(Note: This code relies on your assertion that all of your regression samples are large: it treats the t-statistic and t-distribution as z-statistic and normal distribution. For a sample size of 60 or more this would not be problematic, and is probably OK even down to 30.) If you have strata smaller than that, then you need to also have program one_stratum save the estimation sample size and replace reference to the normal distribution by the corresponding t-distribution with the appropriate degrees of freedom.)

Alternatively, within program one_stratum, after the regression you can access out the matrix r(table) and pull all of these statistics, including the coefficients and standard errors, directly from it. The drawback to this approach is that you have to access that matrix through row and column numbers which requires you to either know the appropriate row and column numbers in advance or use the macro extended functions -rownumb- and -colnumb- to pull them out. Actually, I do this often enough that I know that the coefficients are in row 1, the standard errors in row 2, the t-statistics in row 3, the p-values in row 4, and the confidence bounds in rows 5 and 6. But the column numbers will depend on the ordering of your variables.
Comment
Maryam Kebbe

Join Date: Apr 2020

Posts: 14
#12

04 May 2020, 15:50

You've been very helpful, Clyde - thank you!
Comment
filippo dare

Join Date: Sep 2021

Posts: 15
#13

26 Oct 2021, 13:26

.

Last edited by filippo dare; 26 Oct 2021, 13:33.
Comment
George Ford

Join Date: Aug 2014

Posts: 3152
#14

27 Oct 2021, 18:36

Clyde's approach is a good one. But also try asreg (ssc install asreg), which will give you all you need as new variables. It is very fast with large datasets.
1 like
Comment

Announcement