Logistic regression with a small number experiencing the event

Fifi Maither

Join Date: Aug 2023

Posts: 1
#1

Logistic regression with a small number experiencing the event

24 Aug 2023, 04:56

Hi

I have a sample of 834 school aged children. I am trying to determine what the predictors are of dropping out of school. I have 18 independent variable in my logistic regression (pseudo R2 of 0.34). Only 54 participants have dropped out of school (.0482143 with a standard deviation of .2143144). I am unsure about the following.

(i) Is my sample size large enough to run this type of analysis
(ii) Is a standard logistic regression appropriate - should I use any of the small-sample regressions.
(iii) Do I have too many predictors in my regression - they have been included as they are all relevant predictors according to literature
(iv) Could I look at gender differences in predictors by running separate logistic regressions for each gender (This would mean the n for dropout=yes would be 32 for females with a sample size of 442 and a pseudo R2 of 0.4646)

Any advice will be appreciated.
Tags: None
Andrew Musau

Join Date: Oct 2014

Posts: 10084
#2

24 Aug 2023, 06:21

Originally posted by Fifi Maither View Post

(i) Is my sample size large enough to run this type of analysis

Yes, \(N=834\) is a big enough sample.

(ii) Is a standard logistic regression appropriate - should I use any of the small-sample regressions.

\(54/834\approx 6.5\%\) does not count as a rare event, so regular logistic regression should be fine.

(iii) Do I have too many predictors in my regression - they have been included as they are all relevant predictors according to literature

Rule of thumb for logit is that you need 15 observations for every regressor, so you are well above the threshold \((834 > 18\times15= 270)\).

Could I look at gender differences in predictors by running separate logistic regressions for each gender

You may, then combine the results with suest and use test to check whether the coefficients differ. But it will be easier to interact your independent variables with gender. The coefficients on the interaction terms will tell you whether there are gender differences.

Code:

logit dv i.female##(c.iv1 c.iv2 ...)
3 likes
Comment
Rich Goldstein

Join Date: Mar 2014

Posts: 4439
#3

24 Aug 2023, 06:49

I disagree with Andrew Musau on one issue - the various rules of thumb for logistic regression that I am familiar with have to do with the number of events/regressor and you have 54 events for 18 regressors and that might cause problems - be sure to check the results carefully and I would recommend bootstrapping in addition
4 likes
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 10084
#4

24 Aug 2023, 07:16

Rich is correct, it is the number of events per regressor. Sorry for the confusion.
1 like
Comment
Bruce Weaver

Join Date: May 2014

Posts: 1119
#5

24 Aug 2023, 07:25

I agree with Rich Goldstein: For logistic regression, the limiting sample size is the number of events (or non-events if that is smaller). Frank Harrell now recommends at least 20 events-per-variable (EPV). See this section of his Author Checklist. (Note that Harrell's 20:1 rule of thumb is about reducing the likelihood of overfitting, not about ensuring sufficient power to some smallest effect size of interest.)

See also these notes on analysis of rare events by Richard Williams.

Cheers,
Bruce

--
Bruce Weaver
Email: [email protected]
Version: Stata/MP 18.5 (Windows)
2 likes
Comment
Rich Goldstein

Join Date: Mar 2014

Posts: 4439
#6

24 Aug 2023, 08:00

note that I agree with what Bruce Weaver says in general but I think that the 20:1 rule is too arbitrary - I do agree that overfitting is a problem which is why I suggested bootstrap; note that another alternative is -relogit- (user-written and available at SSC)
Comment
Joseph Coveney

Join Date: Apr 2014

Posts: 4374
#7

24 Aug 2023, 10:08

Originally posted by Fifi Maither View Post

(i) Is my sample size large enough to run this type of analysis

Whether your sample size is large enough depends upon what effect size you're trying to detect. With your sample size and outcome rate, you'll have only about 30% power to detect an odds ratio of 2 (rule-of-thumb minimum effect size by many for logistic regression), but you'll have about 90% power to detect an odds ratio of 5 if that magnitude fits in with your research objective. See below for the results of the simulation exercise behind these values.

(ii) Is a standard logistic regression appropriate - should I use any of the small-sample regressions.

I'd go with Andrew's reply above.

(iii) Do I have too many predictors in my regression - they have been included as they are all relevant predictors according to literature

Others have raised the spectre of overfitting, but I think the danger there is greatest with a kitchen-sink approach to variable inclusion. Yours seem more principled than that.

It does seem as if your ratio of cases to predictor is a little low: the test size (Type I error rate) runs around 5.6% or 5.7%, a little higher than nominal. Estimate of effect size (regression coeffient) likewise is a tad inflated: an odds ratio of 2.1 for a true of 2, and 5.8 for a true of 5.1 (target of 5).

(iv) Could I look at gender differences in predictors by running separate logistic regressions for each gender

Interactions require a much greater sample size than for the main effects, but again if you're willing to live with the limitations on power that you're saddled with, then you can certainly explore sex-related differences in strength of association of the predictors and outcome.

Here's the simulation:

.ÿ
.ÿversionÿ18.0

.ÿ
.ÿclearÿ*

.ÿ
.ÿ//ÿseedem
.ÿsetÿseedÿ264129110

.ÿ
.ÿquietlyÿsetÿobsÿ834

.ÿgenerateÿbyteÿsexÿ=ÿ_nÿ>=ÿ442

.ÿgenerateÿbyteÿoutÿ=ÿcond(!sex,ÿ_nÿ<=ÿ54ÿ-ÿ32,ÿ_nÿ>=ÿ834ÿ-ÿ32ÿ+ÿ1)

.ÿtabulateÿsexÿout

ÿÿÿÿÿÿÿÿÿÿÿ|ÿÿÿÿÿÿÿÿÿÿout
ÿÿÿÿÿÿÿsexÿ|ÿÿÿÿÿÿÿÿÿ0ÿÿÿÿÿÿÿÿÿÿ1ÿ|ÿÿÿÿÿTotal
-----------+----------------------+----------
ÿÿÿÿÿÿÿÿÿ0ÿ|ÿÿÿÿÿÿÿ419ÿÿÿÿÿÿÿÿÿ22ÿ|ÿÿÿÿÿÿÿ441ÿ
ÿÿÿÿÿÿÿÿÿ1ÿ|ÿÿÿÿÿÿÿ361ÿÿÿÿÿÿÿÿÿ32ÿ|ÿÿÿÿÿÿÿ393ÿ
-----------+----------------------+----------
ÿÿÿÿÿTotalÿ|ÿÿÿÿÿÿÿ780ÿÿÿÿÿÿÿÿÿ54ÿ|ÿÿÿÿÿÿÿ834ÿ

.ÿ
.ÿgenerateÿbyteÿpr0ÿ=ÿ0

.ÿforvaluesÿiÿ=ÿ1/17ÿ{
ÿÿ2.ÿÿÿÿÿlocalÿindÿ:ÿdisplayÿ%02.0fÿ`i'
ÿÿ3.ÿÿÿÿÿgenerateÿdoubleÿpr`ind'ÿ=ÿ0
ÿÿ4.ÿ}

.ÿ
.ÿprogramÿdefineÿtrue
ÿÿ1.ÿÿÿÿÿversionÿ18.0
ÿÿ2.ÿÿÿÿÿsyntaxÿ,ÿTru(name)
ÿÿ3.ÿ
.ÿÿÿÿÿtempnameÿod1
ÿÿ4.ÿÿÿÿÿsummarizeÿpr0ÿifÿout,ÿmeanonly
ÿÿ5.ÿÿÿÿÿscalarÿdefineÿ`od1'ÿ=ÿr(mean)ÿ/ÿ(1ÿ-ÿr(mean))
ÿÿ6.ÿÿÿÿÿsummarizeÿpr0ÿifÿ!out,ÿmeanonly
ÿÿ7.ÿÿÿÿÿscalarÿdefineÿ`tru'ÿ=ÿ`od1'ÿ/ÿ(r(mean)ÿ/ÿ(1ÿ-ÿr(mean)))
ÿÿ8.ÿend

.ÿ
.ÿprogramÿdefineÿsimem,ÿrclass
ÿÿ1.ÿÿÿÿÿversionÿ18.0
ÿÿ2.ÿÿÿÿÿsyntaxÿ,ÿ[Odds(integerÿ2)]
ÿÿ3.ÿ
.ÿÿÿÿÿreplaceÿpr0ÿ=ÿrbinomial(1,ÿcond(out,ÿ`odds'ÿ/ÿ(`odds'ÿ+ÿ1),ÿ1/2))
ÿÿ4.ÿÿÿÿÿtempnameÿtru
ÿÿ5.ÿÿÿÿÿtrueÿ,ÿt(`tru')
ÿÿ6.ÿ
.ÿÿÿÿÿforvaluesÿiÿ=ÿ1/17ÿ{
ÿÿ7.ÿÿÿÿÿÿÿÿÿlocalÿindÿ:ÿdisplayÿ%02.0fÿ`i'
ÿÿ8.ÿÿÿÿÿÿÿÿÿreplaceÿpr`ind'ÿ=ÿruniform()
ÿÿ9.ÿÿÿÿÿ}
ÿ10.ÿÿÿÿÿlogitÿoutÿi.sex##(i.pr0ÿc.pr??)
ÿ11.ÿÿÿÿÿreturnÿscalarÿpowÿ=ÿr(table)["pvalue",ÿ"out:1.pr0"]ÿ<ÿ0.05
ÿ12.ÿÿÿÿÿreturnÿscalarÿsizÿ=ÿr(table)["pvalue",ÿ"out:c.pr01"]ÿ<ÿ0.05
ÿ13.ÿÿÿÿÿreturnÿscalarÿestÿ=ÿexp(r(table)["b",ÿ"out:1.pr0"])
ÿ14.ÿÿÿÿÿreturnÿscalarÿtruÿ=ÿ`tru'
ÿ15.ÿend

.ÿ
.ÿprogramÿdefineÿsumem
ÿÿ1.ÿÿÿÿÿversionÿ18.0
ÿÿ2.ÿÿÿÿÿsyntaxÿ[if]
ÿÿ3.ÿ
.ÿÿÿÿÿsummarizeÿpowÿsizÿ`if'
ÿÿ4.ÿÿÿÿÿcentileÿestÿtruÿ`if'
ÿÿ5.ÿ
.ÿend

.ÿ
.ÿframeÿcreateÿPowerAnalysisÿbyte(tgtÿpowÿsiz)ÿdouble(estÿtru)

.ÿ
.ÿ//ÿOddsÿratioÿ=ÿ2
.ÿforvaluesÿrepÿ=ÿ1/3000ÿ{
ÿÿ2.ÿÿÿÿÿquietlyÿsimem
ÿÿ3.ÿÿÿÿÿframeÿpostÿPowerAnalysisÿ(2)ÿ(r(pow))ÿ(r(siz))ÿ(r(est))ÿ(r(tru))
ÿÿ4.ÿ}

.ÿframeÿPowerAnalysis:ÿsumem

ÿÿÿÿVariableÿ|ÿÿÿÿÿÿÿÿObsÿÿÿÿÿÿÿÿMeanÿÿÿÿStd.ÿdev.ÿÿÿÿÿÿÿMinÿÿÿÿÿÿÿÿMax
-------------+---------------------------------------------------------
ÿÿÿÿÿÿÿÿÿpowÿ|ÿÿÿÿÿÿ3,000ÿÿÿÿÿÿÿÿÿ.32ÿÿÿÿ.4665539ÿÿÿÿÿÿÿÿÿÿ0ÿÿÿÿÿÿÿÿÿÿ1
ÿÿÿÿÿÿÿÿÿsizÿ|ÿÿÿÿÿÿ3,000ÿÿÿÿÿÿÿÿ.057ÿÿÿÿ.2318813ÿÿÿÿÿÿÿÿÿÿ0ÿÿÿÿÿÿÿÿÿÿ1

ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿBinom.ÿinterp.ÿÿÿ
ÿÿÿÿVariableÿ|ÿÿÿÿÿÿÿObsÿÿPercentileÿÿÿÿCentileÿÿÿÿÿÿÿÿ[95%ÿconf.ÿinterval]
-------------+-------------------------------------------------------------
ÿÿÿÿÿÿÿÿÿestÿ|ÿÿÿÿÿ3,000ÿÿÿÿÿÿÿÿÿ50ÿÿÿÿ2.095647ÿÿÿÿÿÿÿÿ2.048841ÿÿÿÿ2.142975
ÿÿÿÿÿÿÿÿÿtruÿ|ÿÿÿÿÿ3,000ÿÿÿÿÿÿÿÿÿ50ÿÿÿÿÿ1.98977ÿÿÿÿÿÿÿÿ1.959391ÿÿÿÿ2.020619

.ÿ
.ÿ//ÿOddsÿratioÿ=ÿ5
.ÿforvaluesÿrepÿ=ÿ1/3000ÿ{
ÿÿ2.ÿÿÿÿÿquietlyÿsimemÿ,ÿo(5)
ÿÿ3.ÿÿÿÿÿframeÿpostÿPowerAnalysisÿ(5)ÿ(r(pow))ÿ(r(siz))ÿ(r(est))ÿ(r(tru))
ÿÿ4.ÿ}

.ÿframeÿPowerAnalysis:ÿsumemÿifÿtgtÿ==ÿ5

ÿÿÿÿVariableÿ|ÿÿÿÿÿÿÿÿObsÿÿÿÿÿÿÿÿMeanÿÿÿÿStd.ÿdev.ÿÿÿÿÿÿÿMinÿÿÿÿÿÿÿÿMax
-------------+---------------------------------------------------------
ÿÿÿÿÿÿÿÿÿpowÿ|ÿÿÿÿÿÿ3,000ÿÿÿÿ.9043333ÿÿÿÿ.2941826ÿÿÿÿÿÿÿÿÿÿ0ÿÿÿÿÿÿÿÿÿÿ1
ÿÿÿÿÿÿÿÿÿsizÿ|ÿÿÿÿÿÿ3,000ÿÿÿÿÿÿÿÿ.056ÿÿÿÿ.2299601ÿÿÿÿÿÿÿÿÿÿ0ÿÿÿÿÿÿÿÿÿÿ1

ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿBinom.ÿinterp.ÿÿÿ
ÿÿÿÿVariableÿ|ÿÿÿÿÿÿÿObsÿÿPercentileÿÿÿÿCentileÿÿÿÿÿÿÿÿ[95%ÿconf.ÿinterval]
-------------+-------------------------------------------------------------
ÿÿÿÿÿÿÿÿÿestÿ|ÿÿÿÿÿ3,000ÿÿÿÿÿÿÿÿÿ50ÿÿÿÿ5.777989ÿÿÿÿÿÿÿÿ5.552121ÿÿÿÿ5.954734
ÿÿÿÿÿÿÿÿÿtruÿ|ÿÿÿÿÿ3,000ÿÿÿÿÿÿÿÿÿ50ÿÿÿÿÿ5.09348ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ5ÿÿÿÿ5.182768

.ÿ
.ÿexit

endÿofÿdo-file

.
1 like
Comment
Bruce Weaver

Join Date: May 2014

Posts: 1119
#8

24 Aug 2023, 10:49

Is anyone else getting hidden characters (as shown below) when they copy & paste Joseph Coveney's simulation code in #7?

Code:

.ÿ .ÿversionÿ18.0 .ÿ .ÿclearÿ* .ÿ .ÿ//ÿseedem .ÿsetÿseedÿ264129110 etc.

I have encountered this in previous posts by Joseph, btw.

--
Bruce Weaver
Email: [email protected]
Version: Stata/MP 18.5 (Windows)
2 likes
Comment
Joseph Coveney

Join Date: Apr 2014

Posts: 4374
#9

24 Aug 2023, 17:43

You'll need to take that up with the forum administrator.

At first, the forum software respected multiple consecutive white spaces, but somehow that got toggled off so that now all consecutive white spaces are trimmed to a single white space. So-called hard spaces, such as ANSI 160 or the Unicode equivalent do not overcome that. This distorts the display of Stata output.

Code delimiters work fine for code, but I distinguish code that a reader may copy and paste verbatim to run, and output, where there is no such expectation: if a forum user wants to take output and use it as code then he or she will need to clean it up first in a text editor regardless of whether there is a space-holder character.

Again, if it bothers you, then I recommend taking it up with the administrator and ask that the forum software restore its original behavior with respect to honoring what users type.
Comment
Bruce Weaver

Join Date: May 2014

Posts: 1119
#10

25 Aug 2023, 10:13

Thank you, Joseph. I have always used code delimiters when posting output, and have not noticed any particular problems in doing so. That makes me curious to know what problems you may have encountered. However, I do sense from your response in #9 that you may have no desire to comment further, and if so, that's fine.

PS- I just checked the FAQ to see if it says anything about using code delimiters for output, and it does not (currently).

--
Bruce Weaver
Email: [email protected]
Version: Stata/MP 18.5 (Windows)
Comment

Announcement

Logistic regression with a small number experiencing the event

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment