Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Logistic regression with a small number experiencing the event

    Hi

    I have a sample of 834 school aged children. I am trying to determine what the predictors are of dropping out of school. I have 18 independent variable in my logistic regression (pseudo R2 of 0.34). Only 54 participants have dropped out of school (.0482143 with a standard deviation of .2143144). I am unsure about the following.


    (i) Is my sample size large enough to run this type of analysis
    (ii) Is a standard logistic regression appropriate - should I use any of the small-sample regressions.
    (iii) Do I have too many predictors in my regression - they have been included as they are all relevant predictors according to literature
    (iv) Could I look at gender differences in predictors by running separate logistic regressions for each gender (This would mean the n for dropout=yes would be 32 for females with a sample size of 442 and a pseudo R2 of 0.4646)

    Any advice will be appreciated.







  • #2
    Originally posted by Fifi Maither View Post
    (i) Is my sample size large enough to run this type of analysis
    Yes, \(N=834\) is a big enough sample.

    (ii) Is a standard logistic regression appropriate - should I use any of the small-sample regressions.
    \(54/834\approx 6.5\%\) does not count as a rare event, so regular logistic regression should be fine.

    (iii) Do I have too many predictors in my regression - they have been included as they are all relevant predictors according to literature
    Rule of thumb for logit is that you need 15 observations for every regressor, so you are well above the threshold \((834 > 18\times15= 270)\).

    Could I look at gender differences in predictors by running separate logistic regressions for each gender
    You may, then combine the results with suest and use test to check whether the coefficients differ. But it will be easier to interact your independent variables with gender. The coefficients on the interaction terms will tell you whether there are gender differences.

    Code:
    logit dv i.female##(c.iv1 c.iv2 ...)


    Comment


    • #3
      I disagree with Andrew Musau on one issue - the various rules of thumb for logistic regression that I am familiar with have to do with the number of events/regressor and you have 54 events for 18 regressors and that might cause problems - be sure to check the results carefully and I would recommend bootstrapping in addition

      Comment


      • #4
        Rich is correct, it is the number of events per regressor. Sorry for the confusion.

        Comment


        • #5
          I agree with Rich Goldstein: For logistic regression, the limiting sample size is the number of events (or non-events if that is smaller). Frank Harrell now recommends at least 20 events-per-variable (EPV). See this section of his Author Checklist. (Note that Harrell's 20:1 rule of thumb is about reducing the likelihood of overfitting, not about ensuring sufficient power to some smallest effect size of interest.)

          See also these notes on analysis of rare events by Richard Williams.

          Cheers,
          Bruce
          --
          Bruce Weaver
          Email: [email protected]
          Version: Stata/MP 18.5 (Windows)

          Comment


          • #6
            note that I agree with what Bruce Weaver says in general but I think that the 20:1 rule is too arbitrary - I do agree that overfitting is a problem which is why I suggested bootstrap; note that another alternative is -relogit- (user-written and available at SSC)

            Comment


            • #7
              Originally posted by Fifi Maither View Post
              (i) Is my sample size large enough to run this type of analysis
              Whether your sample size is large enough depends upon what effect size you're trying to detect. With your sample size and outcome rate, you'll have only about 30% power to detect an odds ratio of 2 (rule-of-thumb minimum effect size by many for logistic regression), but you'll have about 90% power to detect an odds ratio of 5 if that magnitude fits in with your research objective. See below for the results of the simulation exercise behind these values.

              (ii) Is a standard logistic regression appropriate - should I use any of the small-sample regressions.
              I'd go with Andrew's reply above.

              (iii) Do I have too many predictors in my regression - they have been included as they are all relevant predictors according to literature
              Others have raised the spectre of overfitting, but I think the danger there is greatest with a kitchen-sink approach to variable inclusion. Yours seem more principled than that.

              It does seem as if your ratio of cases to predictor is a little low: the test size (Type I error rate) runs around 5.6% or 5.7%, a little higher than nominal. Estimate of effect size (regression coeffient) likewise is a tad inflated: an odds ratio of 2.1 for a true of 2, and 5.8 for a true of 5.1 (target of 5).

              (iv) Could I look at gender differences in predictors by running separate logistic regressions for each gender
              Interactions require a much greater sample size than for the main effects, but again if you're willing to live with the limitations on power that you're saddled with, then you can certainly explore sex-related differences in strength of association of the predictors and outcome.

              Here's the simulation:

              .ÿ
              .ÿversionÿ18.0

              .ÿ
              .ÿclearÿ*

              .ÿ
              .ÿ//ÿseedem
              .ÿsetÿseedÿ264129110

              .ÿ
              .ÿquietlyÿsetÿobsÿ834

              .ÿgenerateÿbyteÿsexÿ=ÿ_nÿ>=ÿ442

              .ÿgenerateÿbyteÿoutÿ=ÿcond(!sex,ÿ_nÿ<=ÿ54ÿ-ÿ32,ÿ_nÿ>=ÿ834ÿ-ÿ32ÿ+ÿ1)

              .ÿtabulateÿsexÿout

              ÿÿÿÿÿÿÿÿÿÿÿ|ÿÿÿÿÿÿÿÿÿÿout
              ÿÿÿÿÿÿÿsexÿ|ÿÿÿÿÿÿÿÿÿ0ÿÿÿÿÿÿÿÿÿÿ1ÿ|ÿÿÿÿÿTotal
              -----------+----------------------+----------
              ÿÿÿÿÿÿÿÿÿ0ÿ|ÿÿÿÿÿÿÿ419ÿÿÿÿÿÿÿÿÿ22ÿ|ÿÿÿÿÿÿÿ441ÿ
              ÿÿÿÿÿÿÿÿÿ1ÿ|ÿÿÿÿÿÿÿ361ÿÿÿÿÿÿÿÿÿ32ÿ|ÿÿÿÿÿÿÿ393ÿ
              -----------+----------------------+----------
              ÿÿÿÿÿTotalÿ|ÿÿÿÿÿÿÿ780ÿÿÿÿÿÿÿÿÿ54ÿ|ÿÿÿÿÿÿÿ834ÿ

              .ÿ
              .ÿgenerateÿbyteÿpr0ÿ=ÿ0

              .ÿforvaluesÿiÿ=ÿ1/17ÿ{
              ÿÿ2.ÿÿÿÿÿlocalÿindÿ:ÿdisplayÿ%02.0fÿ`i'
              ÿÿ3.ÿÿÿÿÿgenerateÿdoubleÿpr`ind'ÿ=ÿ0
              ÿÿ4.ÿ}

              .ÿ
              .ÿprogramÿdefineÿtrue
              ÿÿ1.ÿÿÿÿÿversionÿ18.0
              ÿÿ2.ÿÿÿÿÿsyntaxÿ,ÿTru(name)
              ÿÿ3.ÿ
              .ÿÿÿÿÿtempnameÿod1
              ÿÿ4.ÿÿÿÿÿsummarizeÿpr0ÿifÿout,ÿmeanonly
              ÿÿ5.ÿÿÿÿÿscalarÿdefineÿ`od1'ÿ=ÿr(mean)ÿ/ÿ(1ÿ-ÿr(mean))
              ÿÿ6.ÿÿÿÿÿsummarizeÿpr0ÿifÿ!out,ÿmeanonly
              ÿÿ7.ÿÿÿÿÿscalarÿdefineÿ`tru'ÿ=ÿ`od1'ÿ/ÿ(r(mean)ÿ/ÿ(1ÿ-ÿr(mean)))
              ÿÿ8.ÿend

              .ÿ
              .ÿprogramÿdefineÿsimem,ÿrclass
              ÿÿ1.ÿÿÿÿÿversionÿ18.0
              ÿÿ2.ÿÿÿÿÿsyntaxÿ,ÿ[Odds(integerÿ2)]
              ÿÿ3.ÿ
              .ÿÿÿÿÿreplaceÿpr0ÿ=ÿrbinomial(1,ÿcond(out,ÿ`odds'ÿ/ÿ(`odds'ÿ+ÿ1),ÿ1/2))
              ÿÿ4.ÿÿÿÿÿtempnameÿtru
              ÿÿ5.ÿÿÿÿÿtrueÿ,ÿt(`tru')
              ÿÿ6.ÿ
              .ÿÿÿÿÿforvaluesÿiÿ=ÿ1/17ÿ{
              ÿÿ7.ÿÿÿÿÿÿÿÿÿlocalÿindÿ:ÿdisplayÿ%02.0fÿ`i'
              ÿÿ8.ÿÿÿÿÿÿÿÿÿreplaceÿpr`ind'ÿ=ÿruniform()
              ÿÿ9.ÿÿÿÿÿ}
              ÿ10.ÿÿÿÿÿlogitÿoutÿi.sex##(i.pr0ÿc.pr??)
              ÿ11.ÿÿÿÿÿreturnÿscalarÿpowÿ=ÿr(table)["pvalue",ÿ"out:1.pr0"]ÿ<ÿ0.05
              ÿ12.ÿÿÿÿÿreturnÿscalarÿsizÿ=ÿr(table)["pvalue",ÿ"out:c.pr01"]ÿ<ÿ0.05
              ÿ13.ÿÿÿÿÿreturnÿscalarÿestÿ=ÿexp(r(table)["b",ÿ"out:1.pr0"])
              ÿ14.ÿÿÿÿÿreturnÿscalarÿtruÿ=ÿ`tru'
              ÿ15.ÿend

              .ÿ
              .ÿprogramÿdefineÿsumem
              ÿÿ1.ÿÿÿÿÿversionÿ18.0
              ÿÿ2.ÿÿÿÿÿsyntaxÿ[if]
              ÿÿ3.ÿ
              .ÿÿÿÿÿsummarizeÿpowÿsizÿ`if'
              ÿÿ4.ÿÿÿÿÿcentileÿestÿtruÿ`if'
              ÿÿ5.ÿ
              .ÿend

              .ÿ
              .ÿframeÿcreateÿPowerAnalysisÿbyte(tgtÿpowÿsiz)ÿdouble(estÿtru)

              .ÿ
              .ÿ//ÿOddsÿratioÿ=ÿ2
              .ÿforvaluesÿrepÿ=ÿ1/3000ÿ{
              ÿÿ2.ÿÿÿÿÿquietlyÿsimem
              ÿÿ3.ÿÿÿÿÿframeÿpostÿPowerAnalysisÿ(2)ÿ(r(pow))ÿ(r(siz))ÿ(r(est))ÿ(r(tru))
              ÿÿ4.ÿ}

              .ÿframeÿPowerAnalysis:ÿsumem

              ÿÿÿÿVariableÿ|ÿÿÿÿÿÿÿÿObsÿÿÿÿÿÿÿÿMeanÿÿÿÿStd.ÿdev.ÿÿÿÿÿÿÿMinÿÿÿÿÿÿÿÿMax
              -------------+---------------------------------------------------------
              ÿÿÿÿÿÿÿÿÿpowÿ|ÿÿÿÿÿÿ3,000ÿÿÿÿÿÿÿÿÿ.32ÿÿÿÿ.4665539ÿÿÿÿÿÿÿÿÿÿ0ÿÿÿÿÿÿÿÿÿÿ1
              ÿÿÿÿÿÿÿÿÿsizÿ|ÿÿÿÿÿÿ3,000ÿÿÿÿÿÿÿÿ.057ÿÿÿÿ.2318813ÿÿÿÿÿÿÿÿÿÿ0ÿÿÿÿÿÿÿÿÿÿ1

              ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿBinom.ÿinterp.ÿÿÿ
              ÿÿÿÿVariableÿ|ÿÿÿÿÿÿÿObsÿÿPercentileÿÿÿÿCentileÿÿÿÿÿÿÿÿ[95%ÿconf.ÿinterval]
              -------------+-------------------------------------------------------------
              ÿÿÿÿÿÿÿÿÿestÿ|ÿÿÿÿÿ3,000ÿÿÿÿÿÿÿÿÿ50ÿÿÿÿ2.095647ÿÿÿÿÿÿÿÿ2.048841ÿÿÿÿ2.142975
              ÿÿÿÿÿÿÿÿÿtruÿ|ÿÿÿÿÿ3,000ÿÿÿÿÿÿÿÿÿ50ÿÿÿÿÿ1.98977ÿÿÿÿÿÿÿÿ1.959391ÿÿÿÿ2.020619

              .ÿ
              .ÿ//ÿOddsÿratioÿ=ÿ5
              .ÿforvaluesÿrepÿ=ÿ1/3000ÿ{
              ÿÿ2.ÿÿÿÿÿquietlyÿsimemÿ,ÿo(5)
              ÿÿ3.ÿÿÿÿÿframeÿpostÿPowerAnalysisÿ(5)ÿ(r(pow))ÿ(r(siz))ÿ(r(est))ÿ(r(tru))
              ÿÿ4.ÿ}

              .ÿframeÿPowerAnalysis:ÿsumemÿifÿtgtÿ==ÿ5

              ÿÿÿÿVariableÿ|ÿÿÿÿÿÿÿÿObsÿÿÿÿÿÿÿÿMeanÿÿÿÿStd.ÿdev.ÿÿÿÿÿÿÿMinÿÿÿÿÿÿÿÿMax
              -------------+---------------------------------------------------------
              ÿÿÿÿÿÿÿÿÿpowÿ|ÿÿÿÿÿÿ3,000ÿÿÿÿ.9043333ÿÿÿÿ.2941826ÿÿÿÿÿÿÿÿÿÿ0ÿÿÿÿÿÿÿÿÿÿ1
              ÿÿÿÿÿÿÿÿÿsizÿ|ÿÿÿÿÿÿ3,000ÿÿÿÿÿÿÿÿ.056ÿÿÿÿ.2299601ÿÿÿÿÿÿÿÿÿÿ0ÿÿÿÿÿÿÿÿÿÿ1

              ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿBinom.ÿinterp.ÿÿÿ
              ÿÿÿÿVariableÿ|ÿÿÿÿÿÿÿObsÿÿPercentileÿÿÿÿCentileÿÿÿÿÿÿÿÿ[95%ÿconf.ÿinterval]
              -------------+-------------------------------------------------------------
              ÿÿÿÿÿÿÿÿÿestÿ|ÿÿÿÿÿ3,000ÿÿÿÿÿÿÿÿÿ50ÿÿÿÿ5.777989ÿÿÿÿÿÿÿÿ5.552121ÿÿÿÿ5.954734
              ÿÿÿÿÿÿÿÿÿtruÿ|ÿÿÿÿÿ3,000ÿÿÿÿÿÿÿÿÿ50ÿÿÿÿÿ5.09348ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ5ÿÿÿÿ5.182768

              .ÿ
              .ÿexit

              endÿofÿdo-file


              .

              Comment


              • #8
                Is anyone else getting hidden characters (as shown below) when they copy & paste Joseph Coveney's simulation code in #7?

                Code:
                .ÿ
                .ÿversionÿ18.0
                
                .ÿ
                .ÿclearÿ*
                
                .ÿ
                .ÿ//ÿseedem
                .ÿsetÿseedÿ264129110
                etc.
                I have encountered this in previous posts by Joseph, btw.
                --
                Bruce Weaver
                Email: [email protected]
                Version: Stata/MP 18.5 (Windows)

                Comment


                • #9
                  You'll need to take that up with the forum administrator.

                  At first, the forum software respected multiple consecutive white spaces, but somehow that got toggled off so that now all consecutive white spaces are trimmed to a single white space. So-called hard spaces, such as ANSI 160 or the Unicode equivalent do not overcome that. This distorts the display of Stata output.

                  Code delimiters work fine for code, but I distinguish code that a reader may copy and paste verbatim to run, and output, where there is no such expectation: if a forum user wants to take output and use it as code then he or she will need to clean it up first in a text editor regardless of whether there is a space-holder character.

                  Again, if it bothers you, then I recommend taking it up with the administrator and ask that the forum software restore its original behavior with respect to honoring what users type.

                  Comment


                  • #10
                    Thank you, Joseph. I have always used code delimiters when posting output, and have not noticed any particular problems in doing so. That makes me curious to know what problems you may have encountered. However, I do sense from your response in #9 that you may have no desire to comment further, and if so, that's fine.

                    PS- I just checked the FAQ to see if it says anything about using code delimiters for output, and it does not (currently).
                    --
                    Bruce Weaver
                    Email: [email protected]
                    Version: Stata/MP 18.5 (Windows)

                    Comment

                    Working...
                    X