Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • logistic regression ? panel data

    Dear Stata community,

    My data is long data and I have given an example of how it looks below.

    I have 250 people with multiple observations over 1 to 25 years. I have recorded the data long.
    Now I would like to do a logistic regression but I am struggling to to this in the long form. Do I need to use the "panel logistic regression" option?

    My dependent variable is binary (event 1=yes 0=no).
    variables I would like to put into my model are
    1. gender
    2. age of person at time of examination (continuous)
    3. age of X (continuous)
    4. disease 1 type (categorical)
    5. maximum ever stage for disease 2 (ordinal categorical 1 to 4)

    My problem is we know that the event is related to "age of X" with 50% of people experiencing this by 15 years. Also the event is related to disease 2 stage (with stage 3/4 occuring more often with the event 1) BUT we know that as most people age they people develop some stage of disease 2. Is there a way I can factor all this into my logistic regression please?

    I would be so grateful for your thoughts and sorry if my questions are basic.

    Many thanks,
    Observation ID age of person at time of examination age of X event disease 1 type disease 2 stage maximum ever stage for disease 2 gender
    1 1 30 4 1 2 1 2 1
    2 1 31 5 1 2 1 2 1
    3 1 33 6 1 2 2 2 1
    4 1 35 8 1 2 1 2
    5 1 36 9 1 2 2 2
    6 2 24 1 0 4 1 3
    7 2 25 2 0 4 1 3
    8 2 26 3 0 4 3 3
    9 2 27 4 0 4 2 3
    10 3 38 6 1 1 3 4
    11 3 39 7 1 1 3 4
    12 3 42 11 1 1 4 4
    13 3 43 12 1 1 4 4
    14 3 44 13 1 1 4 4

  • #2
    Yes, you will need to use the panel-data version of logistic regression, -xtlogit- for this. You cannot use -logit- or -logistic- because the observations are nested within ID and are therefore not independent.

    BUT we know that as most people age they people develop some stage of disease 2. Is there a way I can factor all this into my logistic regression please?
    Yes. The simplest approach, and what I would start with, is including both age and stage 2 disease as predictor variables in your -xtlogit- command list of predictor variables. This approach is simple to code and assumes that the log odds of event is a linear combination of age and stage2 disease. More complicated relationships are possible, however, and you need to have some mechanistic understanding of how the two variable contribute to the genesis of the event outcome. There may be non-linearities involved, and possibly interaction terms. It all depends on the underlying real-world process. Also, since your stage2 disease variable is ordinal, you will have to chose whether to treat it as discrete or as continuous: there are no ordinal variables in -xtlogit-. Again, this choice depends on your understanding of the underlying real-world data generating process.

    If you need more specific advice, first import your data into Stata (the tableau you show is clearly not from a Stata data set as the column headers are not legal Stata variable names). Then use the -dataex- command to show code that will enable somebody who wants to help you to create a replica of your example data. If you are running version 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

    Comment


    • #3
      Dear Clyde,

      Thank you for your detailed response. I am grateful. My data looks like this:

      ​​​​​​
      Code:
      * Example generated by -dataex-. To install: ssc install dataex
      clear
      input int(observation id) byte(age_at_examination disease1_early_vs_advanced) int age_of_reservoir_months float(max_number_of_growths_ever max_stage_ever_disease1) byte(stage_for_disease1 Age_at_operation) float(Main_disease_outcome max_spigscore)
       1 1 41 .  87 5 2 . 34 0 3
       2 1 42 .  99 5 2 . 34 0 3
       3 1 43 1 115 5 2 1 34 0 3
       4 1 44 1 127 5 2 1 34 0 3
       5 1 45 1 139 5 2 2 34 0 3
       7 1 46 1 146 5 2 2 34 0 3
       8 1 47 1 158 5 2 2 34 0 3
      10 1 48 1 165 5 2 2 34 0 3
      11 1 48 1 171 5 2 2 34 0 3
      12 1 48 1 177 5 2 2 34 0 3
      13 1 49 2 183 5 2 3 34 0 3
      14 1 50 2 195 5 2 3 34 0 3
      15 1 51 2 207 5 2 3 34 0 3
      16 1 52 2 219 5 2 3 34 0 3
      17 1 52 2 225 5 2 3 34 0 3
      18 1 54 2 240 5 2 3 34 0 3
      19 1 55 2 254 5 2 3 34 0 3
      20 1 57 2 282 5 2 3 34 0 3
      21 1 59 2 298 5 2 3 34 0 3
      22 1 59 2 308 5 2 3 34 0 3
      23 2 34 1  33 1 1 0 31 0 0
      24 3 44 .  21 3 2 . 42 1 3
      25 3 47 .  52 3 2 . 42 1 3
      26 3 48 1  68 4 2 2 42 1 3
      27 3 50 2  88 5 2 3 42 1 3
      28 3 52 2 112 6 2 3 42 1 3
      29 3 53 2 124 6 2 3 42 1 3
      30 3 54 2 136 6 2 3 42 1 3
      end
      label values disease1_early_vs_advanced sp_stage
      label values max_stage_ever_disease1 sp_stage
      label def sp_stage 1 "Early", modify
      label def sp_stage 2 "Advanced", modify
      label values max_number_of_growths_ever pbnumber_cat
      label def pbnumber_cat 1 "0", modify
      label def pbnumber_cat 3 "11-20", modify
      label def pbnumber_cat 4 "21-30", modify
      label def pbnumber_cat 5 "31-50", modify
      label def pbnumber_cat 6 "51-100", modify
      label values Main_disease_outcome severity_pouch_adenoma_numbers
      label def severity_pouch_adenoma_numbers 0 "≤ 50", modify
      label def severity_pouch_adenoma_numbers 1 ">50", modify
      ------------------ copy up to and including the previous line ------------------

      Comment


      • #4
        Sorry I posted my dataex example without comment. I have tried using the panel logistic regression function. However I don't know if I should be using the random effects, fixed effects or mixed effects models.

        My dependent variable is "main disease outcome" 1=>50 vs 0<50
        variables I would like to put into the model include:
        1. age at operation (column 9) (continuous)
        2. age at examination (continuous)
        3.maximum stage ever disease 1 (early vs advanced) (this is a different disease to the dependent variable but we think that those that have higher disease burden in this variable will have a event 1 in the dependent variable... but both worsen with increasing biological age of the person so I would like to control for this.
        4. age of reservoir

        Would you be able to suggest a code for me please. I am worried that what I am trying is not correct.

        Comment


        • #5
          First, I assume this is an observational study, not a randomized trial. On that assumption, you need to actually try it both ways. If the results come out nearly the same both ways, then use the random effects results. If they come out appreciably different, then the random effects model's assumptions are too strongly violated to allow you to rely on it.

          What to do next depends on your specific research question, which you have not identified in this thread. You have said what you want to put in the model, but you have not said what specific relationships are the focus of your study. Moreover, because you have longitudinal (panel) data, all of the predictor variables have both within-panel effects and between (or across) panel effects. Those effects can be different. In fact, if the fixed and random effects estimators give noticeably different results, then the within and between panel effects are different. They can be very different, even be opposite in sign. So you need to specify whether you are trying to estimate the within-panel effect or the between-panel effect.

          As an aside, I notice that your outcome variable is a dichotomy with 1 >= 50 and 0 < 50. You don't say what is >= or < 50. But let me offer you the advice that using this kind of outcome variable is almost always a bad idea. It is usually wiser to use the underlying continuous variable (the one that you were splitting at 50) and use a regression model suitable for continuous variables, rather than dichotomizing and doing a logistic regression. Unless there is something truly qualitatively different about outcomes >= 50 vs < 50, some difference that is discrete and occurs abruptly at 50, creating dichotomies like this throws away information, wastes statistical power, and sometimes also introduces bias. So ask yourself, is it really true that a person with a score of 50 is radically different in some meaningful way from a person with a score of 49, but is completley equivalent to a person with a score of 100? If you cannot say that with a straight face, you should not be dichotomizing the underlying variable.
          Last edited by Clyde Schechter; 16 Oct 2019, 22:02.

          Comment


          • #6
            Dear Clyde thank you for your detailed response. I chose > 50 to be a sign of the event (I am counting numbers of growths). This is because it is clinically relevant if a person has >50. Ideally I would have liked to deal with this data as continuous rather than dichotomising to <=50 or >50. However in many cases the observations were vague i.e clinicians reporting would often not use numbers and write on the reports >50 or <50 or hundreds or "less than ten" this meant that unfortunately I did not have continuous data hence then reason for dichotomizing the dependent variable. But what you say is certainly food for thought and I will go back to my data and have think.

            Many thanks again for all your help.

            Comment


            • #7
              OK. If the data are reported coarsely in the way you describe, and if there is a widely-accepted belief that <= 50 vs > 50 is clinically meaningful, then what you are doing is reasonable.

              Comment


              • #8
                Thank you.

                Comment

                Working...
                X