Panel data regression weird OR

David Meldon

Join Date: Mar 2021
Posts: 20

Panel data regression weird OR

15 Mar 2021, 05:43

Dear statalist community,

Thank you all in advance for taking the time to read this post and help me.

I am doing a research project focused on risk factors for stroke in a subset of patients. For doing so, I gathered retrospective data from a cohort of patients and organised it into" assessments". The first assessment is the first time they started medical follow up and in each assessment different tests and explorations were carried out. The first problem is that not all patients have the same number of assessments, not all patients have the same number of explorations per assessments and, therefore, there is a lot of longitudinal missing data in some areas.

Once the data was collected, we organised the next steps in this way:

1. Descriptive analysis of the cohort to see important variables that might be considered as risk factors
2. Set up panel data from the longitudinal evolution of patients with the important variables and do a univariate logistic regression to pick up variables for a multivariate logistic regression
3. Generate a multivariate regression model with the variables that were significant in the previous step.

When we did the descriptive analysis we picked up some variables that changed across different assessments and others that did not. In the end, I assembled the data into a panel data with the variable id identifying each individual, time for identifying the assessment number (1,2...) and the outcome variable stroketotal (0/1) Besides, there are the variables: sex for gender, renal function (gfr), mean age at the assessment (it is mean age because some tests were done at different times and I had to do a mean age for each assessment), presence of an autoimmune disease (autoinmunity 0/1) and presence or not of a specific mutation (N215S 0/1), presence of white matter lesions in the MRI (wml 0/1/2/3) and the degree of enlargement of the left atrium (LAecho 0/1/2/3)

It is a panel data of 414 patients, some of them had the stroke before the first assessment (about 40) and about 40 more had a stroke at the end of the study period. I limited the panel data to 7 assessments and, therefore, I ended up with around 2800 data rows.

It looks like this:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input int id byte time float(stroketotallong_ meanageass_) int gfr_ float(wml_ LAecho_ autoinmunitynum N215S) byte sex
 1 1 0 34  97 1 . 0 0 1
 1 2 0 35   . . . 0 0 1
 1 3 0 36   . . . 0 0 1
 1 4 0 37   . . . 0 0 1
 1 5 0 38   . . . 0 0 1
 1 6 0 39   . . . 0 0 1
 1 7 0  .   . . . 0 0 1
 2 1 0 27  98 0 1 0 0 1
 2 2 0 28  96 0 . 0 0 1
 2 3 0 29  98 . . 0 0 1
 2 4 0 30 104 . . 0 0 1
 2 5 0 31 103 . . 0 0 1
 2 6 0 32 105 . . 0 0 1
 2 7 0 33 102 . . 0 0 1
 4 1 0 40 112 0 0 0 0 1
 4 2 0 41 102 . 1 0 0 1
 4 3 0 42 103 . . 0 0 1
 4 4 0 43   . . . 0 0 1
 4 5 0 44   . . . 0 0 1
 4 6 0 45   . . . 0 0 1
 4 7 0  .   . . . 0 0 1
 5 1 0 41  83 1 . 0 0 1
 5 2 1  .   . . . 0 0 1
 5 3 1  .   . . . 0 0 1
 5 4 1  .   . . . 0 0 1
 5 5 1  .   . . . 0 0 1
 5 6 1  .   . . . 0 0 1
 5 7 1  .   . . . 0 0 1
 6 1 0 32  94 0 2 0 1 0
 6 2 0 34  93 0 . 0 1 0
 6 3 0 35 103 . . 0 1 0
 6 4 0 33  79 . . 0 1 0
 6 5 0 36  63 . . 0 1 0
 6 6 0 37  48 . . 0 1 0
 6 7 0 41  46 . . 0 1 0
 7 1 0 61 106 . 0 0 1 1
 7 2 0 58  98 . . 0 1 1
 7 3 0 59 100 . . 0 1 1
 7 4 0 62 102 . . 0 1 1
 7 5 0 63 111 . . 0 1 1
 7 6 0 64  93 . . 0 1 1
 7 7 0 67  99 . . 0 1 1
 8 1 0 20 115 0 1 0 0 0
 8 2 0 21 110 . . 0 0 0
 8 3 0  .   . . . 0 0 0
 8 4 0  .   . . . 0 0 0
 8 5 0  .   . . . 0 0 0
 8 6 0  .   . . . 0 0 0
 8 7 0  .   . . . 0 0 0
 9 1 0 40  92 1 0 0 0 1
 9 2 0 42  80 1 . 0 0 1
 9 3 0 44  84 . . 0 0 1
 9 4 0 45   . . . 0 0 1
 9 5 0 46   . . . 0 0 1
 9 6 0 47   . . . 0 0 1
 9 7 0  .   . . . 0 0 1
10 1 0 21 117 0 0 0 0 1
10 2 0 23 121 0 0 0 0 1
10 3 0 25 115 . . 0 0 1
10 4 0 26 129 . . 0 0 1
10 5 0 27 121 . . 0 0 1
10 6 0 28 106 . . 0 0 1
10 7 0 29  98 . . 0 0 1
11 1 0 25 131 0 0 0 0 1
11 2 0 26 120 0 1 0 0 1
11 3 0 28 132 . . 0 0 1
11 4 0 29 124 . . 0 0 1
11 5 0 30  97 . . 0 0 1
11 6 0 31 125 . . 0 0 1
11 7 0 32 106 . . 0 0 1
13 1 0 38 121 0 . 0 1 1
13 2 0 39 124 . . 0 1 1
13 3 0 41 106 . . 0 1 1
13 4 0 42 105 . . 0 1 1
13 5 0 43 114 . . 0 1 1
13 6 0 44 108 . . 0 1 1
13 7 0  .   . . . 0 1 1
14 1 1 42  95 0 0 0 0 1
14 2 1 43  89 0 0 0 0 1
14 3 1 44 107 . 0 0 0 1
14 4 1 45 105 . . 0 0 1
14 5 1 41 109 . . 0 0 1
14 6 1 46 103 . . 0 0 1
14 7 1 47 109 . . 0 0 1
15 1 0 35 108 0 0 0 0 1
15 2 0 36 108 0 0 0 0 1
15 3 0 37 102 . 0 0 0 1
15 4 0 38 113 . 0 0 0 1
15 5 0 39 105 . . 0 0 1
15 6 0 40 104 . . 0 0 1
15 7 0 41 108 . . 0 0 1
16 1 0 66 110 0 0 0 1 0
16 2 0 67  97 0 0 0 1 0
16 3 0 69 103 . . 0 1 0
16 4 0 70  93 . . 0 1 0
16 5 0 71   . . . 0 1 0
16 6 0 72   . . . 0 1 0
16 7 0  .   . . . 0 1 0
17 1 1 60  76 1 0 0 0 1
17 2 1 61  77 1 0 0 0 1
end
label values autoinmunitynum Zero
label values N215S Zero
label def Zero 0 "No", modify
label def Zero 1 "Yes", modify
label values sex Sex
label def Sex 0 "Male", modify
label def Sex 1 "Female", modify

For doing the univariate regression analysis I chose the re model as I had time-varying variables and patients where the outcome did not vary. (Perhaps this assumption is wrong) Doing so, for some variables I obtain reasonable OR whereas for others it is a weird value probably driven by the small number of events in that category or vice-versa.

For example: xtlogit comparison of GFR and stroke in a model where age and gender is also added or N215S and stroke with the same other variables

Code:

xtset id

xtlogit stroketotallong_ i.sex c.meanageass_ c.gfr_, or

Random-effects logistic regression              Number of obs     =      1,838
Group variable: id                              Number of groups  =        390

Random effects u_i ~ Gaussian                   Obs per group:
                                                              min =          1
                                                              avg =        4.7
                                                              max =          7

Integration method: mvaghermite                 Integration pts.  =         12

                                                Wald chi2(3)      =      38.56
Log likelihood  = -297.21479                    Prob > chi2       =     0.0000

----------------------------------------------------------------------------------
stroketotallong_ | Odds Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
-----------------+----------------------------------------------------------------
             sex |
         Female  |   .3959993   .4896565    -0.75   0.454     .0350895     4.46902
     meanageass_ |   1.416943   .0838234     5.89   0.000     1.261819    1.591137
            gfr_ |   .9731659   .0225645    -1.17   0.241     .9299302    1.018412
           _cons |   5.24e-13   1.87e-12    -7.90   0.000     4.72e-16    5.81e-10
-----------------+----------------------------------------------------------------
        /lnsig2u |   5.151202   .1560547                      4.845341    5.457064
-----------------+----------------------------------------------------------------
         sigma_u |   13.13921   1.025218                      11.27593     15.3104
             rho |      .9813   .0028637                       .974778    .9861595
----------------------------------------------------------------------------------
Note: Estimates are transformed only in the first equation.
Note: _cons estimates baseline odds (conditional on zero random effects).
LR test of rho=0: chibar2(01) = 877.78                 Prob >= chibar2 = 0.000


 xtlogit stroketotallong_ i.sex c.meanageass_ i.N215S, or nolog

Random-effects logistic regression              Number of obs     =      2,256
Group variable: id                              Number of groups  =        409

Random effects u_i ~ Gaussian                   Obs per group:
                                                              min =          1
                                                              avg =        5.5
                                                              max =          7

Integration method: mvaghermite                 Integration pts.  =         12

                                                Wald chi2(3)      =     715.32
Log likelihood  = -282.26214                    Prob > chi2       =     0.0000

----------------------------------------------------------------------------------
stroketotallong_ | Odds Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
-----------------+----------------------------------------------------------------
             sex |
         Female  |   .0002462   .0002721    -7.52   0.000     .0000282    .0021489
     meanageass_ |   2.382608   .0859723    24.06   0.000     2.219926    2.557212
                 |
           N215S |
            Yes  |   4.65e-11   8.67e-11   -12.76   0.000     1.20e-12    1.80e-09
           _cons |   8.09e-24   1.68e-23   -25.64   0.000     1.39e-25    4.71e-22
-----------------+----------------------------------------------------------------
        /lnsig2u |   6.104662   .1448829                      5.820697    6.388627
-----------------+----------------------------------------------------------------
         sigma_u |   21.16462   1.533196                      18.36319    24.39342
             rho |   .9927091   .0010486                       .990338    .9945016
----------------------------------------------------------------------------------
Note: Estimates are transformed only in the first equation.
Note: _cons estimates baseline odds (conditional on zero random effects).
LR test of rho=0: chibar2(01) = 1205.64                Prob >= chibar2 = 0.000

As you can see, the OR for N215S makes no sense.

If I do a calculation of OR without a panel data analysis

Code:

 tab stroketotallong_ N215S

stroketota |         N215S
    llong_ |        No        Yes |     Total
-----------+----------------------+----------
         0 |     1,586        875 |     2,461 
         1 |       367         70 |       437 
-----------+----------------------+----------
     Total |     1,953        945 |     2,898 

logit stroketotallong_ i.sex c.meanageass_ i.N215S, or nolog

Logistic regression                             Number of obs     =      2,256
                                                LR chi2(3)        =     180.27
                                                Prob > chi2       =     0.0000
Log likelihood = -885.08092                     Pseudo R2         =     0.0924

----------------------------------------------------------------------------------
stroketotallong_ | Odds Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
-----------------+----------------------------------------------------------------
             sex |
         Female  |   .6185883   .0776087    -3.83   0.000     .4837367    .7910324
     meanageass_ |   1.043193   .0042768    10.31   0.000     1.034845    1.051609
                 |
           N215S |
            Yes  |   .2324415   .0380368    -8.92   0.000     .1686641    .3203353
           _cons |   .0425143   .0095265   -14.09   0.000     .0274031    .0659583
----------------------------------------------------------------------------------

So I tried to guess why this happened but I have been unsuccessful to know why when running xtlogit the OR differs so much. In the end, the important question I ask myself is, can I do a panel data analysis with this kind of data or I am trying the impossible? If so, should I try to use other models (melogit? xtcloglog because the event is rare?) ?

Thank you all very much for your help.

David.

Tags: None

David Meldon

Join Date: Mar 2021

Posts: 20
#2

19 Mar 2021, 05:05

Apologies for not citing my STATA software. I am using STATA 16.1

Thank you!
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#3

19 Mar 2021, 10:50

Biostatistics is outside my area of expertise, so take these thoughts as suggestions for further exploration.

It appears to me that your outcome variable stroketotallong_ becomes 1 when the patient has had a stroke (for 40 patients this was before the first assessment) and then remains 1 for each assessment thereafter. But in that case all the time-varying measurements taken after the stroke are irrelevant - any measures that might be expected to reduce the risk of a stroke will not affect the outcome "already had a stroke" and will blunt the estimated effect of those measures.

You realize do you not that you lose a lot of observations from missing values? Your second regression with fewer independent variables includes 19 additional patients and 400+ additional observations.

By having individual-level effects it seems to me that you cannot estimate effects for variables that do not vary over time, such as N215S. Anything you may see for those is likely an artifact of the estimation process.

In general I think this might benefit from being cast as a survival model, not a panel data model. But a biostatistician experienced with those techniques would be the appropriate reference, not my opinion. My apologies if in fact you are a biostatistician.
2 likes
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17671
#4

19 Mar 2021, 10:58

David:
as an aside to William's helpful guidance, please also note that your cross-sectional logistic regression is not that helpfu: you should have at least clustered your standard errors on -panelid-, as your observations are not independent (due to the panel structure of your dataset).

Kind regards,
Carlo
(StataNow 18.5)
1 like
Comment
David Meldon

Join Date: Mar 2021

Posts: 20
#5

20 Mar 2021, 10:50

Thank you very much for your helpful comments.

After reading your replies, I will focus now on developing a survival model. All the problems you have highlighted really troubled me because I did not know how to improve the panel data model to estimate effects. I found myself in a no exit way...and even know, I still do not know why I had not thought about the survival analysis before...so they have been life-saving comments.

By the way, I am a clinician, so I have had to remember everything about biostatistics from my time in college, ages ago.

I do really appreciate your help and this wonderful community.

David.
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#6

20 Mar 2021, 11:17

Glad I weighed in with what little I could guess.

I'm relieved to learn you're a clinician who is a little rusty on biostatistics. I feared you were someone outside the medical professions moving into a new area, bringing their methodological hammer to bear on the task of driving an analytic screw.

https://en.wikipedia.org/wiki/Law_of_the_instrument
1 like
Comment
Weiwen Ng

Join Date: Jun 2015

Posts: 1241
#7

20 Mar 2021, 13:23

Originally posted by William Lisowski View Post

...
It appears to me that your outcome variable stroketotallong_ becomes 1 when the patient has had a stroke (for 40 patients this was before the first assessment) and then remains 1 for each assessment thereafter. But in that case all the time-varying measurements taken after the stroke are irrelevant - any measures that might be expected to reduce the risk of a stroke will not affect the outcome "already had a stroke" and will blunt the estimated effect of those measures.
...

In fact, in the data sample presented, most observations are all 0s or all 1s, except for patient 5 who had one 0 and the rest 1. Basically, nearly everyone had no stroke during the entire observation period, or they had a stroke for every observation period.

When I look at the rho values in output from the full sample, they're nearly 1 for both specifications. Hence, I wonder if this feature also exists in the full dataset. If so, it might be worth checking for a data processing error first. Because fewer than 10% of the data sample had a stroke at the start of the observation period, I am not sure if this reflects reality. However, if everyone else either had a stroke at assessment 2 or never had a stroke at all, that might do it.

If David is going to do survival analysis, I would consult someone who's actually experienced in survival modeling. The 40 patients with a stroke before the first assessment are, I think, left censored, and I'm not sure if the Cox model will simply discard them. I think it will, but I haven't actually dealt with left censored data before.

Last edited by Weiwen Ng; 20 Mar 2021, 13:34.

Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.
1 like
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17671
#8

21 Mar 2021, 04:44

David:
as in my case "rust never sleeps" (just like somebody sang many years ago), I've found relief in these two helpful sources about survival analysis with Stata:
https://www.stata-press.com/books/su...introduction/;
https://www.iser.essex.ac.uk/resourc...sis-with-stata by Stephen Jenkins (regular and towering contributor to this forum).

Kind regards,
Carlo
(StataNow 18.5)
1 like
Comment
David Meldon

Join Date: Mar 2021

Posts: 20
#9

21 Mar 2021, 05:28

Thank you all!

Weiwen, thank you for your input. When I put the example of the data via dataex I did not want to put too many values and, sadly, the first ones do not change throughout the follow-up time. In this disease, stroke is not a frequent complication, although we would not define it as rare. In this sense, both the prevalence of stroke at the beginning and the end of the study makes sense. In all the dataset, excluding people that already had a stroke at the beginning, we have 10.9% of new events (41 over 374), so I thought that the event was not rare, and I could use the usual statistic tests (now from a survival model perspective)

After reading your posts, I started to look for more information about survivals models, and I encountered different problems. The first one is what I am going to do with the people who already had a stroke. I have never seen any study or model taking into account left-censored data so this is new to me. All I read about the Cox model made me think that I could not use them. The second one is how I will evaluate the effect of time-varying variables and, specifically, people who switched treatment several times throughout the study. Nevertheless, I hope that after studying and reading a bit more, plus working on it on Stata, I will be able to advance step by step.

Carlo, thank you very much for the resources you recommended. The first book is already on its way to my home, and I really look forward to reading it.

Finally, I did not know about the law of the hammer. Although it seems quite negative, I wish I could have at least one hammer regarding statistics...sometimes, I feel so lost that to have it would make me feel better, haha.
1 like
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17671
#10

21 Mar 2021, 05:50

David:
provided that dealing with left-censored data is really demanding, this and other tricky topics about survival analysis are covered in https://www.springer.com/us/book/9780387953991.

Kind regards,
Carlo
(StataNow 18.5)
1 like
Comment
David Meldon

Join Date: Mar 2021

Posts: 20
#11

22 Mar 2021, 03:59

Thanks Carlo for your recommendation. It is good to know that there are always resources to read should the need arise.

I will go though the book introduction to survival analysis in Stata first, try to get a good feeling about what I can do with my data and probably run some analysis without the left censored data. Then...we will see.

Thank you very much for everything.
Comment

Announcement

Panel data regression weird OR

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment