Difference in differences method - dealing with unbalanced populations

Jules Chassetuillier

Join Date: Jan 2022

Posts: 3
#1

Difference in differences method - dealing with unbalanced populations

13 Jan 2022, 09:49

Hello,

I am working on a retrospective observational cohort study, and I would like to compare exposed patients and non-exposed patients:
The exposed patients participate in an experiment with new dental care practises. Their reference date (or "index" date) is their date of entry into the experiment.

The unexposed patients do not participate in the experiment, and are treated in the "usual way". They are matched with the exposed patients with a 1:3 ratio (3 unexposed patients for 1 exposed patient), based on specific socio-demographic characteristics. Their reference date is equal to the reference date of their corresponding exposed patient.

Both groups are studied from their index date to the end date of the study, or death if it occurs earlier. The patients included in the study are elderly, so it is expected that there will be a significant mortality during follow-up. However, their pathology (related to dental care) does not have a high mortality.

One of the study's outcomes is the evaluation of the expenditures (costs) associated with dental care consumption. These costs would be compared between the exposed and unexposed populations.
To perform this analysis, I wish to use the Difference in Differences (DiD) method, by comparing the costs during periods of time around the index date:
6 months before vs 6 months after index date

1 year before vs 1 year after index date

I therefore have questions about the application of DiDs in relation to the available numbers of patients in different periods, both in the medical history (before the index date) and in the follow-up (after the index date) :
What are the most usual (or best) practices for managing fluctuations in population size from one period to the next, both for exposed and unexposed patients ?

example: we have 100 exposed, and 300 unexposed patients :
30 exposed patients die during the year following index date, of which 20 during the first 6 months

60 unexposed patients die during the year following index date, of which 10 in during the first 6 months

--> Should I only consider the patients alive at the end of the studied period ? (i.e at 6 months : 80 exposed and 290 unexposed ; at 1 year : 70 exposed and 240 unexposed patients)

Is it possible for the DiD model to account for the missing data if all the patients are kept in the study ? Should I assign a cost of "zero" for the missing data ?

I thank you in advance for your insights and help

Kind regards.

Jules
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#2

13 Jan 2022, 13:24

Interesting and vexatious problem. The unexposed constitute 75% of the total sample due to the 1:3 matching ratio, yet they incurred only one-third of the 30 total deaths in the first 6 months. Assuming deaths are independent events, if the mortality rates in the two groups are equal, the binomial probability of this happening by chance is 1.8 x 10^-6. Something really strange is going on here. OK, rare things do happen, but I would invest a lot of time and energy making sure this data is correct, and investigating whether somehow there was a strong selection bias preferentially leading those who were nearer to death to be in the exposed group (or, worse still, whether there is something lethal about this new practice). Even at the one-year mark, there are 90 total deaths, of which only two-thirds occurred in the unexposed group--the probability of that is .047. By itself that would not disturb me much, but given what happened in the first 6 months, it is worrisome. In short, I'm very concerned about doing any analysis with this data until we understand this huge mortality discrepancy.

That said assuming we can reassure ourselves that the data are actually correctly representing what happened to these people and we are satisfied that the imbalance in deaths is just an extremely unlikely accident, then we have to ask ourselves how to deal with this in the data. So the first thing is to recognize that costs do actually fall to zero once a person dies. This is one of the limitations of cost analysis, at least for policy making purposes, because one of the easiest ways to bring down costs is to kill off the people who generate the costs. That's why policy analyses usually look not just at costs but at cost-effectiveness ratios or cost-benefit analyses. Nevertheless, in this situation, there is actually, I think, a reasonably simple solution to the problem here. Cost distributions, especially in health care, tend to be highly skewed to the right, so that standard linear regression analyses are inadequate anyway. I'm thinking that for your study, you might address all of this by using a Poisson regression model and make use of the -exposure()- option (which is about duration of time at risk for outcome and shoud not be confused with exposed vs unexposed group in the study).

Presumably you have observations on the same people both pre- and post- treatment. I guess for pre-treatment, each person survived the full 6 or 12 month observation window--because you wouldn't be including people who died before being treated in either group. But post-treatment, different people survived different durations. So you can use the duration of actual survival in the post-treatment observation window, and use the full duration of the observation window for pre-treatment observations, as the variable to place in the -exposure()- option of an -xtpoisson- regression. I strongly suggest that you use robust standard errors in this analysis. This analysis will enable you to estimate the rate at which expenditures accrue per month (or whatever unit of time you use for the -exposure()- option variable). Also, using the -irr- option will give you the "incidence" rate ratios instead of the Poisson coefficients, which will be more helpful for understanding and presentation.
1 like
Comment
Jules Chassetuillier

Join Date: Jan 2022

Posts: 3
#3

17 Jan 2022, 08:53

Thank you very much for your reply !

As for the example I provided (30 exposed and 60 unexposed died, etc.), this is not real data (I do not have the actual data yet, I am currently anticipating the possible scenarios that could occur). That being said, I realize that my example is probably far off from what I will get with the real data.

Assuming the data made more "sense" (few exposed people would die after index date / and way less than unexposed patients), would DiD be more appropriate ? Or do you think because cost distributions are usually skewed to the right in healthcare, it is not the best solution for those types of problems ?

As I only study the evolution of costs between exposed and unexposed patients (during the time - before and after index date), I study only costs during time, without accounting for any other feature. Basically for applying DiD, I would have for each patient a cost associated with a "period" of time (for instance 6 months before index date).

I am not very hands on with Poisson regression models, but as I understand, it requires features (like any linear model) and a time "offset" variable, and would look like this when modelling a cost : ln(cost) = b0 +b1.X1 + ... + bk.Xk + ln(t) (Xi corresponding to the different features)
--> as I only study the cost with respect to the time from beginning of study period, would the model simply look like this : ln(cost) = b0 + ln(t) ?

As for applying the Poisson regression model to this problem, if the input data were in the following format :
- 1 row per patient - time (in months from beginning of study period) - cost (cumulative costs over the current month)

Patient Exposed Period Cost

A Y 1 300

A Y 2 250

... ... ... ...

A Y 30 500

B N 1 250

B N 2 300

How would the pre-treatment / post treatment observational windows (for example 6 months before vs 6 months after) be compared ?

Would you recommend aggregating by "period label" (for example for the 1st patient, who has 30 months follow-up, and has index date (exposition) at month 8, I would gather all costs from months 2 to 7 as "6 months before index date" period, and all costs from months 8 to 13 as "6 months after index date).

I hope my questions are clear

I thank you very much again.

Kind regards.

Jules
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#4

17 Jan 2022, 12:22

Or do you think because cost distributions are usually skewed to the right in healthcare, it is not the best solution for those types of problems ?

DID refers to study design really, not to analysis. The skewness of the distributions makes certain analyses (like OLS regression) questionable, but does not change the validity of the DID design.

I am not very hands on with Poisson regression models, but as I understand, it requires features (like any linear model) and a time "offset" variable, and would look like this when modelling a cost : ln(cost) = b0 +b1.X1 + ... + bk.Xk + ln(t) (Xi corresponding to the different features)
--> as I only study the cost with respect to the time from beginning of study period, would the model simply look like this : ln(cost) = b0 + ln(t) ?

Well, no, not exactly. You still have to include a variable to designate the exposed/unexposed status, and a pre-post variable, and the interaction between them to have a DID design.

As for applying the Poisson regression model to this problem, if the input data were in the following format :
- 1 row per patient - time (in months from beginning of study period) - cost (cumulative costs over the current month)

Patient Exposed Period Cost

A Y 1 300

A Y 2 250

... ... ... ...

A Y 30 500

B N 1 250

B N 2 300

How would the pre-treatment / post treatment observational windows (for example 6 months before vs 6 months after) be compared ?

I don't understand the variable you are calling Period here. There should be only two periods: pre-treatment and post-treatment. Also, you need a numeric 0/1 variable for exposure to use in the calculations. So take a look at -help encode- to create that. You will also need a numeric variable to identify patients. If you have fewer than 65,000 patients, then -encode- will work for that as well. If you have more than that, read -help egen- and scroll down to the -group()- function.

Would you recommend aggregating by "period label" (for example for the 1st patient, who has 30 months follow-up, and has index date (exposition) at month 8, I would gather all costs from months 2 to 7 as "6 months before index date" period, and all costs from months 8 to 13 as "6 months after index date).

I'm not sure I understand this question. Maybe what you are calling "period label" is what I referred to as pre-treatment and post-treatment just above. If that's the case, yes, I would aggregate the data up to two observations per patient: one for the pre-treatment period and one for the post-treatment period. Just how you would define pre- and post-treatment isn't clear. Is it at month 8 for every patient? If not, how do you know what it is for any patient? There doesn't seem to be any variable that shows that.

The above advice is of a general nature. If you would like more specific help with code, I can provide that. But you will need to provide example data that includes all the relevant variables, and also you will need to make that example data usable, by using the -dataex- command to show it. If you are running version 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.
2 likes
Comment
Jules Chassetuillier

Join Date: Jan 2022

Posts: 3
#5

01 Mar 2022, 10:29

Hello,

Thank you very much for your reply.

Do you know any references / articles where the Poisson regression model was used in a DiD design, especially for tackling the issue of unbalanced populations pre-treatment / post-treatment ?

I have found this one : https://pubmed.ncbi.nlm.nih.gov/22998231/ where a fixed-effect Poisson model was used in a DiD design for comparing changes in staffing and in quality of care in California hospitals to changes over the same time period in hospitals (the event of interest is the implementation of California's minimum nurse staffing legislation), but the article does not emphasize on the unbalanced populations' issue.

Thank you very much again.

Best regards.

Jules
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#6

01 Mar 2022, 12:01

Offhand, I don't know of any such references. Part of the reason is that this is not the type of analysis where people typically care whether the data are balanced or not, so I don't think people would generally highlight this in publication. It's just not really an issue. In fact, with modern statistical software, there are very few analyses where it makes any difference whether you have balanced data or not.
Comment

Patient	Exposed	Period	Cost
A	Y	1	300
A	Y	2	250
...	...	...	...
A	Y	30	500
B	N	1	250
B	N	2	300

Patient	Exposed	Period	Cost
A	Y	1	300
A	Y	2	250
...	...	...	...
A	Y	30	500
B	N	1	250
B	N	2	300

Announcement

Difference in differences method - dealing with unbalanced populations

Comment

Comment

Comment

Comment

Comment