Exact matching on one categorical variable

Ashani Abayasekara

Join Date: May 2023

Posts: 106
#1

Exact matching on one categorical variable

10 May 2023, 03:05

Hi,

I have a panel dataset with individuals observed over 8 years. I'm estimating an event study design to identify selected outcomes of a treated group in comparison to a control group (these groups are defined based on industries in which the individuals are employed). To make the two groups more comparable, I want to do exact matching based only on the 4-digit occupation code in each industry (a categorical variable). Is there a command to do this in Stata? So far what I have come across is coarsened exact matching, which from my understanding is more appropriate for matching on one or more continuous variables, and which does not seem to work in my setting.

I'm looking for a command that would match individuals in the two groups simply based on their 4-digit occupation code, so that I would be comparing individuals working in the same exact occupations across the two industries. I assume this can be done manually too, given that matching is on just one variable, but I am not sure how to do this either. Any help on this would be greatly appreciated. Due to privacy issues, I am unfortunately not able to share the actual dataset. What I have is the individual id, year, occupation code, treatment status (belonging to one of two industries), and outcome variables such as income and social welfare support. The occupation code is the same for a given individual across years.

Thank you very much.
Ashani
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30097
#2

10 May 2023, 08:40

Since you do not provide example data to work with, here is an outline of the approach:

1. -preserve- the data set, and then keep only the control observations. Rename all of the variables except the treatment status by suffixing _ctrl. Save the result in a tempfile.
2. -restore- the original data set and now keep only the cases. Rename all of the variables except the occupation_code by suffixing _case.
Create a new variable called case_num by running -gen `c(obs_t)' case_num = _n-.
3. Run -joinby occupation_code using the_tempfile_you_saved_in_step_1.
4. Set the random number seed to whatever you like.
5. -gen double shuffle1 = runiform()-. If your data set is more than a few million observations long at this point, also create a second new variable -gen double shuffle2 = runiform()-.
6. -by case_num (shuffle1),sort: keep if _n == 1. If you had to create a shuffle2 variable, then do it -by case_num (shuffle1 shuffle2), sort:...-
7. You now have randomly matched case-control pairs with agreement on occupation code.
8. You will probably need to now -reshape- the data into long layout for further work.

If you have difficulty implementing the approach from this outline, when you post back show example data, including some potential matches in the example. I understand your data set is confidential. We don't need to see real data. So take your data set, select some acceptable matching pairs and then change the values of all the variables (e.g. replace them with random numbers) except the occupation code and the treatment indicator. Then show that here.
Comment
Ashani Abayasekara

Join Date: May 2023

Posts: 106
#3

11 May 2023, 00:03

Hi Clyde,

Thank you very much for the detailed help. I will try this out.

In the meantime, I came across the "k2k" option in coarsened exact matching, which "forces the algorithm to create strata with equal numbers of treated and control units." I think this is what I need. However, k2k does this by randomly dropping observations from each strata, and I therefore get different results every time I run it. I'm not certain whether randomly dropping unmatched observations is the best when performing exact matching. Would setting the seed be the best in this situation?

I will also go through the steps you have suggested above. My dataset is somewhat like the following, before reshaping, where income1 income2 income3 refer to income in three consecutive years. As indicated there are both treatment and control units belonging to a given occupation code.

Thanks for your help,
Ashani
Comment
Ashani Abayasekara

Join Date: May 2023

Posts: 106
#4

11 May 2023, 00:15

Attached is a sample dataset
Attached Files
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30097
#5

11 May 2023, 08:56

Would setting the seed be the best in this situation?

Whenever you do a procedure that involves random selection of anything, it is wise to first set the random number generator seed so that the results can be reproduced.
Comment
Ashani Abayasekara

Join Date: May 2023

Posts: 106
#6

11 May 2023, 16:36

Ok, thank you. And do you think that randomly matching equal numbers of treatment and control units to a strata (by occupation code) is a credible approach in this case?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30097
#7

11 May 2023, 18:30

I have no idea. That depends on things you have not disclosed (and that I might not understand even if you had).

First, are you able to find matches for all, or nearly all of the cases in this way? If there are many cases that are unmatchable, then you are working on a reduced, and probably biased, subset of the data.

Next, is this occupational code variable a really important confounder of the main relationship you are trying to study? That is, are the outcomes dependent on the occupational code, and is the treatment applied in different proportions in different occupational codes. Unless both are true, occupational code is not a confounder, and matching on it serves no purpose.
Comment
Ashani Abayasekara

Join Date: May 2023

Posts: 106
#8

12 May 2023, 00:53

I am able to match only around 19% of the total sample using k2k matching (5000 plus out of a total of 31000). The answer to both your questions on the occupation variable are yes, so I think occupation code is an important confounder.

My concern is that the fianl results are very sensitive to the specific seed I set at the start.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30097
#9

12 May 2023, 09:46

My concern is that the fianl results are very sensitive to the specific seed I set at the start.

That is surprising. Unless your data set is tiny, or some of the variables in your data exhibit very wide variation, the properties of the random set of controls picked to pair with the cases should be similar, within the limits of sampling variation. Can you post example data (use -dataex- for this) that produces very different samples with different seeds. (And also show the exact code you are using.)

If you are running version 18, 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.
Comment
Ashani Abayasekara

Join Date: May 2023

Posts: 106
#10

12 May 2023, 20:04

Thank you very much for the explanation on posting example data. Unfortunately I can't do that either - the data I am using is accessible on a remote platform and cannot be copied externally until clearance is obtained.

My dataset has around 25,000 observations. There is some variation in the data but not overly wide.

The code I use to match the data is:

cem occupation (#0), treatment(treat) k2k

where occupation refers to the 4-digit occupation code and treat to treatment status (0 or 1).

After this command, I reshape the data and run fixed effects regressions as follows:

reshape long income, i(id)
xtreg income i.year treat12-treat18 if cem_matched==1, fe i(id) r

where treat12-treat18 are dummies where specific year dummies are interacted with the treatment status dummy variable.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30097
#11

13 May 2023, 05:42

I'm afraid I can't help you. -cem- is a community-contributed command that I am not familiar with.

Also, the structure of your data seems a bit unusual: you have gone long on income, which makes sense. But you have a series of treatment variables treat12-treat18, and I don't know what that means or how it works. I would ordinarily expect to see a single treatment variable that (perhaps) varies over time.

Between an unfamiliar command and the absence of data to work with, I don't think I can figure this one out. Perhaps somebody else following the thread can and will chime in.
Comment
Ashani Abayasekara

Join Date: May 2023

Posts: 106
#12

14 May 2023, 05:40

Thank you very much for all your efforts to help, I appreciate it. I opted for -cem- as it seemed capable of doing what I wanted to and is much simpler than doing it manually. I will keep you posted of any further developments.

The series of treatment variables refer to yearly dummies (for the years 2012-2018) interacted with the treatment dummy. We consider several years compared to a base year, following an event study framework, to see how outcomes of the treatment group compare to that of the control group in each year relative to a base year.
Comment

Announcement

Exact matching on one categorical variable

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment