How to write a matching algorithm?

Max Crichton

Join Date: Nov 2015

Posts: 19
#1

How to write a matching algorithm?

12 Nov 2015, 09:12

Dear forum members,

First of all, thank you for taking the time and maybe being able to help me.

I am facing the following challenge: I am trying to replicate and extend the paper "Estimating the effect of smoking on birth outcomes using a matched panel data set" by J.Abrevaya (2006). Data on births is available for the US through NCHS (I am using '99 - '04) for a single year. Since there is no unique identifier for each mother (e.g. Soc. security number) in the data, Abrevaya implements a "matching algorithm" in order to identify mothers with several births throughout the years.

Frankly speaking, I do not know how to write the algorithm. I have prepared the data for every single year (seperate cross-sectional data sets) and would like to write an algorithm such as that the data from the single years is merged into a new panel data set given some specific matching criteria.

In a first algorithm by Abrevaya, the matching criteria are "Mother's state of birth", "Mother's race", "Child's state of birth", "Child's county of birth", and "Child's city of birth".
Is it possible to merge the data from the single years into a new panel data set, given the matching criteria? Or should I merge all the data into a (very large) new dataset and apply the matching criteria in order to sort out the data I don't want to use (and then again, how could I do that?) ?

Thank you very much in advance for your help and comments, I am very much looking forward to them.

Best,

Max

Last edited by Max Crichton; 12 Nov 2015, 09:24.
Tags: panel data, Time Series
Mike Lacy

Join Date: Apr 2014

Posts: 2404
#2

12 Nov 2015, 14:55

I'm thinking something must be missing here. What I'm taking to be your situation is that you have data records for the same women in different years, and you'd like to identify all records that belong to a given person. I can't imagine how the variables you mention could come close to uniquely identifying one person: For example, there must be 1,000s of women born in NY state, who are white,and who had a child in Bronx county/city NY at two different points in time. So, I must be misunderstanding something here, and I suspect others would as well. Could you explain a bit?
(By the way: I'm not as reactive to informal article citations as are some participants here, but I'd recommend you check the FAQ on this subject.)

Regards, Mike
Comment
wbuchanan

Join Date: Mar 2014

Posts: 1361
#3

12 Nov 2015, 17:25

You may want to check out the program cem in the SSC archives. It sounds like you are trying to do some form of fuzzy matching, but it isn't completely clear.
Comment
Max Crichton

Join Date: Nov 2015

Posts: 19
#4

13 Nov 2015, 05:32

Hello Mike, hello wbuchanan,

Thank you for your replies. Mike, the algorithm I described in my first post is a baseline selection and is not aiming at uniquely identifying a woman and her births over the years. I will try to do that with a more advanced one using more criteria. In Abrevaya's paper there also is a longer discussion about how precise the applied algorithms are and to what extent one can rely on the matches as being "true". (Also, for his study he only uses white women for example)
The more advanced algorithms I would like to apply would in addition to the above mentioned criteria include "Mother's years of education", "Mother's marital status", if she is married then "Father's race" and so on.
In principal you did understand my problem correctly. At the moment I am struggling with the coding in Stata, i.e. how to combine the datasets of each single year into one panel data set and how to sort this panel data set given a range of matching criteria. For example, I would like code something in Stata which executes the following (please excuse my everyday formulation here): If "Mother's state of birth" and "Mother's race" and so on are the same, I would like to group these observations and give them a "unique" identifier in the newly constructed panel data set.

I hope this made it a bit clearer for you. Thanks also to wbuchanan for the cem tip, I am looking into it now.

Once again, thanks for taking the time and helping me out, I am really looking forward to your replies.

Best,

Max

P.S.: Thanks also for mentioning the citation issue, I changed it. (As a newbie here in this forum I did not fully read the FAQ in the beginning)
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2404
#5

13 Nov 2015, 06:50

Perhaps -append- is all you need. The following would result in a data set that is "long" with respect to the identifiers you mentioned. The code might look something like this:

Code:

append using "file1999.dta" "file2000.dta" ......., /// generate(FileYear) keep(RelevantVarList) recode FileYear (1 = 1999) (2 = 2000) ..... // append does not know this egen Crude_Group = group(MotherState, MotherCounty, MotherRace.....) sort Crude_Group

Regards, Mike
Comment
Max Crichton

Join Date: Nov 2015

Posts: 19
#6

16 Nov 2015, 16:42

Hi Mike,

Thank you very much for your help. The code you proposed is working fine! I just have one follow-up question: I would like to include further restrictions to the group command. Now that I have grouped the variables using the above mentioned criteria, I would like to differentiate within the groups in order to receive "unique identifiers". I would like to do that by putting a restriction on the age order, i.e. that a mother who was say 28 in 1998 at the birth of her first child can be between 29 and 31 (depending on her birthday) in 1999 when she has a (potential) second birth.

Summing this up, I would like to include a restriction in the sense that the age in later years of the dataset has to be higher than the one linked to the first birth. Do you have any suggestion on how to code that?

Thank you guys so much already, you have been a great help!

Best,

Max
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2404
#7

16 Nov 2015, 21:46

Your description is relatively opaque to me, as your use of terminology is different than mine.
I'm guessing you mean you would like to detect records within Crude_Group for which MotherAge at year2 is less or equal to MotherAge at year1, where year1 < year2. A start on this problem would be the following:

Code:

by Crude_Group (year): gen not_same_person = 1 if (MotherAge <= Mother_Age[_n-1]) & (year > year[_n-1])

I have not thought this through completely, but I suspect this might not work perfectly.
Comment
Max Crichton

Join Date: Nov 2015

Posts: 19
#8

17 Nov 2015, 03:50

Thanks Mike, this is working again. I end up with a group of women where the above mentioned criteria match each other. The last problem I am facing now is to create a unique ID for each woman tracking her births over the years. I am trying to end up with something like this

ID | #birth | MotherAge | Year .......

1 | 1 | 27 | 9
1 | 2 | 28 | 10
2 | 1 | 25 | 8
2 | 2 | 26 | 9
2 | 3 | 27 | 10
3 | 1 | 31 | 10
3 | 2 | 32 | 11

I just cant figure out how to code in Stata to define a unique ID for each mother within the groups which are defined by the code from your latest post. Would you maybe have any suggestions for that?

Thanks so much again, you have been great help already!
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2404
#9

17 Nov 2015, 07:12

Presuming you are going to treat the observations with not_same_person == 1 as unknown, you might try:

Code:

gen id = Crude_Group if (not_same_person != 1)
Comment
Max Crichton

Join Date: Nov 2015

Posts: 19
#10

17 Nov 2015, 07:33

Thanks Mike! Great work you are doing, you have no idea how much you are helping me! I would be completely lost without you! Since I am working with a large dataset, this unfortunately does not give me unique identifiers. Would you have another suggestion for creating an ID variable based on consecutive steps?

For example: Assign the ID 1 to a woman with

first birth in 1999, at age 29, with 13 years of education
(potential) second birth in 2000, at age 30, with 13 years of education

The problem I am facing is to create a unique link for every woman where the matching criteria make sense and to drop all the other possible combinations where the matching criteria are violated. I cant find a way to do that, maybe the -merge- command?

Best,

Max
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2404
#11

17 Nov 2015, 09:29

I find it too difficult to guess what the problems are based on your description. For example, from your most recent post, your difficulty could be that you don't know about the -drop- command, but I presume that is not true.

Consequently, I would ask that per the FAQ (http://www.statalist.org/forums/help) you create a concrete data example and post it here; I'm not any longer wanting to respond to purely verbal descriptions of your data problem). See the material in the FAQ about posting data examples and using -dataex- to do this. Making a good data example is not easy, I understand, but if you can do that, we can help you efficiently and effectively.
Comment

Max Crichton

Join Date: Nov 2015
Posts: 19

#12

17 Nov 2015, 09:57

Hi Mike, thanks for the suggestion and sorry about the whole confusion. This is my merged, "unpaneled" data set.

Code:

 use "/Users/maxcrichton/Desktop/Uni/Master/Semester 3/Advanced Microeconometrics/Pa
> per/Data/natlmerged.dta", clear

. describe, numbers

Contains data from /Users/maxcrichton/Desktop/Uni/Master/Semester 3/Advanced Microeco
> nometrics/Paper/Data/natlmerged.dta
  obs:     1,160,832                          
 vars:            27                          17 Nov 2015 09:14
 size:    82,419,072                          (_dta has notes)
-------------------------------------------------------------------------------------
    variable  storage   display    value
      name      type    format     label      variable label
-------------------------------------------------------------------------------------
    1. dmage     byte    %8.0g                 Age of Mother
    2. mrace     byte    %8.0g                 Race of Mother
    3. dmeduc    byte    %8.0g                 Education of Mother
    4. dmar      byte    %8.0g                 Marital Status of Mother
    5. mplbir    long    %8.0g                 Place of Birth of Mother
    6. nlbnl     byte    %8.0g                 Number of Live Births Now Living
    7. frace     byte    %8.0g                 Race of Father
    8. year      int     %8.0g                 Year of Birth
    9. gestat    byte    %8.0g                 Gestation - Detail in Weeks
   10. dbirwt    int     %8.0g                 Birthweight
   11. cigar     byte    %8.0g                 Average Number of Cigarettes Per Day
   12. stateres  long    %8.0g      stateres1
                                              NCHS State of Residence
   13. births~e  long    %8.0g      birthstate
                                              NCHS State of Occurence
   14. birthc~y  long    %8.0g      birthcounty
                                              NCHS County of Occurence
   15. smoke     float   %9.0g                 
   16. male      float   %9.0g                 
   17. married   float   %9.0g                 
   18. hsgrad    byte    %8.0g                 dmeduc == 1 2 3 4 5 6 7 8 9 10 11 12
   19. somecoll  byte    %8.0g                 dmeduc == 13 14 15
   20. collgrad  byte    %8.0g                 dmeduc == 16 17 18 19 20
   21. agesq     float   %9.0g                 
   22. black     float   %9.0g                 
   23. adeqco~2  float   %9.0g                 
   24. adeqco~3  float   %9.0g                 
   25. novisit   float   %9.0g                 
   26. pretri2   float   %9.0g                 
   27. pretri3   float   %9.0g                 
-------------------------------------------------------------------------------------

As you suggested, I ran the following code:

Code:

use "/Users/maxcrichton/Desktop/Uni/Master/Semester 3/Advanced Microeconometrics/Paper/Data/natlmerged.dta"

egen match = group(mplbir mrace birthstate birthcounty)

sort match year 

by match (year): gen not_same_person = 1 if (dmage >= dmage[_n-1]) & (year > year[_n-1])
(1,305,693 missing values generated)

drop if not_same_person == 1
(29,697 observations deleted)

This is what the data looks like afterwards:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input byte(nlbnl dmage dmeduc) int year float(match not_same_person)
1 40 16  9 1 .
0 26 14  9 1 .
2 21 15  9 1 .
1 27 16  9 1 .
0 22 14  9 1 .
3 27 13  9 1 .
2 26 10  9 1 .
1 22 11  9 1 .
0 25 14  9 1 .
0 20 12  9 1 .
1 28 14  9 1 .
1 28 13  9 1 .
2 28 13  9 1 .
1 26 13  9 1 .
0 17 12  9 1 .
0 19 13 10 1 .
2 28 12 10 1 .
0 32 15 10 1 .
0 25 14 10 1 .
2 29 13 10 1 .
end

As you can see, match ==1 for many observations, i.e. no unique identifier is created. Would you now have any suggestions on how to create a unique identifier tracing a mom's births over the years, also relating to the above mentioned restrictions about the age and birth order in consecutive years?

Thank you very much for your help!

Comment

Mike Lacy

Join Date: Apr 2014

Posts: 2404
#13

18 Nov 2015, 20:03

All I could suggest is to create the match variable using more variables than just (mplbir mrace birthstate birthcounty). Since you have not done this, I presume you have some reason for not doing it, so I don't know what further to suggest.
Comment

Announcement