Merging two datasets with repeated individuals

Juliana Cunha

Join Date: Jan 2024

Posts: 18
#1

Merging two datasets with repeated individuals

09 Mar 2024, 08:29

Hello, please, could anyone help me?
I have two data sets. The first one has data when the person committed a crime. Some individuals committed more than one crime in different dates per year.

Code:

* Example generated by -dataex-. For more info, type help dataex clear input float(id yearcrime category monthcrime daycrime) 1 2000 1 3 23 1 2001 1 5 2 4 2000 2 8 5 4 2001 1 12 14 4 2002 3 1 17 5 2003 3 4 1 8 2003 2 5 6 9 2002 1 7 22 9 2003 1 9 15 28 2000 1 6 13 end

The second dataset brings id, year when the person took a test, year, month and day when the person got ill. There are people who never got ill, people who got ill, people who never committed crime, people who committed more than one in one given year and some people who committed crimes in several years.

Code:

* Example generated by -dataex-. For more info, type help dataex clear input float(id yeartest sick datestartedillness score) 1 2000 1 14695 100 1 2001 0 . 101 1 2002 0 . 256 2 2000 0 . 233 2 2001 0 . 99 3 2000 0 . 298 3 2001 1 15078 89 3 2002 0 . 167 4 2000 1 14945 243 5 2001 0 . 156 5 2002 0 . 128 6 2000 0 . 257 7 2000 0 . 234 7 2001 1 15226 194 7 2002 0 . 206 7 2003 0 . 108 8 2000 0 . 87 8 2001 1 15021 165 8 2002 0 . 271 9 2003 0 . 172 end format %td datestartedillness

The data that contains test scores and illness have names repeating over the years, but not all years. so this is more like a panel at the school level.
I need to create the crime variable (y) at the individual level using the date at which they started to be ill , creating a dummy for committing any crime in one 365 days and 730 days after that. I also need to create another variable for count_crime to count the number of crimes each individual committed 365 and 730 days after being ill. Many thanks in advance!
Tags: None

Clyde Schechter

Join Date: Apr 2014
Posts: 29792

09 Mar 2024, 11:24

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input float(id yearcrime category monthcrime daycrime)
 1 2000 1  3 23
 1 2001 1  5  2
 4 2000 2  8  5
 4 2001 1 12 14
 4 2002 3  1 17
 5 2003 3  4  1
 8 2003 2  5  6
 9 2002 1  7 22
 9 2003 1  9 15
28 2000 1  6 13
end
tempfile crimes
save `crimes'

* Example generated by -dataex-. For more info, type help dataex
clear
input float(id yeartest sick datestartedillness score)
1 2000 1 14695 100
1 2001 0     . 101
1 2002 0     . 256
2 2000 0     . 233
2 2001 0     .  99
3 2000 0     . 298
3 2001 1 15078  89
3 2002 0     . 167
4 2000 1 14945 243
5 2001 0     . 156
5 2002 0     . 128
6 2000 0     . 257
7 2000 0     . 234
7 2001 1 15226 194
7 2002 0     . 206
7 2003 0     . 108
8 2000 0     .  87
8 2001 1 15021 165
8 2002 0     . 271
9 2003 0     . 172
end
format %td datestartedillness
tempfile illnesses
save `illnesses'


use `crimes'
gen crimedate = mdy(monthcrime, daycrime, yearcrime)
assert missing(crimedate) == missing(monthcrime, daycrime, yearcrime)
drop monthcrime daycrime yearcrime
format crimedate %td
tempfile crimes_with_dates
save `crimes_with_dates'

use `illnesses', clear
gen `c(obs_t)' obs_no = _n
frame put _all if !missing(datestartedillness), into(working)
frame working {
    gen lower = datestartedillness + 1
    gen upper = datestartedillness + 730
    replace lower = 1 if missing(datestartedillness)
    replace upper = 0 if missing(datestartedillness)
    format lower upper %td
    rangejoin crimedate lower upper using `crimes_with_dates', by(id)
    by obs_no, sort: egen crimes_365 = total(crimedate <= datestartedillness + 365)
    by obs_no: egen crimes_730 = total(crimedate <= datestartedillness + 730)
    gen byte any_crime_365 = crimes_365 > 0 & !missing(crimes_365)
    gen byte any_crime_730 = crimes_730 > 0 & !missing(crimes_730)
    by obs_no, sort: keep if _n == 1
    drop crimedate
}
frlink 1:1 obs_no, frame(working)
frget crimes_* any_crime_*, from(working)

Note: -rangejoin- is written by Robert Picard and is available from SSC. To use it, you must also install -rangestat-, by Robert Picard, Nick Cox, and Roberto Ferrer, also available from SSC.

I have put the original two data sets into tempfiles in this code, but there is no need for you to do that--you can just use the original permanent data sets.

Comment

Juliana Cunha

Join Date: Jan 2024

Posts: 18
#3

09 Mar 2024, 14:00

Thank you so much prof. Clyde. It worked very well.

But 'category' variable disappears. And I am now trying to create new variables: crimes_365,
crimes_730
any_crime_365
any_crime_730
------------------------------
, but per category as well. Number of crimes per category per person within the same time thresholds.
I tried something like this: * Assuming `category` is part of my working frame after the rangejoin I would do:

Code:

by obs_no category, sort: egen category_crimes_365 = total(crimedate <= datestartedillness + 365) by obs_no category: egen category_crimes_730 = total(crimedate <= datestartedillness + 730)

And I wanted to create all these new variables per category. But I failed.
Please, how could I do it? Many thanks!

Last edited by Juliana Cunha; 09 Mar 2024, 14:04.
Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 29792

09 Mar 2024, 14:26

but per category as well. Number of crimes per category per person within the same time thresholds.

You didn't say that in your original post. But it's just a small modification.

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input float(id yearcrime category monthcrime daycrime)
 1 2000 1  3 23
 1 2001 1  5  2
 4 2000 2  8  5
 4 2001 1 12 14
 4 2002 3  1 17
 5 2003 3  4  1
 8 2003 2  5  6
 9 2002 1  7 22
 9 2003 1  9 15
28 2000 1  6 13
end
tempfile crimes
save `crimes'

* Example generated by -dataex-. For more info, type help dataex
clear
input float(id yeartest sick datestartedillness score)
1 2000 1 14695 100
1 2001 0     . 101
1 2002 0     . 256
2 2000 0     . 233
2 2001 0     .  99
3 2000 0     . 298
3 2001 1 15078  89
3 2002 0     . 167
4 2000 1 14945 243
5 2001 0     . 156
5 2002 0     . 128
6 2000 0     . 257
7 2000 0     . 234
7 2001 1 15226 194
7 2002 0     . 206
7 2003 0     . 108
8 2000 0     .  87
8 2001 1 15021 165
8 2002 0     . 271
9 2003 0     . 172
end
format %td datestartedillness
tempfile illnesses
save `illnesses'


use `crimes'
gen crimedate = mdy(monthcrime, daycrime, yearcrime)
assert missing(crimedate) == missing(monthcrime, daycrime, yearcrime)
drop monthcrime daycrime yearcrime
format crimedate %td
tempfile crimes_with_dates
save `crimes_with_dates'

use `illnesses', clear
gen `c(obs_t)' obs_no = _n
tempfile original
save `original'

keep if !missing(datestartedillness)
gen lower = datestartedillness + 1
gen upper = datestartedillness + 730
format lower upper %td
rangejoin crimedate lower upper using `crimes_with_dates', by(id)
by obs_no category, sort: egen crimes_365 = total(crimedate <= datestartedillness + 365)
by obs_no category: egen crimes_730 = total(crimedate <= datestartedillness + 730)
gen byte any_crime_365 = crimes_365 > 0 & !missing(crimes_365)
gen byte any_crime_730 = crimes_730 > 0 & !missing(crimes_730)
by obs_no category, sort: keep if _n == 1
drop crimedate

merge m:1 obs_no using `original', assert(match using) nogenerate

//    OPTIONALLY, CREATE WIDE DATA LAYOUT FOR CRIME CATEGORIES
rename crimes* crimes_*_cat_
rename any_crime* any_crime*_cat_
mvencode category, mv(0)

reshape wide crimes_*_cat_ any_crime*_cat_, i(obs_no) j(category)
drop *_cat_0 obs_no lower upper
order *crime*, last
isid id yeartest, sort

This code, up through the -merge- will create what you ask. There will now be multiple observations, one per category of crime committed within the specified timeframes, for each original observation. If you prefer to have the category-specific results "side by side" in the data, then you can run the rest of the code to get that. Which way is better really depends on what you're going to do with the results.

Note: Assumes that the variable category never takes the value 0. If that's wrong, the code will break at the -mvencode- command. If that happens, replace 0, there and in the -drop- command two lines later, by some other non-negative integer which is not a possible value of category.

Comment

Juliana Cunha

Join Date: Jan 2024

Posts: 18
#5

12 Mar 2024, 13:48

Thank you very much Prof. Clyde. It ran smoothly. However, when I run a fixed effects model (year and school_id) for

Code:

number of crimes = ill + controls+ uit

where ill takes 1 if the person got sick and 0 otherwise, the results drop the variable ill and keep just the obs of people who got ill. I would like to be able to run this model keeping all my observations including people who got ill or not and who committed a crime or not. Currently, the regression is running with a small sample of people who got ill, rather than comparing every one. Please, would you have a solution for this? Many thanks!
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29792
#6

12 Mar 2024, 14:32

Well, this is not possible with your data/study design. For the people who got ill, you have a reference date, datestartedillness, that enables you to define the 365 and 730 day periods during which you ascertain the number of crimes. But for the people who didn't get ill, you have no comparable reference date, so there is no way to establish what are the right 365 and 730 day periods over which to count crimes.

The usual approach to a situation like this in epidemiology is to create matched pairs: for each person who got ill, identify another person who didn't get ill but has the same* relevant characteristics as the person who did. Then you assign to the non-ill person the same reference date as the datestartedillness of the ill person that they are matched with. From there you do the crime counting, and then when you get to your regression you will need to do a multi-level regression that accounts for the matched pair data as well as the repeated observations within individual people.

*When I say the "same relevant characteristics" that is shorthand for a variety of approaches that can be used. One can pick out a small number of characteristics that are very strongly related to the illness event and then require exact agreement between the ill and non-ill person on those. You can't do that for a large number of characteristics, because you will find that most people can't be matched exactly to anyone on a large number of attributes. So if many attributes ought to be considered, then various kinds of approximate agreement can be used (coarsened exact matching, or caliper matching). Or one might use propensity score matching, which creates an "index" of the matching variables and matches approximately on that index--which in effect matches on "overall similarity" without requiring really close agreement on any particular attribute. There is also the issue of which characteristics are "relevant."

This is a complicated subject. The choice of a best matching technique depends very much on substantive considerations about the variables to be matched and the other variables that will be used along with the matched pairs in the ultimate analysis. Moreover there may be preferences for different approaches in different fields. And, it may be that there are other approaches altogether that might be more suitable for achieving your specific research goals. So I think you are best served by speaking to an expert in your own field about how to approach this. When you have settled on an approach, you can always come back to Statalist for help implementing it in Stata if needed.
Comment
Juliana Cunha

Join Date: Jan 2024

Posts: 18
#7

12 Mar 2024, 14:42

Thank you so much for the thorough explanation Prof. Clyde!
Comment
Juliana Cunha

Join Date: Jan 2024

Posts: 18
#8

13 Mar 2024, 12:13

Hi Prof. Clyde,
I am trying

Code:

use maindata,clear logit ill $xi, vce(cl school_id) predict p_score psmatch2 ill, pscore(p_score) neighbor(1) caliper(0.05) save matched_dataset,replace

I have not included all the controls and haven't controlled for any individual or school id. I have 6 million obs of people not ill (ill =0 ) and 55k people who got ill and have the date of first symptoms (ill = 1). I tried with caliper 0.01, but got all _n1 = missing. Also tried with caliper 0.05 and with even fewer controls and also got no match. Do you have any advice to solve this, please?
Many thanks in advance.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29792
#9

13 Mar 2024, 14:06

I don't use -psmatch2- myself, and am not very familiar with its workings, so I can't help you with that. The way I do propensity score matching for a single control, nearest on propensity score looks like this:

Code:

logit ill relevant_variables predict score preserve keep if !ill tempfile controls save `controls' restore keep if ill local caliper 0.05 // OR WHATEVER YOU WANT rangejoin score -.05 .05 using `controls' gen delta = abs(score - score_U) by id (delta), sort: keep if _n == 1 gen `c(obs_t)' pair_num = _n ds pair_num delta *_U, not rename `r(varlist)' =_case rename *_U *_ctrl ds *_case local stubs `r(varlist)' local stubs: subinstr local stubs "_case" "", all reshape long `stubs', i(pair_num) j(case_ctrl) string

[/code]

That should leave you with a data set of matched pairs. The observations belonging to the same pair will have the same value for pair_num. And the ill ones will have case_ctrl == "case", whereas their matched controls have case_ctrl == "ctrl".

-rangejoin- is written by Robert Picard and is available from SSC. To use it, you must also install -rangestat-, by Robert Picard, Nick Cox, and Roberto Ferrer, also available from SSC.

Note: Lacking a suitable data set to try this code out on, it's done by trying to keep track in my head of what's happening. So there may be errors in this code. If you have trouble adapting it to your real data set, be sure to show a suitable -dataex- example when you post back.
Comment
Juliana Cunha

Join Date: Jan 2024

Posts: 18
#10

13 Mar 2024, 15:04

Many thanks prof. Clyde. It ran until

Code:

local caliper 0.05

Code:

rangejoin score -.05 .05 using `controls'

did not run as I get a message about insufficient memory - segmentsize.
I have around 6.3million obs.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29792
#11

13 Mar 2024, 15:52

Oh, that's a real problem. With that large a data set, you could probably require exact matching on some variables. Doing that would enable you to reduce the memory required. But as a pure caliper match, -rangejoin- is the most economical approach to this problem, and if you have insufficient memory for it, then you have no alternative but to find a machine with more memory to run this.

If, however, you can find a categorical variable with several categories to use for exact matching, then:

Code:

rangejoin score -`caliper' `caliper' using `controls', by(exact_match_variable)

will reduce the demands on memory, and will then give you the nearest match on score, within a caliper of radius `caliper' conditional on exact matching on the exact match variable.
Comment
Juliana Cunha

Join Date: Jan 2024

Posts: 18
#12

14 Mar 2024, 03:07

Many thanks prof Clyde. I will try it.
Comment
Tushita Agnihotri

Join Date: Mar 2024

Posts: 1
#13

14 Mar 2024, 14:16

Has anyone worked on IHDS individual level merging?
Comment

Juliana Cunha

Join Date: Jan 2024
Posts: 18

#14

19 Mar 2024, 16:23

Hi prof. Clyde Schechter , please, could you help me with another merge? I am stuck on it.

I have two datasets as follows:

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input float(id yearcrime monthcrime daycrime typecrime crimedate)
1 2000  8 15 1 14837
1 2000  8 15 2 14837
1 2000 12 25 3 14969
1 2002  3  3 1 15402
1 2003  5 31 4 15856
2 2001  7  8 2 15164
2 2002  9 12 1 15595
3 2000  6 24 4 14785
4 2002  3 17 1 15416
4 2002  4 15 2 15445
5 2003  2  2 3 15738
6 2000  1  9 4 14618
6 2003 10 10 1 15988
6 2004  2  3 3 16104
6 2005  9  9 2 16688
9 2002  3  6 3 15405
end
format %td crimedate

and

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input float(id yearexam yearill monthill dayill score illdate)
 1 2000 2000 11  3 100 14917
 1 2001    .  .  . 150     .
 1 2002    .  .  . 111     .
 2 2000    .  .  . 123     .
 2 2001 2001  5  2 200 15097
 2 2002    .  .  . 214     .
 2 2003    .  .  . 203     .
 2 2004    .  .  . 302     .
 2 2005    .  .  . 136     .
 3 2001 2001  7  6 222 15162
 4 2001    .  .  . 158     .
 4 2002    .  .  . 178     .
 4 2003    .  .  . 228     .
 4 2004    .  .  . 311     .
 5 2000    .  .  . 197     .
 5 2001 2001  2 17 106 15023
 5 2002    .  .  . 147     .
 5 2003    .  .  . 241     .
 5 2004    .  .  . 299     .
 6 2002 2002  3  3 321 15402
 6 2003    .  .  . 139     .
 6 2004    .  .  . 284     .
13 2002 2002  5 23 123 15483
13 2005    .  .  . 214     .
end
format %td illdate

I want to count the number of crimes committed per person per year (and also in total per person) after getting ill. Also, if there are duplicates id crimedate, then I drop. If the person was never ill but committed a crime, I will still count as having committed a crime. If the person did not commit ill or not ill, then crime_count will receive 0 towards the final count. And exclude from the final sample the number of crimes committed by the person before getting ill.
That's what I am trying to do but it is obviously wrong:

Code:

use crimes, clear
br
sort id crimedate
duplicates report id crimedate
duplicates drop id crimedate,force
sort id crimedate
merge m:m id using ill

And that's what I get, which I know is wrong.

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input float(id yearcrime monthcrime daycrime typecrime crimedate yearexam yearill monthill dayill score illdate) byte _merge
 1 2000  8 15 1 14837 2000 2000 11  3 100 14917 3
 1 2000 12 25 3 14969 2001    .  .  . 150     . 3
 1 2002  3  3 1 15402 2002    .  .  . 111     . 3
 1 2003  5 31 4 15856 2002    .  .  . 111     . 3
 2 2001  7  8 2 15164 2000    .  .  . 123     . 3
 2 2002  9 12 1 15595 2001 2001  5  2 200 15097 3
 3 2000  6 24 4 14785 2001 2001  7  6 222 15162 3
 4 2002  3 17 1 15416 2001    .  .  . 158     . 3
 4 2002  4 15 2 15445 2002    .  .  . 178     . 3
 5 2003  2  2 3 15738 2000    .  .  . 197     . 3
 6 2000  1  9 4 14618 2002 2002  3  3 321 15402 3
 6 2003 10 10 1 15988 2003    .  .  . 139     . 3
 6 2004  2  3 3 16104 2004    .  .  . 284     . 3
 6 2005  9  9 2 16688 2004    .  .  . 284     . 3
 9 2002  3  6 3 15405    .    .  .  .   .     . 1
 2 2002  9 12 1 15595 2002    .  .  . 214     . 3
 2 2002  9 12 1 15595 2003    .  .  . 203     . 3
 2 2002  9 12 1 15595 2004    .  .  . 302     . 3
 2 2002  9 12 1 15595 2005    .  .  . 136     . 3
 4 2002  4 15 2 15445 2003    .  .  . 228     . 3
 4 2002  4 15 2 15445 2004    .  .  . 311     . 3
 5 2003  2  2 3 15738 2001 2001  2 17 106 15023 3
 5 2003  2  2 3 15738 2002    .  .  . 147     . 3
 5 2003  2  2 3 15738 2003    .  .  . 241     . 3
 5 2003  2  2 3 15738 2004    .  .  . 299     . 3
13    .  .  . .     . 2002 2002  5 23 123 15483 2
13    .  .  . .     . 2005    .  .  . 214     . 2
end
format %td crimedate
format %td illdate
label values _merge _merge
label def _merge 1 "Master only (1)", modify
label def _merge 2 "Using only (2)", modify
label def _merge 3 "Matched (3)", modify

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 29792
#15

19 Mar 2024, 17:33

You could know that that code is wrong without even running it: -merge m:m- is always wrong.*

I'm afraid I cannot figure out from your description what result you want to get. Please work out, by hand, what the results would look like for id's 1 and 2 in these examples, and post back showing that.

*I'm being very, very, very slightly hyperbolic here. I've been using Stata almost every day since 1994, and in all that time I have only once encountered a situation where -merge m:m- was an acceptable solution. Even then, there was a better way to do it with other commands. -merge m:m- creates a data salad that is almost never useful for any purpose. For practical purposes, it is always wrong.
Comment

Announcement