Creating variable to flag subsequent appearances of a Person ID if specific conditions preesent

Richard James

Join Date: Nov 2024

Posts: 10
#1

Creating variable to flag subsequent appearances of a Person ID if specific conditions preesent

10 Feb 2025, 13:15

Dear forum members,

I have a large administrative dataset with observations of offences, clustered within arrests (CUSTODYNUMBER), clustered within people (PERSONID). Many PERSONIDs have multiple entries (a PERSONID can have a number of observations under the same CUSTODYNUMBER if their arrest is for numerous offences on the same data, but also distinct CUSTODYNUMBERs if they are arrested again on a different date). The date of the CUSTODYNUMBER is found in a variable (EarliestDisposalDate_). I would like to identify PERSONID's that reappear if the following conditions occur and create a new variable (labelled 'REOFFENDS ON BAIL')

No if:
- The person (PERSONID) is released on bail (EARLIESTDISPOSALCorrected ==1) but does not reappear again later in the dataset with a distinct CUSTODYNUMBER.

Yes if:
- The person (PERSONID) is released on bail (EARLIESTDISPOSALCorrected ==1) and that PERSONID appears again in the dataset with a distinct CUSTODYNUMBER later in time (so a later EarliestDisposalDate_).

I'm struggling to write a code that achieves this. I have used the following but it's not producing accurate findings (I'm finding observations in the dataset that satisfy the conditions but aren't being flagged):

sort PERSONID EarliestDisposalDate_

gen first_arrest_date = .
gen first_custody_number = ""

bysort PERSONID (EarliestDisposalDate_): replace first_arrest_date = EarliestDisposalDate_ if _n == 1
bysort PERSONID (EarliestDisposalDate_): replace first_custody_number = CUSTODYNUMBER if _n == 1

gen released_on_bail = 0
bysort PERSONID (EarliestDisposalDate_): replace released_on_bail = EARLIESTDISPOSALCorrected if _n == 1

gen reoffends_on_bail = 0 // Default: No reoffense

bysort PERSONID (EarliestDisposalDate_): replace reoffends_on_bail = 1 if released_on_bail == 1 & CUSTODYNUMBER != first_custody_number & EarliestDisposalDate_ > first_arrest_date

bysort PERSONID (EarliestDisposalDate_): replace reoffends_on_bail = max(reoffends_on_bail)

label var reoffends_on_bail "1 = Reoffends on Bail, 0 = No reoffense after bail"
And here is dataex example summary of data (relevant variables only, PERSONID and CUSTODYNUMBER anonymised)
input str8 PERSONID str10 CUSTODYNUMBER long EARLIESTDISPOSALCorrected double EarliestDisposalDate_
"[12345]" "678910" 2 21650
"12345" "678910" 2 21650
"12345" "678910" 2 21650
"12345" "678910" 2 21650
"12345" "678910" 2 21650
"12345" "10111213" 2 21680
"12345" "10111213" 2 21680
"12345" "10111213" 2 21680
"12345" "10111213" 2 21680
"12345" "10111213" 2 21688
end
format %td EarliestDisposalDate_
label values EARLIESTDISPOSALCorrected FIXEDEARLYDIS
label def FIXEDEARLYDIS 1 "Pre-ChargeBail", modify
label def FIXEDEARLYDIS 2 "ReleaseUnderInvestigation", modify
label def FIXEDEARLYDIS 3 "Charge", modify
label def FIXEDEARLYDIS 5 "NoFurtherAction", modify
label def FIXEDEARLYDIS 6 "NoChargeDecision/Misc", modify
[/CODE]
Any help gratefully received!
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 29818
#2

10 Feb 2025, 15:05

If I can assume that for any given PERSONID CUSTODYNUMBER, the value of EARLIESTDISPOSALCorrected will be the same for each offense it contains, then I believe the following does what you ask. The code begins by verifying this assumption in the data. If that command returns an error message, your data is not suitable for this approach, so do not execute the rest of the code, as it will not produce correct answers.

Code:

// VERIFY SUITABILITY OF DATA FOR THIS CODE by PERSONID CUSTODYNUMBER (EARLIESTDISPOSALCorrected_), sort: /// assert EARLIESTDISPOSALCorrected[1] == EARLIESTDISPOSALCorrected[_N] // CALCULATE THE REQUESTED VARIABLE by PERSONID (EarliestDisposalDate_ CUSTODYNUMBER), sort: /// gen byte reoffended_on_bail = (CUSTODYNUMBER[_N] != CUSTODYNUMBER[1]) /// & EARLIESTDISPOSALCorrected[1] == 1

This code is untested because the example data provided does not offer any PERSONID with more than one CUSTODYNUMBER, nor does it have any instance where the person was released on bail (EARLIESTDISPOSALCorrected == 1). If this code does not do what you want, when posting back, please provide a different example data set which contains these phenomena, as well as some instances where the code fails, so that troubleshooting and revision will be possible.

Last edited by Clyde Schechter; 10 Feb 2025, 15:09.
1 like
Comment
Richard James

Join Date: Nov 2024

Posts: 10
#3

12 Feb 2025, 09:22

Hi Clyde,

Many thanks for coming back to me on this. Alas, the EARLIESTDISPOSALCorrected is not the same for every offence - the same PERSONID's CUSTODYNUMBER can have multiple UniqueOffenceIDs each with their own type of EARLIESTDISPOSALCorrected.

I have now done a proper dataex in the hope this will help, apologies for first tiime around (numbers anonymised).

* Example generated by -dataex-. For more info, type help dataex
clear
input str8 PERSONID str10 CUSTODYNUMBER str24 UniqueOffenceID str10 EarliestDisposalDate_str long EARLIESTDISPOSALCorrected
"10000228" "166062923" "163408" "24sep2021" 1
"10000228" "166062923" "163390" "24sep2021" 1
"10001344" "3407663148" "289840" "20jun2023" 1
"10001344" "3407663148" "321301" "06nov2023" 2
"10001888" "1526586208" "175601" "29nov2021" 6
"10002466" "1735071043" "4561" "29apr2019" 2
"10003068" "4241789031" "163363" "24sep2021" 2
"10007403" "2311609212" "1605" "11apr2019" 2
"10007403" "2311609212" "1602" "11apr2019" 2
"10007403" "2311609212" "1589" "11apr2019" 2
"10007403" "2311609212" "1604" "11apr2019" 2
"10007403" "2311609212" "1603" "11apr2019" 2
"10007403" "814838362" "6432" "11may2019" 2
"10007403" "814838362" "6474" "11may2019" 2
"10007403" "814838362" "6473" "11may2019" 2
"10007403" "814838362" "6475" "11may2019" 2
"10007403" "1198769735" "7810" "19may2019" 2
"10007403" "1198769735" "7809" "19may2019" 2
"10007403" "2311609212" "23680" "20aug2019" 5
"10007403" "3576721941" "72995" "16may2020" 1
"10007403" "3433557915" "79854" "19jun2020" 1
"10007403" "3253655041" "94468" "31aug2020" 1
"10007403" "3253655041" "94469" "31aug2020" 1
"10007403" "3253655041" "94470" "31aug2020" 1
"10007403" "3253655041" "94467" "31aug2020" 1
"10007403" "3253655041" "94360" "31aug2020" 1
"10007403" "2990124287" "100686" "03oct2020" 1
"10007403" "2990124287" "100640" "03oct2020" 1
"10007403" "2990124287" "100641" "03oct2020" 1
"10007403" "2990124287" "100642" "03oct2020" 1
"10007403" "4504661912" "119107" "15jan2021" 3
"10007403" "4504661912" "119108" "16jan2021" 2
"10007403" "942345646" "125007" "19feb2021" 3
"10007403" "942345646" "124956" "19feb2021" 2
"10007403" "1062731082" "266704" "03mar2023" 6
"10007403" "3253655041" "266747" "04mar2023" 3
"10007403" "3253655041" "266745" "04mar2023" 3
"10007403" "3253655041" "266746" "04mar2023" 3
"10007403" "3253655041" "266748" "04mar2023" 3
"10007894" "203148731" "23396" "19aug2019" 3
end
label values EARLIESTDISPOSALCorrected FIXEDEARLYDIS
label def FIXEDEARLYDIS 1 "Pre-ChargeBail", modify
label def FIXEDEARLYDIS 2 "ReleaseUnderInvestigation", modify
label def FIXEDEARLYDIS 3 "Charge", modify
label def FIXEDEARLYDIS 5 "NoFurtherAction", modify
label def FIXEDEARLYDIS 6 "NoChargeDecision/Misc", modify
[/CODE]

I am trying to flag PERSONIDs and create a new variable (labelled 'REOFFENDS ON BAIL') coded as:

1/NO if
The person (PERSONID) is released on bail (EARLIESTDISPOSALCorrected ==1) but does not reappear again later in the dataset with a distinct CUSTODYNUMBER.

1/'Yes' if:
- The person (PERSONID) is released on bail (EARLIESTDISPOSALCorrected ==1) and that PERSONID appears again in the dataset with a distinct CUSTODYNUMBER later in time (so a later EarliestDisposalDate_).

So in dataex example above '10007403' should be flaged, but not '10000228'.

Thanks again!
Richard
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29818
#4

12 Feb 2025, 10:40

OK, your data exhibits some patterns that, based on my naive interpretations of the variables, I didn't think were possible. So I have tried to write this code in a way that is robust to pretty much every possibility. To do that, I've had to fill in some gaps in the definition provided for reoffending after being released on bail.

Specifically, since a person may be released on bail for some charges but not for others in the same custody event, I am assuming that the person is considered "released on bail" from a given custody if he is released on bail for any of the charges in that custody.

Next, I'm assuming that you are not just interested in reoffending after release on bail from the person's first custody event, but rather you want to identify any custody where the person is released on bail and then has a later custody event, even if the one from which he/she was released on bail is, itself, not their first custody.

Finally, there is the issue of what constitutes a "later" custody event. I notice that the variable EarliestDisposalDate_str can vary for different charges in the same custody. That makes it possible, in principle at least, that the intervals from first to last EarliestDisposalDate_str in two different custodies could overlap. So it wouldn't necessarily be clear whether a given custody where the person is released on bail precedes or follows another custody. To resolve this ambiguity, I go by the chronologically first value of EarliestDisposalDate_str in a custody to order the custodies in time.

With those additions to your definition:

Code:

// CONVERT DATE VARIABLE FROM STRING TO NUMERIC STATA DATE VARIABLE gen earliest_disposal_date = daily(EarliestDisposalDate_str, "DMY") assert missing(earliest_disposal_date) == missing(EarliestDisposalDate_str) format earliest_disposal_date %td // IDENTIFY CUSTODY NUMBERS WHERE PERSON WAS RELEASED ON BAIL FOR AT LEAST // ONE OFFENSE label define boolean 0 "No" 1 "Yes" by PERSONID CUSTODYNUMBER, sort: egen byte released_on_bail = /// max(EARLIESTDISPOSALCorrected == 1) label values released_on_bail boolean // SEQUENCE THE CUSTODY NUMBERS CHRONOLOGICALLY BY THEIR FIRST DISPOSAL DATE by PERSONID CUSTODYNUMBER: egen first_disposal_date = /// min(earliest_disposal_date) format first_disposal_date %td by PERSONID (first_disposal_date CUSTODYNUMBER), sort: /// gen int seq = sum(CUSTODYNUMBER != CUSTODYNUMBER[_n-1]) by PERSONID: gen int n_custodies = seq[_N] // CALCULATE REOFFENDING AFTER BAIL by PERSONID seq, sort: gen byte reoffends_after_bail:boolean = released_on_bail /// & n_custodies > seq
Comment
Richard James

Join Date: Nov 2024

Posts: 10
#5

12 Feb 2025, 12:23

Hello Clyde,

Many thanks for investing the time and effort to make sense of this and write the code. You're assumptions are absolutley write and well articulated. I am, however, struggling with the Code (especially as a relative amateur at Stata). Here's the responses I'm getting from Stata when I plug in the Code:

. gen earliest_disposal_date = daily(EarliestDisposalDate_str, "DMY") [NOTE: the variable is named 'EarliestDisposalDate_' but even when I replace it with that, the code still produces the error message]
EarliestDisposalDate_str not found
r(111);

. assert missing(earliest_disposal_date) == missing(EarliestDisposalDate_str)
earliest_disposal_date not found
r(111);

. format earliest_disposal_date %td
variable earliest_disposal_date not found
r(111);

.
. label define boolean 0 "No" 1 "Yes"

. by PERSONID CUSTODYNUMBER, sort: egen byte released_on_bail =
unknown egen function ()
r(133);

. max(EARLIESTDISPOSALCorrected == 1)
command max is unrecognized
r(199);

. label values released_on_bail boolean
variable released_on_bail not found
r(111);

.
. by PERSONID CUSTODYNUMBER: egen first_disposal_date =
unknown egen function ()
r(133);

. min(earliest_disposal_date)
command min is unrecognized
r(199);

. format first_disposal_date %td
variable first_disposal_date not found
r(111);

. by PERSONID (first_disposal_date CUSTODYNUMBER), sort:
variable first_disposal_date not found
r(111);

. gen int seq = sum(CUSTODYNUMBER != CUSTODYNUMBER[_n-1])

. by PERSONID: gen int n_custodies = seq[_N]
(427,796 missing values generated)

.
. by PERSONID seq, sort: gen byte reoffends_after_bail:boolean = released_on_bail
released_on_bail not found
r(111);

. & n_custodies > seq
& is not a valid command name
r(199)
Any thoughts very gratefully received!
Richard
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29818
#6

12 Feb 2025, 13:41

I can respond to some of these problems, but not all of them.

Most of these error messages you are getting are arising because, it seems, you are trying to type this code in line by line in the Command window, or are "downstream" from that in that the command refers to a variable that was never created because the command creating it did not run because it was used in the Command window. This code cannot be run that way. You must run it from a do-file. Take the code from #4, copy it into the Stata do-file editor, and then run it from there. (If you are not familiar with running code from a do-file, clicking on the rightmost icon in the toolbar will accomplish that.) By the way, when getting help here on Statalist, it is always safest to assume that the code people provide you is intended to be run from a do-file, and it is also safest to assume that the do-file should be run in its entirety without interruption, not line-by-line or "paragraph" by "paragraph.

As for

gen earliest_disposal_date = daily(EarliestDisposalDate_str, "DMY") [NOTE: the variable is named 'EarliestDisposalDate_' but even when I replace it with that, the code still produces the error message]
EarliestDisposalDate_str not found
r(111);

in the example data you showed, there is a variable named EarliestDisposalDate_str. Apparently, it is not called that in your full data set. It is not a good idea to change things like that when posting example data here. People who respond are going to take your example data at face value and write code to work with it. The code in #4 works with the example data in #3 and produces correct results. Still, if you make minor name changes in the data set, and make corresponding changes in the code, it should still run. I cannot identify a reason why -gen earliest disposal_date = daily(EarliestDisposalDate_, "DMY")- should produce an error message about a variable not being found if your data set actually contains that variable. It may be that your actual variable name is not exactly EarliestDisposalDate_: remember that Stata variable names are case-sensitive. What I can envision more easily is that in your real data set, there is a variable named EarliestDisposalDate_ but it is not a string variable--in the example data, EarliestDisposalDate_str is a string variable. If your date variable is not actually a string variable resembling the one shown in your example, then that line of code may be simply unnecessary, or a different conversion may be needed. What is needed is a Stata internal format date variable that represents the earliest disposal date. How you get that depends on what variable you actually have and what kind of information about the dates it contains. Had your original example data actually contained that, we would not have this problem.

I think the simplest way forward is for you to post a new data example that is actually directly taken from your real data set using -dataex-. Begin by loading your data set into Stata. Then using appropriate -drop- or -keep- commands select a reasonable subset of the data that includes content that illustrates the various complexities of the data. (In this last respect, the data set you showed in #3 was excellent.) Then just run -dataex-, and copy/paste that -dataex- output into the Forum editor. Do not transform the variables or change their names or data types. I can adapt the code in #4 to work with it--though you will still need to run it from the do-file editor, not from the Command line.
Comment
Richard James

Join Date: Nov 2024

Posts: 10
#7

12 Feb 2025, 14:26

Hello Clyde,

Thank you for your patience and careful explanation.

First, thank you for introducing me to the 'do-file' function which, embarassingly, I did not know. It's a game changer!

Second, thank you for alerting me to the changes. I thought I'd faithfully copied my dataex example across, but obviously I had not. On this point "If your date variable is not actually a string variable resembling the one shown in your example, then that line of code may be simply unnecessary, or a different conversion may be needed." This is entirely right, the variable is already numeric.

Third, when I produce the dataex output below my EarliestDisposalDate_ appears as the numbers below (which I think is Stata caculating from 1 Jan 1960) but in my dataset in Stata it appears as as DayMonthYear (e.g. 22547 as 24sep2021).

If you're not sick of my posting (!), I would be very gratefully for an adapted code, which I will then run through my new friend, 'do-file'. Here is the dataex output, copied without alteration:

* Example generated by -dataex-. For more info, type help dataex
clear
input str8 PERSONID str10 CUSTODYNUMBER str24 UniqueOffenceID double EarliestDisposalDate_ long EARLIESTDISPOSALCorrected
"10000228" "166062923" "163408" 22547 1
"10000228" "166062923" "163390" 22547 1
"10001343" "3407663148" "321301" 23320 2
"10001343" "3407663148" "289840" 23181 1
"10001888" "1526586208" "175601" 22613 6
"10002467" "1735071043" "4561" 21668 2
"10003069" "4241789031" "163363" 22547 2
"10007403" "2990124287" "100640" 22191 1
"10007403" "2311609212" "1603" 21650 2
"10007403" "2311609212" "1605" 21650 2
"10007403" "3253655041" "266745" 23073 3
"10007403" "1062731082" "266704" 23072 6
"10007403" "4504661912" "119107" 22295 3
"10007403" "3433557915" "79854" 22085 1
"10007403" "2990124287" "100686" 22191 1
"10007403" "3253655041" "266748" 23073 3
"10007403" "2311609212" "1589" 21650 2
"10007403" "942345646" "124956" 22330 2
"10007403" "814838362" "6432" 21680 2
"10007403" "1198769735" "7809" 21688 2
"10007403" "3253655041" "94360" 22158 1
"10007403" "814838362" "6474" 21680 2
"10007403" "942345646" "125007" 22330 3
"10007403" "3253655041" "94469" 22158 1
"10007403" "3253655041" "94467" 22158 1
"10007403" "4504661912" "119108" 22296 2
"10007403" "2311609212" "23680" 21781 5
"10007403" "3253655041" "94470" 22158 1
"10007403" "2990124287" "100642" 22191 1
"10007403" "1198769735" "7810" 21688 2
end
format %td EarliestDisposalDate_
label values EARLIESTDISPOSALCorrected FIXEDEARLYDIS
label def FIXEDEARLYDIS 1 "Pre-ChargeBail", modify
label def FIXEDEARLYDIS 2 "ReleaseUnderInvestigation", modify
label def FIXEDEARLYDIS 3 "Charge", modify
label def FIXEDEARLYDIS 5 "NoFurtherAction", modify
label def FIXEDEARLYDIS 6 "NoChargeDecision/Misc", modify
[/CODE]

Last edited by Richard James; 12 Feb 2025, 14:31.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29818
#8

12 Feb 2025, 16:09

Code:

// IDENTIFY CUSTODY NUMBERS WHERE PERSON WAS RELEASED ON BAIL FOR AT LEAST // ONE OFFENSE label define boolean 0 "No" 1 "Yes" by PERSONID CUSTODYNUMBER, sort: egen byte released_on_bail = /// max(EARLIESTDISPOSALCorrected == 1) label values released_on_bail boolean // SEQUENCE THE CUSTODY NUMBERS CHRONOLOGICALLY BY THEIR FIRST DISPOSAL DATE by PERSONID CUSTODYNUMBER: egen first_disposal_date = /// min(EarliestDisposalDate_) format first_disposal_date %td by PERSONID (first_disposal_date CUSTODYNUMBER), sort: /// gen int seq = sum(CUSTODYNUMBER != CUSTODYNUMBER[_n-1]) by PERSONID: gen int n_custodies = seq[_N] // CALCULATE REOFFENDING AFTER BAIL by PERSONID seq, sort: gen byte reoffends_after_bail:boolean = released_on_bail /// & n_custodies > seq

This works with the new -dataex- output and produces correct results there.

but in my dataset in Stata it appears as as DayMonthYear (e.g. 22547 as 24sep2021).

This is the best way to represent dates in Stata. The appearance of 22547 as 24sep2021 is due to the variable having been given display format %td. (Notice that this is also reflected in the -dataex- output--look at the command immediately following -end-.) You correctly note that Stata represents dates by counting the number of days from 1Jan1960. It also offers many ways to display these dates when they appear in -list-, -display- or -browse- results. See -help datetime formats- to learn more about this.

If you're not sick of my posting (!)

Sometimes I sign off from threads if the discussion veers into territory where I simply don't have the knowledge needed to continue. And I have, once or twice, signed off after explaining the same thing in as many ways as I can think of and still not succeeded in making it clear. But my general approach here is formed by my own experience as a Stata user. When I started using it, in 1994, I was on the steeper part of the learning curve. I learned much of what I know about Stata from others on Statalist, which was then an email listserve out of Harvard School of Public Health. If I start to feel frustrated with a thread, I remind myself that we were all beginners once, and that but for the help of others on this Forum, I might have not have progressed beyond that level. So I want to "pay it forward."

So, no, I'm not sick of your posting. If for some reason this code does not work correctly in your full data, do post back and show a new -dataex- example which illustrates whatever problems you are encountering with it.
Comment
Richard James

Join Date: Nov 2024

Posts: 10
#9

13 Feb 2025, 15:39

Hello Clyde,

Many thanks for re-writing the code: it worked perfectly! I'm very pleased.

Your general approach is very kind indeed and, who am I to say, but it seems like you've more than paid it forward in your many thousands of responses here. The statalist forum is an immensely helpful resource that I call on regularly.

Having played around with the dataset today, I'm posting back today to call on your generosity once more to see if I might refine my variable (and your code) a little more. My aim is to add an additional qualifier: exactly the same criteria as in original code BUT now to only flag those if the subsequent reoffending takes place before the LatestDisposalDate_ of CUSTODYNUMBER for which the person was released on bail. In other words, if the PERSONID's CUSTODYNUMBER is EARLIESTDISPOSALCorrected ==1 and they reppear with a new CUSTODYNUMBER at a subsequent time that is within the LatestDisposalDate_ of the CUSTODYNUMBER that EARLIESTDISPOSALCorrected ==1. Does that make sense?

Here is the dataex with LatestDisposalDate_ now included too.

clear
input str8 PERSONID str10 CUSTODYNUMBER str24 UniqueOffenceID double(EarliestDisposalDate_ LatestDisposalDate_) long EARLIESTDISPOSALCorrected
"10000228" "166062923" "163408" 22547 22603 1
"10000228" "166062923" "163390" 22547 22603 1
"10001343" "3407663148" "321301" 23320 23320 2
"10001343" "3407663148" "289840" 23181 23320 1
"10001888" "1526586208" "175601" 22613 22613 6
"10002467" "1735071043" "4561" 21668 21761 2
"10003069" "4241789031" "163363" 22547 22944 2
"10007403" "2990124287" "100642" 22191 22408 1
"10007403" "3253655041" "94360" 22158 23072 1
"10007403" "3253655041" "94468" 22158 23072 1
"10007403" "2990124287" "100641" 22191 23289 1
"10007403" "2311609212" "1589" 21650 21781 2
"10007403" "3433557915" "79854" 22085 22386 1
"10007403" "942345646" "125007" 22330 22330 3
"10007403" "2311609212" "1605" 21650 21781 2
"10007403" "3576721942" "72995" 22051 22461 1
"10007403" "4504661912" "119108" 22296 22540 2
"10007403" "2311609212" "1602" 21650 21784 2
"10007403" "1198769735" "7809" 21688 22047 2
"10007403" "1062731082" "266704" 23072 23073 6
"10007403" "2311609212" "23680" 21781 21781 5
"10007403" "2990124287" "100640" 22191 22408 1
"10007403" "3253655041" "94467" 22158 23072 1
"10007403" "3253655041" "94469" 22158 23072 1
"10007403" "2990124287" "100686" 22191 23289 1
"10007403" "3253655041" "266745" 23073 23073 3
"10007403" "3253655041" "266747" 23073 23073 3
"10007403" "3253655041" "94470" 22158 23072 1
"10007403" "814838362" "6432" 21680 22229 2
"10007403" "4504661912" "119107" 22295 22295 3
end
format %td EarliestDisposalDate_
format %td LatestDisposalDate_
label values EARLIESTDISPOSALCorrected FIXEDEARLYDIS
label def FIXEDEARLYDIS 1 "Pre-ChargeBail", modify
label def FIXEDEARLYDIS 2 "ReleaseUnderInvestigation", modify
label def FIXEDEARLYDIS 3 "Charge", modify
label def FIXEDEARLYDIS 5 "NoFurtherAction", modify
label def FIXEDEARLYDIS 6 "NoChargeDecision/Misc", modify
[/CODE]

Thank you!

Last edited by Richard James; 13 Feb 2025, 15:41.
Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 29818

#10

13 Feb 2025, 16:22

This is a bit more complicated than the original request. Also there is some unclarity in the new condition. How do we determine whether the subsequent custody event is within the range of dates you specify. First, there is no "the" LatestDisposalDate for a custody, because each charge has its own Latest Disposal Date and this can vary among the charges in a custody. Similarly, how do we know what "the" date of the subsequent arrest is, given that the earliest disposal date and latest disposal dates vary among the charges in the subsequent custody. Here's what I assume in the following code: The date range for the custody in which bail is granted is from the chronologically first EarliestDisposalDate_ to the chronologically final LatestDisposalDate of the CUSTODYNUM. And the date of the candidate reoffending custody is taken to be its chronological first EarliestDisposalDate_.

With those assumptions, the following works correctly in the example provided in #9.

Code:

//    IDENTIFY CUSTODY NUMBERS WHERE PERSON WAS RELEASED ON BAIL FOR AT LEAST
//    ONE OFFENSE
label define boolean    0    "No"    1    "Yes"
by PERSONID CUSTODYNUMBER, sort: egen byte released_on_bail = ///
    max(EARLIESTDISPOSALCorrected == 1)
label values released_on_bail boolean

//    IDENTIFY THE LAST LATEST DISPOSAL DATE FOR THE CUSTODY
by PERSONID CUSTODYNUMBER: egen last_disposal_date = ///
    max(LatestDisposalDate_)
    
//    SEQUENCE THE CUSTODY NUMBERS CHRONOLOGICALLY BY THEIR FIRST DISPOSAL DATE
by PERSONID CUSTODYNUMBER: egen first_disposal_date = ///
    min(EarliestDisposalDate_)
format first_disposal_date %td
by PERSONID (first_disposal_date CUSTODYNUMBER), sort: ///
    gen int seq = sum(CUSTODYNUMBER != CUSTODYNUMBER[_n-1])
    
preserve
keep PERSONID first_disposal_date seq
duplicates drop
tempfile custodies
save `custodies'

restore
gen `c(obs_t)' obs_no = _n
rangejoin first_disposal_date 1 last_disposal_date using `custodies', ///
    by(PERSONID)
replace first_disposal_date_U = . if seq >= seq_U
by obs_no, sort: egen byte offense_within_date_range = ///
    max(!missing(first_disposal_date_U))
drop *_U
by obs_no, sort: keep if _n == 1
drop obs_no
gen byte reoffended_after_bail = released_on_bail & offense_within_date_range

Note: -rangejoin- is written by Robert Picard and is available from SSC. To use it, you must also install -rangestat-, by Robert Picard, Nick Cox, and Roberto Ferrer, also from SSC.

Comment

Richard James

Join Date: Nov 2024

Posts: 10
#11

19 Feb 2025, 11:02

Hello Clyde,

Apologies for not making the conditions clearer, but your assumptions were correct. The code worked just as I hoped it would. Thanks so much, you've really helped me out here!

Richard
Comment

Announcement