Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Merging two datasets with repeated individuals

    Hello, please, could anyone help me?
    I have two data sets. The first one has data when the person committed a crime. Some individuals committed more than one crime in different dates per year.
    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input float(id yearcrime category monthcrime daycrime)
     1 2000 1  3 23
     1 2001 1  5  2
     4 2000 2  8  5
     4 2001 1 12 14
     4 2002 3  1 17
     5 2003 3  4  1
     8 2003 2  5  6
     9 2002 1  7 22
     9 2003 1  9 15
    28 2000 1  6 13
    end
    The second dataset brings id, year when the person took a test, year, month and day when the person got ill. There are people who never got ill, people who got ill, people who never committed crime, people who committed more than one in one given year and some people who committed crimes in several years.
    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input float(id yeartest sick datestartedillness score)
    1 2000 1 14695 100
    1 2001 0     . 101
    1 2002 0     . 256
    2 2000 0     . 233
    2 2001 0     .  99
    3 2000 0     . 298
    3 2001 1 15078  89
    3 2002 0     . 167
    4 2000 1 14945 243
    5 2001 0     . 156
    5 2002 0     . 128
    6 2000 0     . 257
    7 2000 0     . 234
    7 2001 1 15226 194
    7 2002 0     . 206
    7 2003 0     . 108
    8 2000 0     .  87
    8 2001 1 15021 165
    8 2002 0     . 271
    9 2003 0     . 172
    end
    format %td datestartedillness
    The data that contains test scores and illness have names repeating over the years, but not all years. so this is more like a panel at the school level.
    I need to create the crime variable (y) at the individual level using the date at which they started to be ill , creating a dummy for committing any crime in one 365 days and 730 days after that. I also need to create another variable for count_crime to count the number of crimes each individual committed 365 and 730 days after being ill. Many thanks in advance!

  • #2
    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input float(id yearcrime category monthcrime daycrime)
     1 2000 1  3 23
     1 2001 1  5  2
     4 2000 2  8  5
     4 2001 1 12 14
     4 2002 3  1 17
     5 2003 3  4  1
     8 2003 2  5  6
     9 2002 1  7 22
     9 2003 1  9 15
    28 2000 1  6 13
    end
    tempfile crimes
    save `crimes'
    
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input float(id yeartest sick datestartedillness score)
    1 2000 1 14695 100
    1 2001 0     . 101
    1 2002 0     . 256
    2 2000 0     . 233
    2 2001 0     .  99
    3 2000 0     . 298
    3 2001 1 15078  89
    3 2002 0     . 167
    4 2000 1 14945 243
    5 2001 0     . 156
    5 2002 0     . 128
    6 2000 0     . 257
    7 2000 0     . 234
    7 2001 1 15226 194
    7 2002 0     . 206
    7 2003 0     . 108
    8 2000 0     .  87
    8 2001 1 15021 165
    8 2002 0     . 271
    9 2003 0     . 172
    end
    format %td datestartedillness
    tempfile illnesses
    save `illnesses'
    
    
    use `crimes'
    gen crimedate = mdy(monthcrime, daycrime, yearcrime)
    assert missing(crimedate) == missing(monthcrime, daycrime, yearcrime)
    drop monthcrime daycrime yearcrime
    format crimedate %td
    tempfile crimes_with_dates
    save `crimes_with_dates'
    
    use `illnesses', clear
    gen `c(obs_t)' obs_no = _n
    frame put _all if !missing(datestartedillness), into(working)
    frame working {
        gen lower = datestartedillness + 1
        gen upper = datestartedillness + 730
        replace lower = 1 if missing(datestartedillness)
        replace upper = 0 if missing(datestartedillness)
        format lower upper %td
        rangejoin crimedate lower upper using `crimes_with_dates', by(id)
        by obs_no, sort: egen crimes_365 = total(crimedate <= datestartedillness + 365)
        by obs_no: egen crimes_730 = total(crimedate <= datestartedillness + 730)
        gen byte any_crime_365 = crimes_365 > 0 & !missing(crimes_365)
        gen byte any_crime_730 = crimes_730 > 0 & !missing(crimes_730)
        by obs_no, sort: keep if _n == 1
        drop crimedate
    }
    frlink 1:1 obs_no, frame(working)
    frget crimes_* any_crime_*, from(working)
    Note: -rangejoin- is written by Robert Picard and is available from SSC. To use it, you must also install -rangestat-, by Robert Picard, Nick Cox, and Roberto Ferrer, also available from SSC.

    I have put the original two data sets into tempfiles in this code, but there is no need for you to do that--you can just use the original permanent data sets.

    Comment


    • #3
      Thank you so much prof. Clyde. It worked very well.

      But 'category' variable disappears. And I am now trying to create new variables: crimes_365,
      crimes_730
      any_crime_365
      any_crime_730
      ------------------------------
      , but per category as well. Number of crimes per category per person within the same time thresholds.
      I tried something like this: * Assuming `category` is part of my working frame after the rangejoin I would do:
      Code:
      by obs_no category, sort: egen category_crimes_365 = total(crimedate <= datestartedillness + 365)
      by obs_no category: egen category_crimes_730 = total(crimedate <= datestartedillness + 730)
      And I wanted to create all these new variables per category. But I failed.
      Please, how could I do it? Many thanks!
      Last edited by Juliana Cunha; 09 Mar 2024, 14:04.

      Comment


      • #4
        but per category as well. Number of crimes per category per person within the same time thresholds.
        You didn't say that in your original post. But it's just a small modification.

        Code:
        * Example generated by -dataex-. For more info, type help dataex
        clear
        input float(id yearcrime category monthcrime daycrime)
         1 2000 1  3 23
         1 2001 1  5  2
         4 2000 2  8  5
         4 2001 1 12 14
         4 2002 3  1 17
         5 2003 3  4  1
         8 2003 2  5  6
         9 2002 1  7 22
         9 2003 1  9 15
        28 2000 1  6 13
        end
        tempfile crimes
        save `crimes'
        
        * Example generated by -dataex-. For more info, type help dataex
        clear
        input float(id yeartest sick datestartedillness score)
        1 2000 1 14695 100
        1 2001 0     . 101
        1 2002 0     . 256
        2 2000 0     . 233
        2 2001 0     .  99
        3 2000 0     . 298
        3 2001 1 15078  89
        3 2002 0     . 167
        4 2000 1 14945 243
        5 2001 0     . 156
        5 2002 0     . 128
        6 2000 0     . 257
        7 2000 0     . 234
        7 2001 1 15226 194
        7 2002 0     . 206
        7 2003 0     . 108
        8 2000 0     .  87
        8 2001 1 15021 165
        8 2002 0     . 271
        9 2003 0     . 172
        end
        format %td datestartedillness
        tempfile illnesses
        save `illnesses'
        
        
        use `crimes'
        gen crimedate = mdy(monthcrime, daycrime, yearcrime)
        assert missing(crimedate) == missing(monthcrime, daycrime, yearcrime)
        drop monthcrime daycrime yearcrime
        format crimedate %td
        tempfile crimes_with_dates
        save `crimes_with_dates'
        
        use `illnesses', clear
        gen `c(obs_t)' obs_no = _n
        tempfile original
        save `original'
        
        keep if !missing(datestartedillness)
        gen lower = datestartedillness + 1
        gen upper = datestartedillness + 730
        format lower upper %td
        rangejoin crimedate lower upper using `crimes_with_dates', by(id)
        by obs_no category, sort: egen crimes_365 = total(crimedate <= datestartedillness + 365)
        by obs_no category: egen crimes_730 = total(crimedate <= datestartedillness + 730)
        gen byte any_crime_365 = crimes_365 > 0 & !missing(crimes_365)
        gen byte any_crime_730 = crimes_730 > 0 & !missing(crimes_730)
        by obs_no category, sort: keep if _n == 1
        drop crimedate
        
        merge m:1 obs_no using `original', assert(match using) nogenerate
        
        //    OPTIONALLY, CREATE WIDE DATA LAYOUT FOR CRIME CATEGORIES
        rename crimes* crimes_*_cat_
        rename any_crime* any_crime*_cat_
        mvencode category, mv(0)
        
        reshape wide crimes_*_cat_ any_crime*_cat_, i(obs_no) j(category)
        drop *_cat_0 obs_no lower upper
        order *crime*, last
        isid id yeartest, sort
        This code, up through the -merge- will create what you ask. There will now be multiple observations, one per category of crime committed within the specified timeframes, for each original observation. If you prefer to have the category-specific results "side by side" in the data, then you can run the rest of the code to get that. Which way is better really depends on what you're going to do with the results.

        Note: Assumes that the variable category never takes the value 0. If that's wrong, the code will break at the -mvencode- command. If that happens, replace 0, there and in the -drop- command two lines later, by some other non-negative integer which is not a possible value of category.

        Comment


        • #5
          Thank you very much Prof. Clyde. It ran smoothly. However, when I run a fixed effects model (year and school_id) for
          Code:
          number of crimes = ill + controls+ uit
          where ill takes 1 if the person got sick and 0 otherwise, the results drop the variable ill and keep just the obs of people who got ill. I would like to be able to run this model keeping all my observations including people who got ill or not and who committed a crime or not. Currently, the regression is running with a small sample of people who got ill, rather than comparing every one. Please, would you have a solution for this? Many thanks!

          Comment


          • #6
            Well, this is not possible with your data/study design. For the people who got ill, you have a reference date, datestartedillness, that enables you to define the 365 and 730 day periods during which you ascertain the number of crimes. But for the people who didn't get ill, you have no comparable reference date, so there is no way to establish what are the right 365 and 730 day periods over which to count crimes.

            The usual approach to a situation like this in epidemiology is to create matched pairs: for each person who got ill, identify another person who didn't get ill but has the same* relevant characteristics as the person who did. Then you assign to the non-ill person the same reference date as the datestartedillness of the ill person that they are matched with. From there you do the crime counting, and then when you get to your regression you will need to do a multi-level regression that accounts for the matched pair data as well as the repeated observations within individual people.

            *When I say the "same relevant characteristics" that is shorthand for a variety of approaches that can be used. One can pick out a small number of characteristics that are very strongly related to the illness event and then require exact agreement between the ill and non-ill person on those. You can't do that for a large number of characteristics, because you will find that most people can't be matched exactly to anyone on a large number of attributes. So if many attributes ought to be considered, then various kinds of approximate agreement can be used (coarsened exact matching, or caliper matching). Or one might use propensity score matching, which creates an "index" of the matching variables and matches approximately on that index--which in effect matches on "overall similarity" without requiring really close agreement on any particular attribute. There is also the issue of which characteristics are "relevant."

            This is a complicated subject. The choice of a best matching technique depends very much on substantive considerations about the variables to be matched and the other variables that will be used along with the matched pairs in the ultimate analysis. Moreover there may be preferences for different approaches in different fields. And, it may be that there are other approaches altogether that might be more suitable for achieving your specific research goals. So I think you are best served by speaking to an expert in your own field about how to approach this. When you have settled on an approach, you can always come back to Statalist for help implementing it in Stata if needed.

            Comment


            • #7
              Thank you so much for the thorough explanation Prof. Clyde!

              Comment


              • #8
                Hi Prof. Clyde,
                I am trying
                Code:
                use maindata,clear
                logit ill $xi, vce(cl school_id)
                predict p_score
                psmatch2 ill, pscore(p_score) neighbor(1) caliper(0.05) 
                save matched_dataset,replace
                I have not included all the controls and haven't controlled for any individual or school id. I have 6 million obs of people not ill (ill =0 ) and 55k people who got ill and have the date of first symptoms (ill = 1). I tried with caliper 0.01, but got all _n1 = missing. Also tried with caliper 0.05 and with even fewer controls and also got no match. Do you have any advice to solve this, please?
                Many thanks in advance.

                Comment


                • #9
                  I don't use -psmatch2- myself, and am not very familiar with its workings, so I can't help you with that. The way I do propensity score matching for a single control, nearest on propensity score looks like this:
                  Code:
                  logit ill relevant_variables
                  predict score
                  preserve
                  keep if !ill
                  tempfile controls
                  save `controls'
                  
                  restore
                  keep if ill
                  local caliper 0.05 // OR WHATEVER YOU WANT
                  rangejoin score -.05 .05 using `controls'
                  gen delta = abs(score - score_U)
                  by id (delta), sort: keep if _n == 1
                  
                  gen `c(obs_t)' pair_num = _n
                  ds pair_num delta *_U, not
                  rename `r(varlist)' =_case
                  rename *_U *_ctrl
                  ds *_case
                  local stubs `r(varlist)'
                  local stubs: subinstr local stubs "_case" "", all
                  reshape long `stubs', i(pair_num) j(case_ctrl) string
                  [/code]

                  That should leave you with a data set of matched pairs. The observations belonging to the same pair will have the same value for pair_num. And the ill ones will have case_ctrl == "case", whereas their matched controls have case_ctrl == "ctrl".

                  -rangejoin- is written by Robert Picard and is available from SSC. To use it, you must also install -rangestat-, by Robert Picard, Nick Cox, and Roberto Ferrer, also available from SSC.

                  Note: Lacking a suitable data set to try this code out on, it's done by trying to keep track in my head of what's happening. So there may be errors in this code. If you have trouble adapting it to your real data set, be sure to show a suitable -dataex- example when you post back.

                  Comment


                  • #10
                    Many thanks prof. Clyde. It ran until
                    Code:
                     local caliper 0.05
                    Code:
                     rangejoin score -.05 .05 using `controls'
                    did not run as I get a message about insufficient memory - segmentsize.
                    I have around 6.3million obs.

                    Comment


                    • #11
                      Oh, that's a real problem. With that large a data set, you could probably require exact matching on some variables. Doing that would enable you to reduce the memory required. But as a pure caliper match, -rangejoin- is the most economical approach to this problem, and if you have insufficient memory for it, then you have no alternative but to find a machine with more memory to run this.

                      If, however, you can find a categorical variable with several categories to use for exact matching, then:
                      Code:
                      rangejoin score -`caliper' `caliper' using `controls', by(exact_match_variable)
                      will reduce the demands on memory, and will then give you the nearest match on score, within a caliper of radius `caliper' conditional on exact matching on the exact match variable.

                      Comment


                      • #12
                        Many thanks prof Clyde. I will try it.

                        Comment


                        • #13
                          Has anyone worked on IHDS individual level merging?

                          Comment


                          • #14
                            Hi prof. Clyde Schechter , please, could you help me with another merge? I am stuck on it.

                            I have two datasets as follows:
                            Code:
                            * Example generated by -dataex-. For more info, type help dataex
                            clear
                            input float(id yearcrime monthcrime daycrime typecrime crimedate)
                            1 2000  8 15 1 14837
                            1 2000  8 15 2 14837
                            1 2000 12 25 3 14969
                            1 2002  3  3 1 15402
                            1 2003  5 31 4 15856
                            2 2001  7  8 2 15164
                            2 2002  9 12 1 15595
                            3 2000  6 24 4 14785
                            4 2002  3 17 1 15416
                            4 2002  4 15 2 15445
                            5 2003  2  2 3 15738
                            6 2000  1  9 4 14618
                            6 2003 10 10 1 15988
                            6 2004  2  3 3 16104
                            6 2005  9  9 2 16688
                            9 2002  3  6 3 15405
                            end
                            format %td crimedate
                            and

                            Code:
                            * Example generated by -dataex-. For more info, type help dataex
                            clear
                            input float(id yearexam yearill monthill dayill score illdate)
                             1 2000 2000 11  3 100 14917
                             1 2001    .  .  . 150     .
                             1 2002    .  .  . 111     .
                             2 2000    .  .  . 123     .
                             2 2001 2001  5  2 200 15097
                             2 2002    .  .  . 214     .
                             2 2003    .  .  . 203     .
                             2 2004    .  .  . 302     .
                             2 2005    .  .  . 136     .
                             3 2001 2001  7  6 222 15162
                             4 2001    .  .  . 158     .
                             4 2002    .  .  . 178     .
                             4 2003    .  .  . 228     .
                             4 2004    .  .  . 311     .
                             5 2000    .  .  . 197     .
                             5 2001 2001  2 17 106 15023
                             5 2002    .  .  . 147     .
                             5 2003    .  .  . 241     .
                             5 2004    .  .  . 299     .
                             6 2002 2002  3  3 321 15402
                             6 2003    .  .  . 139     .
                             6 2004    .  .  . 284     .
                            13 2002 2002  5 23 123 15483
                            13 2005    .  .  . 214     .
                            end
                            format %td illdate

                            I want to count the number of crimes committed per person per year (and also in total per person) after getting ill. Also, if there are duplicates id crimedate, then I drop. If the person was never ill but committed a crime, I will still count as having committed a crime. If the person did not commit ill or not ill, then crime_count will receive 0 towards the final count. And exclude from the final sample the number of crimes committed by the person before getting ill.
                            That's what I am trying to do but it is obviously wrong:
                            Code:
                            use crimes, clear
                            br
                            sort id crimedate
                            duplicates report id crimedate
                            duplicates drop id crimedate,force
                            sort id crimedate
                            merge m:m id using ill
                            And that's what I get, which I know is wrong.
                            Code:
                            * Example generated by -dataex-. For more info, type help dataex
                            clear
                            input float(id yearcrime monthcrime daycrime typecrime crimedate yearexam yearill monthill dayill score illdate) byte _merge
                             1 2000  8 15 1 14837 2000 2000 11  3 100 14917 3
                             1 2000 12 25 3 14969 2001    .  .  . 150     . 3
                             1 2002  3  3 1 15402 2002    .  .  . 111     . 3
                             1 2003  5 31 4 15856 2002    .  .  . 111     . 3
                             2 2001  7  8 2 15164 2000    .  .  . 123     . 3
                             2 2002  9 12 1 15595 2001 2001  5  2 200 15097 3
                             3 2000  6 24 4 14785 2001 2001  7  6 222 15162 3
                             4 2002  3 17 1 15416 2001    .  .  . 158     . 3
                             4 2002  4 15 2 15445 2002    .  .  . 178     . 3
                             5 2003  2  2 3 15738 2000    .  .  . 197     . 3
                             6 2000  1  9 4 14618 2002 2002  3  3 321 15402 3
                             6 2003 10 10 1 15988 2003    .  .  . 139     . 3
                             6 2004  2  3 3 16104 2004    .  .  . 284     . 3
                             6 2005  9  9 2 16688 2004    .  .  . 284     . 3
                             9 2002  3  6 3 15405    .    .  .  .   .     . 1
                             2 2002  9 12 1 15595 2002    .  .  . 214     . 3
                             2 2002  9 12 1 15595 2003    .  .  . 203     . 3
                             2 2002  9 12 1 15595 2004    .  .  . 302     . 3
                             2 2002  9 12 1 15595 2005    .  .  . 136     . 3
                             4 2002  4 15 2 15445 2003    .  .  . 228     . 3
                             4 2002  4 15 2 15445 2004    .  .  . 311     . 3
                             5 2003  2  2 3 15738 2001 2001  2 17 106 15023 3
                             5 2003  2  2 3 15738 2002    .  .  . 147     . 3
                             5 2003  2  2 3 15738 2003    .  .  . 241     . 3
                             5 2003  2  2 3 15738 2004    .  .  . 299     . 3
                            13    .  .  . .     . 2002 2002  5 23 123 15483 2
                            13    .  .  . .     . 2005    .  .  . 214     . 2
                            end
                            format %td crimedate
                            format %td illdate
                            label values _merge _merge
                            label def _merge 1 "Master only (1)", modify
                            label def _merge 2 "Using only (2)", modify
                            label def _merge 3 "Matched (3)", modify

                            Comment


                            • #15
                              You could know that that code is wrong without even running it: -merge m:m- is always wrong.*

                              I'm afraid I cannot figure out from your description what result you want to get. Please work out, by hand, what the results would look like for id's 1 and 2 in these examples, and post back showing that.

                              *I'm being very, very, very slightly hyperbolic here. I've been using Stata almost every day since 1994, and in all that time I have only once encountered a situation where -merge m:m- was an acceptable solution. Even then, there was a better way to do it with other commands. -merge m:m- creates a data salad that is almost never useful for any purpose. For practical purposes, it is always wrong.

                              Comment

                              Working...
                              X