Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Calculating conditional row average across different variables

    Please consider the example toy data below. The data is contrived but for our purposes each row represents an individual recently released from jail. The month they are released is the variable "start_modate." All these individuals are re-incarcerated k number of months later (variable "length"). The month they are re-incarcerated is the variable "end_modate." I have monthly data about reported crime in the area where they reside after release. The variables that start with the prefix "v_" measure the crime rate. The suffix for all the "v_" variables denotes the month during which the crime rate was measured.

    I wish to calculate a rowwise average for each individual over only the period they were living in the community. In other words, I wish to average the crime rate over only the months each individual is not incarcerated, which are the months between "start_modate" and "end_modate" (inclusive).
    I have tried looping over each row and using the -ereplace- package to calculate a conditional average using -rowmean-, but this was terribly inefficient and slow. The real data has over 16 million rows so looping over each row in this fashion was not suitable.

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str7(start_modate end_modate) byte duration float(v_690 v_691 v_692 v_693 v_694 v_695 v_696 v_697 v_698 v_699 v_700)
    "2019m6"  "2019m11"  5 72.499 82.165 80.554 73.304 53.971 53.971 69.784  74.71 57.469 59.111 64.037
    "2017m10" "2018m4"   6 12.356 12.846 17.923 15.048 13.824 15.109 12.906 16.549  15.87 16.364 13.955
    "2018m6"  "2019m1"   7 26.306 32.305 35.997 31.151 23.537 20.998 19.711  22.56 23.273 23.035 23.985
    "2018m11" "2019m7"   8  8.841  7.073 10.609 17.682 14.145  7.073 13.076 14.944 14.944   9.34 14.944
    "2017m9"  "2018m6"   9 12.356 12.846 17.923 15.048 13.824 15.109 12.906 16.549  15.87 16.364 13.955
    "2019m2"  "2019m12" 10 21.182 21.182 17.331 19.257  15.79  13.48  7.232   5.53  8.083  5.956  5.956
    end
    Thank you for your time; I sincerely appreciate your attention and any suggestions you may have.

  • #2
    Well, you don't say what months 690, 691, etc. are. I'm going to assume that 690 corresponds to 2017m7, 691 to 2017m8, etc. because that is the way Stata would represent those months numerically. But I fear the assumption may be wrong because, for example start_modate and end_modate in the very first row define an interval that is well beyond the last date represented in the v's.

    As with most data management tasks in Stata, this is very difficult to do in wide layout, but is simplicity itself with long data layout. And probably whatever else you're going to do with this data will also be better (or only possibly) done in long layout.

    Code:
    //   CONVERT *_MODATE TO NUMERIC DATE VARIABLES
    foreach v of varlist *_modate {
        gen _`v' = monthly(`v', "YM")
        assert missing(_`v') == missing(`v')
        format _`v' %tm
        drop `v'
        rename _`v' `v'
    }
    
    //  GO LONG
    gen long obs_no = _n
    reshape long v_, i(obs_no) j(mdate)
    format mdate %tm
    
    //  CALCULATE THE DESIRED MEAN CRIME RATE
    by obs_no, sort: egen wanted = mean(cond(inrange(mdate, start_modate, end_modate), v_, .))

    Comment


    • #3
      Thank you Clyde. I apologize for the incorrect dataex. I made an error creating the extract, but your assumption pertaining to the "_modate" variables was correct; I mistakenly did not include all the necessary "v_" variables.
      Your solution to go long is helpful and works well. Much appreciated

      Comment

      Working...
      X