Calculating conditional row average across different variables

Konrad Franco

Join Date: Mar 2016

Posts: 14
#1

Calculating conditional row average across different variables

08 Apr 2021, 17:06

Please consider the example toy data below. The data is contrived but for our purposes each row represents an individual recently released from jail. The month they are released is the variable "start_modate." All these individuals are re-incarcerated k number of months later (variable "length"). The month they are re-incarcerated is the variable "end_modate." I have monthly data about reported crime in the area where they reside after release. The variables that start with the prefix "v_" measure the crime rate. The suffix for all the "v_" variables denotes the month during which the crime rate was measured.

I wish to calculate a rowwise average for each individual over only the period they were living in the community. In other words, I wish to average the crime rate over only the months each individual is not incarcerated, which are the months between "start_modate" and "end_modate" (inclusive).
I have tried looping over each row and using the -ereplace- package to calculate a conditional average using -rowmean-, but this was terribly inefficient and slow. The real data has over 16 million rows so looping over each row in this fashion was not suitable.

Code:

* Example generated by -dataex-. For more info, type help dataex clear input str7(start_modate end_modate) byte duration float(v_690 v_691 v_692 v_693 v_694 v_695 v_696 v_697 v_698 v_699 v_700) "2019m6" "2019m11" 5 72.499 82.165 80.554 73.304 53.971 53.971 69.784 74.71 57.469 59.111 64.037 "2017m10" "2018m4" 6 12.356 12.846 17.923 15.048 13.824 15.109 12.906 16.549 15.87 16.364 13.955 "2018m6" "2019m1" 7 26.306 32.305 35.997 31.151 23.537 20.998 19.711 22.56 23.273 23.035 23.985 "2018m11" "2019m7" 8 8.841 7.073 10.609 17.682 14.145 7.073 13.076 14.944 14.944 9.34 14.944 "2017m9" "2018m6" 9 12.356 12.846 17.923 15.048 13.824 15.109 12.906 16.549 15.87 16.364 13.955 "2019m2" "2019m12" 10 21.182 21.182 17.331 19.257 15.79 13.48 7.232 5.53 8.083 5.956 5.956 end

Thank you for your time; I sincerely appreciate your attention and any suggestions you may have.
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#2

08 Apr 2021, 17:58

Well, you don't say what months 690, 691, etc. are. I'm going to assume that 690 corresponds to 2017m7, 691 to 2017m8, etc. because that is the way Stata would represent those months numerically. But I fear the assumption may be wrong because, for example start_modate and end_modate in the very first row define an interval that is well beyond the last date represented in the v's.

As with most data management tasks in Stata, this is very difficult to do in wide layout, but is simplicity itself with long data layout. And probably whatever else you're going to do with this data will also be better (or only possibly) done in long layout.

Code:

// CONVERT *_MODATE TO NUMERIC DATE VARIABLES foreach v of varlist *_modate { gen _`v' = monthly(`v', "YM") assert missing(_`v') == missing(`v') format _`v' %tm drop `v' rename _`v' `v' } // GO LONG gen long obs_no = _n reshape long v_, i(obs_no) j(mdate) format mdate %tm // CALCULATE THE DESIRED MEAN CRIME RATE by obs_no, sort: egen wanted = mean(cond(inrange(mdate, start_modate, end_modate), v_, .))
Comment
Konrad Franco

Join Date: Mar 2016

Posts: 14
#3

08 Apr 2021, 18:31

Thank you Clyde. I apologize for the incorrect dataex. I made an error creating the extract, but your assumption pertaining to the "_modate" variables was correct; I mistakenly did not include all the necessary "v_" variables.
Your solution to go long is helpful and works well. Much appreciated
Comment

Announcement

Calculating conditional row average across different variables

Comment

Comment