Hello,
I am working with a dataset (100 variables & 130,666 observations) which contains information on construction activity, specifically, commencement and completion data. Each commencement notice can refer to multiple buildings, however, completion certificates are issued for each finished building. Therefore, when each new completion certificate is added the dataset it creates a new row in the dataset which contains the duplicated commencement data and new completion data.
For clarity, I am using Stata 14.1
I used the following code to generate a duplicate variable which is 0 if the observation is unique, 1 if the observation is the first duplicate, 2 if the observation is the second duplicate, etc.
A subset of the data, including the dup variable is below.
As we can see there is four iterations of "CN0005488KE" under the cn_number variable (commencement id) and four different codes for ccc_number (completion id). What I would like to do is to add the values reported for ccc_units_completed for the duplicated values e.g., the number of units completed for "CN0005488KE" would be 3.
I am open to suggestions about the best method to do this but was thinking a loop of some description might work best. If I could begin summing the ccc_units_completed variable when dup = 1 and continue if the next dup value is greater than the previous, stopping when it is not. When dup = 0, I would just want the ccc_units_completed as reported without any alteration.
I am working with a dataset (100 variables & 130,666 observations) which contains information on construction activity, specifically, commencement and completion data. Each commencement notice can refer to multiple buildings, however, completion certificates are issued for each finished building. Therefore, when each new completion certificate is added the dataset it creates a new row in the dataset which contains the duplicated commencement data and new completion data.
For clarity, I am using Stata 14.1
I used the following code to generate a duplicate variable which is 0 if the observation is unique, 1 if the observation is the first duplicate, 2 if the observation is the second duplicate, etc.
Code:
sort cn_number quietly by cn_number: gen dup = cond(_N==1, 0, _n)
Code:
* Example generated by -dataex-. To install: ssc install dataex clear input str11 cn_number str15 ccc_number int ccc_units_completed float dup2 "CN0000154DL" "CC00000013072DL" 1 0 "CN0005488KE" "CC00000006290KE" 2 1 "CN0005488KE" "CC00000006287KE" 1 2 "CN0000156OY" "CC00000004916OY" 2 0 "CN0000157GY" "CC00000000393GY" 1 0 "CN0006136SD" "CC00000004315SD" 1 1 "CN0006136SD" "CC00000004317SD" 1 2 "CN0006136SD" "CC00000004316SD" 1 3 "CN0006251KE" "CC00000003442KE" 1 1 "CN0006251KE" "CC00000003437KE" 1 2 "CN0006251KE" "CC00000003440KE" 1 3 "CN0006251KE" "CC00000003438KE" 1 4 "CN0000162DR" "CC00000026345DR" 1 0 end
I am open to suggestions about the best method to do this but was thinking a loop of some description might work best. If I could begin summing the ccc_units_completed variable when dup = 1 and continue if the next dup value is greater than the previous, stopping when it is not. When dup = 0, I would just want the ccc_units_completed as reported without any alteration.
Comment