Summing values of duplicate observations within the same variable

Allan Kelly

Join Date: Sep 2022

Posts: 5
#1

Summing values of duplicate observations within the same variable

19 Sep 2022, 10:31

Hello,

I am working with a dataset (100 variables & 130,666 observations) which contains information on construction activity, specifically, commencement and completion data. Each commencement notice can refer to multiple buildings, however, completion certificates are issued for each finished building. Therefore, when each new completion certificate is added the dataset it creates a new row in the dataset which contains the duplicated commencement data and new completion data.

For clarity, I am using Stata 14.1

I used the following code to generate a duplicate variable which is 0 if the observation is unique, 1 if the observation is the first duplicate, 2 if the observation is the second duplicate, etc.

Code:

sort cn_number quietly by cn_number: gen dup = cond(_N==1, 0, _n)

A subset of the data, including the dup variable is below.

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input str11 cn_number str15 ccc_number int ccc_units_completed float dup2 "CN0000154DL" "CC00000013072DL" 1 0 "CN0005488KE" "CC00000006290KE" 2 1 "CN0005488KE" "CC00000006287KE" 1 2 "CN0000156OY" "CC00000004916OY" 2 0 "CN0000157GY" "CC00000000393GY" 1 0 "CN0006136SD" "CC00000004315SD" 1 1 "CN0006136SD" "CC00000004317SD" 1 2 "CN0006136SD" "CC00000004316SD" 1 3 "CN0006251KE" "CC00000003442KE" 1 1 "CN0006251KE" "CC00000003437KE" 1 2 "CN0006251KE" "CC00000003440KE" 1 3 "CN0006251KE" "CC00000003438KE" 1 4 "CN0000162DR" "CC00000026345DR" 1 0 end

As we can see there is four iterations of "CN0005488KE" under the cn_number variable (commencement id) and four different codes for ccc_number (completion id). What I would like to do is to add the values reported for ccc_units_completed for the duplicated values e.g., the number of units completed for "CN0005488KE" would be 3.

I am open to suggestions about the best method to do this but was thinking a loop of some description might work best. If I could begin summing the ccc_units_completed variable when dup = 1 and continue if the next dup value is greater than the previous, stopping when it is not. When dup = 0, I would just want the ccc_units_completed as reported without any alteration.
Tags: duplicate, loop
Clyde Schechter

Join Date: Apr 2014

Posts: 30101
#2

19 Sep 2022, 10:39

Looping is unnecessarily complicated for this purpose.

Code:

by cn_number (dup2), sort: egen total_units_completed = total(ccc_units_completed)

is all you need.

Thank you for using -dataex- in your very first post.
Comment
Jared Greathouse

Join Date: Sep 2021

Posts: 2170
#3

19 Sep 2022, 11:43

I would do

Code:

qbys cn_number (dup2): egen total_units_completed2 = total(ccc_units_completed)

but tomato tomata. My only question here is why do we need to keep duplicates? This is typically undesirable, right?

EDIT: Nevermind, I see (I think) the reason. Also, thanks for using dataex here.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30101
#4

19 Sep 2022, 11:46

Well, the duplicate observations are not complete duplicates: they have different completion certificate numbers. If that information is still needed for other purposes, then the original data must be retained.

If, in fact, there is no further need for that information, then Jared Greathouse is right: we can get rid of them. And in that case, the simplest way to do the whole thing is:

Code:

collapse (sum) total_units_completed = ccc_units_completed, by(cn_number)
Comment
Jared Greathouse

Join Date: Sep 2021

Posts: 2170
#5

19 Sep 2022, 11:52

You're right, they were only duplicates in one sense (the first variable). But yeah either way, collapse what was I was thinking of. In fact, in case you really don't need anything aside from these numbers, I STRONGLY recommend gcollapse. This version of collapse is lightyears faster than Stata's collapse, and since you've over 100k observations, you'll want to work quickly.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30101
#6

19 Sep 2022, 12:04

Actually, my experience with these commands is that the difference between -collapse- and -gcollapse- is not really noticeable in data sets of the order of magnitude of 10⁵ observations. There is some overhead involved in -gcollapse- that is not overcome until we get to substantially larger data sets than that. Moreover, the current version of -collapse- is much faster than some earlier versions and can even leave -gcollapse- in the dust on some very large data sets:

Code:

. clear* r; t=0.02 11:01:46 . set obs 200000000 Number of observations (_N) was 0, now 200,000,000. r; t=0.00 11:01:46 . set seed 1234 r; t=0.00 11:01:46 . gen long group = mod(_n, 1000000) r; t=2.22 11:01:49 . gen summand = runiform() r; t=6.36 11:01:55 . . set rmsg on r; t=0.00 11:01:55 . preserve r; t=3.14 11:01:58 . collapse (sum) total_result = summand, by(group) r; t=16.03 11:02:14 . restore r; t=3.42 11:02:18 . gcollapse (sum) total_result = summand, by(group) r; t=29.40 11:02:47

-gcollapse- does have some nice additional features that -collapse- lacks, accessible with options. But none of those look to me to be relevant to O.P.'s situation.

Last edited by Clyde Schechter; 19 Sep 2022, 12:09.
Comment
Allan Kelly

Join Date: Sep 2022

Posts: 5
#7

22 Sep 2022, 02:47

Thank you Clyde and Jared your commands worked a treat.

Jared Greathouse - yes, the duplicates are not duplicates on every variable. I had already removed observations which are duplicates on every variable. See code below if anyone needs to use it in the future;

Code:

unab vlist: _all sort `vlist' quietly by `vlist': gen dup = cond(_N==1, 0, _n) drop if dup > 1

Thanks also to both of you for the additional information on -collapse- and -gcollapse-. They should come in useful in the future
Comment
Jared Greathouse

Join Date: Sep 2021

Posts: 2170
#8

22 Sep 2022, 05:43

Yep, and the other one that is better is greshape. I said gcollapse, when I was thinking of greshape. Stata's reshape is slow as molasses, to put it very politely, and this one (in my experience) can come in handy sometimes. Another gtool you may come to like
Comment

Rajdeep Chaudhuri

Join Date: Apr 2024
Posts: 36

25 Dec 2024, 01:25

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str14 State byte StateCode str22 DistrictArea int(DistrictCode Year) byte(Murder Infanticide dup)
"Assam"          18 "Barpeta"       280 2014  2 0 0
"Assam"          18 "Baksa"         616 2014  0 0 0
"Assam"          18 "Bongaigaon"    281 2014  0 0 0
"Assam"          18 "Cachar"        282 2014  1 0 0
"Assam"          18 "Chirang"       612 2014  4 0 0
"Assam"          18 "Darrang"       283 2014  0 0 0
"Assam"          18 "Dhemaji"       284 2014  2 0 0
"Assam"          18 "Dhubri"        285 2014  0 0 0
"Assam"          18 "Dibrugarh"     286 2014  0 0 0
"Assam"          18 "Goalpara"      287 2014  0 0 0
"Assam"          18 "Golaghat"      288 2014  0 0 0
"Assam"          18 "Guwahati City" 618 2014  0 0 0
"Assam"          18 "Hailakandi"    289 2014  0 0 0
"Assam"          18 "Jorhat"        290 2014  0 0 0
"Assam"          18 "Kamrup"        291 2014  1 0 0
"Assam"          18 "Karbi Anglong" 292 2014  0 0 0
"Assam"          18 "Karimganj"     293 2014  0 0 0
"Assam"          18 "Kokrajhar"     294 2014  0 0 0
"Assam"          18 "Lakhimpur"     295 2014  0 0 0
"Assam"          18 "Morigaon"      296 2014  0 0 0
"Assam"          18 "Dima Hasao"    299 2014  1 0 0
"Assam"          18 "Nagaon"        297 2014  2 0 0
"Assam"          18 "Nalbari"       298 2014  0 0 0
"Assam"          18 "Sibsagar"      300 2014  5 0 0
"Assam"          18 "Sonitpur"      301 2014  2 0 0
"Assam"          18 "Tinsukia"      302 2014  2 0 0
"Assam"          18 "Udalguri"      617 2014  0 0 0
"Bihar"          10 "Araria"        188 2014  0 0 0
"Bihar"          10 "Arwal"         611 2014  0 0 0
"Bihar"          10 "Aurangabad"    189 2014  0 0 0
"Bihar"          10 "Bagaha"        211 2014  0 0 0
"Bihar"          10 "Banka"         190 2014  0 0 0
"Bihar"          10 "Begusarai"     191 2014  6 0 0
"Bihar"          10 "Bhabhua"       200 2014  0 0 0
"Bihar"          10 "Bhagalpur"     192 2014  0 0 1
"Bihar"          10 "Bhojpur"       193 2014  5 0 0
"Bihar"          10 "Buxar"         194 2014  0 0 0
"Bihar"          10 "Darbhanga"     195 2014  1 0 0
"Bihar"          10 "Gaya"          196 2014  7 0 0
"Bihar"          10 "Gopalganj"     197 2014  0 0 0
"Bihar"          10 "Jamui"         198 2014  0 0 0
"Bihar"          10 "Jehanabad"     199 2014  0 0 0
"Bihar"          10 "Katihar"       201 2014  0 1 0
"Bihar"          10 "Khagaria"      202 2014  0 0 0
"Bihar"          10 "Kishanganj"    203 2014  0 0 0
"Bihar"          10 "Lakhisarai"    204 2014  3 0 0
"Bihar"          10 "Madhepura"     205 2014  0 0 0
"Bihar"          10 "Madhubani"     206 2014  0 0 0
"Bihar"          10 "Motihari"      213 2014  0 0 0
"Bihar"          10 "Munger"        207 2014  0 0 0
"Bihar"          10 "Muzaffarpur"   208 2014  0 0 0
"Bihar"          10 "Nalanda"       209 2014  0 0 0
"Bihar"          10 "Naugachia"     192 2014  1 0 1
"Bihar"          10 "Nawadah"       210 2014  3 0 0
"Bihar"          10 "Patna"         212 2014 12 0 2
"Bihar"          10 "Purnea"        214 2014  6 0 0
"Bihar"          10 "Rohtas"        215 2014 10 0 0
"Bihar"          10 "Saharsa"       216 2014  2 0 0
"Bihar"          10 "Samastipur"    217 2014  0 0 0
"Bihar"          10 "Saran"         218 2014  2 0 0
"Bihar"          10 "Sheikhpura"    219 2014  0 0 0
"Bihar"          10 "Sheohar"       220 2014  0 0 0
"Bihar"          10 "Sitamarhi"     221 2014  0 0 0
"Bihar"          10 "Siwan"         222 2014  2 0 0
"Bihar"          10 "Supaul"        223 2014  0 0 0
"Bihar"          10 "Vaishali"      224 2014  1 0 0
"Chhattisgarh"   22 "Balod"         646 2014  1 0 0
"Chhattisgarh"   22 "Balodbazar"    644 2014  2 0 0
"Chhattisgarh"   22 "Balrampur"     649 2014  5 0 0
"Chhattisgarh"   22 "Bemetra"       650 2014  1 0 0
"Chhattisgarh"   22 "Bilaspur"      375 2014  1 0 0
"Chhattisgarh"   22 "Bizapur"       636 2014  1 0 0
"Chhattisgarh"   22 "Dantewara"     376 2014  1 0 0
"Chhattisgarh"   22 "Dhamtari"      377 2014  1 0 0
"Chhattisgarh"   22 "Durg"          378 2014  0 0 0
"Chhattisgarh"   22 "Gariyaband"    645 2014  0 0 0
"Chhattisgarh"   22 "Jagdalpur"     374 2014  1 1 0
"Chhattisgarh"   22 "Janjgir"       379 2014  5 0 0
"Chhattisgarh"   22 "Jashpur"       380 2014  4 0 0
"Chhattisgarh"   22 "Kabirdham"     382 2014  4 0 0
"Chhattisgarh"   22 "Kanker"        381 2014  3 1 0
"Chhattisgarh"   22 "Kondagaon"     643 2014  1 1 0
"Chhattisgarh"   22 "Korba"         383 2014  3 1 0
"Chhattisgarh"   22 "Koriya"        384 2014  0 0 0
"Chhattisgarh"   22 "Mahasamund"    385 2014  2 0 0
"Chhattisgarh"   22 "Mungali"       647 2014  1 0 0
"Chhattisgarh"   22 "Narayanpur"    637 2014  0 0 0
"Chhattisgarh"   22 "Raigarh"       386 2014  3 0 0
"Chhattisgarh"   22 "Raipur"        387 2014  4 1 1
"Chhattisgarh"   22 "Rajnandgaon"   388 2014  1 0 0
"Chhattisgarh"   22 "Sarguja"       389 2014  1 0 0
"Chhattisgarh"   22 "Sukma"         642 2014  0 0 0
"Chhattisgarh"   22 "Surajpur"      648 2014  4 0 0
"Madhya Pradesh" 23 "Agar"          667 2014  0 0 0
"Madhya Pradesh" 23 "Alirajpur"     639 2014  0 0 0
"Madhya Pradesh" 23 "Anuppur"       390 2014  1 0 0
"Madhya Pradesh" 23 "Ashok Nagar"   391 2014  3 0 0
"Madhya Pradesh" 23 "Balaghat"      392 2014  2 0 0
"Madhya Pradesh" 23 "Barwani"       393 2014  2 0 0
"Madhya Pradesh" 23 "Betul"         394 2014  6 0 0
end

Here I have duplicates on DistrictCode. I want the variables murder and infanticide to be summed up for every duplicate. Eg. In this example dataset, i have a duplicate for DistrictCode = 192. In this case I want murder and infanticide to be summed up. So murder should be 1 and infanticide should be 0.
I apologise for not being able to provide a clearer example. In short, I want the observations for all variables(Murder onwards) to be summed up for every duplicate value of DistrictCode.
Any help would be appreciated.
I tried

Code:

foreach v of varlist Murder-TotalIPCCrimesAgainstChildre{
    by DistrictCode (dup), sort : egen `v'1= total(`v')
    drop `v'
    ren `v'1 `v'
}

This doesn't get rid of the duplicates, and I am not sure whether

Code:

drop if dup!=0

is a correct way to go about it

Note: Apologies if the question is too basic. I am really new to this software

Last edited by Rajdeep Chaudhuri; 25 Dec 2024, 01:29.

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30101
#10

25 Dec 2024, 12:16

The easiest way to do this is:

Code:

collapse (sum) Murder Infanticide, by(DistrictCode State StateCode)

Notes:
1. In your example data, there is no TotalIPCCrimesAgainstChildre variable. Assuming your real data set does have more variables than Murder and Infanticide to be summed, and that they are properly represented by Murder-TotalIPCCrimesAgainstChildre, replace Murder Infanticide in the -collapse- command by Murder-TotalIPCCrimesAgainstChildre.
2. About the variables in the -by()- option, the key one is, of course, DistrictCode. The other variables mentioned there are added so that they are carried along in the resulting data set: without them being mentioned there, they would be lost. Of course, you may or may not need them, and if you don't, you can remove all but DistrictCode.
3. Notice that DistrictArea is not included in the -by()- option. That's because, unlike State and StateCode, DistrictArea can take on more than one value for the same DistrictCode*, so carrying along one of them would be misleading, and, more to the point, -collapse- would actually leave you with separate observations for each DistrictArea if we included it in -by()-. It doesn't do that for State and StateCode because those are constant within DistrictCode.

*Actually, in the example data this only happens once: District Code 192 is associated once with District Area Bhagalpur, and elsewhere with Naugachia. Since this happens only once, I wonder if it is a data error?
Comment

Rajdeep Chaudhuri

Join Date: Apr 2024
Posts: 36

#11

25 Dec 2024, 22:39

Thanks so much, Clyde for the valuable feedback. I was wondering if under -by()- , could we include time variables as well? Like in the more comprehensive dataset

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input int(id year) str16 state_name byte state_code str24 district_name int district_code str23 registeration_circles byte(murder culpable_homicide_not_amounting_ causing_death_by_negligence dowry_deaths)
1820 2019 "Himachal Pradesh"  2 "Bilaspur"         15 "Bilaspur"         0 0  0 0
 227 2021 "Himachal Pradesh"  2 "Bilaspur"         15 "Bilaspur"         0 0  0 0
2615 2018 "Himachal Pradesh"  2 "Bilaspur"         15 "Bilaspur"         0 0  0 0
1071 2020 "Himachal Pradesh"  2 "Bilaspur"         15 "Bilaspur"         0 0  0 0
2616 2018 "Himachal Pradesh"  2 "Chamba"           16 "Chamba"           0 0  0 0
1821 2019 "Himachal Pradesh"  2 "Chamba"           16 "Chamba"           0 0  0 0
 228 2021 "Himachal Pradesh"  2 "Chamba"           16 "Chamba"           0 0  0 0
3319 2017 "Himachal Pradesh"  2 "Chamba"           16 "Chamba"           1 0  0 0
1072 2020 "Himachal Pradesh"  2 "Chamba"           16 "Chamba"           0 0  0 0
3320 2017 "Himachal Pradesh"  2 "Hamirpur"         17 "Hamirpur"         0 0  0 0
 229 2021 "Himachal Pradesh"  2 "Hamirpur"         17 "Hamirpur"         0 0  0 0
1822 2019 "Himachal Pradesh"  2 "Hamirpur"         17 "Hamirpur"         0 0  0 0
2617 2018 "Himachal Pradesh"  2 "Hamirpur"         17 "Hamirpur"         0 0  0 0
1073 2020 "Himachal Pradesh"  2 "Hamirpur"         17 "Hamirpur"         0 0  0 0
 230 2021 "Himachal Pradesh"  2 "Kangra"           18 "Kangra"           1 1  0 0
1823 2019 "Himachal Pradesh"  2 "Kangra"           18 "Kangra"           0 1  0 0
2618 2018 "Himachal Pradesh"  2 "Kangra"           18 "Kangra"           0 0  9 0
3321 2017 "Himachal Pradesh"  2 "Kangra"           18 "Kangra"           0 0  0 0
1074 2020 "Himachal Pradesh"  2 "Kangra"           18 "Kangra"           0 0  0 0
3322 2017 "Himachal Pradesh"  2 "Kinnaur"          19 "Kinnaur"          0 0  0 0
1075 2020 "Himachal Pradesh"  2 "Kinnaur"          19 "Kinnaur"          0 0  0 0
2619 2018 "Himachal Pradesh"  2 "Kinnaur"          19 "Kinnaur"          0 0  0 0
1824 2019 "Himachal Pradesh"  2 "Kinnaur"          19 "Kinnaur"          0 0  1 0
 231 2021 "Himachal Pradesh"  2 "Kinnaur"          19 "Kinnaur"          0 0  0 0
1825 2019 "Himachal Pradesh"  2 "Kullu"            20 "Kullu"            0 0  0 0
 232 2021 "Himachal Pradesh"  2 "Kullu"            20 "Kullu"            0 0  0 0
2620 2018 "Himachal Pradesh"  2 "Kullu"            20 "Kullu"            1 0  0 0
1076 2020 "Himachal Pradesh"  2 "Kullu"            20 "Kullu"            0 0  0 0
3323 2017 "Himachal Pradesh"  2 "Kullu"            20 "Kullu"            0 0  0 0
1077 2020 "Himachal Pradesh"  2 "Lahul And Spiti"  21 "Lahul And Spiti"  0 0  0 0
 233 2021 "Himachal Pradesh"  2 "Lahul And Spiti"  21 "Lahaul And Spiti" 0 0  0 0
1826 2019 "Himachal Pradesh"  2 "Lahul And Spiti"  21 "Lahaul And Spiti" 0 0  0 0
2621 2018 "Himachal Pradesh"  2 "Lahul And Spiti"  21 "Lahul And Spiti"  0 0  0 0
3324 2017 "Himachal Pradesh"  2 "Lahul And Spiti"  21 "Lahul And Spiti"  0 0  0 0
2622 2018 "Himachal Pradesh"  2 "Mandi"            22 "Mandi"            0 0  0 0
 234 2021 "Himachal Pradesh"  2 "Mandi"            22 "Mandi"            0 0  5 0
3325 2017 "Himachal Pradesh"  2 "Mandi"            22 "Mandi"            0 0  0 0
1827 2019 "Himachal Pradesh"  2 "Mandi"            22 "Mandi"            1 0 10 0
1078 2020 "Himachal Pradesh"  2 "Mandi"            22 "Mandi"            0 0  0 0
3843 2021 "Himachal Pradesh"  2 "Shimla"           23 "GRP"              0 0  0 0
3910 2020 "Himachal Pradesh"  2 "Shimla"           23 "GRP"              0 0  0 0
3326 2017 "Himachal Pradesh"  2 "Shimla"           23 "Shimla"           0 0  0 0
3908 2020 "Himachal Pradesh"  2 "Shimla"           23 "CID"              0 0  0 0
3952 2019 "Himachal Pradesh"  2 "Shimla"           23 "GRP"              0 0  0 0
1079 2020 "Himachal Pradesh"  2 "Shimla"           23 "Shimla"           0 1  2 0
4049 2017 "Himachal Pradesh"  2 "Shimla"           23 "CID"              0 0  0 0
4051 2017 "Himachal Pradesh"  2 "Shimla"           23 "Cyber"            0 0  0 0
3844 2021 "Himachal Pradesh"  2 "Shimla"           23 "Cyber"            0 0  0 0
 235 2021 "Himachal Pradesh"  2 "Shimla"           23 "Shimla"           0 0  0 0
3909 2020 "Himachal Pradesh"  2 "Shimla"           23 "Cyber"            0 0  0 0
2623 2018 "Himachal Pradesh"  2 "Shimla"           23 "Shimla"           3 0  4 0
3951 2019 "Himachal Pradesh"  2 "Shimla"           23 "CID"              0 0  0 0
4004 2018 "Himachal Pradesh"  2 "Shimla"           23 "Cyber"            0 0  0 0
4002 2018 "Himachal Pradesh"  2 "Shimla"           23 "CID"              0 0  0 0
1828 2019 "Himachal Pradesh"  2 "Shimla"           23 "Shimla"           0 0  0 0
4050 2017 "Himachal Pradesh"  2 "Shimla"           23 "GRP"              0 0  0 0
3842 2021 "Himachal Pradesh"  2 "Shimla"           23 "CID"              0 0  0 0
3953 2019 "Himachal Pradesh"  2 "Shimla"           23 "Cyber"            0 0  0 0
4003 2018 "Himachal Pradesh"  2 "Shimla"           23 "GRP"              0 0  0 0
3327 2017 "Himachal Pradesh"  2 "Sirmaur"          24 "Sirmaur"          0 0  0 0
1829 2019 "Himachal Pradesh"  2 "Sirmaur"          24 "Sirmaur"          0 1  0 0
 236 2021 "Himachal Pradesh"  2 "Sirmaur"          24 "Sirmaur"          0 0  0 0
2624 2018 "Himachal Pradesh"  2 "Sirmaur"          24 "Sirmaur"          0 0  0 0
1080 2020 "Himachal Pradesh"  2 "Sirmaur"          24 "Sirmaur"          1 0  0 0
1081 2020 "Himachal Pradesh"  2 "Solan"            25 "Solan"            0 0  0 0
1819 2019 "Himachal Pradesh"  2 "Solan"            25 "Baddi"            0 0  0 0
3318 2017 "Himachal Pradesh"  2 "Solan"            25 "Solan"            0 0  2 0
 226 2021 "Himachal Pradesh"  2 "Solan"            25 "Baddi"            0 0  0 0
2625 2018 "Himachal Pradesh"  2 "Solan"            25 "Solan"            0 0  1 0
 237 2021 "Himachal Pradesh"  2 "Solan"            25 "Solan"            1 0  0 0
1830 2019 "Himachal Pradesh"  2 "Solan"            25 "Solan"            0 0  0 0
1831 2019 "Himachal Pradesh"  2 "Una"              26 "Una"              0 0  0 0
1082 2020 "Himachal Pradesh"  2 "Una"              26 "Una"              0 0  2 0
2626 2018 "Himachal Pradesh"  2 "Una"              26 "Una"              0 0  1 0
3328 2017 "Himachal Pradesh"  2 "Una"              26 "Una"              0 0  2 0
 238 2021 "Himachal Pradesh"  2 "Una"              26 "Una"              1 0  1 0
2493 2018 "Bihar"            10 "Araria"          188 "Araria"           0 0  0 0
 925 2020 "Bihar"            10 "Araria"          188 "Araria"           0 0  0 0
1682 2019 "Bihar"            10 "Araria"          188 "Araria"           0 0  0 0
  80 2021 "Bihar"            10 "Araria"          188 "Araria"           0 0  0 0
3196 2017 "Bihar"            10 "Araria"          188 "Araria"           0 0  0 0
3198 2017 "Bihar"            10 "Aurangabad"      189 "Aurangabad"       2 1  2 0
  82 2021 "Bihar"            10 "Aurangabad"      189 "Aurangabad"       0 0  2 0
2495 2018 "Bihar"            10 "Aurangabad"      189 "Aurangabad"       0 0  0 0
1684 2019 "Bihar"            10 "Aurangabad"      189 "Aurangabad"       0 0  0 0
 927 2020 "Bihar"            10 "Aurangabad"      189 "Aurangabad"       0 0  0 0
1686 2019 "Bihar"            10 "Banka"           190 "Banka"            0 0  0 0
  84 2021 "Bihar"            10 "Banka"           190 "Banka"            0 0  0 0
2497 2018 "Bihar"            10 "Banka"           190 "Banka"            0 0  0 0
3200 2017 "Bihar"            10 "Banka"           190 "Banka"            0 0  0 0
 929 2020 "Bihar"            10 "Banka"           190 "Banka"            0 0  0 0
  85 2021 "Bihar"            10 "Begusarai"       191 "Begusarai"        0 0  0 0
3201 2017 "Bihar"            10 "Begusarai"       191 "Begusarai"        0 0  0 0
1687 2019 "Bihar"            10 "Begusarai"       191 "Begusarai"        0 0  0 0
 930 2020 "Bihar"            10 "Begusarai"       191 "Begusarai"        0 0  0 0
2498 2018 "Bihar"            10 "Begusarai"       191 "Begusarai"        0 0  0 0
 109 2021 "Bihar"            10 "Bhagalpur"       192 "Naugachia"        0 0  0 0
1711 2019 "Bihar"            10 "Bhagalpur"       192 "Naugachia"        0 0  0 0
2500 2018 "Bihar"            10 "Bhagalpur"       192 "Bhagalpur"        1 0  0 3
1690 2019 "Bihar"            10 "Bhagalpur"       192 "Bhagalpur"        0 0  0 0
end
format %tdnn/dd/CCYY year

Here you can see that in 2019 we have two cases of district_code = 192. Now for every year, I would want one value of 192. If there are more than 1 for any year, then the variables from murder onwards need to be summed across those values of district_code.
Now I don't need the registration circle variable in the final analysis, but I require the district_code state_code and year as well.

Originally posted by Clyde Schechter View Post

*Actually, in the example data this only happens once: District Code 192 is associated once with District Area Bhagalpur, and elsewhere with Naugachia. Since this happens only once, I wonder if it is a data error?

Not really, since here data is collected at a more disaggregated level than district(registration circles), but I want to focus on the district level only since the disaggregation is only for some districts.

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30101
#12

26 Dec 2024, 07:52

As it happens, in this example, the District Name is always consistent within District Code, so it can be brought into the final result without disrupting the aggregation process. Registration Circles, a you point out, however, do vary within District Code. So I come up with:

Code:

collapse (sum) murder-dowry_deaths, by(year district_code district_name state_name state_code)

If, in the full, correct data, district_name can be inconsistent within a district code, then you must remove district_name from the -by()- option in order to get complete aggregation to the year-district_code level.
Comment
Rajdeep Chaudhuri

Join Date: Apr 2024

Posts: 36
#13

03 Jan 2025, 22:48

Thanks Clyde Schechter for the valuable inputs. I used the code

Code:

collapse (sum) murder-dowry_deaths, by(year district_code state_name state_code)

and it worked.
Just as a quick question: The collapse command would keep all the variables that I specify in the -by()- option and drop the others right? In other words, if I use the collapse option, then I lose the variables and the observations. If I use

Code:

preserve collapse (sum) murder-dowry_deaths, by(year district_code state_name state_code) foreach var of varlist murder-dowry_deaths{ reg `var' i.trt##i.post i.state_code estimates store `var' } restore

This should bring back the original dataset back right?

trt is a dummy for treatment group and post is a dummy for year after 2018.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30101
#14

04 Jan 2025, 10:09

Yes, the -collapse- command overwrites the original data set, so you lose any variables that were not mentioned in the -collapse- command. Wrapping the -collapse- and subsequent commands between -preserve- and -restore- does permit you to do those calculations and then return to the original data set.
Comment

Announcement