Calculating difference between observations and sum

Ke Gong

Join Date: May 2017

Posts: 30
#1

Calculating difference between observations and sum

14 Sep 2019, 10:46

Dear experts,

I'm wondering whether you can help me. I wanted to calculate the age similarity based on below data:

1. Calculate the age difference between persons first. For example, the age difference between person 1 and person 2 is 9. the age difference between person 1 and person 3 is 1.
2. Sum the absolute age difference for person 1, which is 9+1 = 10. This absolute summed age difference is the age similarity that I want to gather for each person.

person age

1 61

1 61

1 61

2 52

2 52

2 52

3 60

3 60

3 60

So, the ideal output should be like this:
person age age similarity

1 61 10

1 61 10

1 61 10

2 52 17

2 52 17

2 52 17

3 60 13

3 60 13

3 60 13

Could you please show me how I can use Stata syntax to accomplish the two steps? Thank you very much!
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 29962
#2

14 Sep 2019, 11:15

Either I don't understand what you want, or there is an error in your example. For person 3, at age 60, the difference from person 1 (age 61) is 1 and from person 2 (age 52) is 8, so the total should be 9 not 13.

Assuming that your example is in error and you meant the answer to be 9 for person 3, the following code will do it:

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input byte(person age) 1 61 1 61 1 61 2 52 2 52 2 52 3 60 3 60 3 60 end preserve duplicates drop tempfile copy save `copy' rename age age1 rename person person1 cross using `copy' drop if person == person1 by person, sort: egen similarity = total(abs(age-age1)) keep person similarity duplicates drop save `"`copy'"', replace restore merge m:1 person using `copy'

In the future, when showing data examples, please use the -dataex- command to do so. If you are running version 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.
Comment
Ke Gong

Join Date: May 2017

Posts: 30
#3

14 Sep 2019, 12:31

Clyde Schechter

Thank you so much, Clyde. You are right. The age similarity for person 3 should be 9, not 13.

When I ran cross using `copy', Stata says "sum of expand values exceed 2,147,483,620 observations". But I only have 90,000 obs in the dataset. Do you have any idea why this happens?
Comment
Ke Gong

Join Date: May 2017

Posts: 30
#4

14 Sep 2019, 12:55

Clyde Schechter

I think I get what's happening. The cross using function leads to 90,000 x 90,000 observations. That's why the number of observations exceeds limit. Maybe I should use append or merge instead of cross using.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29962
#5

14 Sep 2019, 13:02

Yes. When you run -cross- you (attempt to) create a new data set that contains all possible pairs of observations between the data set in memory and the using data set. If each contains 90,000 observations then the -cross- will contain 90,000², which is much larger than you can work with in Stata (and probably much larger than you can work with on your computer in any software.)

Here's a workaround. It handles one person at a time, so that -cross- will create a data set with 90,000 observations, not 90,000². I should warn you, however, that with a data set this size, this will be a bit time consuming. So make sure you have something to do while it's running. (The -status- option will cause Stata to give you a progress report every few minutes as this runs, so at least you will know you are making progress. But it is best to keep yourself occupied during long runs like this.)

Code:

preserve duplicates drop save copy, replace rename person person1 rename age age1 capture program drop one_person program define one_person cross using copy egen similarity = total(abs(age-age1)) keep person1 similarity keep in 1 exit end runby one_person, by(person1) status rename person1 person save copy, replace restore merge m:1 person using copy

Notes: -runby- is written by Robert Picard and me, and is available from SSC. Note that in this code, copy is not a tempfile. You can't pass a tempfile name into the program called by -runby-. So just make sure that your working directory doesn't already have a file named copy.dta, or, at least not one whose contents you need to keep. If you do have a copy.dta file already that you need to keep, pick some filename other than copy that doesn't clash with any existing filenames.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29962
#6

14 Sep 2019, 13:12

#5 crossed with #4, where Ke Gong anticipated my explanation.

But

Maybe I should use append or merge instead of cross using.

will not solve be satisfactory. They will eliminate the error message, but they will not produce a data configuration that can be used to solve the original problem. It is not just a matter of throwing the original and copy data sets together any old way. Crucial to the logic of the code is that each person's observation in one data set be paired with every person's observations in the other. Neither -merge- nor -append- will accomplish that.

So there is no getting around the need for doing what -cross- does, namely forming a cartesian product of data sets. The solution in #5 will work because it does this with one person's data at a time, calculates the total absolute difference, and then discards surplus observations and builds up a new answer data set one person at a time. (The one person at a time parts are what -runby- does.)
Comment
Ke Gong

Join Date: May 2017

Posts: 30
#7

14 Sep 2019, 17:39

Clyde Schechter

Great thanks! It takes too long to calculate if i keep the 90,000 obs. So I collapsed the data by year and person. It should take much less time. I'm showing the exact data including year below. Basically I just want to calculate the age similairty within the same year. For example, person "3013454" is 53 years old in 2007, so I just want to calculate the age differences between him and the others within 2007, and then sum the differences for him as his age similarity in 2007. Should I add something like "by year, sort:"? The exact data I have is shown below:

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input int year long person float age 2007 3013454 53 2007 30347252 54 2007 30347413 62 2007 30347425 63 2007 30347506 54 2008 3013454 54 2008 30347413 63 2008 30347425 64 2008 30347506 55 2008 30347604 66 end
Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 29962

14 Sep 2019, 17:59

It requires somewhat more extensive modifications to the code than that:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input int year long person float age
2007  3013454 53
2007 30347252 54
2007 30347413 62
2007 30347425 63
2007 30347506 54
2008  3013454 54
2008 30347413 63
2008 30347425 64
2008 30347506 55
2008 30347604 66
end

preserve
duplicates drop
save copy, replace

rename person person1
rename age age1
rename year year1

capture program drop one_person
program define one_person
    cross using copy
    keep if year == year1
    egen similarity = total(abs(age-age1))
    keep person1 year1 similarity
    keep in 1
    exit
end

runby one_person, by(person1 year1) status


rename person1 person
rename year1 year
save copy, replace
restore
merge m:1 person year using copy

Changes shown in bold face.

Comment

Ke Gong

Join Date: May 2017

Posts: 30
#9

14 Sep 2019, 19:00

Clyde Schechter

Great! Your code solved my problem perfectly. Thank you so much!! Have a wonderful weekend!!
Comment
Shashi Mishra

Join Date: Oct 2019

Posts: 4
#10

16 Mar 2021, 09:19

Clyde Schechter , Can you please help me form the stata command for following,.. i am not able to formulate it. please

I am using a data set which is in this form
householdid Personid religion gender expenditure
101 1 1 1 100
102 1 1 1 200
102 2 1 1 300
103 1 2 1 300
103 2 2 2 500
104 1 1 2 100
104 2 1 2 50

and so on

Say gender 1- boy 2 is girl
religion 1 is hindu 2- muslims
I need command help to find:

1. How many HH s have 2 sons and how many have 1 daughter and 1 son

2. difference between the two child consumption expenditure if a family has more than one child?
There are families with even 3 or more child but i need to focus more on 2 child part

Please help
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29962
#11

16 Mar 2021, 10:04

1. How many HH s have 2 sons and how many have 1 daughter and 1 son

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input int householdid byte(personid religion gender) int expenditure 101 1 1 1 100 102 1 1 1 200 102 2 1 1 300 103 1 2 1 300 103 2 2 2 500 104 1 1 2 100 104 2 1 2 50 end label values religion religion label def religion 1 "Hindu", modify label def religion 2 "Muslim", modify label values gender gender label def gender 1 "Male", modify label def gender 2 "Female", modify by householdid, sort: egen n_boys = total(gender == "Male":gender) by householdid: egen n_girls = total(gender == "Female":gender) egen household_flag = tag(householdid) count if n_boys == 2 & household_flag count if n_boys == 1 & n_girls == 1 & household_flag

2. difference between the two child consumption expenditure if a family has more than one child?

I do not understand the question. What does "two child consumption expenditure" mean? Is it the sum of the values of the variable expenditure for the two children? And if there are more than two children, which two should be included in this calculation? Or is it something else altogether? Please clarify.

In the future, when showing data examples, please use the -dataex- command to do so. If you are running version 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.
Comment
Shashi Mishra

Join Date: Oct 2019

Posts: 4
#12

16 Mar 2021, 10:30

Dear Clyde Schechter sir, I am grateful for your prompt reply.

I am looking into analyzing if the household differentiates among their own children with regards to consumption expenditure based on gender and birth order of the child.
[say if a household has two boys vis a vis a house with one boy and one girl,is there any difference in the difference in expenditure in second house hold?]
. further is there a difference in difference across religious groups?

I am only be focusing on families with two children only( so 3 groups (son- son) (son- daughter) (daughter -daughter)
however is there any way to capture birth order (age information is available with me) -

sorry for too many questions at once . also i shall install -dataex- i have stata 14

Thank you
Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 29962

#13

16 Mar 2021, 10:43

Since you will only be focusing on households with exactly two children, it will simplify things greatly to remove other households from the data set. Then you might proceed something like this:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input int householdid byte(personid religion gender) int expenditure
101 1 1 1 100
102 1 1 1 200
102 2 1 1 300
103 1 2 1 300
103 2 2 2 500
104 1 1 2 100
104 2 1 2  50
end
label values religion religion
label def religion 1 "Hindu", modify
label def religion 2 "Muslim", modify
label values gender gender
label def gender 1 "Male", modify
label def gender 2 "Female", modify

//  RESTRICT DATA TO HOUSEHOLDS WITH EXACTLY TWO CHILDREN
by householdid, sort: keep if _N == 2
egen household_flag = tag(householdid)

by householdid, sort: egen n_boys = total(gender == "Male":gender)
by householdid: egen n_girls = total(gender == "Female":gender)

by householdid: egen total_expenditure = total(expenditure)

count if n_boys == 2 & household_flag
count if n_boys == 1 & n_girls == 1 & household_flag

egen hh_type = group(n_boys n_girls), label
tabstat total_expenditure if household_flag, by(hh_type) statistics(N mean sd)

The comparison of household expenditure across religion would be done similarly, with religion instead of hh_type in the -by()- option of the last command.

If you have age information, you can get birth order as follows:

Code:

by hhid (age), sort: gen birth_order = _N-_n + 1

If you are going to do hypothesis tests or want to look at several effects simultaneously, then you will need to learn to use commands like -regress- (or perhaps for household expenditures -poisson- might be more appropriate).

Announcement