Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Calculating difference between observations and sum

    Dear experts,

    I'm wondering whether you can help me. I wanted to calculate the age similarity based on below data:

    1. Calculate the age difference between persons first. For example, the age difference between person 1 and person 2 is 9. the age difference between person 1 and person 3 is 1.
    2. Sum the absolute age difference for person 1, which is 9+1 = 10. This absolute summed age difference is the age similarity that I want to gather for each person.
    person age
    1 61
    1 61
    1 61
    2 52
    2 52
    2 52
    3 60
    3 60
    3 60
    So, the ideal output should be like this:
    person age age similarity
    1 61 10
    1 61 10
    1 61 10
    2 52 17
    2 52 17
    2 52 17
    3 60 13
    3 60 13
    3 60 13
    Could you please show me how I can use Stata syntax to accomplish the two steps? Thank you very much!

  • #2
    Either I don't understand what you want, or there is an error in your example. For person 3, at age 60, the difference from person 1 (age 61) is 1 and from person 2 (age 52) is 8, so the total should be 9 not 13.

    Assuming that your example is in error and you meant the answer to be 9 for person 3, the following code will do it:

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input byte(person age)
    1 61
    1 61
    1 61
    2 52
    2 52
    2 52
    3 60
    3 60
    3 60
    end
    
    preserve
    duplicates drop
    tempfile copy
    save `copy'
    
    rename age age1
    rename person person1
    cross using `copy'
    drop if person == person1
    by person, sort: egen similarity = total(abs(age-age1))
    keep person similarity
    duplicates drop
    save `"`copy'"', replace
    
    restore
    merge m:1 person using `copy'
    In the future, when showing data examples, please use the -dataex- command to do so. If you are running version 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

    Comment


    • #3
      Clyde Schechter

      Thank you so much, Clyde. You are right. The age similarity for person 3 should be 9, not 13.

      When I ran cross using `copy', Stata says "sum of expand values exceed 2,147,483,620 observations". But I only have 90,000 obs in the dataset. Do you have any idea why this happens?

      Comment


      • #4
        Clyde Schechter

        I think I get what's happening. The cross using function leads to 90,000 x 90,000 observations. That's why the number of observations exceeds limit. Maybe I should use append or merge instead of cross using.

        Comment


        • #5
          Yes. When you run -cross- you (attempt to) create a new data set that contains all possible pairs of observations between the data set in memory and the using data set. If each contains 90,000 observations then the -cross- will contain 90,0002, which is much larger than you can work with in Stata (and probably much larger than you can work with on your computer in any software.)

          Here's a workaround. It handles one person at a time, so that -cross- will create a data set with 90,000 observations, not 90,0002. I should warn you, however, that with a data set this size, this will be a bit time consuming. So make sure you have something to do while it's running. (The -status- option will cause Stata to give you a progress report every few minutes as this runs, so at least you will know you are making progress. But it is best to keep yourself occupied during long runs like this.)

          Code:
          preserve
          duplicates drop
          save copy, replace
          
          rename person person1
          rename age age1
          
          capture program drop one_person
          program define one_person
              cross using copy
              egen similarity = total(abs(age-age1))
              keep person1 similarity
              keep in 1
              exit
          end
          
          runby one_person, by(person1) status
          
          
          rename person1 person
          save copy, replace
          restore
          merge m:1 person using copy
          Notes: -runby- is written by Robert Picard and me, and is available from SSC. Note that in this code, copy is not a tempfile. You can't pass a tempfile name into the program called by -runby-. So just make sure that your working directory doesn't already have a file named copy.dta, or, at least not one whose contents you need to keep. If you do have a copy.dta file already that you need to keep, pick some filename other than copy that doesn't clash with any existing filenames.

          Comment


          • #6
            #5 crossed with #4, where Ke Gong anticipated my explanation.

            But
            Maybe I should use append or merge instead of cross using.
            will not solve be satisfactory. They will eliminate the error message, but they will not produce a data configuration that can be used to solve the original problem. It is not just a matter of throwing the original and copy data sets together any old way. Crucial to the logic of the code is that each person's observation in one data set be paired with every person's observations in the other. Neither -merge- nor -append- will accomplish that.

            So there is no getting around the need for doing what -cross- does, namely forming a cartesian product of data sets. The solution in #5 will work because it does this with one person's data at a time, calculates the total absolute difference, and then discards surplus observations and builds up a new answer data set one person at a time. (The one person at a time parts are what -runby- does.)

            Comment


            • #7
              Clyde Schechter

              Great thanks! It takes too long to calculate if i keep the 90,000 obs. So I collapsed the data by year and person. It should take much less time. I'm showing the exact data including year below. Basically I just want to calculate the age similairty within the same year. For example, person "3013454" is 53 years old in 2007, so I just want to calculate the age differences between him and the others within 2007, and then sum the differences for him as his age similarity in 2007. Should I add something like "by year, sort:"? The exact data I have is shown below:

              Code:
              * Example generated by -dataex-. To install: ssc install dataex
              clear
              input int year long person float age
              2007  3013454 53
              2007 30347252 54
              2007 30347413 62
              2007 30347425 63
              2007 30347506 54
              2008  3013454 54
              2008 30347413 63
              2008 30347425 64
              2008 30347506 55
              2008 30347604 66
              
              end

              Comment


              • #8
                It requires somewhat more extensive modifications to the code than that:

                Code:
                * Example generated by -dataex-. To install: ssc install dataex
                clear
                input int year long person float age
                2007  3013454 53
                2007 30347252 54
                2007 30347413 62
                2007 30347425 63
                2007 30347506 54
                2008  3013454 54
                2008 30347413 63
                2008 30347425 64
                2008 30347506 55
                2008 30347604 66
                end
                
                preserve
                duplicates drop
                save copy, replace
                
                rename person person1
                rename age age1
                rename year year1
                
                capture program drop one_person
                program define one_person
                    cross using copy
                    keep if year == year1
                    egen similarity = total(abs(age-age1))
                    keep person1 year1 similarity
                    keep in 1
                    exit
                end
                
                runby one_person, by(person1 year1) status
                
                
                rename person1 person
                rename year1 year
                save copy, replace
                restore
                merge m:1 person year using copy
                Changes shown in bold face.

                Comment


                • #9
                  Clyde Schechter

                  Great! Your code solved my problem perfectly. Thank you so much!! Have a wonderful weekend!!

                  Comment


                  • #10
                    Clyde Schechter , Can you please help me form the stata command for following,.. i am not able to formulate it. please

                    I am using a data set which is in this form
                    householdid Personid religion gender expenditure
                    101 1 1 1 100
                    102 1 1 1 200
                    102 2 1 1 300
                    103 1 2 1 300
                    103 2 2 2 500
                    104 1 1 2 100
                    104 2 1 2 50

                    and so on

                    Say gender 1- boy 2 is girl
                    religion 1 is hindu 2- muslims
                    I need command help to find:

                    1. How many HH s have 2 sons and how many have 1 daughter and 1 son

                    2. difference between the two child consumption expenditure if a family has more than one child?
                    There are families with even 3 or more child but i need to focus more on 2 child part



                    Please help

                    Comment


                    • #11
                      1. How many HH s have 2 sons and how many have 1 daughter and 1 son
                      Code:
                      * Example generated by -dataex-. To install: ssc install dataex
                      clear
                      input int householdid byte(personid religion gender) int expenditure
                      101 1 1 1 100
                      102 1 1 1 200
                      102 2 1 1 300
                      103 1 2 1 300
                      103 2 2 2 500
                      104 1 1 2 100
                      104 2 1 2  50
                      end
                      label values religion religion
                      label def religion 1 "Hindu", modify
                      label def religion 2 "Muslim", modify
                      label values gender gender
                      label def gender 1 "Male", modify
                      label def gender 2 "Female", modify
                      
                      by householdid, sort: egen n_boys = total(gender == "Male":gender)
                      by householdid: egen n_girls = total(gender == "Female":gender)
                      
                      egen household_flag = tag(householdid)
                      
                      count if n_boys == 2 & household_flag
                      count if n_boys == 1 & n_girls == 1 & household_flag
                      2. difference between the two child consumption expenditure if a family has more than one child?
                      I do not understand the question. What does "two child consumption expenditure" mean? Is it the sum of the values of the variable expenditure for the two children? And if there are more than two children, which two should be included in this calculation? Or is it something else altogether? Please clarify.

                      In the future, when showing data examples, please use the -dataex- command to do so. If you are running version 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

                      Comment


                      • #12
                        Dear Clyde Schechter sir, I am grateful for your prompt reply.

                        I am looking into analyzing if the household differentiates among their own children with regards to consumption expenditure based on gender and birth order of the child.
                        [say if a household has two boys vis a vis a house with one boy and one girl,is there any difference in the difference in expenditure in second house hold?]
                        . further is there a difference in difference across religious groups?


                        I am only be focusing on families with two children only( so 3 groups (son- son) (son- daughter) (daughter -daughter)
                        however is there any way to capture birth order (age information is available with me) -

                        sorry for too many questions at once . also i shall install -dataex- i have stata 14

                        Thank you

                        Comment


                        • #13
                          Since you will only be focusing on households with exactly two children, it will simplify things greatly to remove other households from the data set. Then you might proceed something like this:

                          Code:
                          * Example generated by -dataex-. To install: ssc install dataex
                          clear
                          input int householdid byte(personid religion gender) int expenditure
                          101 1 1 1 100
                          102 1 1 1 200
                          102 2 1 1 300
                          103 1 2 1 300
                          103 2 2 2 500
                          104 1 1 2 100
                          104 2 1 2  50
                          end
                          label values religion religion
                          label def religion 1 "Hindu", modify
                          label def religion 2 "Muslim", modify
                          label values gender gender
                          label def gender 1 "Male", modify
                          label def gender 2 "Female", modify
                          
                          //  RESTRICT DATA TO HOUSEHOLDS WITH EXACTLY TWO CHILDREN
                          by householdid, sort: keep if _N == 2
                          egen household_flag = tag(householdid)
                          
                          by householdid, sort: egen n_boys = total(gender == "Male":gender)
                          by householdid: egen n_girls = total(gender == "Female":gender)
                          
                          by householdid: egen total_expenditure = total(expenditure)
                          
                          count if n_boys == 2 & household_flag
                          count if n_boys == 1 & n_girls == 1 & household_flag
                          
                          egen hh_type = group(n_boys n_girls), label
                          tabstat total_expenditure if household_flag, by(hh_type) statistics(N mean sd)
                          The comparison of household expenditure across religion would be done similarly, with religion instead of hh_type in the -by()- option of the last command.

                          If you have age information, you can get birth order as follows:
                          Code:
                          by hhid (age), sort: gen birth_order = _N-_n + 1
                          If you are going to do hypothesis tests or want to look at several effects simultaneously, then you will need to learn to use commands like -regress- (or perhaps for household expenditures -poisson- might be more appropriate).

                          Comment

                          Working...
                          X