Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Calculating household size

    Hello all,

    I would like to calculate certain characteristics (household size, other earners in the household, per capita household income) at the household level in certain sectors in the economy.

    However, I am having two problems.

    Calculating household size

    Key Variables: uqnr (unique household identifier)
    personnr (unique person identifier)

    Now using the advice in this thread (http://www.statalist.org/forums/foru...survey-dataset) I did the following:

    Code:
    sort uqnr personnr
    by uqnr: generate hhsize=_N
    However, what it is doing is calculating the same member multiple times (i.e. household member no. 2 is being counted more than once). This is due to the nature of the data, which is a rotating panel survey (e.g. Wave 1 = 100 new people, Wave 2 = 75 from Wave 1 +25 new people, Wave 3 = 50 from Wave 2 + 50 new people etc.).

    Question: How do I go about creating the correct household size - obviously I want the "personnr" value to be unique within the household (uqnr).


    Calculating other earners in the household

    Key Variables: unqr (unique household identifier)
    personnr (unique person identifier)
    status (1=employed, 2=unemployed)

    Basically, I'd like investigate how many people are earning wages (wages > 0) in households where a person is earning a wage in a "minimum wage" sector. I've already identified the various sectors (variable "secdetcat", ranging in values from 1-8).

    So I guess the first step would be to identify households where there is an earner in a minimum wage sector.

    Second step - identify other earners in the household. How do I go about the second step?

    Regards


  • #2
    Your first question seems very muddled to me.

    Either household size is the number of people in the household (and it's incidental that individuals may belong to one or more households) or you need to define household size in some other precise way so that we can suggest code.

    You can't just say: but that is wrong, as several people are counted repeatedly. What would be right?

    Equally, "obviously I want the "personnr" value to be unique within the household (uqnr)". What does that mean? If you have a situation in which a given personnr is repeated within a household (i.e. there are two or more observations with the same uqnr and personnr), then you have a problem with duplicates. If that is not so, then personnr is unique (meaning, occurs at most once) within a household and there is no problem.

    If what you mean is that no personnr should be repeated across households, i.e. you want uniqueness within the dataset, then as said you need to tell us how to assign each personnr unequivocally to the correct household to which they belong.

    The second question is addressed in various places, e.g.

    http://www.stata.com/support/faqs/da...ies/index.html

    http://www.stata-journal.com/article...article=dm0055

    http://www.stata-journal.com/article...article=dm0075
    Last edited by Nick Cox; 30 Mar 2015, 07:33.

    Comment


    • #3
      Hi Nick,

      Thanks for response. Household size = number of people in the household.

      My problem is the same people are being counted as additional household members. This is clearly wrong.

      The screenshot below demonstrates the problem. From Row 3 to Row 11, the value of the ''hhsize" variable is 9. Clearly this is incorrect when looking at the "uqnr" and "personnr" columns...

      Click image for larger version

Name:	Stata.png
Views:	2
Size:	48.0 KB
ID:	1205322
      You are correct when you say the problem is "([...] there are two or more observations with the same uqnr and personnr)." However, those are not duplicate observations - they just interviewed the same member more than once.

      Thanks for the links regarding the second question - very helpful!
      Attached Files

      Comment


      • #4
        Please see the FAQ Advice on not using screenshots: See Section 12. I can't read that easily and I certainly can't copy and paste to see what the issue is.

        If you have duplicates on your two identifiers then an appropriate count is

        Code:
          
        bysort uqnr personnr: gen hhsize2 = _n == 1
        by uqnr : replace hhsize2 = sum(hhsize2)  
        by uqnr: replace hhsize2 = hhsize2[_N]
        See also http://www.stata-journal.com/article...article=dm0042
        Last edited by Nick Cox; 30 Mar 2015, 08:16.

        Comment


        • #5
          I guess there is a wave id variable which is there in the data, but not shown. Then the following code will compute the household size (for a particular household id in a particular point in time).

          Code:
          clear
          input year id persnr
          2001 1 1
          2001 1 2
          2001 2 1
          2001 3 1
          2001 3 2
          2001 3 3
          2011 1 1
          2011 1 2
          2011 2 1
          2011 2 2
          2011 3 1
          2011 3 2
          2011 3 3
          2011 3 4
          end
          list,sepby(year id)
          bys year id: gen hhsize=_N
          list,sepby(year id)
          Produces:

          Code:
               +-----------------------------+
               | year   id   persnr   hhsize |
               |-----------------------------|
            1. | 2001    1        1        2 |
            2. | 2001    1        2        2 |
               |-----------------------------|
            3. | 2001    2        1        1 |
               |-----------------------------|
            4. | 2001    3        1        3 |
            5. | 2001    3        2        3 |
            6. | 2001    3        3        3 |
               |-----------------------------|
            7. | 2011    1        1        2 |
            8. | 2011    1        2        2 |
               |-----------------------------|
            9. | 2011    2        1        2 |
           10. | 2011    2        2        2 |
               |-----------------------------|
           11. | 2011    3        1        4 |
           12. | 2011    3        2        4 |
           13. | 2011    3        3        4 |
           14. | 2011    3        4        4 |
               +-----------------------------+
          If there is no wave ID (for example if someone appended all data, conveniently forgetting to add a file identifier), then you will have to recover that first.

          Best, Sergiy Radyakin

          Comment


          • #6
            Hi Sergiy

            Thanks - yes, a wave variable occurred to me, however, there isn't one. Nick's code worked perfectly though.

            However, I am still a bit stuck with regards to my second question (calculate number of other employed people in a household)

            Following on from advice from here (http://www.stata.com/support/faqs/da...ies/index.html)

            Code:
            egen nemployed=total(status==1), by(uqnr)
            replace nemployed = nemployed - (status==1)
            However, I come unstuck because as we established previously, the same individual within the same household can be interviewed more than once (same uqnr and personnr), therefore Stata is double-counting. However, they are not duplicates as they were simply interviewed at different times - there is no wave variable.
            Last edited by Chris Rooney; 31 Mar 2015, 03:14.

            Comment


            • #7
              So the code needs to be made sharper. You need to specify somehow which instance of the repeated occurrences is to be selected. However, if status varies among those instances, then how are you going to choose? Also, suppose two individuals both occur twice: A appears variously with status 0 and 1 and so does B. Depending on which two observations you select you can get 0, 1 or 2 instances of status 1. The problem is combinatorial.
              Last edited by Nick Cox; 31 Mar 2015, 03:29.

              Comment


              • #8
                Originally posted by Nick Cox View Post
                So the code needs to be made sharper. You need to specify somehow which instance of the repeated occurrences is to be selected. However, if status varies among those instances, then how are you going to choose? Also, suppose two individuals both occur twice: A appears variously with status 0 and 1 and so does B. Depending on which two observations you select you can get 0, 1 or 2 instances of status 1. The problem is combinatorial.
                Good point. I will investigate the dataset more thoroughly to determine whether there is a variable which indicates the different waves. There is definitely not a variable called wave, but it could be named something else.

                Comment


                • #9
                  Suppose that there is some variable wavebyanyothername that does what you want. Then you just use it to select.

                  Code:
                  forval nw = 1/5 {
                     egen nemployed`nw' = total(status==1 & wavebyanythername == `nw'), by(uqnr)
                     replace nemployed`nw' = nemployed`nw` - (status==1 & wavebyanythername == `nw')
                  }
                  Last edited by Nick Cox; 31 Mar 2015, 04:42.

                  Comment

                  Working...
                  X