how to turn household level data into individual level data

Martin von Brandt

Join Date: Jan 2020

Posts: 21
#1

how to turn household level data into individual level data

09 Feb 2022, 16:42

Hi everyone,

I have a household level data where parents give information on themselves and their children. I would like to turn this data into children level data. However, I have not a lot of clue about how to do it. I appreciate if someone can help me with this.

Below, you can see 3 variables I got from the data. I have mergeid (id of parents), w7_ch_gender1 (gender of children), and w7_ch_number (number of children that parents have). So first, I would like to create a unique id number for each children and then I would like to transpose the data in a way that it will be at children level. parents id will be repeated depending on how many children they have. I am not sure if i am clear but i will demonstrate what I exactly want in the second example:

mergeid w7_ch_gender1 w7_ch_number
AT-014640-01 Male 2
AT-014640-02 Male 2
AT-015615-01 Female 2
AT-015935-01 Female 1
AT-017298-01 Female 2
AT-017821-02 Female 3

Example of what I would like to achieve:
So mergeid is doubled if both parents answered the questions so is the number of children (row-wise). There is a unique id number for each child.

mergeid number of children childid gender
AT-000674-01 2 19701 female
AT-000674-01 2 19731 female
AT-001492-02 6 19702 male
AT-001492-01 6 19702 male
AT-001492-02 6 19732 female
AT-001492-01 6 19732 female
AT-001492-01 6 19742 male
AT-001492-02 6 19742 male
AT-001492-01 6 19801 female
AT-001492-02 6 19801 female
AT-001492-01 6 19822 male
AT-001492-02 6 19822 male
AT-001492-02 6 19831 male
AT-001492-01 6 19831 male

Can someone help me with the coding? Thanks in advance!
Tags: None
William Lisowski

Join Date: Dec 2014

Posts: 10150
#2

09 Feb 2022, 18:27

It would be much more helpful if the examples of what you want were based on the examples of what you have. In particular your example of what you want seems to be based on data for two families - parent AT-000674-01, a single parent with two female children, and parents AT-001492-02 and AT-001492-01, a couple with 4 male and two female children. You do not show us what you started with for the parent data. You tell us you have w7_ch_gender1 - do you also have w7_ch_gender2, w7_ch_gender3, ... one for each child? And how exactly do you want the childid created - they seem to follow a complicated pattern within each family?

Finally, some advice that I see you have not been given in response to your previous questions.

Please take a few moments to review the Statalist FAQ linked to from the top of the page, as well as from the Advice on Posting link on the page you used to create your post. Note especially sections 9-12 on how to best pose your question. It is particularly helpful to copy commands and output from your Stata Results window and paste them into your Statalist post using code delimiters [CODE] and [/CODE], and to use the dataex command to provide sample data, as described in section 12 of the FAQ.

Be sure to use the dataex command to show example data. If you are running version 17, 16 or a fully updated version 15.1 or 14.2, dataex is already part of your official Stata installation. If not, run ssc install dataex to get it. Either way, run help dataex and read the simple instructions for using it. dataex will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

To be concrete, to help you with the code we will, for example, need to know if the w7_ch_gender* variables are a string variables containing "male" or "female" or numeric variables with value labels. This is the sort of information that dataex reveals.

By default dataex will output the first 100 observations of every variable, but the output of help dataex shows that you can use if clauses and variable lists to limit your variables and observations

Code:

dataex mergeid w7_ch_number w7_ch_gender1 w7_ch_gender2 w7_ch_gender3 w7_ch_gender4 w7_ch_gender5 w7_ch_gender6 /// if inlist(mergeid, "AT-000674-01", "AT-001492-02", "AT-001492-01")

(assuming I've guessed correctly about the variables that matter).

The output of dataex will look something like the following.

Code:

----------------------- copy starting from the next line ----------------------- [CODE] * Example generated by -dataex-. To install: ssc install dataex clear input int(x1 x2 x3) float x4 int x5 byte x6 4195 24 1 2 10 0 10371 16 3 3.5 17 0 4647 28 3 2 11 0 ... 5079 24 4 2.5 8 1 8129 21 4 2.5 8 1 4296 21 3 2.5 16 1 end label values x6 yesno label def yesno 0 "No", modify label def yesno 1 "Yes", modify [/CODE] ------------------ copy up to and including the previous line ------------------

In your dataex output you will select the lines between, but not including, "copy starting from the next line" and "copy up to and including the previous line" and then paste that into your reply. The result presented in your post will look something like the following.

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input int(x1 x2 x3) float x4 int x5 byte x6 4195 24 1 2 10 0 10371 16 3 3.5 17 0 4647 28 3 2 11 0 ... 5079 24 4 2.5 8 1 8129 21 4 2.5 8 1 4296 21 3 2.5 16 1 end label values x6 yesno label def yesno 0 "No", modify label def yesno 1 "Yes", modify

Last edited by William Lisowski; 09 Feb 2022, 18:37.
4 likes
Comment
Martin von Brandt

Join Date: Jan 2020

Posts: 21
#3

10 Feb 2022, 03:48

Dear Mr. Lisowski thank you for your answer. I did not give enough information. So as you guessed, I have w7_ch_gender1 variable for each child and it goes up to w7_ch_gender20. This is the same for every single child variable. E.g. I have 20 columns for marital status of children, employment status of children etc. In the dataex look, you will see:

mergeid: id of parents
w6_ch001_: number of children that parents have
w6_ch005_1 - w6_ch005_10: gender of children (1: male, 2: female) and this variable actually goes till w6_ch005_20 but I just got it till the 10th child.

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input str12 mergeid byte(w6_ch001_ w6_ch005_1 w6_ch005_2 w6_ch005_3 w6_ch005_4 w6_ch005_5 w6_ch005_6 w6_ch005_7 w6_ch005_8 w6_ch005_9 w6_ch005_10) "AT-000674-01" 2 1 1 . . . . . . . . "AT-001492-01" 6 2 2 2 1 2 1 . . . . "AT-001881-01" 3 1 1 2 . . . . . . . end

I would like to turn this household level dataset into child level dataset. I would like to have a unique id variable for each child so I can track them throughout the waves. This id number can be a random number, If you think it is possible. In the example above information on children variables are given horizontally. I would like to have it vertically like below.

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input str12 mergeid float(childid w6_ch001_ gndr_child) "AT-000674-01" 19701 2 1 "AT-000674-01" 19731 2 1 "AT-001492-01" 19702 6 2 "AT-001492-01" 19732 6 2 "AT-001492-01" 19742 6 2 "AT-001492-02" 19742 6 2 "AT-001492-01" 19801 6 1 "AT-001492-01" 19822 6 2 "AT-001492-01" 19831 6 1 "AT-001881-01" 19522 3 2 "AT-001881-01" 19561 3 1 "AT-001881-01" 19651 3 1 end

Is this a bit clearer? I appreciate if you can help me with this.
Thanks!
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#4

10 Feb 2022, 07:00

Thank you. You are apparently a few time zones to the east of me, and I have commitments for a few hours, so I will not be able to provide example code until later, although someone else may see this and step in.

But let me say this. The solution will certainly be built around the use of Stata's reshape command, specifically reshape long. If you are unfamiliar with this command, the output of help reshape is a starting point, but I would advise reading the full PDF documentation linked to from the start of the help output. It is a crucial command for data management and is fairly subtle, so seeing detailed examples is an aid to understanding it. This will be good background for you to understand the solution that is proposed, or may even allow you to take the first steps on your own.
Comment
Martin von Brandt

Join Date: Jan 2020

Posts: 21
#5

10 Feb 2022, 09:17

Thanks a lot for your answer. I am in CET time zone. I might be answering you a bit late as a result of that. I appreciate if you can help me with the coding and will be waiting for your answer.
Meanwhile, I'll check reshape command.
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#6

10 Feb 2022, 10:34

I'm working on this now and I see a problem in what you ask.

I would like to have a unique id variable for each child so I can track them throughout the waves.

That will not be possible from what you have given us.

As yourself: what happens in wave 2 if the two adults AT-001492-01 and AT-001492-02 in household AT-001492 in wave 1 separate, and 3 of the children go with one adult and the other three children go with the other adult? One of them will need to have a new household identifier, how will you recognize that these children are the same as three of the children in AT-001492 in the previous wave?

And this depends very much on the identifier assigned to the household, and how households are tracked in the survey across waves. In the longitudinal surveys I have worked with, there is no attempt to track "households" or "families" across waves, because the form and dissolve over time. Individuals are tracked over time with a constant identifier; households are not tracked over time, even when the membership does not change.

See the discussion, including my examples at post #2 and especially that of Stephen Jenkins at post #5, in the topic at

https://www.statalist.org/forums/for...a-household-id

The bottom line is that if you want to reliably track specific people - adults or children - over time, the survey - which knows who is who - needs to provide a cross-wave individual identifier; trying to infer a cross-wave individual identifier from the data is not remotely reliable.

I'll continue to work on the reshape example but wanted to alert you to this issue.

Last edited by William Lisowski; 10 Feb 2022, 10:42.
Comment

William Lisowski

Join Date: Dec 2014
Posts: 10150

10 Feb 2022, 12:04

Here is the reshape example. I have not created a child id as discussed in the previous post, but this shows how to apply reshape to your example data.

Code:

reshape long w6_ch005_, i(mergeid) j(child)
drop if child>w6_ch001_
rename w6_ch005_ gndr_child
list, abbreviate(12) sepby(mergeid)

Code:

. list, abbreviate(12) sepby(mergeid)

     +-----------------------------------------------+
     |      mergeid   child   w6_ch001_   gndr_child |
     |-----------------------------------------------|
  1. | AT-000674-01       1           2            1 |
  2. | AT-000674-01       2           2            1 |
     |-----------------------------------------------|
  3. | AT-001492-01       1           6            2 |
  4. | AT-001492-01       2           6            2 |
  5. | AT-001492-01       3           6            2 |
  6. | AT-001492-01       4           6            1 |
  7. | AT-001492-01       5           6            2 |
  8. | AT-001492-01       6           6            1 |
     |-----------------------------------------------|
  9. | AT-001881-01       1           3            1 |
 10. | AT-001881-01       2           3            1 |
 11. | AT-001881-01       3           3            2 |
     +-----------------------------------------------+

Comment

Martin von Brandt

Join Date: Jan 2020

Posts: 21
#8

11 Feb 2022, 09:30

Thanks a lot for your answer but I still have some questions.

If you look at the example below, AT-000674-01 has two children and as a result of that mergeid doubled. However, if you look at the next mergeid, we have AT-001492-01 and AT-001492-02 (i assume they are husband and wife) and they have 6 children. For the 6 children AT-001492-01 have, the code created 6 rows but for the other parent (AT-001492-02) I see 20 rows. I do not understand the logic behind this? Normally this person did not answer this question so shouldn't it be missing as well? Or 6 rows with missing values?

Another question is that when I copy-paste the code you wrote to another wave, Stata says it could not find 'child' variable. May I ask what the reason for this be?

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input str12 mergeid byte(child gndr_child) "AT-000674-01" 1 1 "AT-000674-01" 2 1 "AT-001492-01" 1 2 "AT-001492-01" 2 2 "AT-001492-01" 3 2 "AT-001492-01" 4 1 "AT-001492-01" 5 2 "AT-001492-01" 6 1 "AT-001492-02" 1 . "AT-001492-02" 2 . "AT-001492-02" 3 . "AT-001492-02" 4 . "AT-001492-02" 5 . "AT-001492-02" 6 . "AT-001492-02" 7 . "AT-001492-02" 8 . "AT-001492-02" 9 . "AT-001492-02" 10 . "AT-001492-02" 11 . "AT-001492-02" 12 . "AT-001492-02" 13 . "AT-001492-02" 14 . "AT-001492-02" 15 . "AT-001492-02" 16 . "AT-001492-02" 17 . "AT-001492-02" 18 . "AT-001492-02" 19 . "AT-001492-02" 20 . end label values gndr_child gender label def gender 1 "Male", modify label def gender 2 "Female", modify
Comment

William Lisowski

Join Date: Dec 2014
Posts: 10150

11 Feb 2022, 10:16

For your second first question, you do not show the input that produced this output, so I am left to guess that for AT-001492-02 the value of w6_ch001_ was missing, unlike in post #1, where it the output showed values for both AT-001492-01 and AT-001492-02. So I have added the missing data to your example data and changed the code accordingly to drop all observations where w6_ch001_ is missing. Stata treats a missing value as larger than any non-missing value, so the comparison with child doesn't work.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str12 mergeid byte(w6_ch001_ w6_ch005_1 w6_ch005_2 w6_ch005_3 w6_ch005_4 w6_ch005_5 w6_ch005_6 w6_ch005_7 w6_ch005_8 w6_ch005_9 w6_ch005_10)
"AT-000674-01" 2 1 1 . . . . . . . .
"AT-001492-01" 6 2 2 2 1 2 1 . . . .
"AT-001492-02" . . . . . . . . . . .
"AT-001881-01" 3 1 1 2 . . . . . . .
end
reshape long w6_ch005_, i(mergeid) j(child)
drop if w6_ch001_==.
drop if child&gt;w6_ch001_
rename w6_ch005_ gndr_child
list, abbreviate(12) sepby(mergeid)

Code:

. reshape long w6_ch005_, i(mergeid) j(child)
(j = 1 2 3 4 5 6 7 8 9 10)

Data                               Wide   ->   Long
-----------------------------------------------------------------------------
Number of observations                4   ->   40          
Number of variables                  12   ->   4           
j variable (10 values)                    ->   child
xij variables:
  w6_ch005_1 w6_ch005_2 ... w6_ch005_10   ->   w6_ch005_
-----------------------------------------------------------------------------

. drop if w6_ch001_==.
(10 observations deleted)

. drop if child>w6_ch001_ 
(19 observations deleted)

. rename w6_ch005_ gndr_child

. list, abbreviate(12) sepby(mergeid)

     +-----------------------------------------------+
     |      mergeid   child   w6_ch001_   gndr_child |
     |-----------------------------------------------|
  1. | AT-000674-01       1           2            1 |
  2. | AT-000674-01       2           2            1 |
     |-----------------------------------------------|
  3. | AT-001492-01       1           6            2 |
  4. | AT-001492-01       2           6            2 |
  5. | AT-001492-01       3           6            2 |
  6. | AT-001492-01       4           6            1 |
  7. | AT-001492-01       5           6            2 |
  8. | AT-001492-01       6           6            1 |
     |-----------------------------------------------|
  9. | AT-001881-01       1           3            1 |
 10. | AT-001881-01       2           3            1 |
 11. | AT-001881-01       3           3            2 |
     +-----------------------------------------------+

.

For your second question, I can't even begin to guess what the problem is given what little evidence you have provided.

You must copy the commands and output from your Stata Results window and paste them into your next reply here using code delimiters [CODE] and [/CODE], in the same way that I have presented the output above. And do be sure to tell us what the names of the gender variables in wave 7 are - from post #1 above, they appear to be different than they are in wave #6.

Comment

Martin von Brandt

Join Date: Jan 2020

Posts: 21
#10

13 Feb 2022, 16:07

Dear Mr. Lisowski, thank you so much for you time and patience. With your help, I managed to do what I wanted to do!
Comment

Announcement