Combining rows which have the same variables.

JR Rea

Join Date: Feb 2022

Posts: 18
#1

Combining rows which have the same variables.

14 Feb 2022, 13:24

After combining numerous datasets I need to initially combine the 2 rows where PNUM00=1 so that they become 1 row. Thereafter, I need to change to a 'wide' format where each family identifier (MCSID) has only one row with contains all the relevant information. Would appreciate any advice.
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 29795
#2

14 Feb 2022, 16:10

In what sense do you want to "combine" those rows? In the screenshot of data you show all of the variables that are visible have missing values in the second such row, except for PNUM00 and CNUM00. Is that always the case where you have pairs of rows with PNUM00 == 0? If not, when there is non-missing data in both rows, what do you want to happen if they have disagreeing values?

Please post back with clarifications, and, with usable example data. That means using the -dataex- command. If you are running version 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

Going forward, please always use -dataex- to show example data.
Comment
JR Rea

Join Date: Feb 2022

Posts: 18
#3

14 Feb 2022, 17:04

Thank you for replying. I have attempted to post a selected number of variables from the combined dataset. Within each family (MCSID) there is a PNUM00 relating to a member of the family who is not the cohort child (CNUM00). Each cohort member has their own line, followed by a line/s for the corresponding PNUM00. I have selected all those with a PNUM00==1 who represent the mother of the child. I am trying to achieve one line for each PNUM00==1 within the family, and then finally create 1 line for each MCSID which includes both the MCSID PNUM00 and CNUM00. There will be missing data as the combined dataset crosses 4 sweeps, reducing in size each time.

dataex MCSID PNUM00 AELIG00 ARESP00 CNUM00 AHCSEX00 AHPSEX00 AHCREL00 GPNTCO00 GPNTLP00 APACQU00 APVCQU00 DCWRSD00 DCMATHS7SA

----------------------- copy starting from the next line -----------------------
[CODE]
* Example generated by -dataex-. For more info, type help dataex
clear
input str7 MCSID int PNUM00 byte(AELIG00 ARESP00 CNUM00 AHCSEX00 AHPSEX00 AHCREL00 GPNTCO00 GPNTLP00 APACQU00 APVCQU00) int(DCWRSD00 DCMATHS7SA)
"M10016V" 100 -1 -1 1 1 -1 96 . . . . 136 130
"M10016V" 1 1 1 . -1 2 7 9 -1 6 96 . .
"M10016V" 1 . . -1 . . . 9 -1 6 96 . .
"M10024V" 100 -1 -1 1 1 -1 96 . . . . 109 108
"M10024V" 1 1 1 . -1 2 7 . . 2 96 . .
"M10032V" 100 -1 -1 1 1 -1 96 . . . . . .
"M10032V" 1 1 1 . -1 2 7 . . 96 96 . .
"M10044Z" 100 -1 -1 1 2 -1 96 . . . . -1 -1
"M10044Z" 1 . . -1 . . . -1 -9 96 96 . .
"M10044Z" 1 1 1 . -1 2 7 -1 -9 96 96 . .
"M10052Z" 1 1 1 . -1 2 7 . . 5 5 . .
"M10052Z" 100 -1 -1 1 1 -1 96 . . . . 90 93
"M10058F" 1 1 1 . -1 2 7 . . 1 1 . .
"M10058F" 100 -1 -1 1 1 -1 96 . . . . . .
"M10062B" 1 1 1 . -1 2 7 . . 3 95 . .
"M10062B" 1 . . -1 . . . . . 3 95 . .
"M10062B" 100 -1 -1 1 2 -1 96 . . . . 116 98
"M10068H" 1 . . -1 . . . 3 -1 5 6 . .
"M10068H" 100 -1 -1 1 1 -1 96 . . . . 140 121
"M10068H" 1 1 1 . -1 2 7 3 -1 5 6 . .
"M10083G" 1 1 1 . -1 2 7 3 -1 5 96 . .
"M10083G" 1 . . -1 . . . 3 -1 5 96 . .
"M10083G" 100 -1 -1 1 2 -1 96 . . . . 119 112
"M10113V" 1 1 1 . -1 2 7 . . 5 96 . .
"M10113V" 100 -1 -1 1 1 -1 96 . . . . 101 112
"M10113V" 1 . . -1 . . . . . 5 96 . .
"M10134A" 1 . . -1 . . . . . 4 3 . .
"M10134A" 1 1 1 . -1 2 7 . . 4 3 . .
"M10134A" 100 -1 -1 1 2 -1 96 . . . . 85 94
"M10146E" 1 . . -1 . . . 3 -1 2 1 . .
"M10146E" 100 -1 -1 1 2 -1 96 . . . . 96 126
"M10146E" 1 1 1 . -1 2 7 3 -1 2 1 . .
"M10153D" 1 1 1 . -1 2 7 2 -1 6 96 . .
"M10153D" 1 . . -1 . . . 2 -1 6 96 . .
"M10153D" 100 -1 -1 1 2 -1 96 . . . . 126 132
"M10158J" 100 -1 -1 1 2 -1 96 . . . . 119 113
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29795
#4

14 Feb 2022, 17:26

OK, within your example data in #3, it is always the case that the records of a give MCSID with PNUM00 == 1 all agree, except for possibly missing values. I am assuming this is the case throughout your full data set. If it is not, your results will be incorrect. So this code starts by verifying this assumption

Code:

// VERIFY CONSISTENCY OF DATA foreach v of varlist AELIG00-DCMATHS7SA { by MCSID PNUM00 (`v'), sort: assert inlist(`v', `v'[1], .) } // REDUCE MULTIPLE OBSERVATIONS TO SINGLETONS USING NON-MISSING VALUES WHERE THEY EXIST collapse (firstnm) AELIG00-DCMATHS7SA, by(MCSID PNUM00) // WIDE LAYOUT, AS REQUESTED reshape wide AELIG00-DCMATHS7SA, i(MCSID) j(PNUM00)

does what you ask.

Make sure you have a good reason for going to wide layout. Bear in mind that most data management and analysis tasks in Stata are easier, or only possible, with long data.
Comment
JR Rea

Join Date: Feb 2022

Posts: 18
#5

15 Feb 2022, 01:41

Thank you, when I ran the code it gave the following error.

930 contradictions in 4,766 observations
assertion is false

Does this mean there has been a fault in the merges and how could I investigate where the problem might be?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29795
#6

15 Feb 2022, 11:47

So this tells you that you have 930 instances of observations having the same MCSID and PNUM00 which have conflicting values on some variable(s). So you have to first identify these:

Code:

gen flag = 0 foreach v of varlist AELIG00-DCMATHS7SA { // USE THE FULL LIST OF VARIABLES IN THE REAL DATA SET HERE by MCSID PNUM00 (`v'), sort: replace flag = 1 if !inlist(`v', `v'[1], .) } by MCSID PNUM00 (flag), sort: replace flag = flag[_N] browse if flag

Now all of the offending observations will appear in the browser. From that point you have to look at the contradictory numbers and figure out, a) how they got there, and b) what to do about them.

It may be that they exist only because the data management that created this data set was flawed and these contradictory data are erroneous. In that case, you have to go back through the data management that created the data set, fix all the errors that led to this situation, and regenerate a corrected data set. If the data set creation was not in your purview, then you need to contact the person who provided it to you, call these errors to their attention, and have them create a corrected data set for you.

It may be that these contradictory values do not represent errors, and that you did not properly understand the nature of the data when you set out to do what you proposed in #1. In that case, you need a new plan. Perhaps there is a way to single out from among the conflicting values one which is correct and write code to retain only that one. Or it may be that the conflicting values can be combined in some way that you can calculate. Evidently this all depends on the meaning of the data itself, and only you can figure that part out.
Comment
JR Rea

Join Date: Feb 2022

Posts: 18
#7

15 Feb 2022, 12:27

Thank you, I'll work through all your suggestions, much appreciated.
Comment

Announcement

Combining rows which have the same variables.

Comment

Comment

Comment

Comment

Comment

Comment