Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Combining rows which have the same variables.

    After combining numerous datasets I need to initially combine the 2 rows where PNUM00=1 so that they become 1 row. Thereafter, I need to change to a 'wide' format where each family identifier (MCSID) has only one row with contains all the relevant information. Would appreciate any advice.

    Click image for larger version

Name:	Screenshot 2022-02-14 at 18.05.38.png
Views:	1
Size:	398.4 KB
ID:	1650066

  • #2
    In what sense do you want to "combine" those rows? In the screenshot of data you show all of the variables that are visible have missing values in the second such row, except for PNUM00 and CNUM00. Is that always the case where you have pairs of rows with PNUM00 == 0? If not, when there is non-missing data in both rows, what do you want to happen if they have disagreeing values?

    Please post back with clarifications, and, with usable example data. That means using the -dataex- command. If you are running version 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

    Going forward, please always use -dataex- to show example data.

    Comment


    • #3
      Thank you for replying. I have attempted to post a selected number of variables from the combined dataset. Within each family (MCSID) there is a PNUM00 relating to a member of the family who is not the cohort child (CNUM00). Each cohort member has their own line, followed by a line/s for the corresponding PNUM00. I have selected all those with a PNUM00==1 who represent the mother of the child. I am trying to achieve one line for each PNUM00==1 within the family, and then finally create 1 line for each MCSID which includes both the MCSID PNUM00 and CNUM00. There will be missing data as the combined dataset crosses 4 sweeps, reducing in size each time.

      dataex MCSID PNUM00 AELIG00 ARESP00 CNUM00 AHCSEX00 AHPSEX00 AHCREL00 GPNTCO00 GPNTLP00 APACQU00 APVCQU00 DCWRSD00 DCMATHS7SA

      ----------------------- copy starting from the next line -----------------------
      [CODE]
      * Example generated by -dataex-. For more info, type help dataex
      clear
      input str7 MCSID int PNUM00 byte(AELIG00 ARESP00 CNUM00 AHCSEX00 AHPSEX00 AHCREL00 GPNTCO00 GPNTLP00 APACQU00 APVCQU00) int(DCWRSD00 DCMATHS7SA)
      "M10016V" 100 -1 -1 1 1 -1 96 . . . . 136 130
      "M10016V" 1 1 1 . -1 2 7 9 -1 6 96 . .
      "M10016V" 1 . . -1 . . . 9 -1 6 96 . .
      "M10024V" 100 -1 -1 1 1 -1 96 . . . . 109 108
      "M10024V" 1 1 1 . -1 2 7 . . 2 96 . .
      "M10032V" 100 -1 -1 1 1 -1 96 . . . . . .
      "M10032V" 1 1 1 . -1 2 7 . . 96 96 . .
      "M10044Z" 100 -1 -1 1 2 -1 96 . . . . -1 -1
      "M10044Z" 1 . . -1 . . . -1 -9 96 96 . .
      "M10044Z" 1 1 1 . -1 2 7 -1 -9 96 96 . .
      "M10052Z" 1 1 1 . -1 2 7 . . 5 5 . .
      "M10052Z" 100 -1 -1 1 1 -1 96 . . . . 90 93
      "M10058F" 1 1 1 . -1 2 7 . . 1 1 . .
      "M10058F" 100 -1 -1 1 1 -1 96 . . . . . .
      "M10062B" 1 1 1 . -1 2 7 . . 3 95 . .
      "M10062B" 1 . . -1 . . . . . 3 95 . .
      "M10062B" 100 -1 -1 1 2 -1 96 . . . . 116 98
      "M10068H" 1 . . -1 . . . 3 -1 5 6 . .
      "M10068H" 100 -1 -1 1 1 -1 96 . . . . 140 121
      "M10068H" 1 1 1 . -1 2 7 3 -1 5 6 . .
      "M10083G" 1 1 1 . -1 2 7 3 -1 5 96 . .
      "M10083G" 1 . . -1 . . . 3 -1 5 96 . .
      "M10083G" 100 -1 -1 1 2 -1 96 . . . . 119 112
      "M10113V" 1 1 1 . -1 2 7 . . 5 96 . .
      "M10113V" 100 -1 -1 1 1 -1 96 . . . . 101 112
      "M10113V" 1 . . -1 . . . . . 5 96 . .
      "M10134A" 1 . . -1 . . . . . 4 3 . .
      "M10134A" 1 1 1 . -1 2 7 . . 4 3 . .
      "M10134A" 100 -1 -1 1 2 -1 96 . . . . 85 94
      "M10146E" 1 . . -1 . . . 3 -1 2 1 . .
      "M10146E" 100 -1 -1 1 2 -1 96 . . . . 96 126
      "M10146E" 1 1 1 . -1 2 7 3 -1 2 1 . .
      "M10153D" 1 1 1 . -1 2 7 2 -1 6 96 . .
      "M10153D" 1 . . -1 . . . 2 -1 6 96 . .
      "M10153D" 100 -1 -1 1 2 -1 96 . . . . 126 132
      "M10158J" 100 -1 -1 1 2 -1 96 . . . . 119 113


      Comment


      • #4
        OK, within your example data in #3, it is always the case that the records of a give MCSID with PNUM00 == 1 all agree, except for possibly missing values. I am assuming this is the case throughout your full data set. If it is not, your results will be incorrect. So this code starts by verifying this assumption

        Code:
        // VERIFY CONSISTENCY OF DATA
        foreach v of varlist AELIG00-DCMATHS7SA {
            by MCSID PNUM00 (`v'), sort: assert inlist(`v', `v'[1], .)
        }
        
        // REDUCE MULTIPLE OBSERVATIONS TO SINGLETONS USING NON-MISSING VALUES WHERE THEY EXIST
        collapse (firstnm) AELIG00-DCMATHS7SA, by(MCSID PNUM00)
        
        // WIDE LAYOUT, AS REQUESTED
        reshape wide AELIG00-DCMATHS7SA, i(MCSID) j(PNUM00)
        does what you ask.

        Make sure you have a good reason for going to wide layout. Bear in mind that most data management and analysis tasks in Stata are easier, or only possible, with long data.

        Comment


        • #5
          Thank you, when I ran the code it gave the following error.

          930 contradictions in 4,766 observations
          assertion is false

          Does this mean there has been a fault in the merges and how could I investigate where the problem might be?

          Comment


          • #6
            So this tells you that you have 930 instances of observations having the same MCSID and PNUM00 which have conflicting values on some variable(s). So you have to first identify these:

            Code:
            gen flag = 0
            foreach v of varlist AELIG00-DCMATHS7SA { // USE THE FULL LIST OF VARIABLES IN THE REAL DATA SET HERE
                by MCSID PNUM00 (`v'), sort: replace flag = 1 if !inlist(`v', `v'[1], .)
            }
            by MCSID PNUM00 (flag), sort: replace flag = flag[_N]
            browse if flag
            Now all of the offending observations will appear in the browser. From that point you have to look at the contradictory numbers and figure out, a) how they got there, and b) what to do about them.

            It may be that they exist only because the data management that created this data set was flawed and these contradictory data are erroneous. In that case, you have to go back through the data management that created the data set, fix all the errors that led to this situation, and regenerate a corrected data set. If the data set creation was not in your purview, then you need to contact the person who provided it to you, call these errors to their attention, and have them create a corrected data set for you.

            It may be that these contradictory values do not represent errors, and that you did not properly understand the nature of the data when you set out to do what you proposed in #1. In that case, you need a new plan. Perhaps there is a way to single out from among the conflicting values one which is correct and write code to retain only that one. Or it may be that the conflicting values can be combined in some way that you can calculate. Evidently this all depends on the meaning of the data itself, and only you can figure that part out.


            Comment


            • #7
              Thank you, I'll work through all your suggestions, much appreciated.

              Comment

              Working...
              X