Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Sorting data: filling in missing data, duplicating data and adding an additional variable

    Hi

    I'm attempting to clean and organise a large set of data that has been presented in an unusual way, i've included an example. It's slightly difficult to explain, thanks for your help!

    I have an identifying code for a person "identity", and multiple rows of data for each person which are shown by the "count" variable. For each person there is an "event" code. I would like to fill in the missing data in the event column, using the event code which corresponds to the same person.

    Each "event" is unique to one person (you'll never have the same event code for a different person), but a person can have more than one event (more than one event code corresponding to a person).

    If they have more than one event all the remaining data for the corresponding identity should be duplicated.

    The position of the event code should also be noted as an additional variable. To help make it clearer i've also included an example of the outcome i'm looking for!


    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input float(event count) str8 identity float(var4 var5)
         . 1 "g19937g"   37   67
         . 2 "g19937g"   54 3832
    256789 3 "g19937g"   23  653
         . 4 "g19937g"   34    3
         . 5 "g19937g"   35   45
         . 6 "g19937g"   14    6
    347652 1 "d18793"    75   67
         . 2 "d18793"    56   34
         . 3 "d18793"    23    1
         . 4 "d18793"    55  568
         . 6 "d18793"    87  134
    456736 7 "d18793"     3    .
         . 8 "d18793"    58   76
         . 9 "d18793"  2337  567
         . 1 "e34863"    23  223
         . 2 "e34863"    14  787
         . 3 "e34863"    76  342
    156782 4 "e34863"   745  753
         . 5 "e34863"   135   26
    end
    So it should go on to look like the data below.

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input float(newevent event count) str8 identity float(var4 var5)
    0 256789 1 "g19937g"   37   67
    0 256789 2 "g19937g"   54 3832
    1 256789 3 "g19937g"   23  653
    0 256789 4 "g19937g"   34    3
    0 256789 5 "g19937g"   35   45
    0 256789 6 "g19937g"   14    6
    1 347652 1 "d18793"    75   67
    0 347652 2 "d18793"    56   34
    0 347652 3 "d18793"    23    1
    0 347652 4 "d18793"    55  568
    0 347652 6 "d18793"    87  134
    0 347652 7 "d18793"     3    .
    0 347652 8 "d18793"    58   76
    0 347652 9 "d18793"  2337  567
    0 456736 1 "d18793"    75   67
    0 456736 2 "d18793"    56   34
    0 456736 3 "d18793"    23    1
    0 456736 4 "d18793"    55  568
    0 456736 6 "d18793"    87  134
    1 456736 7 "d18793"     3    .
    0 456736 8 "d18793"    58   76
    0 456736 9 "d18793"  2337  567
    0 156782 1 "e34863"    23  223
    0 156782 2 "e34863"    14  787
    0 156782 3 "e34863"    76  342
    1 156782 4 "e34863"   745  753
    0 156782 5 "e34863"   135   26
    end
    Thank you.

  • #2
    Code:
    preserve
    
    keep identity event count
    drop if missing(event)
    duplicates drop
    rename count count2
    tempfile events
    save `events'
    
    restore
    drop event
    joinby identity using `events'
    isid identity event count, sort
    gen new_event = (count == count2)
    drop count2
    order new_event event, first
    Note: You cannot run this code line by line. Because of the -preserve- and -restore- commands and use of tempfiles, it must be run in one fell swoop.

    Comment


    • #3
      Thank you, that's brilliant!

      Comment


      • #4
        Hi Clyde

        You solution worked very well on the example dataset, when I come to use it on the main dataset I get the following error:

        isid identity event count, sort
        "variables identity event count do not uniquely identify the observations"


        I've double checked the data and there are no duplicates of this nature prior to running the code. There are lots of additional variables but I can't see how this would influence things.
        I've tried to work out what's going on and have re-run the code with the following additional line prior to isid identity even count, sort:

        duplicates drop identity event count, force


        This seems to do the job, but wonder if i'm missing something? Apologies I can't share the output directly, the data is confidential and for the work I access the data and STATA through a secure server.

        Many thanks for your help!

        Comment


        • #5
          This seems to do the job, but wonder if i'm missing something?
          As do I. And a clue that something is going seriously wrong is that you used the -force- option on your -duplicates drop- command. If you can run it with -duplicates drop identity event count- without the -force- option, then you can be sure that you are not losing any information. But if you actually need the -force- option to avoid an error message and a break, then it means that you not only have multiple instances of some combination(s) of identity event and count, but those instances disagree about the values of other variables. When you use -force-, one of the multiples is arbitrarily picked for keeping. And it's not even reproducible if you rerun the code. So that's big trouble.

          Unfortunately, without example data, I cannot directly troubleshoot for you. What I recommend you do is start over, and immediately after the joinby command run:

          Code:
          duplicates tag identity event count, gen(flag)
          browse if flag
          That will enable you to see what the multiple observations are, and perhaps then you can figure out where they are coming from and what to do about it.

          Comment


          • #6
            Thanks Clyde, this is v helpful.

            The only variable that is not duplicated at that stage seems to be in the new "count2" column. Everything else throughout the dataset is identical, there is a single new row for a portion of the rows with n+1 from the corresponding row above in the count2 column. This seems rather odd?

            When I drop this duplicate, it seems to function fine.

            I'm not sure if this is due to the platform on which i'm having to access the data and STATA that's causing an issue.

            Comment


            • #7
              Well, again, it's hard to do this flying blind. But I think what you're telling me in #6 is that after -joinby- you have some observations that agree on id, event, and count, but have different values of count2. Now, by the way the code works, event comes from the temporary `events' data set, not the original data. And because the construction of `events' has -duplicates drop- and the only variables at that time are identiy, event, and count, there cannot be duplicates of combination of identity, event, and count in there.

              So the duplicates must be arising from the original data set. But that would imply that the original data set contains observations that have the same value of identity, event, and count--which you said in #1 should not happen. So I think you need to go back all the way to the start and before you run any of the code from #2, run
              Code:
              duplicates tag identity event count, gen(flag)
              browse if flag
              to find those. Those are, I believe, the source of the problem, and you will need to fix that error in the data.

              Comment

              Working...
              X