Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Combining duplicate observation

    Hello,

    I feel like this should be easy, but I cannot figure it out myself based on the help files or answers to similar questions.

    I have observations on countries, but some countries have duplicate observations. For example, Hong Kong is is my dataset three times. For some variables the save values are in every observation and for others only one observation has a value and the others are missing. To clarify, this looks like this (although I have many more countries and variables):
    ISO var1 var2 var3 var4 var5 var6
    HKG 4 5 1 . . .
    HKG 4 5 . 2 . .
    HKG 4 5 . . 3 .

    I need to combine these three observations for Hong Kong into one observation:
    ISO var1 var2 var3 var4 var5 var6
    HKG 4 5 1 2 3 .

    So, there are three situations: (1) all duplicate observations have the same value for a variable, (2) only one observation has a value and the others are missing and (3) all are missing.

    One thing I though about was to use:
    Code:
    egen var4_mean = mean(var4), by ISO
    replace var4 = var4_mean if ISO == "HKG"
    drop var4_mean
    The problem with this is that I have many observations, so I think there should be an easier way of doing this than looping over all the variables.

    Any help is much appreciated!

    Best,
    Alexander

  • #2
    Code:
    collapse (firstnm) var1-var6, by(ISO)
    That said, are you 100% certain that when a country has more than one non-missing value for a variable those non-missing values all agree? I don't doubt that your data are supposed to be that way. But real world data sets often have nasty surprises in them. Have you verified this? If it were my data, I would do this first:

    Code:
    foreach v of varlist var1-var6 {
    by ISO (`v'), sort: assert `v' == `v'[1] if !missing(`v')
    }
    Last edited by Clyde Schechter; 08 Feb 2018, 15:40. Reason: Fix damaged code block.

    Comment


    • #3
      Thank! I actually did it by looping over the variables, which wasn't as much work as I expected.

      Comment


      • #4
        Hello,

        I have a dataset of about 38000 observations panel data. I did

        duplicates list ID ticker year

        to identify the duplicates. As Clyde mentioned in the post I verified if the duplicates agree in value before proceeding. Actually they do not agree in at least one of my 12 cases. I don't know how to proceed now in the most accurate way because I can't judge which observation would be the "right" one. Otherwise I would just have used the codes mentioned in your post...

        What would you propose in this case?

        I appreciate any help!

        Best,
        Romi

        Comment


        • #5
          Actually, the first thing I would do is investigate why you have conflicting observations. It seems that you do not expect that to happen, so there is probably some error in the data management that led up to the creation of your data set. So you should review that code to look for errors if you were the one who put the data together. If somebody else did that and just gave you the data set to work with, you should confer with them about why this has happened and let them fix the problem.

          Now, it may turn out that the problem is that your expectation is wrong and all of these conflicting observations are in some sense correct. In that case, you need some rule for either selecting one, or for combining them into a single observation (perhaps taking the mean or median of conflicting values or something like that.) As for which approach is the best to take, that is not a statistical question: it's a question of the science in your domain, and you need help from one of your colleagues about that.

          Comment

          Working...
          X