Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • delete duplicates

    Hello,

    I'm a bit desperate and need your help with my master's thesis. I'm really sorry if this is a stupid question or the wrong place to post (it's my first post in a forum ever).

    This is my problem:

    - anchor data set: Wave 2-6 -> works
    Click image for larger version

Name:	18814514_457868401222451_6008569011304543352_o.jpg
Views:	1
Size:	123.2 KB
ID:	1395696



    - parenting data set: wave 2-6 → Duplicates in wave 6
    Click image for larger version

Name:	parenting.jpg
Views:	1
Size:	168.4 KB
ID:	1395697




    - child data set: Wave 2-6 → Duplicates in wave 6
    Click image for larger version

Name:	kinder.jpg
Views:	1
Size:	166.6 KB
ID:	1395698



    How is it possible to delete the duplicates (for each id there should be only 1 child id for each wave).
    After this my plan is it to merge anchor- data, parenting-data and child-data. I read something about dropping duplicates: "duplicates drop id wave, force" but I'm not sure at all?!

    Thanks in advance

    Guest
    Last edited by sladmin; 28 Jan 2019, 09:31. Reason: anonymize original poster

  • #2
    Well
    Code:
    duplicates drop
    by itself is always safe, assuming that there aren't supposed to be any observations in the data set that are completely duplicates of each other.

    But
    Code:
    duplicates drop id wave, force
    may be dangerous. If, in the child data set, all of the observations that have the same value of id and wave also agree on all the other variables, then this is a reasonable way to proceed. But if they disagree on other variables, then using this command would arbitrarily select one of the several records to retain and throw away the others. This process will not be replicable: if you re-run the same code you will get a different selection, and subsequent calculations involving any variables that differed among those records will produce different results.

    So before you worry about how to eliminate these records, worry about why they are there in the first place. If they are pure duplicates on all variables, then -duplicate drop- will handle it. But if they disagree on other variables, you need to find out why. Are all of the records but one errors? If so, how do you identify which is the correct one? Or perhaps the records are not errors at all: all of them are correct but reflect different times or aspects of the situation and for your purposes you need to combine the information in them in some way, e.g. averaging, or the largest value, or something like that. Anyway, you need to resolve these issues before you proceed.

    Comment


    • #3
      Originally posted by Guest View Post
      I read something about dropping duplicates: "duplicates drop id wave, force" but I'm not sure at all?!
      Try the duplicates command and compare your data before and after.

      Originally posted by Guest View Post
      it's my first post in a forum ever
      Please read the FAQ, it has much useful advice for new Statalist members.
      Last edited by sladmin; 28 Jan 2019, 09:31. Reason: anonymize original poster

      Comment


      • #4
        I can testify as the putative author of duplicates that the option name force was intended as a deliberate reminder that you are being brutal to your data. It seems that it's widely not taken seriously enough. I wonder whether people should be obliged to type out the whole of an option name like irealisethisisdamaging

        Comment


        • #5
          Dear Stata members
          I would like to remove the "nondisjoint-groups" from my sample. By nondisjoint-groups, I mean those entities (in my case co_code) that are in both categories. See the following data example

          Code:
          input long co_code byte list_unlist int year float(epu wanted group)
          
                    co_code  list_u~t      year        epu     wanted      group
            1.
          .      3 0 2004 4.08252  1 0
            2.
          .      3 0 2011 4.73533  1 0
            3.
          .      3 0 2012 5.20827  1 0
            4.
          .      3 0 2013 5.14044  1 0
            5.
          .      3 0 2014 4.76683  1 0
            6.
          .      3 0 2015 4.59869  1 0
            7.
          .      3 0 2016 4.24885  1 0
            8.
          .      3 0 2018 4.13127  1 0
            9.
          .      3 0 2019 4.03485  1 0
           10.
          .     15 0 2004 4.08252  2 0
           11.
          .    289 1 2004 4.08252 13 1
           12.
          .    289 1 2005 4.27638 13 1
           13.
          .    289 1 2006 3.86614 13 1
           14.
          .    289 1 2009 4.95791 13 1
           15.
          .    289 1 2010 4.61611 13 1
           16.
          .    289 1 2011 4.73533 13 1
           17.
          .    289 1 2012 5.20827 13 1
           18.
          .    289 1 2013 5.14044 13 1
           19.
          .    289 1 2014 4.76683 13 1
           20.
          .    289 1 2015 4.59869 13 1
           21.
          .    289 1 2016 4.24885 13 1
           22.
          .    289 1 2017 4.33206 13 1
           23.
          .    289 1 2018 4.13127 13 1
           24.
          .    289 1 2019 4.03485 13 1
           25.
          .    328 0 2005 4.27638  3 0
           26.
          .    328 0 2006 3.86614  3 0
           27.
          .    328 0 2007 4.00362  3 0
           28.
          .    328 0 2008 4.27017  3 0
           29.
          .    328 0 2009 4.95791  3 0
           30.
          .    328 0 2010 4.61611  3 0
           31.
          .    337 0 2004 4.08252  4 0
           32.
          .    337 0 2005 4.27638  4 0
           33.
          .    337 0 2006 3.86614  4 0
           34.
          .    337 0 2007 4.00362  4 0
           35.
          .    337 0 2008 4.27017  4 0
           36.
          .    337 0 2009 4.95791  4 0
           37.
          .    337 0 2010 4.61611  4 0
           38.
          .    337 0 2011 4.73533  4 0
           39.
          .    337 0 2012 5.20827  4 0
           40.
          .    337 0 2013 5.14044  4 0
           41.
          . 136909 0 2005 4.27638  5 0
           42.
          . 136909 0 2006 3.86614  5 0
           43.
          . 136909 0 2010 4.61611  5 0
           44.
          . 136909 0 2011 4.73533  5 0
           45.
          . 136909 0 2012 5.20827  5 0
           46.
          . 136911 0 2005 4.27638  6 0
           47.
          . 136911 0 2006 3.86614  6 0
           48.
          . 136911 0 2007 4.00362  6 0
           49.
          . 136911 0 2008 4.27017  6 0
           50.
          . 137490 0 2004 4.08252  7 0
           51.
          . 137490 0 2005 4.27638  7 0
           52.
          . 137490 0 2006 3.86614  7 0
           53.
          . 137490 0 2007 4.00362  7 0
           54.
          . 137490 0 2008 4.27017  7 0
           55.
          . 137490 0 2011 4.73533  7 0
           56.
          . 137490 0 2012 5.20827  7 0
           57.
          . 137490 0 2013 5.14044  7 0
           58.
          . 137490 0 2014 4.76683  7 0
           59.
          . 137490 0 2015 4.59869  7 0
           60.
          . 137490 0 2016 4.24885  7 0
           61.
          . 137490 0 2017 4.33206  7 0
           62.
          . 137490 0 2018 4.13127  7 0
           63.
          . 137490 0 2019 4.03485  7 0
           64.
          . 137495 0 2004 4.08252  8 0
           65.
          . 137495 0 2005 4.27638  8 0
           66.
          . 137495 0 2012 5.20827  8 0
           67.
          . 137495 0 2013 5.14044  8 0
           68.
          . 137495 0 2014 4.76683  8 0
           69.
          . 137495 0 2015 4.59869  8 0
           70.
          . 137495 0 2016 4.24885  8 0
           71.
          . 137495 0 2017 4.33206  8 0
           72.
          . 137495 0 2018 4.13127  8 0
           73.
          . 137495 0 2019 4.03485  8 0
           74.
          . 137511 0 2004 4.08252  9 0
           75.
          . 137511 0 2005 4.27638  9 0
           76.
          . 137511 1 2006 3.86614 14 1
           77.
          . 137511 1 2007 4.00362 14 1
           78.
          . 137511 1 2008 4.27017 14 1
           79.
          . 137511 1 2009 4.95791 14 1
           80.
          . 137511 1 2010 4.61611 14 1
           81.
          . 137511 1 2011 4.73533 14 1
           82.
          . 137511 1 2012 5.20827 14 1
           83.
          . 137511 1 2013 5.14044 14 1
           84.
          . 137511 1 2014 4.76683 14 1
           85.
          . 137511 1 2015 4.59869 14 1
           86.
          . 137511 1 2016 4.24885 14 1
           87.
          . 137511 1 2017 4.33206 14 1
           88.
          . 137511 1 2018 4.13127 14 1
           89.
          . 137511 1 2019 4.03485 14 1
           90.
          . 137512 0 2004 4.08252 10 0
           91.
          . 137512 0 2005 4.27638 10 0
           92.
          . 137512 0 2006 3.86614 10 0
           93.
          . 137512 0 2007 4.00362 10 0
           94.
          . 137512 0 2008 4.27017 10 0
           95.
          . 137512 0 2009 4.95791 10 0
           96.
          . 137512 0 2010 4.61611 10 0
           97.
          . 137512 0 2011 4.73533 10 0
           98.
          . 137512 0 2012 5.20827 10 0
           99.
          . 137554 1 2004 4.08252 15 1
          100.
          . 137554 1 2005 4.27638 15 1
          101.
          . end
          
          .  tab co_code group
          
                     |         group
             co_code |         0          1 |     Total
          -----------+----------------------+----------
                   3 |         9          0 |         9
                  15 |         1          0 |         1
                 289 |         0         14 |        14
                 328 |         6          0 |         6
                 337 |        10          0 |        10
              136909 |         5          0 |         5
              136911 |         4          0 |         4
              137490 |        14          0 |        14
              137495 |        10          0 |        10
              137511 |         2         14 |        16 
              137512 |         9          0 |         9
              137554 |         0          2 |         2
          -----------+----------------------+----------
               Total |        70         30 |       100
          Here co_code 137511 is there in both groups, hence non-disjoint. I would like to do two things
          1. COUNT those co_codes that are in both groups
          2. Delete those co_codes that are in both groups
          3. Also, is there a way to tag or identify those -non-disjoint groups so that I can exclude them if needed. For instance such non-disjoint groups are codes as one, so that I can avoid them in running regressions

          I have a large data set, hence -tab- wont work,so I can't inspect those observations in original data

          Any help in this regard will be extremely helpful to me
          .
          Last edited by lal mohan kumar; 29 Apr 2021, 00:43.

          Comment

          Working...
          X