Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Cleaning dataset by finding duplicates among different variables

    Hello,

    I hope this is not a redundant question (I'm rather new to Stata) but I haven't yet found an attractive solution to my problem.

    I need to identify (and later drop) observations that have a value in one variable (ID_1) which is equal to another observation's value of a different variable (ID_2) under the condition that both observations have the same date and the matched observation has a specific value in an indicator variable (either C or W). I would like to generate a dummy variable that ticks the corresponding observations first before I drop them.

    Here, the simplified data:
    Firm ID_1 ID_2 Date Indicator
    1 28896 07. Feb 03 T
    2 29105 07. Feb 03 T
    1 29106 28896 07. Feb 03 W
    3 29126 07. Feb 03 T
    3 29254 07. Feb 03 N
    4 29283 07. Feb 03 T
    3 29288 29254 07. Feb 03 C

    For example, I would like to the dummy variable to mark the first observation with ID_1=28896 because the third line item's ID_2 matches that ID_1 while it has the same date and also the Indicator equals "W" instead of "T". The same applies to the fifth observation which should be marked due to the last line item.

    I am sorry, I cannot provide a code, I have tried different approaches using forvalue and duplicates but nothing worked properly.

    Thanks and best regards,
    Fred

  • #2
    This is a good task for Robert Picard's -rangejoin-, which you can get from SSC. In order to use -rangejoin- you must also install -rangestat- (Also by Robert Picard, with Nick Cox & Roberto Ferrer) from SSC.

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input byte firm int(id_1 id_2) float date str1 indicator
    1 28896     . 15743 "T"
    2 29105     . 15743 "T"
    1 29106 28896 15743 "W"
    3 29126     . 15743 "T"
    3 29254     . 15743 "N"
    4 29283     . 15743 "T"
    3 29288 29254 15743 "C"
    end
    format %td date
    
    preserve
    keep if !missing(id_2) & inlist(indicator, "C", "W")
    rename id_2 matching_id
    tempfile id2s
    save `id2s'
    restore
    rangejoin matching_id id_1 id_1 using `id2s', by(date)
    gen byte has_match = !missing(indicator_U)
    In the future, please use -dataex- to give example data, as I have done here. Your example HTML table was cumbersome to import into Stata because it required converting the date variable to a Stata internal format date variable. Also, it is impossible to know from your table whether indicator is a string variable or a numeric variable with a value label. I guessed it's a string. If it's not, you'll have to modify the code above to treat it as a numeric variable. When you use -dataex- (also by Robert Picard and also available from SSC) you enable those who want to help you to instantly create a completely faithful replica of your Stata example, that includes all of the subtleties of storage types and labeling, with a simple copy/paste operation. Please use it every time you post example data in the future.

    Comment


    • #3
      Hello Clyde,

      thanks a lot, this works perfectly! The indicator is indeed a string (even the IDs were but I destringed those).

      Next time, I will use dataex. I even tried it this time before posting but my inexperience led me to believe that it would be arranged more clearly like that, now I know.

      Best
      Fred

      Comment

      Working...
      X