Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Dealing with (technical Stata) duplicates that are not real duplicates

    Hi,

    I imported data and merged it with other (pane)l data. Reshaping the data resulted in this message: "There are multiple observations on the same Date within ID_S."

    I guess it is based o the fact that on the same day multiple (and additive) values are found in the dataset. You can see it in the sample (the value of 163.607). Stata is technically showing this as a duplicate.

    I don't know how to deal with it as double dates cause problems.

    Any idea highly appreciated.


    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input int Date double Debt_amount str7 ID_S double(Equity_amount Repurchase_amount) str9 Status
    16821       . "U295141" 163.607      . ""         
    16821       . "U295141" 163.607      . ""         
    16824       . "U904391"       . 98.269 "Completed"
    16824       . "U30709W"   16.08      . ""         
    16825       . "F922029" 738.082      . ""         
    16826 605.801 "D882362"       .      . ""         
    end
    format %td Date


  • #2
    Well if your "technical" duplicates are really completely duplicated (i.e. all of the variables contain exactly the same values) you can eliminate them with
    Code:
    duplicates drop
    But, before you do that, I would suggest you go back and review the data management process that led you to this point. -merge- does not create duplicate observations for no reason. Does your example data include all of the variables in the real data set? If so, it is likely that these "technical duplicates" actually contain different values for some variables--and then you cannot proceed to -reshape- the data until you resolve those conflicts (or change the -reshape- command's -i()- and -j()- options by adding a new variable that reliably distinguishes the "technical duplicates" from each other.)

    Moreover, even if the observations really are complete duplicates, -merge-ing does not create true duplicate observations out of thin air. The fact that they are there, but you don't expect them, suggests to me either that your expectations are incorrect and you are asking -reshape- to do the impossible, or something went wrong in the data management that led up to this point. If the latter, then there may be other errors as well that you should correct before moving ahead.

    Finally, why do you want to -reshape- this data. It is already in a nice long layout for panel data analysis (except for the duplication of ID-time combinations). If you go to wide layout you will just make it more difficult (or even impossible) to do further analysis of this data as most Stata commands are optimized for use with long data. The wide layout is really useful only for creating visual displays of the data, or certain graphing commands, and a handful of analysis commands. So if you are tempted to go to wide layout just because you are accustomed to seeing data that way in spreadsheets, think a bit more carefully--Stata is not a spreadsheet and wide data is more cumbersome to work with in Stata. (That said, if there shouldn't be duplications of ID and time in your data at this point, you still need to fix that even if you don't have -reshape- nagging you about it.)

    Comment


    • #3
      First, the obligatory warning. The experienced users here generally agree that, with few exceptions, Stata makes it much more straightforward to accomplish complex analyses using a long layout of your data rather than a wide layout of the same data. The problem you have encountered is one example of that. So think hard about why you want a wide layout.

      Next, a problem. You don't tell us how you were trying to reshape your data, you don't tell us what the command was. That makes it difficult to know if the solution I recommend will be helpful in your case. Do take a few moments to review the Statalist FAQ linked to from the top of the page. Note especially sections 9-12 on how to best pose your question. The more you help others understand your problem, the more likely others are to be able to help you solve your problem.

      With that said, since you describe your duplicated data as additive values, perhaps collapsing the duplicated observations adding up the values to create daily totals will solve your problem. Clyde's post assumes that because the value of Debt_amount is identical on both observations, the represent true duplicates. That is what I would expect as well, but since I cannot fully understand what you are trying to accomplish, I will take your word that the two observations for U2095141 with a debt amount of 163.607 represent two separate but identical transactions that can be combined into a single transaction for the day.
      Code:
      * Example generated by -dataex-. To install: ssc install dataex
      clear
      input int Date double Debt_amount str7 ID_S double(Equity_amount Repurchase_amount) str9 Status
      16821       . "U295141" 163.607      . ""        
      16821       . "U295141" 163.607      . ""        
      16824       . "U904391"       . 98.269 "Completed"
      16824       . "U30709W"   16.08      . ""        
      16825       . "F922029" 738.082      . ""        
      16826 605.801 "D882362"       .      . ""        
      end
      format %td Date
      
      collapse (sum) Debt_amount Equity_amount Repurchase_amount (firstnm) Status, by(ID_S Date)
      list, clean abbreviate(20) noobs
      Code:
      . list, clean abbreviate(20) noobs
      
               Date      ID_S   Debt_amount   Equity_amount   Repurchase_amount      Status  
          25jan2006   D882362       605.801               0                   0              
          24jan2006   F922029             0         738.082                   0              
          20jan2006   U295141             0         327.214                   0              
          23jan2006   U30709W             0           16.08                   0              
          23jan2006   U904391             0               0              98.269   Completed
      Last edited by William Lisowski; 04 Mar 2019, 13:48.

      Comment


      • #4
        Thanks William, you are right. The best way is certainly to recheck whether these are duplicates or just 2 debt issues on one day (my assumption and the reason for this question). Thanks for the code, which is very helpful to me.

        And both of you are right that the question is not really specific and leave lot of interpretation.

        Also @Clyde, the reason for the reshaping to wide is another variable called Datatypes (with the values in Q). These are generally balance sheet values that I want to reshape in order to make a panel regression (something I have never done). The example below shows only a small subset of the data but the most relevant variables for the reshaping process.



        Code:
        * Example generated by -dataex-. To install: ssc install dataex
        clear
        input int Date str7 ID_S str62 name str61 Description str9 Datatypes str18 Q double(Equity_amount Debt_amount Repurchase_amount)
        17776 "U905686" "RENEW HOLDINGS - TOT RETURN IND"                          "RENEW HOLDINGS PLC - TOTAL ASSETS (U.S.$)"   "WC07230A"  ""       . . .
        19340 "U953535" "PARITY GROUP - TOT RETURN IND"                            "PARITY GROUP PLC - NET DEBT"                 "WC18199A"  "5420"   . . .
        16653 "U946023" "FAUPEL DEAD - T/O BY 946858 - TOT RETURN IND"             ""                                            ""          ""       . . .
        14693 "U943663" "SERCO GROUP - TOT RETURN IND"                             ""                                            ""          ""       . . .
        15078 "U974585" "NORTHAMBER - TOT RETURN IND"                              ""                                            ""          ""       . . .
        15984 "F997836" "VICAT - TOT RETURN IND"                                   ""                                            ""          ""       . . .
        15510 "U882671" "CHARTERHOUSE COMMS. DEAD - 21/04/08 - TOT RETURN IND"     ""                                            ""          ""       . . .
        20443 "U991575" "M P EVANS GROUP - TOT RETURN IND"                         "M.P. EVANS - ENTERPRISE VALUE"               "WC18100A"  "224064" . . .
        16394 "U904649" "SAGE GROUP - TOT RETURN IND"                              ""                                            ""          ""       . . .
        16581 "U906124" "PITTARD - TOT RETURN IND"                                 ""                                            ""          ""       . . .
        15320 "U910394" "HAZLEWOOD FOODS DEAD - DEAD 15/02/01 - TOT RETURN IND"    ""                                            ""          ""       . . .
        16615 "U135511" "DOMNICK HUNTER DEAD - T/O BY 905150 - TOT RETURN IND"     ""                                            ""          ""       . . .
        16013 "U910450" "LEEDS GROUP - TOT RETURN IND"                             ""                                            ""          ""       . . .
        18228 "D695996" "BASLER - TOT RETURN IND"                                  "BASLER AG - EBIT & DEPRECIATION"             "WC18198A," "647"    . . .
        15613 "U907481" "WIRELESS GROUP DEAD - 24/10/16 - TOT RETURN IND"          ""                                            ""          ""       . . .
        14668 "U681943" "SPS.& OUTDR.MEDIA INTL. DEAD - 16-06-00 - TOT RETURN IND" ""                                            ""          ""       . . .
        15679 "U943674" "PRIME PEOPLE - TOT RETURN IND"                            ""                                            ""          ""       . . .
        20066 "U943417" "TITON HOLDINGS - TOT RETURN IND"                          "TITON HOLDINGS PLC - RESEARCH & DEVELOPMENT" "WC01201A"  ""       . . .
        15935 "U882297" "MEARS GROUP - TOT RETURN IND"                             ""                                            ""          ""       . . .
        20425 "U991575" "M P EVANS GROUP - TOT RETURN IND"                         "M.P. EVANS - ENTERPRISE VALUE"               "WC18100A"  "224064" . . .
        end
        format %td Date

        I tried the code

        reshape wide Datatypes, i(ID_S) j(Date)

        with the error telling me of the "duplicates" (caused by some double values for debt or equity at the same date for the same company). I mean this could be true as a company is not restricted to one debt issue a day.

        If you have an idea on how to make panel regression using the Datatypes it would be of very help (just in principle). Here, I want to make it semiannually, while the date is on a daily basis.

        Comment


        • #5
          Perhaps I am not understanding what you want to do, but panel regression in Stata can only be done with long data, not wide.

          One other thing you will need to do to prefer for panel data regression is create a numeric panel identifier. Your variable ID_S is a string. If there are fewer than 65,000 distinct values of ID_S, you can do that with the -encode- command (-help encode-). If more, then you will need to use -egen numeric_panel_id = group(ID_S), label-.

          Comment


          • #6
            Clyde: sorry for the confusion, I will provide better explanation in future posts.

            Nevertheless, just look at the last dataex extract above that I posted. You see the WC18100A or the WC18199A. These must be (amongst others) used as VARIABLES in the regression with a value as depicted by Q. This must therefore be transformed to wide, so that each of the ID_S has there variables WC18100A and so on. Of course, these variables in each own need to be long for the regression to work.

            Thanks for the remark on the string. Very helpful.

            Comment


            • #7
              Nevertheless, just look at the last dataex extract above that I posted. You see the WC18100A or the WC18199A. These must be (amongst others) used as VARIABLES in the regression with a value as depicted by Q. This must therefore be transformed to wide, so that each of the ID_S has there variables WC18100A and so on. Of course, these variables in each own need to be long for the regression to work.
              Yes, you are right. I had not understood that this was what you were doing.

              Comment


              • #8
                Thanks for the great support! You cannot imagine, how much I learn through this forum and how much I appreciate the help of people like you, Clyde.

                By the way, I was working around the reshape problem using the quarterly database from which I want to make the panel regression later on:

                https://www.statalist.org/forums/for...atastream-data


                The duplicates (from the issue yesterday) were indeed "double values" as they have refereed to companies raising money in SEVERAL market at the same date. Another reminder to me of the fact that the data always needs to be understood and managed accordingly so that no "garbage in garbage out" problem can arise.

                Comment

                Working...
                X