Dealing with (technical Stata) duplicates that are not real duplicates

Michael Craig

Join Date: Dec 2018

Posts: 53
#1

Dealing with (technical Stata) duplicates that are not real duplicates

04 Mar 2019, 12:34

Hi,

I imported data and merged it with other (pane)l data. Reshaping the data resulted in this message: "There are multiple observations on the same Date within ID_S."

I guess it is based o the fact that on the same day multiple (and additive) values are found in the dataset. You can see it in the sample (the value of 163.607). Stata is technically showing this as a duplicate.

I don't know how to deal with it as double dates cause problems.

Any idea highly appreciated.

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input int Date double Debt_amount str7 ID_S double(Equity_amount Repurchase_amount) str9 Status 16821 . "U295141" 163.607 . "" 16821 . "U295141" 163.607 . "" 16824 . "U904391" . 98.269 "Completed" 16824 . "U30709W" 16.08 . "" 16825 . "F922029" 738.082 . "" 16826 605.801 "D882362" . . "" end format %td Date
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 29962
#2

04 Mar 2019, 13:44

Well if your "technical" duplicates are really completely duplicated (i.e. all of the variables contain exactly the same values) you can eliminate them with

Code:

duplicates drop

But, before you do that, I would suggest you go back and review the data management process that led you to this point. -merge- does not create duplicate observations for no reason. Does your example data include all of the variables in the real data set? If so, it is likely that these "technical duplicates" actually contain different values for some variables--and then you cannot proceed to -reshape- the data until you resolve those conflicts (or change the -reshape- command's -i()- and -j()- options by adding a new variable that reliably distinguishes the "technical duplicates" from each other.)

Moreover, even if the observations really are complete duplicates, -merge-ing does not create true duplicate observations out of thin air. The fact that they are there, but you don't expect them, suggests to me either that your expectations are incorrect and you are asking -reshape- to do the impossible, or something went wrong in the data management that led up to this point. If the latter, then there may be other errors as well that you should correct before moving ahead.

Finally, why do you want to -reshape- this data. It is already in a nice long layout for panel data analysis (except for the duplication of ID-time combinations). If you go to wide layout you will just make it more difficult (or even impossible) to do further analysis of this data as most Stata commands are optimized for use with long data. The wide layout is really useful only for creating visual displays of the data, or certain graphing commands, and a handful of analysis commands. So if you are tempted to go to wide layout just because you are accustomed to seeing data that way in spreadsheets, think a bit more carefully--Stata is not a spreadsheet and wide data is more cumbersome to work with in Stata. (That said, if there shouldn't be duplications of ID and time in your data at this point, you still need to fix that even if you don't have -reshape- nagging you about it.)
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#3

04 Mar 2019, 13:45

First, the obligatory warning. The experienced users here generally agree that, with few exceptions, Stata makes it much more straightforward to accomplish complex analyses using a long layout of your data rather than a wide layout of the same data. The problem you have encountered is one example of that. So think hard about why you want a wide layout.

Next, a problem. You don't tell us how you were trying to reshape your data, you don't tell us what the command was. That makes it difficult to know if the solution I recommend will be helpful in your case. Do take a few moments to review the Statalist FAQ linked to from the top of the page. Note especially sections 9-12 on how to best pose your question. The more you help others understand your problem, the more likely others are to be able to help you solve your problem.

With that said, since you describe your duplicated data as additive values, perhaps collapsing the duplicated observations adding up the values to create daily totals will solve your problem. Clyde's post assumes that because the value of Debt_amount is identical on both observations, the represent true duplicates. That is what I would expect as well, but since I cannot fully understand what you are trying to accomplish, I will take your word that the two observations for U2095141 with a debt amount of 163.607 represent two separate but identical transactions that can be combined into a single transaction for the day.

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input int Date double Debt_amount str7 ID_S double(Equity_amount Repurchase_amount) str9 Status 16821 . "U295141" 163.607 . "" 16821 . "U295141" 163.607 . "" 16824 . "U904391" . 98.269 "Completed" 16824 . "U30709W" 16.08 . "" 16825 . "F922029" 738.082 . "" 16826 605.801 "D882362" . . "" end format %td Date collapse (sum) Debt_amount Equity_amount Repurchase_amount (firstnm) Status, by(ID_S Date) list, clean abbreviate(20) noobs

Code:

. list, clean abbreviate(20) noobs Date ID_S Debt_amount Equity_amount Repurchase_amount Status 25jan2006 D882362 605.801 0 0 24jan2006 F922029 0 738.082 0 20jan2006 U295141 0 327.214 0 23jan2006 U30709W 0 16.08 0 23jan2006 U904391 0 0 98.269 Completed

Last edited by William Lisowski; 04 Mar 2019, 13:48.
Comment

Michael Craig

Join Date: Dec 2018
Posts: 53

04 Mar 2019, 14:47

Thanks William, you are right. The best way is certainly to recheck whether these are duplicates or just 2 debt issues on one day (my assumption and the reason for this question). Thanks for the code, which is very helpful to me.

And both of you are right that the question is not really specific and leave lot of interpretation.

Also @Clyde, the reason for the reshaping to wide is another variable called Datatypes (with the values in Q). These are generally balance sheet values that I want to reshape in order to make a panel regression (something I have never done). The example below shows only a small subset of the data but the most relevant variables for the reshaping process.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input int Date str7 ID_S str62 name str61 Description str9 Datatypes str18 Q double(Equity_amount Debt_amount Repurchase_amount)
17776 "U905686" "RENEW HOLDINGS - TOT RETURN IND"                          "RENEW HOLDINGS PLC - TOTAL ASSETS (U.S.$)"   "WC07230A"  ""       . . .
19340 "U953535" "PARITY GROUP - TOT RETURN IND"                            "PARITY GROUP PLC - NET DEBT"                 "WC18199A"  "5420"   . . .
16653 "U946023" "FAUPEL DEAD - T/O BY 946858 - TOT RETURN IND"             ""                                            ""          ""       . . .
14693 "U943663" "SERCO GROUP - TOT RETURN IND"                             ""                                            ""          ""       . . .
15078 "U974585" "NORTHAMBER - TOT RETURN IND"                              ""                                            ""          ""       . . .
15984 "F997836" "VICAT - TOT RETURN IND"                                   ""                                            ""          ""       . . .
15510 "U882671" "CHARTERHOUSE COMMS. DEAD - 21/04/08 - TOT RETURN IND"     ""                                            ""          ""       . . .
20443 "U991575" "M P EVANS GROUP - TOT RETURN IND"                         "M.P. EVANS - ENTERPRISE VALUE"               "WC18100A"  "224064" . . .
16394 "U904649" "SAGE GROUP - TOT RETURN IND"                              ""                                            ""          ""       . . .
16581 "U906124" "PITTARD - TOT RETURN IND"                                 ""                                            ""          ""       . . .
15320 "U910394" "HAZLEWOOD FOODS DEAD - DEAD 15/02/01 - TOT RETURN IND"    ""                                            ""          ""       . . .
16615 "U135511" "DOMNICK HUNTER DEAD - T/O BY 905150 - TOT RETURN IND"     ""                                            ""          ""       . . .
16013 "U910450" "LEEDS GROUP - TOT RETURN IND"                             ""                                            ""          ""       . . .
18228 "D695996" "BASLER - TOT RETURN IND"                                  "BASLER AG - EBIT & DEPRECIATION"             "WC18198A," "647"    . . .
15613 "U907481" "WIRELESS GROUP DEAD - 24/10/16 - TOT RETURN IND"          ""                                            ""          ""       . . .
14668 "U681943" "SPS.& OUTDR.MEDIA INTL. DEAD - 16-06-00 - TOT RETURN IND" ""                                            ""          ""       . . .
15679 "U943674" "PRIME PEOPLE - TOT RETURN IND"                            ""                                            ""          ""       . . .
20066 "U943417" "TITON HOLDINGS - TOT RETURN IND"                          "TITON HOLDINGS PLC - RESEARCH & DEVELOPMENT" "WC01201A"  ""       . . .
15935 "U882297" "MEARS GROUP - TOT RETURN IND"                             ""                                            ""          ""       . . .
20425 "U991575" "M P EVANS GROUP - TOT RETURN IND"                         "M.P. EVANS - ENTERPRISE VALUE"               "WC18100A"  "224064" . . .
end
format %td Date

I tried the code

reshape wide Datatypes, i(ID_S) j(Date)

with the error telling me of the "duplicates" (caused by some double values for debt or equity at the same date for the same company). I mean this could be true as a company is not restricted to one debt issue a day.

If you have an idea on how to make panel regression using the Datatypes it would be of very help (just in principle). Here, I want to make it semiannually, while the date is on a daily basis.

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 29962
#5

04 Mar 2019, 16:33

Perhaps I am not understanding what you want to do, but panel regression in Stata can only be done with long data, not wide.

One other thing you will need to do to prefer for panel data regression is create a numeric panel identifier. Your variable ID_S is a string. If there are fewer than 65,000 distinct values of ID_S, you can do that with the -encode- command (-help encode-). If more, then you will need to use -egen numeric_panel_id = group(ID_S), label-.
Comment
Michael Craig

Join Date: Dec 2018

Posts: 53
#6

05 Mar 2019, 00:48

Clyde: sorry for the confusion, I will provide better explanation in future posts.

Nevertheless, just look at the last dataex extract above that I posted. You see the WC18100A or the WC18199A. These must be (amongst others) used as VARIABLES in the regression with a value as depicted by Q. This must therefore be transformed to wide, so that each of the ID_S has there variables WC18100A and so on. Of course, these variables in each own need to be long for the regression to work.

Thanks for the remark on the string. Very helpful.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29962
#7

05 Mar 2019, 11:15

Nevertheless, just look at the last dataex extract above that I posted. You see the WC18100A or the WC18199A. These must be (amongst others) used as VARIABLES in the regression with a value as depicted by Q. This must therefore be transformed to wide, so that each of the ID_S has there variables WC18100A and so on. Of course, these variables in each own need to be long for the regression to work.

Yes, you are right. I had not understood that this was what you were doing.
Comment
Michael Craig

Join Date: Dec 2018

Posts: 53
#8

05 Mar 2019, 12:27

Thanks for the great support! You cannot imagine, how much I learn through this forum and how much I appreciate the help of people like you, Clyde.

By the way, I was working around the reshape problem using the quarterly database from which I want to make the panel regression later on:

https://www.statalist.org/forums/for...atastream-data

The duplicates (from the issue yesterday) were indeed "double values" as they have refereed to companies raising money in SEVERAL market at the same date. Another reminder to me of the fact that the data always needs to be understood and managed accordingly so that no "garbage in garbage out" problem can arise.
Comment

Announcement