Comparing two different datasets

Loo Ortega

Join Date: Jul 2017
Posts: 12

Comparing two different datasets

22 Jul 2017, 06:29

Hii Stata users,

I have two different datasets for two different groups with mostly the same information
This is the relevant data for both groups.

Group 1 (Companies that have been takenover)

Code:

input int YearMatch str30 CompanyName str4 SIC double AT str8 Attitude double(CSHO DT OIBDP SALES PRCC_F)
2000 "Ultramar Diamond Shamrock Corp" "2911"   5988.4 "Friendly"  83.352   1661.5   1066.4  14396.6  30.875
2003 "META Group Inc"                 "8732"   86.712 "Friendly"  13.657     .389    1.348  122.485     6.3
1997 "Hein-Werner Corp"               "3559"   37.348 "Friendly"   2.771        .    2.646   39.037   7.312
2001 "Budget Group Inc"               "7514" 4469.505 "Friendly"  37.255  3292.08  712.693  2160.73     .89
2002 "SangStat Medical Corp"          "2834"  192.437 "Friendly"  26.443   20.269   11.114  120.057    11.3
1995 "Meta Software Inc"              "7372"   35.095 "Friendly"     9.8        .    6.514   25.281   16.75
1996 "Studio Plus Hotels Inc"         "7011"   146.24 "Friendly"  12.529        .    9.612   23.065   15.75
1995 "Research Medical Inc"           "3841"   46.545 "Friendly"   9.408        .   12.061   34.024      23
2003 "Commonwealth Industries Inc"    "3353"  379.326 "Friendly"  16.011  125.205    28.25  918.396   10.07
1997 "Sofamor Danek Group Inc"        "3842"  385.657 "Friendly"  25.182        .  106.302  312.902   65.25
2002 "Corvas International Inc"       "2834"   96.593 "Friendly"  27.591   12.558  -20.794     .142    1.55
1997 "Mecklermedia Corp"              "2721"   51.781 "Friendly"   8.517        .    3.468   55.193  23.187
1997 "Learning Co Inc"                "7372"  416.791 "Friendly"  48.869        . -330.466  392.438  16.062
2000 "C-bridge Internet Solutions"    "7372"    98.46 "Friendly"  21.175     .265   -4.451    82.97  3.9063
2004 "Micro Therapeutics Inc"         "3841"   61.238 "Friendly"  48.425        0  -17.601   35.725    3.75
1997 "Whitehall Corp"                 "3721"   48.599 "Friendly"    5.53        .   -8.654   65.791      18
2006 "C-COR Inc"                      "3663"  313.129 "Friendly"  48.008   36.265   -8.251  262.526    7.72
2003 "TippingPoint Technologies Inc"  "3577"   43.373 "Friendly"   7.336    1.003  -14.921    5.771   24.02
2005 "Manugistics Group Inc"          "7372"  413.311 "Friendly"  84.161  178.076   14.199  176.192    1.86
2000 "Louis Dreyfus Natural Gas"      "1311" 1501.965 "Friendly"  43.689  606.909   352.99  489.703 45.8125
1995 "Scanforms Inc"                  "2761"   16.693 "Friendly"   3.547        .    4.068   24.519     2.5
2003 "Lannett Co Inc"                 "2834"   29.063 "Friendly"  20.026    3.098   20.261   42.487

Group 2 (Benchmark, companies that have not been takenover)

Code:

 
input int DataYearFiscal str30 CompanyName int SIC double(AT CSHO DT OIBDP SALES PRCC_F)
1992 "AAR CORP"                    5080  365.151  15.901       .  23.251   382.78   13.5
1993 "AAR CORP"                    5080  417.626  15.906       .  29.106  407.754 14.375
1994 "AAR CORP"                    5080  425.814  15.961       .  34.766  451.395  15.25
1995 "AAR CORP"                    5080  437.846  15.998       .  42.557   504.99 22.125
1996 "AAR CORP"                    5080  529.584  18.204       .  55.177  589.328     31
1999 "AAR CORP"                    5080  740.998  26.865 180.876  89.031 1024.333 13.875
2002 "AAR CORP"                    5080  686.621  31.851 232.914  30.745  606.337    4.5
2003 "AAR CORP"                    5080  709.292  32.245 250.162  47.491  651.958   9.58
2004 "AAR CORP"                    5080   732.23  32.586 229.494  61.774  747.848  16.04
2005 "AAR CORP"                    5080  978.819  36.654 320.704  91.877  897.284  24.08
1992 "ABS INDUSTRIES INC"          3460   41.976   2.526       .   7.907   51.407  14.25
1993 "ABS INDUSTRIES INC"          3460   63.997   5.052       .  10.064    72.36  12.75
1994 "ABS INDUSTRIES INC"          3460   93.811   5.052       .  13.001   92.122  13.75
1992 "ACF INDUSTRIES HOLDING CORP" 3743 1706.454    .015       . 173.367  320.148      .
1993 "ACF INDUSTRIES HOLDING CORP" 3743 1665.921    .015       . 150.525  348.429      .
1994 "ACF INDUSTRIES HOLDING CORP" 3743   1815.1    .015       .     137    380.5      .
1995 "ACF INDUSTRIES HOLDING CORP" 3743   2015.8    .015       .     157    407.2      .
1996 "ACF INDUSTRIES HOLDING CORP" 3743   2218.6    .015       .   183.4    486.8      .
1997 "ACF INDUSTRIES HOLDING CORP" 3743   3181.3    .015       .   176.4    501.9      .
1998 "ACF INDUSTRIES HOLDING CORP" 3743   3257.3    .015       .   188.6    536.5      .
1999 "ACF INDUSTRIES HOLDING CORP" 3743   3563.4    .015  1977.1     186    472.4      .
2000 "ACF INDUSTRIES HOLDING CORP" 3743   3794.5    .015  2042.9   178.7    443.8      .

I would like to know whether the variable OIBDP (firm performance) is significant different between the two groups. And pair the observations of both groups based on Year, SIC and a range of 20% of AT.

I was thinking of doing the Wilcoxon Signed-Rank test. However, searching on the internet on how to do this I got the feeling I have to put the two datasets in one dataset.

Question1: Is there a way to compare two different datasets of should I really combine the two datasets using "append" or "merge".

Question 2: If I have to combine both datasets with append. I was thinking to make a dummy which sets 1 if it belongs to group 1 and 0 if it belongs to group 2. But should I name the variables exactly the same in both datasets and order the variables in the same order in both datasets in order to append? How can I change the names and order of the variables in dataset 2 while working in the same do-file as dataset 1 and afterwards going back to dataset 1 to append?

Question 3: If I have to combine both datasets with merge, I should name the bariable OIBDP different for both groups, right?

Question 4: How do I match the observations of both groups based on Year, SIC and range of 20% AT, while doing the wilcoxon Signed-Rank test?

Thank you for helping in advance!

Tags: None

Clyde Schechter

Join Date: Apr 2014

Posts: 30192
#2

22 Jul 2017, 10:21

Let's start with Question 4, because I think the answer to it will make questions 1 through 3 irrelevant.

This is a job for -rangejoin-. Written by Robert Picard, it is available from SSC. I believe you will also have to install -rangestat- at the same time in order for -rangejoin- to be operable. (-rangestat- is written by Robert Picard, Nick Cox, and Roberto Ferrer, and is also available on SSC).

So you would put the data sets together something like this:

Code:

use dataset1, clear gen lower = AT*.8 gen upper = AT *1.2 clonevar DataYearFiscal = YearMatch rangejoin AT lower upper using dataset2, by(SIC DataYearFiscal)

(There is no variable called Year in either data set. I'm assuming that you mean you want to match DataYearFiscal in dataset1 with YearMatch in dataset2. If that is not the intent, then modify the code to deal with whatever variable(s) provide the information about the Year you are interested in.)

Now, this will leave you with a data set that matches each observation in dataset1 with every observation in dataset2 that has the same values of SIC and "Year" and where the AT value is between 80% and 120% of the value in dataset1. You may not want to actually keep all of these pairings. Sometimes people want to just match 1 observation from dataset 2 to each observation in dataset 1 (or n such observations for some pre-specified n). In that case you will need to write additional code to select which of the pairings you want to keep, implementing whatever criteria you feel are applicable. Actually, if your purpose is to test equality of OIBDP, then you must reduce this to one pair for each observation in dataset1 because the signed rank test requires that each observation be independent of the others. Selecting one pair for each at random might be a good way to do this, or you might select one that is the closest match on some other variable. That's a substantive issue you need to decide.

One of the things -rangejoin- does is attach a _U suffix to each variable in dataset2 that shares a name with a variable in dataset1. (If you don't like a _U suffix for this purpose you can use the -prefix()- or -suffix()- options to get whatever modification you like.)
Comment

Loo Ortega

Join Date: Jul 2017
Posts: 12

23 Jul 2017, 06:53

Dear Clyde,
Thank you for responding!
Unfortunately even after installing rangestat does stata not recognize "rangejoin"

Code:

 ***Combining dataset targets to peers
. ssc install rangestat
checking rangestat consistency and verifying not already installed...
all files already exist and are up to date.

. rangejoin AT lower upper using "C:\Users\LSand\Desktop\Datasets Thesis\Datasets Subquestion 1\Pe
> ers Compustat Q1.dta", by(sic DataYearFiscal)
command rangejoin is unrecognized
r(199);

And if this works, how should I select one pair for each at random? I would have dropped duplicates CompanyName, or is there a better command?

Comment

Loo Ortega

Join Date: Jul 2017

Posts: 12
#4

23 Jul 2017, 08:02

sorry Clyde, it indeed works, I forgot to install rangejoin as well! Thank you
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30192
#5

23 Jul 2017, 12:47

how should I select one pair for each at random?

So, you need an identifier for each observation in dataset1. If there is already such a variable (or group of variables) use that. If not, then you need to create one before you do the -rangejoin-. Let's assume that unique identifier is called unique_identifier.

Code:

gen double shuffle1 = runiform() gen double shuffle2 = runiform() by unique_identifier (shuffle1 shuffle2), sort: keep if _n == 1

will leave you with a 1:1 matched set (unless there are observations for which no match could be found.)
Comment

Announcement

Comparing two different datasets

Comment

Comment

Comment

Comment