Unique String Values Across Variables and Duplicate Observations for an ID Variable

Elizabeth Pancotti

Join Date: Mar 2019

Posts: 1
#1

Unique String Values Across Variables and Duplicate Observations for an ID Variable

15 Mar 2019, 12:11

Hi! I'm trying to count the unique number of string values for variable mainx, and x1-9. My dataset has an ID variable, and there are often duplicates of ID. Across ID, I want to know how many unique values there are for mainx and x1-9. Here's an example of some fake data. In this example, I would want unique to equal 3, because there are 3 unique values of x across the ID. The dataset is very large, and there are up to 100 duplicates for each ID, so looping through each observation isn't a great option. However, I could expand the data so that a duplicate is created where mainx equals each non-missing x1-9 for that ID, and then get the unique values for mainx across ID. Perhaps there's a more efficient way though.

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input str4 id str3(mainx x1 x2) "2019" "ABC" "DEF" "GHI" "2019" "ABC" "." "." "2019" "DEF" "GHI" "ABC" end
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#2

15 Mar 2019, 12:25

Your task is complicated by awkward naming of your x variables, and using "." as a code for something you don't want to count, instead of using missing value strings. So the first task is to clear those obstacles away. After that, it's not terribly hard.

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input str4 id str3(mainx x1 x2) "2019" "ABC" "DEF" "GHI" "2019" "ABC" "." "." "2019" "DEF" "GHI" "ABC" end rename mainx x0 gen long obs_no = _n reshape long x, i(obs_no) j(_j) replace x = "" if x == "." by id x, sort: gen x_count = (_n == 1 & !missing(x)) by id (x): replace x_count = sum(x_count) by id: replace x_count = x_count[_N] // replace x = "." if x == "" // OPTIONAL reshape wide // rename x0 xmain // OPTIONAL

If you are attached to your original variable names and coding, you can uncomment the two optional commands.

You also might want to consider skipping the final -reshape wide- and staying in long layout. Not only will that save time (-reshape- is slow in large data sets), it will also leave your data set in a layout that is more amenable to most things you might do in subsequent analysis.
Comment

Nick Cox

Join Date: Mar 2014
Posts: 35697

15 Mar 2019, 12:36

Your example various questions, such as

1. an identifier that isn't

2. use of "." to mean missing, which is easy to wortk with but isn't Stata standard

I recommend the term distinct over unique, which still carries as primary meaning occurring precisely once. For more on that, see https://www.stata-journal.com/sjpdf....iclenum=dm0042

This works with your example:

Code:

. * Example generated by -dataex-. To install: ssc install dataex
. clear

. input str4 id str3(mainx x1 x2)

            id      mainx         x1         x2
  1. "2019" "ABC" "DEF" "GHI"
  2. "2019" "ABC" "."   "."  
  3. "2019" "DEF" "GHI" "ABC"
  4. end

. save whateveryouwant, replace  
file whateveryouwant.dta saved

. gen long obs = _n

. rename mainx x0

. reshape long x , i(obs)
(note: j = 0 1 2)

Data                               wide   ->   long
-----------------------------------------------------------------------------
Number of obs.                        3   ->       9
Number of variables                   5   ->       4
j variable (3 values)                     ->   _j
xij variables:
                               x0 x1 x2   ->   x
-----------------------------------------------------------------------------

. drop if x == "."
(2 observations deleted)

. keep id x

. duplicates drop

Duplicates in terms of all variables

(4 observations deleted)

. bysort id : gen wanted = _N

. collapse wanted, by(id)
.
. merge 1:m id using whateveryouwant

    Result                           # of obs.
    -----------------------------------------
    not matched                             0
    matched                                 3  (_merge==3)
    -----------------------------------------

.
. list

     +-------------------------------------------------+
     |   id   wanted   mainx    x1    x2        _merge |
     |-------------------------------------------------|
  1. | 2019        3     ABC   DEF   GHI   matched (3) |
  2. | 2019        3     ABC     .     .   matched (3) |
  3. | 2019        3     DEF   GHI   ABC   matched (3) |
     +-------------------------------------------------+

EDIT: Crossed with Clyde's #2. Whatever we tell you twice may be right.

Comment

Lucas Gabriel

Join Date: Feb 2020
Posts: 4

29 Feb 2020, 06:45

Hi everyone, I have a similar case. I need to merge below dataset with another using directorid and companyid or an egen function of these variables, but directorid and companyid often are duplicates in dataset. I tried use the command reshape wide directorid companyid qualification awarddate, i (obs) j (id_director), but Stata returns "variable id_director takes on too many values"
id_director is the result of the command: egen id_director = group (directorid companyid qualification) that I tried use to reduce those "many values".

Can someone help me to reshape and next collapse this dataset to enable the merge?

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str244 directorname str128 companyname str244(qualification fulltextdescription) double(directorid companyid) long(awarddate obs) float id_director
"Rose Araos"                         "St Columban's College"                                                  "MBA"                                              "" 2088110 2423889     . 1220393 1217881
"Marc Higgins"                       "St Columban's College"                                                  "Certified"                                        "" 1356726 2423889  9497  570125  568957
"Cameron Daniel"                     "St Columban's College"                                                  "Graduated"                                        "" 1682772 2423889 15706  820975  819313
"Jan Stejskal"                       "esk� Advok�tn� Komora (CAK) Czech Bar Association)"                     "Certified"                                        "" 1954435 1093660 19571 1140518 1138193
"Doctor Josef Berger"                "esk� institut pro marketing (CIMA) os (Czech Institute of Marketing)"   "MS"                                               ""  446996 2420098  1096  120316  120131
"Tun� Akyurt"                        "stanbul Erkek Lisesi (IEL)"                                             "Graduated"                                        "" 1087546 2000613 10593  410199  409391
"Professor Doctor Metin Cakmakci"    "stanbul Erkek Lisesi (IEL)"                                             "Diploma"                                          "" 1646190 2000613  5114  776871  775297
"St�phane Ta�eb"                     "'Universit� di Parigi (University of Paris XII)"                         "Masters Degree"                                   "" 2184815 2786875 12784 1317403 1314614
"Professor Doctor Philippe Leboulch" "'Universit� di Parigi (University of Paris XII)"                         "MD"                                               "" 1248604 2786875     .  463063  462163
"St�phane Ta�eb"                     "'Universit� di Parigi (University of Paris XII)"                         "DESS (Dipl�me d'�tudes sup�rieures sp�cialis�es)" "" 2184815 2786875 13149 1317402 1314613
"Kristine Youman"                    "17th Judicial Circuit Court of Winnebago County"                         "Certified Mediator"                               "" 2134985 2443245     . 1269092 1266469
"Corbin Rowley"                      "1st Marine Division Association"                                         "Certified"                                        "" 2196299 1373607     . 1326921 1324094
"David Ronn"                         "20/20 Leadership"                                                        "Graduated"                                        "" 1611422 2405953     .  720873  719382
"Lisa Parker"                        "21/64"                                                                   "Certified"                                        "" 1361820 2030042     .  573133  571957
"Sean Lentner"                       "2X Software"                                                             "Certified"                                        "" 1991196 2148861 18263 1176503 1174087
"Doctor Fang Xiangming"              "2nd Affiliated Hospital of Zhejiang University School of Medicine (The)" "MD"                                               "" 1758128 2024413  8036  912507  910672
"Doctor Xiaohua Xin"                 "2nd Affiliated Hospital of Zhejiang University School of Medicine (The)" "MD"                                               "" 2150407 2024413     . 1284396 1281700
"Doctor Xiaohua Xin"                 "2nd Affiliated Hospital of Zhejiang University School of Medicine (The)" "MS"                                               "" 2150407 2024413     . 1284397 1281701
"Doctor John Jiang"                  "304th Hospital"                                                          "Postdoctoral Fellow"                              ""  761046 1231181     .  261679  261258
"R Akhter"                           "3COM CORP (De-listed 04/2010)"                                           "Certified"                                        "" 1255603     275     .  469951  469041
end
format %d awarddate

Att.,
Lucas

Comment

Mike Lacy

Join Date: Apr 2014

Posts: 2416
#5

29 Feb 2020, 09:02

I'd encourage you to first clarify for us and for yourself what you want to happen here in ordinary language, not Stata-ese. What Stata commands are relevant depends on the conceptual and actual structure of your data, and what your purposes are. And, you have not told us anything about the structure of your second file, and what its intended relation is to your first file.

My guesses about your situation would be:

1) You have files in which a combination of directorid and companyid identify observations. That is, data was recorded for some sample of directors when they were working at each of several different companies for which they worked. (If directors only were studied at one company at which they worked, then the situation is simpler, i.e., directorid identifies observations.) Note that, on this assumption, creating id_director by including educational qualification does not do anything for you, as I suspect that an individual director does not have more more than one educational qualification--it's possible, but seems unlikely.

2) Your second data file contains other variables about these directors when they were at particular companies, and you want to add those other variables onto each observation in your original file.

3) You want to in some way analyze outcomes for particular companies, and those outcomes might depend on who was the director. Or, you want to analyze outcomes for directors, with those outcomes possibly depending on the company at which they were working.

General comment:

There might be reasons why a -reshape- would be relevant, but I can't see anything in what you have presented that gives me a definite idea in that direction. It appears that you have repeated observations on directors, which would indicate a so-called long shape, but that is a separate issue from the merge. What kind of shape you want would depend on what kinds of questions you want to address with your data.
Comment
Lucas Gabriel

Join Date: Feb 2020

Posts: 4
#6

29 Feb 2020, 09:42

Hi Mike, I will explain. I'm working on six datasets extracts from BoardEx (WRDS). These datasets contain detailed information about directors and executives of several US Companies,such as directorname, directorid, nationality, salary, bonus, rolename, boardname, boardrole, etc.

I'm trying to merge these datasets with unique ID, but directorid and companyid (common to theses datasets) are often duplicates, thus impeding me to execute a 1:1 or 1:m merge. How can I do theses merges?
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2416
#7

29 Feb 2020, 11:19

Telling us that your data came from BoardEx doesn't happen to help you very much here, as even among the fraction of StataList participants who are economists or financial analysts, only a fraction know that data set. Your explanation therefore only speaks to a small fraction of the people could help you. What matters is what your unit of analysis is and what the structure of your data is. It appears that your unit of analysis is the individual director, but these directors and their outcomes are sometimes observed for multiple companies (OK) and these director/company ids are sometimes observed in multiple contexts (which is ok, but what distinguishes those multiple observations years? roles? just plain mistakes? ...). You know the answer to this but most of us here that could help you don't know.

That being said: The crux of the matter here is what combination of variables identify your observations in your files. Your example shows multiple observations in your master file for some combinations of directorid and companyid. That suggests that your master file will have to be the "m" in m:1. This presumes that your using file has only one observation for each combination of directorid and companyid. If that's true, then what you would need is simply something like:

Code:

merge 1:1 directorid companyid using "YourUsingFile.dta" ....

However, if there are multiple observations in your using file for each such combination, you would need to explain to us how it is that you would want those observations to be matched up with observations in your master file. There might be some other variable (e.g., year) that distinguishes those multiples in one or both files and perhaps that variable needs to be one of the keys. Or, it might be that you want to form observations for all possible matching pairs from the master and the using file, which requires something other than a merge. You might be tempted to think that an m:m merge is what you want, but that's extremely unlikely.
1 like
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2416
#8

01 Mar 2020, 07:34

Correcting a typographical error, the preceding code should contain m:1, not 1:1:

Code:

merge m:1 directorid companyid using "YourUsingFile.dta" ....
Comment

Announcement