Help with counting unique combinations of variables

Reeves Coursey

Join Date: May 2019

Posts: 6
#1

Help with counting unique combinations of variables

23 Jul 2019, 15:28

Hi,

I have a dataset that looks like this. The real data set has 100,000 observations and around 4000 firms and 800 job categories.

Firm JobCategory year city
A 1 2019 NYC
A 1 2019 BOS
A 1 2019 ATL
A 2 2018 NYC
B 2 2018 NYC
B 3 2019 NYC
C 3 2018 NYC

I need to count how many unique rivals each firm has. Firm X has a rival if another firm has the same job category in the same city. For example, Firm A has 1 rival- Firm B, Firm B has 2 rivals- Firm A,B, and Firm C has 1 rival-Firm B. Is there a way to write code that will give me a variable that contains how many rivals each firm has? I keep double-counting rivals and even counting the own firm as a rival with my code.

Thanks!

Best,
Reeves

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input str1 firm float(JobCategory year) str3 city "A" 1 2019 "NYC" "A" 1 2019 "BOS" "A" 1 2019 "ATl" "A" 2 2018 "NYC" "B" 2 2018 "NYC" "B" 3 2019 "NYC" "C" 3 2018 "NYC" "D" 4 2019 "ATL" "D" 4 2019 "BOS" "D" 1 2018 "SLC" "D" 2 2018 "ATL" "D" 1 2019 "BOS" end
Tags: None
Mike Lacy

Join Date: Apr 2014

Posts: 2404
#2

23 Jul 2019, 17:22

I'm leaving aside consideration of year, the relevance of which you did not describe. I believe the following works for your sample data, although I suppose there might be complexities that occur in the actual data that would cause problems. This approach will be slow and will lead to a very large file if there are a large number of firms in some cities.

Code:

// Reduce to one observation per firm/city/JobCategory duplicates drop firm city JobCategory, force // // Copy the file to use for pairing rename firm firm1 preserve tempfile temp rename firm1 firm2 save `temp' restore // All pairs of firms within city/JobCategory joinby city JobCategory using `temp' drop if firm1 == firm2 // can't be own rival // bysort city JobCategory: gen int NumRivals = _N-1
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35433
#3

23 Jul 2019, 17:56

I don't follow this, or Mike Lacy's approach. First, trivial, but ATl is presumably a typo for ATL.

Second, firms compete in the "same job category in the same city" -- nothing said about years.

The number of rivals is then as I understand it the number of different firms in the same job category in the same city -- in the same year? -- MINUS one, as the number of rivals is just the number of the other firms.

That looks like

Code:

egen tag = tag(firm JobCategory city year) egen competitors = total(tag), by(JobCategory city year)

and subtract 1.

If you want to lump years together, then year does not enter the calculations above.

The data example doesn't seem effective, as each firm is present at most in one job category, city, year combination.

Hobby-horse: Unique means occurring once only. Distinct is far the better word. See e.g. https://www.stata-journal.com/sjpdf....iclenum=dm0042

Last edited by Nick Cox; 23 Jul 2019, 18:38.
Comment
Reeves Coursey

Join Date: May 2019

Posts: 6
#4

23 Jul 2019, 18:04

Hi Mike Lacy,

Thank you so much for your help. This gets me closer than I have been able to get to myself. Below is the output. The code produces the correct statement that firm A has 5 other matches in city 1 and JobCatergory 1, but they should only have 2 rivals, W and L. Also, firm A is shown to have another rival in city 1 and JobCategory 57, but because the rival is again firm W, firm A should only show 2 rivals. I appreciate your help!

Reeves

NumRivals city JobCatergory firm1
5 1 1 A
5 1 1 L
5 1 1 W
5 1 1 A
5 1 1 L
5 1 1 W
1 1 57 W
1 1 57 A
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2404
#5

23 Jul 2019, 18:30

What Nick Cox suggests is the intelligent way to implement what I was thinking. (When anyone says pairs, I think -joinby-, unnecessarily in this case.)

The disagreement with my numbers that Reeves notes confuses me, as I don't get the same result that he does. Perhaps he was reporting a result from the actual rather than the -dataex- data.

One comment: I do think it is necessary (in the actual data) to remove duplicates of firm/city/JobCategory so as to avoid a firm being treated as its own rival. That problem does not appear in the example data.
Comment
Reeves Coursey

Join Date: May 2019

Posts: 6
#6

23 Jul 2019, 18:38

Hi Nick Cox ,

Thanks for catching the mistakes and thanks a lot for your help. First time posting and I can see how I need to improve future post. Your assumption about year is correct. I have been experimenting with egen tag to solve this problem and I think your method is close. I ran the code and am getting output with different numbers of competitors for the same firm. Below is a sample of the output. Do you know how to fix this issue? Thanks!

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input float rivals str32 firm float(JobCategory city) double year 3 "A" 141 13 2008 2 "A" 142 10 2008 4 "A" 141 10 2009 2 "A" 534 10 2008 1 "A" 435 10 2008 4 "A" 141 10 2008 4 "A" 141 40 2009 4 "A" 141 40 2009 5 "A" 71 10 2009 2 "A" 142 10 2008 4 "A" 141 10 2008 4 "A" 141 10 2008 9 "A" 193 10 2008 9 "A" 71 10 2008 9 "A" 71 10 2008 end

Best,
Reeves
Comment
Reeves Coursey

Join Date: May 2019

Posts: 6
#7

23 Jul 2019, 18:43

Hi Mike Lacy

Your code worked perfectly for the sample data. Thanks! The output I responded to your code with was from the actual dataset and that's where I ran into issues like you suspected there would be. Going to experiment with a combination of your methods and Nick's. Thanks!

Reeves
Comment
Romalpa Akzo

Join Date: Oct 2017

Posts: 369
#8

23 Jul 2019, 19:50

Not like Nick, I do follow the flow of Mike: utilizing -joinby- seems a good choice for the issue. But to my notice, the last line of Mike's code does not give out the counting that Reeves is seeking for.

Below code, a combination of Mike's & Nick's (as mentioned by Reeves in #6) might be a solution, which, while adding the wanted counting, also tries to keep the original data structure.

Code:

tempfile a save `a', replace ren firm competitor gen long original_order = _n joinby JobCategory city using `a' egen tag = tag(firm competitor) if firm != competitor egen Wanted = total(tag), by(firm) bys original_order: keep if _n==1 drop tag competitor original_order
2 likes
Comment
Reeves Coursey

Join Date: May 2019

Posts: 6
#9

24 Jul 2019, 15:04

Hi Romalpa Akzo

Thank you. This worked seamlessly!

Can you think of any way to write code that would list the names of each firm's competitors?

Thanks!

Last edited by Reeves Coursey; 24 Jul 2019, 15:08.
Comment

Romalpa Akzo

Join Date: Oct 2017
Posts: 369

#10

24 Jul 2019, 21:23

Code:

tempfile a
save `a', replace

ren firm competitor
joinby JobCategory city using `a'

bys firm competitor: keep if _n == 1 & firm != competitor

list firm competitor, sepby(firm)

Comment

Reeves Coursey

Join Date: May 2019

Posts: 6
#11

05 Aug 2019, 21:15

Romalpa Akzo Thank you so much!
Comment

Announcement

Help with counting unique combinations of variables

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment