Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Help with counting unique combinations of variables

    Hi,

    I have a dataset that looks like this. The real data set has 100,000 observations and around 4000 firms and 800 job categories.

    Firm JobCategory year city
    A 1 2019 NYC
    A 1 2019 BOS
    A 1 2019 ATL
    A 2 2018 NYC
    B 2 2018 NYC
    B 3 2019 NYC
    C 3 2018 NYC

    I need to count how many unique rivals each firm has. Firm X has a rival if another firm has the same job category in the same city. For example, Firm A has 1 rival- Firm B, Firm B has 2 rivals- Firm A,B, and Firm C has 1 rival-Firm B. Is there a way to write code that will give me a variable that contains how many rivals each firm has? I keep double-counting rivals and even counting the own firm as a rival with my code.

    Thanks!

    Best,
    Reeves


    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str1 firm float(JobCategory year) str3 city
    "A" 1 2019 "NYC"
    "A" 1 2019 "BOS"
    "A" 1 2019 "ATl"
    "A" 2 2018 "NYC" 
    "B" 2 2018 "NYC"
    "B" 3 2019 "NYC" 
    "C" 3 2018 "NYC"
    "D" 4 2019 "ATL"
    "D" 4 2019 "BOS"
    "D" 1 2018 "SLC"
    "D" 2 2018 "ATL"
    "D" 1 2019 "BOS"
    end

  • #2
    I'm leaving aside consideration of year, the relevance of which you did not describe. I believe the following works for your sample data, although I suppose there might be complexities that occur in the actual data that would cause problems. This approach will be slow and will lead to a very large file if there are a large number of firms in some cities.
    Code:
    // Reduce to one observation per firm/city/JobCategory
    duplicates drop firm city JobCategory, force
    //
    // Copy the file to use for pairing
    rename firm firm1
    preserve
    tempfile temp
    rename firm1 firm2
    save `temp'
    restore
    // All pairs of firms within city/JobCategory
    joinby city JobCategory using `temp'
    drop if firm1 == firm2 // can't be own rival
    //
    bysort city JobCategory: gen int NumRivals = _N-1

    Comment


    • #3
      I don't follow this, or Mike Lacy's approach. First, trivial, but ATl is presumably a typo for ATL.

      Second, firms compete in the "same job category in the same city" -- nothing said about years.

      The number of rivals is then as I understand it the number of different firms in the same job category in the same city -- in the same year? -- MINUS one, as the number of rivals is just the number of the other firms.

      That looks like

      Code:
      egen tag = tag(firm JobCategory city year)
      egen competitors = total(tag), by(JobCategory city year)
      and subtract 1.

      If you want to lump years together, then year does not enter the calculations above.

      The data example doesn't seem effective, as each firm is present at most in one job category, city, year combination.

      Hobby-horse: Unique means occurring once only. Distinct is far the better word. See e.g. https://www.stata-journal.com/sjpdf....iclenum=dm0042
      Last edited by Nick Cox; 23 Jul 2019, 18:38.

      Comment


      • #4
        Hi Mike Lacy,

        Thank you so much for your help. This gets me closer than I have been able to get to myself. Below is the output. The code produces the correct statement that firm A has 5 other matches in city 1 and JobCatergory 1, but they should only have 2 rivals, W and L. Also, firm A is shown to have another rival in city 1 and JobCategory 57, but because the rival is again firm W, firm A should only show 2 rivals. I appreciate your help!

        Reeves

        NumRivals city JobCatergory firm1
        5 1 1 A
        5 1 1 L
        5 1 1 W
        5 1 1 A
        5 1 1 L
        5 1 1 W
        1 1 57 W
        1 1 57 A

        Comment


        • #5
          What Nick Cox suggests is the intelligent way to implement what I was thinking. (When anyone says pairs, I think -joinby-, unnecessarily in this case.)

          The disagreement with my numbers that Reeves notes confuses me, as I don't get the same result that he does. Perhaps he was reporting a result from the actual rather than the -dataex- data.

          One comment: I do think it is necessary (in the actual data) to remove duplicates of firm/city/JobCategory so as to avoid a firm being treated as its own rival. That problem does not appear in the example data.

          Comment


          • #6
            Hi Nick Cox ,

            Thanks for catching the mistakes and thanks a lot for your help. First time posting and I can see how I need to improve future post. Your assumption about year is correct. I have been experimenting with egen tag to solve this problem and I think your method is close. I ran the code and am getting output with different numbers of competitors for the same firm. Below is a sample of the output. Do you know how to fix this issue? Thanks!

            Code:
            * Example generated by -dataex-. To install: ssc install dataex
            clear
            input float rivals str32 firm float(JobCategory city) double year
            3 "A" 141 13 2008
            2 "A" 142 10 2008
            4 "A" 141 10 2009
            2 "A" 534 10 2008
            1 "A" 435 10 2008
            4 "A" 141 10 2008
            4 "A" 141 40 2009
            4 "A" 141 40 2009
            5 "A"  71 10 2009
            2 "A" 142 10 2008
            4 "A" 141 10 2008
            4 "A" 141 10 2008
            9 "A" 193 10 2008
            9 "A"  71 10 2008
            9 "A"  71 10 2008
            end
            Best,
            Reeves

            Comment


            • #7
              Hi Mike Lacy

              Your code worked perfectly for the sample data. Thanks! The output I responded to your code with was from the actual dataset and that's where I ran into issues like you suspected there would be. Going to experiment with a combination of your methods and Nick's. Thanks!

              Reeves

              Comment


              • #8
                Not like Nick, I do follow the flow of Mike: utilizing -joinby- seems a good choice for the issue. But to my notice, the last line of Mike's code does not give out the counting that Reeves is seeking for.

                Below code, a combination of Mike's & Nick's (as mentioned by Reeves in #6) might be a solution, which, while adding the wanted counting, also tries to keep the original data structure.
                Code:
                tempfile a
                save `a', replace
                
                ren firm competitor
                gen long original_order = _n
                
                joinby JobCategory city using `a'
                
                egen tag = tag(firm competitor) if firm != competitor
                egen Wanted = total(tag), by(firm)
                
                bys original_order: keep if _n==1
                drop tag competitor original_order

                Comment


                • #9
                  Hi Romalpa Akzo

                  Thank you. This worked seamlessly!

                  Can you think of any way to write code that would list the names of each firm's competitors?

                  Thanks!
                  Last edited by Reeves Coursey; 24 Jul 2019, 15:08.

                  Comment


                  • #10
                    Code:
                    tempfile a
                    save `a', replace
                    
                    ren firm competitor
                    joinby JobCategory city using `a'
                    
                    bys firm competitor: keep if _n == 1 & firm != competitor
                    
                    list firm competitor, sepby(firm)

                    Comment


                    • #11
                      Romalpa Akzo Thank you so much!

                      Comment

                      Working...
                      X