Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Creating Variable based on NAIC Digit Level match for Firms

    I have a list of firms that I am comparing to each other using NAIC (Industry ID - 6 digits). I want to compare the NAIC code of each firm_id to each firm_id2 on each digit of the 6-digit level NAIC code. If the digit is the same, I am saying they are similar at the "X-level."

    With respect to a portion of the code below:
    cross using `temp'

    I get the following error - r(459) "sum of expand values exceed 2,147,483,620. The dataset may not contain more than 2,147,483,620 observations." That is because there are duplicate firms- The data looks at a firm's deal activity by year (1990-2016) so it is possible the firm is listed 26 times (See firm_id=3 is listed 4 times)... there are 373,088 unique firms but a total of 1,210,053 observations. I believe this is where the error is coming from...

    Code:
    // Example data
     clear input firm_id M_Acq_Naic str2 M_Acq_Reg
    1 511210 "JP"
    2 236116 "EU"
    3 451120 "AM"
    3 451120 "AM"
    3 451120 "AM"
    3 451120 "AM"
    4 441110 "AM"
    4 441110 "AM"
    5 811310 "EU"
    6 221119 "EU"
    7 813212 "JP" end
    
    // NAIC codes should be strings, especially for current purposes.
    tostring M_Acq_Naic*, replace
    
    // Make a file to pair with itself.
    preserve
    tempfile temp
    rename * *_2  
    save `temp' restore
    //
    rename * *_1
    cross using `temp'  // the workhorse here
    //
    drop if  (firm_id_1 == firm_id_2) // no self pairs
    
    // Create indicator variables for digit matches on NAIC codes
    forval i = 1/6 {    
    gen PairSameDigit`i' = substr(M_Acq_Naic_1,`i',1) == substr(M_Acq_Naic_2,`i', 1)
    }
    
    // Drop duplicate firm pairs
    gen min = min(firm_id_1, firm_id_2)
    gen max = max(firm_id_1, firm_id_2)
    bysort min max: keep if _n ==1
    drop min max
    Thanks in advance for your help!
    Last edited by Maddie Adelman; 27 Nov 2018, 11:11.

  • #2
    New readers should note that this code was provided by Mike Lacy in a topic mistakenly started on the Mata Forum at

    https://www.statalist.org/forums/for...atch-for-firms

    where the original post #1 contained only the sample data with no mention of the fact that the actual data contains multiple years of data for each firm, which then led to Mike's suggested code copied above.

    It is possible that your problem can be solved changing
    Code:
    cross using `temp'
    to
    Code:
    joinby year using `temp'
    where year is the name of whatever you call the date variable in your dataset. Lacking appropriate sample data, this suggestion is untested. In particular, I think the next-to-final line may need to change to
    Code:
    bysort year min max: keep if _n ==1
    or something similar.

    Comment


    • #3
      My question would be what is your reason for this similarity index? How similar firm 1 is to firm 5 differs to how similar firm 2 is to firm 7. If you have to show all possible combinations, you will have nearly as many variables as observations. If your interest is to simply do a spot check, I would approach the problem differently


      Code:
      *REPRODUCE A DATA SET SIMILAR TO YOURS 
      set obs 373088
      set seed 2018
      gen firm_id= _n
      gen M_Acq_Naic= runiformint(100000,999999)
      
      *GEN INDICATORS FOR PRESENCE OF A DIGIT
      forval i=0/9{
      gen has`i'=.
      }
      
      forval i=0/9{
      replace has`i'= cond(regexm(string(M_Acq_Naic), "`i'"), 1, .)
      }

      Code:
      *SPOTCHECK FIRM 1, 2 &3
      . list if inlist(firm_id, 1,2,3)
      
              +------------------------------------------------------------------------------------------+
              | firm_id   M_Acq_~c   has0   has1   has2   has3   has4   has5   has6   has7   has8   has9 |
              |------------------------------------------------------------------------------------------|
           1. |       1     898540      1      .      .      .      1      1      .      .      1      1 |
           2. |       2     209465      1      .      1      .      1      1      1      .      .      1 |
           3. |       3     401536      1      1      .      1      1      1      1      .      .      . |
              +------------------------------------------------------------------------------------------+
      So here, all have 0, 4 and 5.



      Comment


      • #4
        William Lisowski thank you for providing clarification about the original code and giving credit where credit is due. I will follow your lead in future posts.

        To clarify, the topic I am trying to address is following behavior between firms. If one firm engages in Mergers and Acquisitions in a given year (while also analyzing the number of deals), will another firm follow? I am planning to use the independent variable of similarity in a regression to test whether firm i and firm j are in the same industry. The variables I have are year, firm_id, NAIC code, region, nation, industry concentration level, among others (the dataset has 426 variables total). I have uniquely identified each observation by firm_id, year and the company they targeted for M&A deal activity.

        To determine the same industry (similarity), I am looking at the 6-digit North American Industry Classification System (NAICS) code. Each digit in the code is part of a series of progressively narrower categories. The first two digits of the code refers to the economic sector of a firm, the third digit designates the subsector, the fourth digit refers to the industry group, the fifth digit designates the NAICS industry, and the sixth digit designates the national industry.

        Comment


        • #5
          I am planning to use the independent variable of similarity in a regression to test whether firm i and firm j are in the same industry.
          It sounds to me that you need to create a variable that identifies industries in your data set and then use this to create indicators in your analysis.

          To determine the same industry (similarity), I am looking at the 6-digit North American Industry Classification System (NAICS) code.
          This classification is taken from https://www.census.gov/cgi-bin/sssd/...rch?chart=2017

          2017 NAICS
          The following table provides detailed information on the structure of NAICS.

          Sector Description
          11 Agriculture, Forestry, Fishing and Hunting
          21 Mining, Quarrying, and Oil and Gas Extraction
          22 Utilities
          23 Construction
          31-33 Manufacturing
          42 Wholesale Trade
          44-45 Retail Trade
          48-49 Transportation and Warehousing
          51 Information
          52 Finance and Insurance
          53 Real Estate and Rental and Leasing
          54 Professional, Scientific, and Technical Services
          55 Management of Companies and Enterprises
          56 Administrative and Support and Waste Management and Remediation Services
          61 Educational Services
          62 Health Care and Social Assistance
          71 Arts, Entertainment, and Recreation
          72 Accommodation and Food Services
          81 Other Services (except Public Administration)
          92 Public Administration

          You can create the industry variable as follows

          Code:
          input firm_id M_Acq_Naic str2 M_Acq_Reg
          1 511210 "JP"
          2 236116 "EU"
          3 451120 "AM"
          3 451120 "AM"
          3 451120 "AM"
          3 451120 "AM"
          4 441110 "AM"
          4 441110 "AM"
          5 811310 "EU"
          6 221119 "EU"
          7 813212 "JP" 
          end
          gen sector= real(substr(string(M_Acq_Naic), 1, 2))
          replace sector=31 if inlist(sector, 32, 33)
          replace sector=44 if sector==45
          replace sector=48 if sector==49
          label define sector 11 "Agriculture, Forestry, Fishing and Hunting"///
          21 "Mining, Quarrying, and Oil and Gas Extraction" 22 "Utilities"///
          23 "Construction" 31 "Manufacturing" 42    "Wholesale Trade" 44 "Retail Trade"///
          48 "Transportation and Warehousing" 51 "Information" 52 "Finance and Insurance"///
          53 "Real Estate and Rental and Leasing" 54 "Professional, Scientific, and Technical Services"///
          55 "Management of Companies and Enterprises"///
          56 "Administrative and Support and Waste Management and Remediation Services"///
          61 "Educational Services" 62 "Health Care and Social Assistance"///
          71 "Arts, Entertainment, and Recreation" 72 "Accommodation and Food Services"///
          81 "Other Services (except Public Administration)" 92 "Public Administration"
          label values sector sector
          This results in the following for your data example

          Code:
           list, clean
          
                 firm_id   M_Acq_~c   M_Acq_~g                                          sector  
            1.         1     511210         JP                                     Information  
            2.         2     236116         EU                                    Construction  
            3.         3     451120         AM                                    Retail Trade  
            4.         3     451120         AM                                    Retail Trade  
            5.         3     451120         AM                                    Retail Trade  
            6.         3     451120         AM                                    Retail Trade  
            7.         4     441110         AM                                    Retail Trade  
            8.         4     441110         AM                                    Retail Trade  
            9.         5     811310         EU   Other Services (except Public Administration)  
           10.         6     221119         EU                                       Utilities  
           11.         7     813212         JP   Other Services (except Public Administration)

          Now, provided that you have enough intra-industry variation (or enough firms within industries), you can do analyses at the industry level and also do comparisons of your coefficients across industries.

          Comment


          • #6
            Thank you all for the suggestions! I was able to run my code on small number of observations, but when I try to expand my code to the full set of observations (648, 812) I keep getting an error with my

            Code:
            cross using `temp'
            and my

            Code:
            joinby Year_2 using `temp'
            I get the following error - r(459) "sum of expand values exceed 2,147,483,620. The dataset may not contain more than 2,147,483,620 observations." That is because there are duplicate firms- The data looks at a firm's deal activity by year (1990-2016) so it is possible the firm is listed 26 times... there are 232,064 unique firms.

            Any advice is appreciated! Thanks!

            Comment


            • #7
              If there are N=373,088 distinct firms, then there are N*(N-1) = 139,190,551,776 possible ordered pairs of distinct firms. You are using Stata/SE or Stata/IC, which limits the number of observations in a dataset to 2,147,483,647, so what you hope to accomplish by creating all pairs of distinct firms is not possible for you. While upgrading to Stata/MP would remove that limit, it would require substantial amounts of memory on your system.

              The advice from Andrew Musau in post #5 seems a more appropriate approach to your problem, more feasible and more substantively correct.

              Comment

              Working...
              X