Creating Variable based on NAIC Digit Level match for Firms

Maddie Adelman

Join Date: Aug 2018

Posts: 10
#1

Creating Variable based on NAIC Digit Level match for Firms

27 Nov 2018, 11:08

I have a list of firms that I am comparing to each other using NAIC (Industry ID - 6 digits). I want to compare the NAIC code of each firm_id to each firm_id2 on each digit of the 6-digit level NAIC code. If the digit is the same, I am saying they are similar at the "X-level."

With respect to a portion of the code below:
cross using `temp'

I get the following error - r(459) "sum of expand values exceed 2,147,483,620. The dataset may not contain more than 2,147,483,620 observations." That is because there are duplicate firms- The data looks at a firm's deal activity by year (1990-2016) so it is possible the firm is listed 26 times (See firm_id=3 is listed 4 times)... there are 373,088 unique firms but a total of 1,210,053 observations. I believe this is where the error is coming from...

Code:

// Example data clear input firm_id M_Acq_Naic str2 M_Acq_Reg 1 511210 "JP" 2 236116 "EU" 3 451120 "AM" 3 451120 "AM" 3 451120 "AM" 3 451120 "AM" 4 441110 "AM" 4 441110 "AM" 5 811310 "EU" 6 221119 "EU" 7 813212 "JP" end // NAIC codes should be strings, especially for current purposes. tostring M_Acq_Naic*, replace // Make a file to pair with itself. preserve tempfile temp rename * *_2 save `temp' restore // rename * *_1 cross using `temp' // the workhorse here // drop if (firm_id_1 == firm_id_2) // no self pairs // Create indicator variables for digit matches on NAIC codes forval i = 1/6 { gen PairSameDigit`i' = substr(M_Acq_Naic_1,`i',1) == substr(M_Acq_Naic_2,`i', 1) } // Drop duplicate firm pairs gen min = min(firm_id_1, firm_id_2) gen max = max(firm_id_1, firm_id_2) bysort min max: keep if _n ==1 drop min max

Thanks in advance for your help!

Last edited by Maddie Adelman; 27 Nov 2018, 11:11.
Tags: None
William Lisowski

Join Date: Dec 2014

Posts: 10150
#2

27 Nov 2018, 11:36

New readers should note that this code was provided by Mike Lacy in a topic mistakenly started on the Mata Forum at

https://www.statalist.org/forums/for...atch-for-firms

where the original post #1 contained only the sample data with no mention of the fact that the actual data contains multiple years of data for each firm, which then led to Mike's suggested code copied above.

It is possible that your problem can be solved changing

Code:

cross using `temp'

to

Code:

joinby year using `temp'

where year is the name of whatever you call the date variable in your dataset. Lacking appropriate sample data, this suggestion is untested. In particular, I think the next-to-final line may need to change to

Code:

bysort year min max: keep if _n ==1

or something similar.
Comment

Andrew Musau

Join Date: Oct 2014
Posts: 9774

27 Nov 2018, 12:39

My question would be what is your reason for this similarity index? How similar firm 1 is to firm 5 differs to how similar firm 2 is to firm 7. If you have to show all possible combinations, you will have nearly as many variables as observations. If your interest is to simply do a spot check, I would approach the problem differently

Code:

*REPRODUCE A DATA SET SIMILAR TO YOURS 
set obs 373088
set seed 2018
gen firm_id= _n
gen M_Acq_Naic= runiformint(100000,999999)

*GEN INDICATORS FOR PRESENCE OF A DIGIT
forval i=0/9{
gen has`i'=.
}

forval i=0/9{
replace has`i'= cond(regexm(string(M_Acq_Naic), "`i'"), 1, .)
}

Code:

*SPOTCHECK FIRM 1, 2 &3
. list if inlist(firm_id, 1,2,3)

        +------------------------------------------------------------------------------------------+
        | firm_id   M_Acq_~c   has0   has1   has2   has3   has4   has5   has6   has7   has8   has9 |
        |------------------------------------------------------------------------------------------|
     1. |       1     898540      1      .      .      .      1      1      .      .      1      1 |
     2. |       2     209465      1      .      1      .      1      1      1      .      .      1 |
     3. |       3     401536      1      1      .      1      1      1      1      .      .      . |
        +------------------------------------------------------------------------------------------+

So here, all have 0, 4 and 5.

Comment

Maddie Adelman

Join Date: Aug 2018

Posts: 10
#4

29 Nov 2018, 11:24

William Lisowski thank you for providing clarification about the original code and giving credit where credit is due. I will follow your lead in future posts.

To clarify, the topic I am trying to address is following behavior between firms. If one firm engages in Mergers and Acquisitions in a given year (while also analyzing the number of deals), will another firm follow? I am planning to use the independent variable of similarity in a regression to test whether firm i and firm j are in the same industry. The variables I have are year, firm_id, NAIC code, region, nation, industry concentration level, among others (the dataset has 426 variables total). I have uniquely identified each observation by firm_id, year and the company they targeted for M&A deal activity.

To determine the same industry (similarity), I am looking at the 6-digit North American Industry Classification System (NAICS) code. Each digit in the code is part of a series of progressively narrower categories. The first two digits of the code refers to the economic sector of a firm, the third digit designates the subsector, the fourth digit refers to the industry group, the fifth digit designates the NAICS industry, and the sixth digit designates the national industry.
Comment

Andrew Musau

Join Date: Oct 2014
Posts: 9774

29 Nov 2018, 16:19

I am planning to use the independent variable of similarity in a regression to test whether firm i and firm j are in the same industry.

It sounds to me that you need to create a variable that identifies industries in your data set and then use this to create indicators in your analysis.

To determine the same industry (similarity), I am looking at the 6-digit North American Industry Classification System (NAICS) code.

This classification is taken from https://www.census.gov/cgi-bin/sssd/...rch?chart=2017

2017 NAICS
The following table provides detailed information on the structure of NAICS.

Sector Description
11 Agriculture, Forestry, Fishing and Hunting
21 Mining, Quarrying, and Oil and Gas Extraction
22 Utilities
23 Construction
31-33 Manufacturing
42 Wholesale Trade
44-45 Retail Trade
48-49 Transportation and Warehousing
51 Information
52 Finance and Insurance
53 Real Estate and Rental and Leasing
54 Professional, Scientific, and Technical Services
55 Management of Companies and Enterprises
56 Administrative and Support and Waste Management and Remediation Services
61 Educational Services
62 Health Care and Social Assistance
71 Arts, Entertainment, and Recreation
72 Accommodation and Food Services
81 Other Services (except Public Administration)
92 Public Administration

You can create the industry variable as follows

Code:

input firm_id M_Acq_Naic str2 M_Acq_Reg
1 511210 "JP"
2 236116 "EU"
3 451120 "AM"
3 451120 "AM"
3 451120 "AM"
3 451120 "AM"
4 441110 "AM"
4 441110 "AM"
5 811310 "EU"
6 221119 "EU"
7 813212 "JP" 
end
gen sector= real(substr(string(M_Acq_Naic), 1, 2))
replace sector=31 if inlist(sector, 32, 33)
replace sector=44 if sector==45
replace sector=48 if sector==49
label define sector 11 "Agriculture, Forestry, Fishing and Hunting"///
21 "Mining, Quarrying, and Oil and Gas Extraction" 22 "Utilities"///
23 "Construction" 31 "Manufacturing" 42    "Wholesale Trade" 44 "Retail Trade"///
48 "Transportation and Warehousing" 51 "Information" 52 "Finance and Insurance"///
53 "Real Estate and Rental and Leasing" 54 "Professional, Scientific, and Technical Services"///
55 "Management of Companies and Enterprises"///
56 "Administrative and Support and Waste Management and Remediation Services"///
61 "Educational Services" 62 "Health Care and Social Assistance"///
71 "Arts, Entertainment, and Recreation" 72 "Accommodation and Food Services"///
81 "Other Services (except Public Administration)" 92 "Public Administration"
label values sector sector

This results in the following for your data example

Code:

 list, clean

       firm_id   M_Acq_~c   M_Acq_~g                                          sector  
  1.         1     511210         JP                                     Information  
  2.         2     236116         EU                                    Construction  
  3.         3     451120         AM                                    Retail Trade  
  4.         3     451120         AM                                    Retail Trade  
  5.         3     451120         AM                                    Retail Trade  
  6.         3     451120         AM                                    Retail Trade  
  7.         4     441110         AM                                    Retail Trade  
  8.         4     441110         AM                                    Retail Trade  
  9.         5     811310         EU   Other Services (except Public Administration)  
 10.         6     221119         EU                                       Utilities  
 11.         7     813212         JP   Other Services (except Public Administration)

Now, provided that you have enough intra-industry variation (or enough firms within industries), you can do analyses at the industry level and also do comparisons of your coefficients across industries.

Comment

Maddie Adelman

Join Date: Aug 2018

Posts: 10
#6

11 Dec 2018, 21:55

Thank you all for the suggestions! I was able to run my code on small number of observations, but when I try to expand my code to the full set of observations (648, 812) I keep getting an error with my

Code:

cross using `temp'

and my

Code:

joinby Year_2 using `temp'

I get the following error - r(459) "sum of expand values exceed 2,147,483,620. The dataset may not contain more than 2,147,483,620 observations." That is because there are duplicate firms- The data looks at a firm's deal activity by year (1990-2016) so it is possible the firm is listed 26 times... there are 232,064 unique firms.

Any advice is appreciated! Thanks!
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#7

12 Dec 2018, 07:05

If there are N=373,088 distinct firms, then there are N*(N-1) = 139,190,551,776 possible ordered pairs of distinct firms. You are using Stata/SE or Stata/IC, which limits the number of observations in a dataset to 2,147,483,647, so what you hope to accomplish by creating all pairs of distinct firms is not possible for you. While upgrading to Stata/MP would remove that limit, it would require substantial amounts of memory on your system.

The advice from Andrew Musau in post #5 seems a more appropriate approach to your problem, more feasible and more substantively correct.
Comment

Announcement