Building Industry Classification based on similarity scores

Hans Nostiz

Join Date: Nov 2018

Posts: 15
#1

Building Industry Classification based on similarity scores

04 Nov 2018, 10:57

Dear all,

i would like to build an industry classification based on Hoberg and Phillips TNIC (Text-based Network Industry Classifications)

The available data from their website looks like this:

Year: range from 1997 to 2015
gvkey: main firm identifier in compustat (for firm x and y)
score: Similarity measure of both companies

year gvkey1 gvkey2 score
1996 1004 1210 0.0693
1996 1004 1849 0.0061
1996 1004 1988 0.0056
1996 1004 2033 0.0595
1996 1004 2049 0.0232
1996 1004 2285 0.038
1996 1004 2519 0.0082
1996 1004 2598 0.0168
1996 1004 2771 0.0022
1996 1004 3178 0.0068
1996 1004 3216 0.0354
1996 1004 3416 0.0099
1996 1004 4091 0.0226
1996 1004 4223 0.0185
1996 1004 4279 0.0077
1996 1004 4460 0.0017
1996 1004 5567 0.1097
1996 1004 6500 0.0046
1996 1004 7085 0.031

The similarity measure indicates how similar the two companies are. I would now like to classify the companies in annual industries based on a similarity threshold. A company may only be listed in one industry by year.

Does anyone know how to do it in STATA?
Tags: None
Phil Bromiley

Join Date: Apr 2014

Posts: 4348
#2

05 Nov 2018, 15:43

You didn't get a quick answer. You'll increase your changes of a useful answer by following the FAQ on asking questions - provide Stata code in code delimiters, readable Stata output, and sample data using dataex.

So, you have similarity scores between firms and somehow want to identify industry similarity? Now, the first place to start would be with Hoberg & Philips - how did they identify industries? You seem to have firm data but you describe H&P as industry classifications. If you can tell us specifically what procedure you want to use, we can probably help you better. There are multivariate classification tools in the multivariate statistics portion of Stata - see the pdf documentation. I don't know how well these work with many groups and many firms.
Comment
Hans Nostiz

Join Date: Nov 2018

Posts: 15
#3

06 Nov 2018, 07:49

Hi Phil,
many thanks for your response. I'll take a closer look at your recommendation - multivaraite statistics portion of Stata.
Comment
David Benson

Join Date: Oct 2018

Posts: 489
#4

06 Nov 2018, 13:00

Hi Hans, this is probably going to be more complicated than you expected.

So, for those coming afterwards:
The Hoberg-Phillips textual analysis data can be downloaded at http://hobergphillips.usc.edu/

They also have links to working paper versions of all of their papers there.

I believe the text-based industry classification is discussed in the following paper:
Gerard Hoberg and Gordon Phillips, "Text-Based Network Industries and Endogenous Product Differentiation," Journal of Political Economy 124, no. 5 (October 2016): 1423-1465. https://doi.org/10.1086/688176
The details of how to build their industry classification is in Appendix B (I've posted a screenshot of it below)

I believe that the "Supplementary Material" listed for one of those papers

You might also find this paper useful (at least to start):
“Product Market Synergies and Competition in Mergers and Acquisitions: A Text-Based Analysis”
The Review of Financial Studies, Volume 23, Issue 10, 1 October 2010, Pages 3773–3811 https://doi.org/10.1093/rfs/hhq053

I believe that the "Supplementary Material" listed for one of those papers has the SAS commands to create the industry classifications.

This is from the Product Market Synergies paper in Review of Financial Studies (p. 3783):
Product Similarity (10 Nearest): For a given firm i, this variable is the average pairwise similarity (using the local dictionary) between firm i and its ten most similar rivals j. The closest rivals are the ten firms with the highest local similarity to i.

Broad Similarity (All Firms): For a given firm i, this variable is the average similarity (using the broad dictionary) between firm i and all other firms j in the sample

If you want to chack your work, on p.3786 they say:
Product similarities are bounded in the range [0,1]. The average broad similarity (across all firms) is 0.022 (or 2.2%). The average local similarity between a firm and its ten closest neighbors is considerably higher at 20.1%

Attached Files

Last edited by David Benson; 06 Nov 2018, 13:06.
Comment
David Benson

Join Date: Oct 2018

Posts: 489
#5

06 Nov 2018, 13:13

So, that was a long-winded way to suggest a few options (others might suggest more)
1. With the data you have right now, it would be easy to identify the 10 closest firms and calculate the similarity to them (and to all firms in the data). You could do this by year, for a 5-year period, or over the entire time frame of the sample

2. You could get the SAS commands from the Journal of Political Economy paper, and run them yourself (if you know SAS), or get someone who does.

3. Contact Hoberg or Phillips to see if someone has created Stata code to create them.
Comment
Hans Nostiz

Join Date: Nov 2018

Posts: 15
#6

08 Nov 2018, 10:06

Hi David,

I'm speechless. Thank you very much for this extraordinarily great response. I'll take your advice and try it out!
1 like
Comment
Ray Mond

Join Date: Jan 2020

Posts: 1
#7

27 Jan 2020, 05:51

Hi Hans,

Were you able to rebuild a Stata code for the industry classification? I am rather a newby in coding and a little input would help tremendously
Comment
Ricardo Lopez

Join Date: Feb 2021

Posts: 2
#8

15 Apr 2021, 08:13

Hi,

Just to clarify, the Appendix B in the JPE paper shows how Hoberg and Phillips compute FICs (fixed industry classifications) . This data is available to download at Hoberg and Phillips website: https://hobergphillips.tuck.dartmouth.edu/
There is really no reason to try to build this on your own since the data is freely available.
Comment

Announcement

Building Industry Classification based on similarity scores

Comment

Comment

Comment

Comment

Comment

Comment

Comment