Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Building Industry Classification based on similarity scores

    Dear all,

    i would like to build an industry classification based on Hoberg and Phillips TNIC (Text-based Network Industry Classifications)

    The available data from their website looks like this:

    Year: range from 1997 to 2015
    gvkey: main firm identifier in compustat (for firm x and y)
    score: Similarity measure of both companies

    year gvkey1 gvkey2 score
    1996 1004 1210 0.0693
    1996 1004 1849 0.0061
    1996 1004 1988 0.0056
    1996 1004 2033 0.0595
    1996 1004 2049 0.0232
    1996 1004 2285 0.038
    1996 1004 2519 0.0082
    1996 1004 2598 0.0168
    1996 1004 2771 0.0022
    1996 1004 3178 0.0068
    1996 1004 3216 0.0354
    1996 1004 3416 0.0099
    1996 1004 4091 0.0226
    1996 1004 4223 0.0185
    1996 1004 4279 0.0077
    1996 1004 4460 0.0017
    1996 1004 5567 0.1097
    1996 1004 6500 0.0046
    1996 1004 7085 0.031

    The similarity measure indicates how similar the two companies are. I would now like to classify the companies in annual industries based on a similarity threshold. A company may only be listed in one industry by year.

    Does anyone know how to do it in STATA?

  • #2
    You didn't get a quick answer. You'll increase your changes of a useful answer by following the FAQ on asking questions - provide Stata code in code delimiters, readable Stata output, and sample data using dataex.

    So, you have similarity scores between firms and somehow want to identify industry similarity? Now, the first place to start would be with Hoberg & Philips - how did they identify industries? You seem to have firm data but you describe H&P as industry classifications. If you can tell us specifically what procedure you want to use, we can probably help you better. There are multivariate classification tools in the multivariate statistics portion of Stata - see the pdf documentation. I don't know how well these work with many groups and many firms.

    Comment


    • #3
      Hi Phil,
      many thanks for your response. I'll take a closer look at your recommendation - multivaraite statistics portion of Stata.

      Comment


      • #4
        Hi Hans, this is probably going to be more complicated than you expected.

        So, for those coming afterwards:
        • The Hoberg-Phillips textual analysis data can be downloaded at http://hobergphillips.usc.edu/
        • They also have links to working paper versions of all of their papers there.
        I believe the text-based industry classification is discussed in the following paper:
        Gerard Hoberg and Gordon Phillips, "Text-Based Network Industries and Endogenous Product Differentiation," Journal of Political Economy 124, no. 5 (October 2016): 1423-1465. https://doi.org/10.1086/688176
        • The details of how to build their industry classification is in Appendix B (I've posted a screenshot of it below)
        • I believe that the "Supplementary Material" listed for one of those papers
        You might also find this paper useful (at least to start):
        “Product Market Synergies and Competition in Mergers and Acquisitions: A Text-Based Analysis”
        The Review of Financial Studies, Volume 23, Issue 10, 1 October 2010, Pages 3773–3811 https://doi.org/10.1093/rfs/hhq053

        I believe that the "Supplementary Material" listed for one of those papers has the SAS commands to create the industry classifications.



        This is from the Product Market Synergies paper in Review of Financial Studies (p. 3783):
        Product Similarity (10 Nearest): For a given firm i, this variable is the average pairwise similarity (using the local dictionary) between firm i and its ten most similar rivals j. The closest rivals are the ten firms with the highest local similarity to i.

        Broad Similarity (All Firms): For a given firm i, this variable is the average similarity (using the broad dictionary) between firm i and all other firms j in the sample


        If you want to chack your work, on p.3786 they say:
        Product similarities are bounded in the range [0,1]. The average broad similarity (across all firms) is 0.022 (or 2.2%). The average local similarity between a firm and its ten closest neighbors is considerably higher at 20.1%
        Attached Files
        Last edited by David Benson; 06 Nov 2018, 13:06.

        Comment


        • #5
          So, that was a long-winded way to suggest a few options (others might suggest more)
          1. With the data you have right now, it would be easy to identify the 10 closest firms and calculate the similarity to them (and to all firms in the data). You could do this by year, for a 5-year period, or over the entire time frame of the sample

          2. You could get the SAS commands from the Journal of Political Economy paper, and run them yourself (if you know SAS), or get someone who does.

          3. Contact Hoberg or Phillips to see if someone has created Stata code to create them.

          Comment


          • #6
            Hi David,

            I'm speechless. Thank you very much for this extraordinarily great response. I'll take your advice and try it out!

            Comment


            • #7
              Hi Hans,

              Were you able to rebuild a Stata code for the industry classification? I am rather a newby in coding and a little input would help tremendously

              Comment


              • #8
                Hi,

                Just to clarify, the Appendix B in the JPE paper shows how Hoberg and Phillips compute FICs (fixed industry classifications) . This data is available to download at Hoberg and Phillips website: https://hobergphillips.tuck.dartmouth.edu/
                There is really no reason to try to build this on your own since the data is freely available.

                Comment

                Working...
                X