Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Problem with calculating the HHI (Herfindahl-Hirschman Index)

    Dear Stata community,

    I am new to the forum and to Stata, too. I am trying to calculate the HHI (Herfindahl-Hirschman Index) in my panel data set in STATA 16.1. To give you an idea how my data structure with the relevant variables looks like, please see the following example values and variables:

    The variables for the firms is gvkey, sale is the net sales of the firm, sic is the 4 digit SIC code and fyear is the fiscal year.


    [CODE]

    * Example generated by -dataex-. To install: ssc install dataex

    clear

    input str6 gvkey double fyear str16 sic double sale

    "001001" 1984 "5812" 32.007

    "001001" 1985 "5812" 53.798

    "001003" 1983 "5712" 13.793

    "001003" 1984 "5712" 13.829

    "001003" 1986 "5712" 36.308

    "001003" 1987 "5712" 37.356

    "001003" 1988 "5712" 32.808

    "001004" 1980 "5080" 132.482

    "001004" 1981 "5080" 175.924

    "001004" 1982 "5080" 155.006

    "001004" 1983 "5080" 177.762

    "001004" 1984 "5080" 218.946

    "001004" 1985 "5080" 248.012

    "001004" 1986 "5080" 298.192

    "001004" 1987 "5080" 347.64

    "001004" 1988 "5080" 406.36

    "001004" 1989 "5080" 444.875

    "001004" 1990 "5080" 466.542

    "001004" 1991 "5080" 422.657

    "001004" 1992 "5080" 382.78

    "001004" 1993 "5080" 407.754

    "001005" 1980 "3724" 23.382

    "001005" 1981 "3724" 35.921

    "001007" 1980 "3652" 9.262

    "001007" 1981 "3652" 7.261

    "001007" 1982 "3652" 4.993

    "001007" 1983 "3652" 3.839

    "001008" 1985 "3577" .705

    "001009" 1982 "3460" 36.01

    "001009" 1983 "3460" 18.753

    "001009" 1984 "3460" 21.019

    "001009" 1985 "3460" 20.507

    "001009" 1986 "3460" 19.266

    "001009" 1987 "3460" 19.55

    "001009" 1988 "3460" 28.419




    I started with the following code:


    ssc install hhi


    hhi sale, by (sic fyear)

    sort sic fyear




    However, I got the error „Negative values in varlist“. Therefore I changed the code to:


    ssc install hhi


    replace sale = . if sale < 0

    drop if sale == .

    hhi sale, by (sic fyear)

    sort sic fyear




    The code runs and I get the following result:


    [CODE]

    * Example generated by -dataex-. To install: ssc install dataex

    clear

    input str6 gvkey double fyear str16 sic double sale float hhi_sale

    "008596" 1980 "0100" 405.884 .440715

    "009391" 1980 "0100" 5.073 .440715

    "010390" 1980 "0100" 25.022 .440715

    "010884" 1980 "0100" 3762.579 .440715

    "002099" 1980 "0100" 87.565 .440715

    "001266" 1980 "0100" 12.517 .440715

    "010971" 1980 "0100" 165.766 .440715

    "002812" 1980 "0100" 1733.501 .440715

    "010802" 1980 "0100" 79.928 .440715

    "010884" 1981 "0100" 4058.385 .6711329

    "010971" 1981 "0100" 247.284 .6711329

    "009391" 1981 "0100" 5.478 .6711329

    "010802" 1981 "0100" 76.542 .6711329

    "001266" 1981 "0100" 12.346 .6711329

    "002099" 1981 "0100" 100.759 .6711329

    "010390" 1981 "0100" 20.977 .6711329

    "008596" 1981 "0100" 477.995 .6711329

    "001266" 1982 "0100" 9.955 .4613737

    "002099" 1982 "0100" 110.442 .4613737

    "010802" 1982 "0100" 78.755 .4613737

    "002812" 1982 "0100" 1823.232 .4613737

    "010390" 1982 "0100" 20.854 .4613737

    "008596" 1982 "0100" 557.398 .4613737

    "009391" 1982 "0100" 5.596 .4613737

    "009062" 1982 "0100" 4.527 .4613737

    "010971" 1982 "0100" 222.384 .4613737

    "010390" 1983 "0100" 23.553 .4112865

    "010884" 1983 "0100" 3360.441 .4112865

    "006275" 1983 "0100" .424 .4112865

    "008596" 1983 "0100" 505.434 .4112865

    "001266" 1983 "0100" 8.877 .4112865

    "009062" 1983 "0100" 4.496 .4112865

    "002812" 1983 "0100" 1551.725 .4112865

    "010971" 1983 "0100" 274.672 .4112865

    "009391" 1983 "0100" 2.908 .4112865

    "002099" 1983 "0100" 111.037 .4112865

    "003702" 1984 "0100" 1.731 .40012285

    "010390" 1984 "0100" 22.14 .40012285

    "010971" 1984 "0100" 258.357 .40012285

    "002812" 1984 "0100" 1520.088 .40012285


    I am really unsure if my code is correct and if I really get the result I am looking for. Could someone of you have a look and check if the procedure is right?

    Why do I write the code hhi sale, by (sic fyear) and not only hhi sale, by (sic)? Meaning why do I include the fyear in my HHI? And what does my result including the fyear is telling me now?

    I would really appreciate your help as I am quite lost.

    Thank you already in advance.

    Shakira

  • #2
    I think there are multiple issues going on here. First, having negative sales does not make sense. Please check the integrity of your data. Where did you get it from and how did you imported it? Look at the negative values and try to identify where the problem lies.

    Second, hhi appears from 2012 and might be outdated and not updated. You can find the definition of the index from:
    https://www.justice.gov/atr/herfindahl-hirschman-index

    Basically you are supposed to calculate percentage shares of companies by sic fyear and , then multiply it by 100 (so that 30% becomes 30, for example) and add the result up to get the index value.

    Your index values are all less than 1 so something appears wrong with the hhi package too.

    The reason you are supposed to specify by (sic fyear) and not only by (sic) is that not only you need a definition for the industry but also a definition for the year. The index values that way would be for a SIC code in a given year.

    So, when you get the right data, calculate the individual share of each company in a given industry and year, multiply the result by 100, then square it. and sum the squared values up by sic and fyear to calculate the index values for each year in an industry.

    Comment


    • #3
      This measure is often discussed on Statalist, occasionally under other names that are more nearly historically correct (several people got there long before Hirschman or Herfindahl), or that avoid the treacherous territory of trying to decide who first thought of summing squared probabilities as a measure of unevenness. Thus repeat rate, match probability, and quadratic entropy are all terms that avoid arbitrary and cryptic attributions to particular scholars.

      There are several different questions here.

      One is that hhi (SSC) refuses to deal with negative values, which makes sense to me. So, you ignore them. I don't have a different solution, but if I were grading your paper/reading your thesis/reviewing your submission for a learned journal I would want to see a discussion of why negative sales arise in the first place and what ignoring them means for the analysis.

      Although I've published commands in this territory too -- typically as producing several such measures all at once -- if this is all you want, you hardly need a community-contributed command, as it's at most two lines of official code, calculating probabilities and then adding their squares.

      If you're unsure of what you're doing, work with a toy example in which you can check the calculations independently. Here's one

      Code:
      * Example generated by -dataex-. To install: ssc install dataex
      clear
      input float(sic year sale)
      1 2019 10
      1 2019  5
      1 2019  5
      1 2019  5
      1 2020 10
      1 2020 10
      1 2020 10
      1 2020 10
      end
      
      egen prop = pc(sale), by(sic year) prop
      
      egen hhi = total(prop^2), by(sic year)
      
      egen prop2 = pc(sale), by(sic) prop
      
      egen hhi2 = total(prop2^2), by(sic)
      
      list, sepby(year)
      
           +------------------------------------------------------+
           | sic   year   sale   prop   hhi      prop2       hhi2 |
           |------------------------------------------------------|
        1. |   1   2019     10     .4   .28   .1538462   .1360947 |
        2. |   1   2019      5     .2   .28   .0769231   .1360947 |
        3. |   1   2019      5     .2   .28   .0769231   .1360947 |
        4. |   1   2019      5     .2   .28   .0769231   .1360947 |
           |------------------------------------------------------|
        5. |   1   2020     10    .25   .25   .1538462   .1360947 |
        6. |   1   2020     10    .25   .25   .1538462   .1360947 |
        7. |   1   2020     10    .25   .25   .1538462   .1360947 |
        8. |   1   2020     10    .25   .25   .1538462   .1360947 |
           +------------------------------------------------------+
      This example is also intended to answer your other question. Working by(sic year) gives separate indexes for each group and year. In 2019 the proportions of the total were 0.4 and 0.2 (3 times); their squares are 0.16 and 0.04 (3 times), so arithmetic gives you 0.28 as the measure. In 2020 (1/4) squared, 4 times, gives you (1/16) four times and so 0.25

      Working by(sic) lumps together all years. I don't want to try the arithmetic mentally, but as those examples check out, I am confident that the code is correct.

      Although not shown here, hhi gives the same results, although at first glance it offers no choice whatsoever on what the result variable is called. Looking inside hhi shows that it centres on the same two-liner as above, which is fine by me. (I wrote the first version of egen, pc() but even then was just standing on Bill Gould's shoulders, I guess.)

      My guess would be that most people doing this want very much to know how unevenness varies by year as well as sector, but I can't know more about your project than you do.

      Hope that helps.

      Note: I've just seen Oscar's helpful #2, most of which makes very similar points. But using proportions rather than percents is utterly fine as a convention.
      Last edited by Nick Cox; 22 Dec 2020, 07:10.

      Comment


      • #4
        Nick Cox Apparently, the current definition of hhi (by DOJ) includes multiplication by 100, so the range is between 1-10000. That is news to me but that is also something to be checked to make sure the textbook formula matches the answer.

        Comment


        • #5
          I don't see that what the DOJ [of the United States?] uses in its own publications -- or even what it recommends (you don't give a link or reference, so I am guessing) -- is binding on anybody else who doesn't want it to be.

          I see that the upper limit with a percent convention is 10000 (a monopoly recording 100% of sales) but the lower limit is arbitrarily close to zero.

          Still I hope that no-one here needs advice on how to multiply or divide by 10000,

          Comment


          • #6
            Dear Nick,

            thanks a lot for your explanation. Your example helped a lot to understand what the HHI exactly calculates. However, I two more questions arose:
            1) Did I understand it right, that it doesn't make a difference if I use by(sic year)or by(year sic)? Either way round - it basically means the same?
            2) Do you know a explanation for the negative sales in the data set? Or does it depend on our data set?

            Thank you so much for your help!

            Shakira

            Comment


            • #7
              Glad it helped.

              1) Correct. It means exactly the same.

              2) Sorry, but I know nothing about your dataset. This measure is straightforward so long as there are no negative values. Note thst zeros just drop out of the calculation.

              Comment


              • #8
                Shakira Foster I am getting similar results. could you sort the issue

                Comment


                • #9
                  I have the same problem. The negative sales may be the inter-segment sales or the loss from the sales. Hard to tell. Maybe the only is to ignore those firms.

                  Comment

                  Working...
                  X