Problem with calculating the HHI (Herfindahl-Hirschman Index)

Shakira Foster

Join Date: Dec 2020

Posts: 5
#1

Problem with calculating the HHI (Herfindahl-Hirschman Index)

22 Dec 2020, 04:40

Dear Stata community,

I am new to the forum and to Stata, too. I am trying to calculate the HHI (Herfindahl-Hirschman Index) in my panel data set in STATA 16.1. To give you an idea how my data structure with the relevant variables looks like, please see the following example values and variables:

The variables for the firms is gvkey, sale is the net sales of the firm, sic is the 4 digit SIC code and fyear is the fiscal year.

[CODE]

* Example generated by -dataex-. To install: ssc install dataex

clear

input str6 gvkey double fyear str16 sic double sale

"001001" 1984 "5812" 32.007

"001001" 1985 "5812" 53.798

"001003" 1983 "5712" 13.793

"001003" 1984 "5712" 13.829

"001003" 1986 "5712" 36.308

"001003" 1987 "5712" 37.356

"001003" 1988 "5712" 32.808

"001004" 1980 "5080" 132.482

"001004" 1981 "5080" 175.924

"001004" 1982 "5080" 155.006

"001004" 1983 "5080" 177.762

"001004" 1984 "5080" 218.946

"001004" 1985 "5080" 248.012

"001004" 1986 "5080" 298.192

"001004" 1987 "5080" 347.64

"001004" 1988 "5080" 406.36

"001004" 1989 "5080" 444.875

"001004" 1990 "5080" 466.542

"001004" 1991 "5080" 422.657

"001004" 1992 "5080" 382.78

"001004" 1993 "5080" 407.754

"001005" 1980 "3724" 23.382

"001005" 1981 "3724" 35.921

"001007" 1980 "3652" 9.262

"001007" 1981 "3652" 7.261

"001007" 1982 "3652" 4.993

"001007" 1983 "3652" 3.839

"001008" 1985 "3577" .705

"001009" 1982 "3460" 36.01

"001009" 1983 "3460" 18.753

"001009" 1984 "3460" 21.019

"001009" 1985 "3460" 20.507

"001009" 1986 "3460" 19.266

"001009" 1987 "3460" 19.55

"001009" 1988 "3460" 28.419

I started with the following code:

ssc install hhi

hhi sale, by (sic fyear)

sort sic fyear

However, I got the error „Negative values in varlist“. Therefore I changed the code to:

ssc install hhi

replace sale = . if sale < 0

drop if sale == .

hhi sale, by (sic fyear)

sort sic fyear

The code runs and I get the following result:

[CODE]

* Example generated by -dataex-. To install: ssc install dataex

clear

input str6 gvkey double fyear str16 sic double sale float hhi_sale

"008596" 1980 "0100" 405.884 .440715

"009391" 1980 "0100" 5.073 .440715

"010390" 1980 "0100" 25.022 .440715

"010884" 1980 "0100" 3762.579 .440715

"002099" 1980 "0100" 87.565 .440715

"001266" 1980 "0100" 12.517 .440715

"010971" 1980 "0100" 165.766 .440715

"002812" 1980 "0100" 1733.501 .440715

"010802" 1980 "0100" 79.928 .440715

"010884" 1981 "0100" 4058.385 .6711329

"010971" 1981 "0100" 247.284 .6711329

"009391" 1981 "0100" 5.478 .6711329

"010802" 1981 "0100" 76.542 .6711329

"001266" 1981 "0100" 12.346 .6711329

"002099" 1981 "0100" 100.759 .6711329

"010390" 1981 "0100" 20.977 .6711329

"008596" 1981 "0100" 477.995 .6711329

"001266" 1982 "0100" 9.955 .4613737

"002099" 1982 "0100" 110.442 .4613737

"010802" 1982 "0100" 78.755 .4613737

"002812" 1982 "0100" 1823.232 .4613737

"010390" 1982 "0100" 20.854 .4613737

"008596" 1982 "0100" 557.398 .4613737

"009391" 1982 "0100" 5.596 .4613737

"009062" 1982 "0100" 4.527 .4613737

"010971" 1982 "0100" 222.384 .4613737

"010390" 1983 "0100" 23.553 .4112865

"010884" 1983 "0100" 3360.441 .4112865

"006275" 1983 "0100" .424 .4112865

"008596" 1983 "0100" 505.434 .4112865

"001266" 1983 "0100" 8.877 .4112865

"009062" 1983 "0100" 4.496 .4112865

"002812" 1983 "0100" 1551.725 .4112865

"010971" 1983 "0100" 274.672 .4112865

"009391" 1983 "0100" 2.908 .4112865

"002099" 1983 "0100" 111.037 .4112865

"003702" 1984 "0100" 1.731 .40012285

"010390" 1984 "0100" 22.14 .40012285

"010971" 1984 "0100" 258.357 .40012285

"002812" 1984 "0100" 1520.088 .40012285

I am really unsure if my code is correct and if I really get the result I am looking for. Could someone of you have a look and check if the procedure is right?

Why do I write the code hhi sale, by (sic fyear) and not only hhi sale, by (sic)? Meaning why do I include the fyear in my HHI? And what does my result including the fyear is telling me now?

I would really appreciate your help as I am quite lost.

Thank you already in advance.

Shakira
Tags: None
Oscar Ozfidan

Join Date: Sep 2018

Posts: 257
#2

22 Dec 2020, 05:45

I think there are multiple issues going on here. First, having negative sales does not make sense. Please check the integrity of your data. Where did you get it from and how did you imported it? Look at the negative values and try to identify where the problem lies.

Second, hhi appears from 2012 and might be outdated and not updated. You can find the definition of the index from:
https://www.justice.gov/atr/herfindahl-hirschman-index

Basically you are supposed to calculate percentage shares of companies by sic fyear and , then multiply it by 100 (so that 30% becomes 30, for example) and add the result up to get the index value.

Your index values are all less than 1 so something appears wrong with the hhi package too.

The reason you are supposed to specify by (sic fyear) and not only by (sic) is that not only you need a definition for the industry but also a definition for the year. The index values that way would be for a SIC code in a given year.

So, when you get the right data, calculate the individual share of each company in a given industry and year, multiply the result by 100, then square it. and sum the squared values up by sic and fyear to calculate the index values for each year in an industry.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35436
#3

22 Dec 2020, 06:07

This measure is often discussed on Statalist, occasionally under other names that are more nearly historically correct (several people got there long before Hirschman or Herfindahl), or that avoid the treacherous territory of trying to decide who first thought of summing squared probabilities as a measure of unevenness. Thus repeat rate, match probability, and quadratic entropy are all terms that avoid arbitrary and cryptic attributions to particular scholars.

There are several different questions here.

One is that hhi (SSC) refuses to deal with negative values, which makes sense to me. So, you ignore them. I don't have a different solution, but if I were grading your paper/reading your thesis/reviewing your submission for a learned journal I would want to see a discussion of why negative sales arise in the first place and what ignoring them means for the analysis.

Although I've published commands in this territory too -- typically as producing several such measures all at once -- if this is all you want, you hardly need a community-contributed command, as it's at most two lines of official code, calculating probabilities and then adding their squares.

If you're unsure of what you're doing, work with a toy example in which you can check the calculations independently. Here's one

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input float(sic year sale) 1 2019 10 1 2019 5 1 2019 5 1 2019 5 1 2020 10 1 2020 10 1 2020 10 1 2020 10 end egen prop = pc(sale), by(sic year) prop egen hhi = total(prop^2), by(sic year) egen prop2 = pc(sale), by(sic) prop egen hhi2 = total(prop2^2), by(sic) list, sepby(year) +------------------------------------------------------+ | sic year sale prop hhi prop2 hhi2 | |------------------------------------------------------| 1. | 1 2019 10 .4 .28 .1538462 .1360947 | 2. | 1 2019 5 .2 .28 .0769231 .1360947 | 3. | 1 2019 5 .2 .28 .0769231 .1360947 | 4. | 1 2019 5 .2 .28 .0769231 .1360947 | |------------------------------------------------------| 5. | 1 2020 10 .25 .25 .1538462 .1360947 | 6. | 1 2020 10 .25 .25 .1538462 .1360947 | 7. | 1 2020 10 .25 .25 .1538462 .1360947 | 8. | 1 2020 10 .25 .25 .1538462 .1360947 | +------------------------------------------------------+

This example is also intended to answer your other question. Working by(sic year) gives separate indexes for each group and year. In 2019 the proportions of the total were 0.4 and 0.2 (3 times); their squares are 0.16 and 0.04 (3 times), so arithmetic gives you 0.28 as the measure. In 2020 (1/4) squared, 4 times, gives you (1/16) four times and so 0.25

Working by(sic) lumps together all years. I don't want to try the arithmetic mentally, but as those examples check out, I am confident that the code is correct.

Although not shown here, hhi gives the same results, although at first glance it offers no choice whatsoever on what the result variable is called. Looking inside hhi shows that it centres on the same two-liner as above, which is fine by me. (I wrote the first version of egen, pc() but even then was just standing on Bill Gould's shoulders, I guess.)

My guess would be that most people doing this want very much to know how unevenness varies by year as well as sector, but I can't know more about your project than you do.

Hope that helps.

Note: I've just seen Oscar's helpful #2, most of which makes very similar points. But using proportions rather than percents is utterly fine as a convention.

Last edited by Nick Cox; 22 Dec 2020, 06:10.
1 like
Comment
Oscar Ozfidan

Join Date: Sep 2018

Posts: 257
#4

22 Dec 2020, 06:17

Nick Cox Apparently, the current definition of hhi (by DOJ) includes multiplication by 100, so the range is between 1-10000. That is news to me but that is also something to be checked to make sure the textbook formula matches the answer.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35436
#5

22 Dec 2020, 06:53

I don't see that what the DOJ [of the United States?] uses in its own publications -- or even what it recommends (you don't give a link or reference, so I am guessing) -- is binding on anybody else who doesn't want it to be.

I see that the upper limit with a percent convention is 10000 (a monopoly recording 100% of sales) but the lower limit is arbitrarily close to zero.

Still I hope that no-one here needs advice on how to multiply or divide by 10000,
Comment
Shakira Foster

Join Date: Dec 2020

Posts: 5
#6

22 Dec 2020, 13:25

Dear Nick,

thanks a lot for your explanation. Your example helped a lot to understand what the HHI exactly calculates. However, I two more questions arose:
1) Did I understand it right, that it doesn't make a difference if I use by(sic year)or by(year sic)? Either way round - it basically means the same?
2) Do you know a explanation for the negative sales in the data set? Or does it depend on our data set?

Thank you so much for your help!

Shakira
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35436
#7

22 Dec 2020, 15:55

Glad it helped.

1) Correct. It means exactly the same.

2) Sorry, but I know nothing about your dataset. This measure is straightforward so long as there are no negative values. Note thst zeros just drop out of the calculation.
Comment
Anuradha Saikia

Join Date: Aug 2020

Posts: 153
#8

07 Feb 2022, 04:14

Shakira Foster I am getting similar results. could you sort the issue
Comment
Lumuse Musena

Join Date: Feb 2022

Posts: 1
#9

08 Feb 2022, 17:17

I have the same problem. The negative sales may be the inter-segment sales or the loss from the sales. Hard to tell. Maybe the only is to ignore those firms.
Comment

Announcement

Problem with calculating the HHI (Herfindahl-Hirschman Index)

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment