entropyetc available from SSC

Nick Cox

Join Date: Mar 2014

Posts: 35404
#1

entropyetc available from SSC

21 Nov 2016, 12:24

With thanks as always to Kit Baum, I flag that a new program entropyetc is available from SSC for Stata 11.2 up.

The name entropyetc should be parsed "entropy, etc." and flags that it calculates Shannon entropy as one of a bundle of loosely related measures of diversity (concentration, inequality, heterogeneity, impurity, ...: the list of near synonyms in many literatures goes on and on).

There are now many user-written Stata programs in what appears to be the same territory, but in fact most seem written with income inequality in mind, so that the data arrive as incomes for groups or individuals and that variable is treated as it comes. Such programs usually carry across to any additive variable.

In contrast entropyetc is one of a smaller group of programs with main focus diversity (same comment) for categorical variables.

The difference is sometimes slurred over. If an input variable is categorical, it can't be added usefully (or meaningfully). For a recent thread raising this point see http://www.statalist.org/forums/foru...milarity-index

What can be added are frequencies, or more generally abundances, of categories.

A near equivalent to entropyetc is divcat from Dirk Enzmann, also on SSC, announced at http://www.statalist.org/forums/foru...ailable-on-ssc

The nearest equivalent is, however, my own ineq from 1998 (also on SSC). ineq assumes that the data are already summarized in terms of frequencies or other measures of abundance. Besides that, between 1998 and 2016 Stata has shifted and different coding is now both possible and natural. Rather than rewriting ineq drastically, it seemed better on reflection to leave the older program untouched. (Despite comments elsewhere, I am mindful of the need for old programs to remain accessible to the extent that that may be useful.)

I've tried to write entropyetc so that it is easy to clone and to modify (which doesn't mean: to plagiarize!). One detail that is in fact central to the design is that most of these measures boil down to about one line of Mata: the main contribution of a program is to make it easy, or at least easier, to collate results from several different groups. I hope to write further on that in due course.

Here are a couple of examples. First, we treat rep78 from the auto dataset as a categorical variable. Second, we look at diversity of occupations within industries in the nlsw88 dataset and underline that results can put into variables and become data for later analysis.

Code:

. sysuse auto (1978 Automobile Data) . entropyetc rep78 ---------------------------------------------------------------------- Group | Shannon H exp(H) Simpson 1/Simpson dissim. ----------+----------------------------------------------------------- all | 1.358 3.888 0.297 3.369 0.296 ---------------------------------------------------------------------- . entropyetc rep78, by(foreign) ---------------------------------------------------------------------- Group | Shannon H exp(H) Simpson 1/Simpson dissim. ----------+----------------------------------------------------------- Domestic | 1.201 3.323 0.383 2.612 0.363 Foreign | 1.004 2.730 0.388 2.579 0.457 ---------------------------------------------------------------------- . webuse nlsw88 (NLSW, 1988 extract) . entropyetc occupation, by(industry) gen(2=numeq) ------------------------------------------------------------------------------------ Group | Shannon H exp(H) Simpson 1/Simpson dissim. ------------------------+----------------------------------------------------------- Ag/Forestry/Fisheries | 1.646 5.186 0.239 4.188 0.534 Mining | 0.562 1.755 0.625 1.600 0.846 Construction | 1.399 4.050 0.353 2.832 0.597 Manufacturing | 1.470 4.348 0.316 3.167 0.575 Transport/Comm/Utility | 1.484 4.411 0.342 2.922 0.556 Wholesale/Retail Trade | 1.740 5.698 0.214 4.681 0.554 Finance/Ins/Real Estate | 1.206 3.340 0.355 2.818 0.707 Business/Repair Svc | 1.579 4.849 0.277 3.608 0.588 Personal Services | 1.597 4.937 0.243 4.107 0.599 Entertainment/Rec Svc | 1.712 5.538 0.218 4.587 0.516 Professional Services | 1.590 4.902 0.219 4.558 0.612 Public Administration | 1.195 3.304 0.404 2.473 0.701 ------------------------------------------------------------------------------------ . egen tag = tag(industry) . graph dot (asis) numeq if tag, over(industry, sort(1) descending) linetype(line)
Tags: None

1 like
Mike Lacy

Join Date: Apr 2014

Posts: 2404
#2

21 Nov 2016, 15:10

Thanks to Nick for pulling these together.

As it happens, I've been working with these measures recently, and noted as has Nick, that various user-written programs for them exist. One additional thing that I would note is that for any measure that is a differentiable function of the multinomial p[i]s (e.g., the Simpson index) a relatively simple Delta Method standard error estimated can be calculated, of the form:

se(M) = sqrt(dM * S * dM'), where

M is the measure of interest, dM is a row vector in which dM[i] is the derivative of M with respect to p[i],
and S is the variance-covariance matrix of the multinomial p[i], with S[i,j] = -p[i] * p[j]/N

This approximation is easy to program, and, in my experience, works surprisingly well as compared to bootstrap results even with modest sized (say N = 150) samples. (There are cases, though, in which the approximation fails badly, e.g., for the Simpson index when the distribution is nearly uniform.)
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35404
#3

22 Nov 2016, 02:21

Mike: Thanks for that. My program doesn't support any kind of error calculation. I'd imagine bootstrapping as an alternative, but someone would perhaps best be advised to write a wrapper that returns the specific quantity of interest.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35404
#4

24 Nov 2018, 03:57

entropyetc is now updated on SSC, thanks to Kit Baum. The main change is that I was a little dissatisfied with the internals, although users would have to work very hard to see any difference in the results. But this also makes public a fix made some time ago in my private files: previously entropyetc would fail with a string variable fed to by(). I noticed it that for myself some time ago but River Huang recently flagged the difficulty, making a public update a good idea.
1 like
Comment
River Huang

Join Date: Mar 2016

Posts: 1903
#5

24 Nov 2018, 17:08

Dear Nick, Many thanks for the updates.

Ho-Chuan (River) Huang
Stata 17.0, MP(4)
Comment
Biswa bhusan

Join Date: Jan 2018

Posts: 60
#6

29 May 2019, 13:21

Dear Nick, Thank you for your updates. I have estimated entropy using entropyetc. However, my data is time series. So, i need to check the stability of the different entropy using the above code (Particularly Shannon H) over the time period in a rolling window framework with increments between successive rolling windows is 1 period. Then, in the second step, i want to plot the entropy value in the y-axis and time period in the x-axis. So, can you please help in writing the code. For data, you may use, sysuse tsline2.dta

Thank you
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35404
#7

29 May 2019, 13:29

I can't see your code to help you with it.

Here's some strategic help. Use rangestat (SSC) for the rolling part and take Mata code from entropyetc and marry the two.

Good luck!
Comment
Biswa bhusan

Join Date: Jan 2018

Posts: 60
#8

30 May 2019, 01:29

Dear Nick, Thank you for your help. Here is the code i have used
webuse lutkepohl2
tsset qtr
rolling, window(10): entropyetc dln_inv
However, the Stata is reporting too many values
an error occurred when rolling executed entropyetc
r(134);

Similarly, i have used other examples you have suggested, as follows
webuse grunfeld, clear
rangestat (entropyetc) invest , interval(year -6 0) by(company)

However, Stata is reporting <istmt>: 3499 entropyetc() not found
r(3499);

So, i request your suggestion, if you will use lutkepohl2 time series data rather than a panel data, it will be better to understand.

Thank you
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35404
#9

30 May 2019, 01:38

Sorry, but there is some caprice in my answering at length, briefly, or at all on Statalist -- and here I can't answer at length given other commitments and I can only answer briefly.

For the record, I never suggested

Code:

rangestat (entropyetc) invest , interval(year -6 0) by(company)

and it's clear from studying the help forrangestat that that can't possibly work. You'll need to write some extra code, as I can't see that any one-liner will do what you want.
Comment

Nick Cox

Join Date: Mar 2014
Posts: 35404

#10

31 May 2019, 02:55

Finding some time to think about this I found it easier to work with rangerun (SSC).

Clearly you will need to change whatever nobs H to other variable names as needed or wished for your problem. The example uses windows of (at most) length 7 ending with the current observation, but again your choice is likely to be different. I show that results match those from entropyetc for the first and last complete windows in the toy dataset.

Code:

clear 
input whatever  
1 
2
3
4
5
6
7 
end 
expand whatever 
sort whatever 
gen t = 1989 + _n 

list 

capture program drop shannon_h 
program shannon_h 
    tempname p 
    tab whatever, matcell(`p') 
    gen nobs = r(N) 
    mata: p = st_matrix("`p'") 
    mata: p = p :/ sum(p) 
    mata: st_numscalar("H", -sum(p :* ln(p))) 
    gen H = scalar(H) 
end 

rangerun shannon_h, int(t -6 0) 

qui entropyetc whatever in 1/7, gen(1=Hfirst) 

qui entropyetc whatever in -7/L, gen(1=Hlast) 
 
list , sep(7)

     +------------------------------------------------------+
     | whatever      t   nobs          H     Hfirst   Hlast |
     |------------------------------------------------------|
  1. |        1   1990      1          0   1.277034       . |
  2. |        2   1991      2   .6931472   1.277034       . |
  3. |        2   1992      3   .6365142   1.277034       . |
  4. |        3   1993      4   1.039721   1.277034       . |
  5. |        3   1994      5    1.05492   1.277034       . |
  6. |        3   1995      6   1.011404   1.277034       . |
  7. |        4   1996      7   1.277034   1.277034       . |
     |------------------------------------------------------|
  8. |        4   1997      7   1.078992          .       . |
  9. |        4   1998      7   1.004242          .       . |
 10. |        4   1999      7   .6829081          .       . |
 11. |        5   2000      7   .9556999          .       . |
 12. |        5   2001      7   .9556999          .       . |
 13. |        5   2002      7   .6829081          .       . |
 14. |        5   2003      7   .6829081          .       . |
     |------------------------------------------------------|
 15. |        5   2004      7   .5982696          .       . |
 16. |        6   2005      7   .7963116          .       . |
 17. |        6   2006      7   .5982696          .       . |
 18. |        6   2007      7   .6829081          .       . |
 19. |        6   2008      7   .6829081          .       . |
 20. |        6   2009      7   .5982696          .       . |
 21. |        6   2010      7   .4101163          .       . |
     |------------------------------------------------------|
 22. |        7   2011      7   .4101163          .       0 |
 23. |        7   2012      7   .5982696          .       0 |
 24. |        7   2013      7   .6829081          .       0 |
 25. |        7   2014      7   .6829081          .       0 |
 26. |        7   2015      7   .5982696          .       0 |
 27. |        7   2016      7   .4101163          .       0 |
 28. |        7   2017      7          0          .       0 |
     +------------------------------------------------------+

Comment

Nick Cox

Join Date: Mar 2014
Posts: 35404

#11

12 Jan 2024, 17:27

Thanks as ever to Kit Baum, entropyetc has now been updated on SSC. The latest version is version 3.

The previous version remains in the package as entropyetc2. The 2 arises because it was version 2.

Longish story short, the previous version ran into problems when users were working with thousands of categories. The program was using various commands and features that had various limits in various versions of Stata, namely Stata matrices, tabulate and tabdisp. There isn't a general fix or work-around short of re-writing the program with different syntax and different internal code. So, generate is now used for new variables and list for display.

A fresh look at the problem led me to drop the dissimilarity index and introduce calculation of the number of distinct categories.

To give some flavour, here are the results of running the examples in the help file. The graph is not shown here but is similar to that in #1.

Code:

. sysuse auto, clear 
(1978 automobile data)

. entropyetc rep78, list

  +-----------------------------------------------------------+
  |       distinct   Shannon H   exp(H)   Simpson   1/Simpson |
  |-----------------------------------------------------------|
  | all          5       1.358    3.888     0.297       3.369 |
  +-----------------------------------------------------------+

. entropyetc rep78, list by(foreign)

  +----------------------------------------------------------------+
  |  foreign   distinct   Shannon H   exp(H)   Simpson   1/Simpson |
  |----------------------------------------------------------------|
  | Domestic          5       1.201    3.323     0.383       2.612 |
  |  Foreign          3       1.004    2.730     0.388       2.579 |
  +----------------------------------------------------------------+

. 
. webuse nlsw88
(NLSW, 1988 extract)

. entropyetc occupation, list by(industry) gen(3=numeq)

  +-------------------------------------------------------------------------------+
  |                industry   distinct   Shannon H   exp(H)   Simpson   1/Simpson |
  |-------------------------------------------------------------------------------|
  |   Ag/Forestry/Fisheries          7       1.646    5.186     0.239       4.188 |
  |                  Mining          2       0.562    1.755     0.625       1.600 |
  |            Construction          7       1.399    4.050     0.353       2.832 |
  |           Manufacturing          8       1.470    4.348     0.316       3.167 |
  |  Transport/Comm/Utility          8       1.484    4.411     0.342       2.922 |
  |-------------------------------------------------------------------------------|
  |  Wholesale/Retail trade         10       1.740    5.698     0.214       4.681 |
  | Finance/Ins/Real estate          5       1.206    3.340     0.355       2.818 |
  |     Business/Repair svc          9       1.579    4.849     0.277       3.608 |
  |       Personal services          8       1.597    4.937     0.243       4.107 |
  |   Entertainment/Rec svc          7       1.712    5.538     0.218       4.587 |
  |-------------------------------------------------------------------------------|
  |   Professional services          7       1.590    4.902     0.219       4.558 |
  |   Public administration          9       1.195    3.304     0.404       2.473 |
  +-------------------------------------------------------------------------------+
(18 missing values generated)

. egen tag = tag(industry)

. graph dot (asis) numeq if tag, over(industry, sort(1) descending) ysc(alt) linetype(line) lines(lc(gs8) lw(vthin))

. 
. webuse grunfeld, clear

. entropyetc company [w=invest], list by(year)
(analytic weights assumed)

  +------------------------------------------------------------+
  | year   distinct   Shannon H   exp(H)   Simpson   1/Simpson |
  |------------------------------------------------------------|
  | 1935         10       1.606    4.985     0.286       3.502 |
  | 1936         10       1.584    4.875     0.283       3.535 |
  | 1937         10       1.620    5.052     0.273       3.666 |
  | 1938         10       1.730    5.640     0.242       4.134 |
  | 1939         10       1.667    5.294     0.265       3.772 |
  |------------------------------------------------------------|
  | 1940         10       1.601    4.957     0.280       3.569 |
  | 1941         10       1.652    5.217     0.263       3.798 |
  | 1942         10       1.606    4.985     0.277       3.605 |
  | 1943         10       1.597    4.938     0.285       3.507 |
  | 1944         10       1.660    5.260     0.276       3.622 |
  |------------------------------------------------------------|
  | 1945         10       1.698    5.465     0.266       3.757 |
  | 1946         10       1.660    5.259     0.267       3.742 |
  | 1947         10       1.709    5.523     0.250       4.005 |
  | 1948         10       1.732    5.654     0.240       4.160 |
  | 1949         10       1.683    5.379     0.259       3.862 |
  |------------------------------------------------------------|
  | 1950         10       1.644    5.178     0.272       3.672 |
  | 1951         10       1.712    5.540     0.248       4.034 |
  | 1952         10       1.693    5.435     0.257       3.895 |
  | 1953         10       1.614    5.025     0.292       3.424 |
  | 1954         10       1.532    4.627     0.337       2.966 |
  +------------------------------------------------------------+

Comment

Maarten Buis

Join Date: Mar 2014

Posts: 3426
#12

13 Jan 2024, 01:36

Why did you choose to drop the dissimilarity index?

---------------------------------
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl
---------------------------------
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35404
#13

13 Jan 2024, 02:33

I've lost interest in it in this context. Also, it's awkward to calculate as you need to keep track of zeros in categories that might have occurred but didn't in some subset.

All the measures included are related to the family SUM p^a (ln 1/p)^b for probabilities p different a and b, which will become more prominent if I ever write up a longer paper on this topic.

Last edited by Nick Cox; 13 Jan 2024, 02:36.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35404
#14

24 Jun 2024, 01:31

Thanks to Kit Baum as always, entropyetc has been updated on SSC to fix a small bug tickled by Paris Rira as explained at #13 of https://www.statalist.org/forums/for...ith-entropyetc
Comment
Jonathan Horowitz

Join Date: Apr 2015

Posts: 95
#15

12 Jul 2024, 12:39

Dear Nick (or others who have used this):

I was looking at the example for this in the first post and it looks like the Simpson's index and I'm having trouble interpreting it because I always thought that "1" was "infinite diversity" and "0" was "no diversity". But it looks like this is flipped from my expectations, so it's more like the Herfindahl-Hirschman index? I mostly noticed this because I thought for sure Mining would have far fewer occupations, but then I realized I might be looking at it the wrong way.

Thanks for any time you can spend on this--it looks like exactly what I need.

Best,
Jonathan
Comment

Announcement