With thanks as always to Kit Baum, I flag that a new program entropyetc is available from SSC for Stata 11.2 up.
The name entropyetc should be parsed "entropy, etc." and flags that it calculates Shannon entropy as one of a bundle of loosely related measures of diversity (concentration, inequality, heterogeneity, impurity, ...: the list of near synonyms in many literatures goes on and on).
There are now many user-written Stata programs in what appears to be the same territory, but in fact most seem written with income inequality in mind, so that the data arrive as incomes for groups or individuals and that variable is treated as it comes. Such programs usually carry across to any additive variable.
In contrast entropyetc is one of a smaller group of programs with main focus diversity (same comment) for categorical variables.
The difference is sometimes slurred over. If an input variable is categorical, it can't be added usefully (or meaningfully). For a recent thread raising this point see http://www.statalist.org/forums/foru...milarity-index
What can be added are frequencies, or more generally abundances, of categories.
A near equivalent to entropyetc is divcat from Dirk Enzmann, also on SSC, announced at http://www.statalist.org/forums/foru...ailable-on-ssc
The nearest equivalent is, however, my own ineq from 1998 (also on SSC). ineq assumes that the data are already summarized in terms of frequencies or other measures of abundance. Besides that, between 1998 and 2016 Stata has shifted and different coding is now both possible and natural. Rather than rewriting ineq drastically, it seemed better on reflection to leave the older program untouched. (Despite comments elsewhere, I am mindful of the need for old programs to remain accessible to the extent that that may be useful.)
I've tried to write entropyetc so that it is easy to clone and to modify (which doesn't mean: to plagiarize!). One detail that is in fact central to the design is that most of these measures boil down to about one line of Mata: the main contribution of a program is to make it easy, or at least easier, to collate results from several different groups. I hope to write further on that in due course.
Here are a couple of examples. First, we treat rep78 from the auto dataset as a categorical variable. Second, we look at diversity of occupations within industries in the nlsw88 dataset and underline that results can put into variables and become data for later analysis.
The name entropyetc should be parsed "entropy, etc." and flags that it calculates Shannon entropy as one of a bundle of loosely related measures of diversity (concentration, inequality, heterogeneity, impurity, ...: the list of near synonyms in many literatures goes on and on).
There are now many user-written Stata programs in what appears to be the same territory, but in fact most seem written with income inequality in mind, so that the data arrive as incomes for groups or individuals and that variable is treated as it comes. Such programs usually carry across to any additive variable.
In contrast entropyetc is one of a smaller group of programs with main focus diversity (same comment) for categorical variables.
The difference is sometimes slurred over. If an input variable is categorical, it can't be added usefully (or meaningfully). For a recent thread raising this point see http://www.statalist.org/forums/foru...milarity-index
What can be added are frequencies, or more generally abundances, of categories.
A near equivalent to entropyetc is divcat from Dirk Enzmann, also on SSC, announced at http://www.statalist.org/forums/foru...ailable-on-ssc
The nearest equivalent is, however, my own ineq from 1998 (also on SSC). ineq assumes that the data are already summarized in terms of frequencies or other measures of abundance. Besides that, between 1998 and 2016 Stata has shifted and different coding is now both possible and natural. Rather than rewriting ineq drastically, it seemed better on reflection to leave the older program untouched. (Despite comments elsewhere, I am mindful of the need for old programs to remain accessible to the extent that that may be useful.)
I've tried to write entropyetc so that it is easy to clone and to modify (which doesn't mean: to plagiarize!). One detail that is in fact central to the design is that most of these measures boil down to about one line of Mata: the main contribution of a program is to make it easy, or at least easier, to collate results from several different groups. I hope to write further on that in due course.
Here are a couple of examples. First, we treat rep78 from the auto dataset as a categorical variable. Second, we look at diversity of occupations within industries in the nlsw88 dataset and underline that results can put into variables and become data for later analysis.
Code:
. sysuse auto (1978 Automobile Data) . entropyetc rep78 ---------------------------------------------------------------------- Group | Shannon H exp(H) Simpson 1/Simpson dissim. ----------+----------------------------------------------------------- all | 1.358 3.888 0.297 3.369 0.296 ---------------------------------------------------------------------- . entropyetc rep78, by(foreign) ---------------------------------------------------------------------- Group | Shannon H exp(H) Simpson 1/Simpson dissim. ----------+----------------------------------------------------------- Domestic | 1.201 3.323 0.383 2.612 0.363 Foreign | 1.004 2.730 0.388 2.579 0.457 ---------------------------------------------------------------------- . webuse nlsw88 (NLSW, 1988 extract) . entropyetc occupation, by(industry) gen(2=numeq) ------------------------------------------------------------------------------------ Group | Shannon H exp(H) Simpson 1/Simpson dissim. ------------------------+----------------------------------------------------------- Ag/Forestry/Fisheries | 1.646 5.186 0.239 4.188 0.534 Mining | 0.562 1.755 0.625 1.600 0.846 Construction | 1.399 4.050 0.353 2.832 0.597 Manufacturing | 1.470 4.348 0.316 3.167 0.575 Transport/Comm/Utility | 1.484 4.411 0.342 2.922 0.556 Wholesale/Retail Trade | 1.740 5.698 0.214 4.681 0.554 Finance/Ins/Real Estate | 1.206 3.340 0.355 2.818 0.707 Business/Repair Svc | 1.579 4.849 0.277 3.608 0.588 Personal Services | 1.597 4.937 0.243 4.107 0.599 Entertainment/Rec Svc | 1.712 5.538 0.218 4.587 0.516 Professional Services | 1.590 4.902 0.219 4.558 0.612 Public Administration | 1.195 3.304 0.404 2.473 0.701 ------------------------------------------------------------------------------------ . egen tag = tag(industry) . graph dot (asis) numeq if tag, over(industry, sort(1) descending) linetype(line)
Comment