Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • use of ineq command across multiple variables?

    hi everyone,

    for the past few weeks I have studied the entropyetc and ineq commands.
    I have found out that ineq is the appropriate command for the type of data that I am working with. However, I have found an issue I don't seem able to solve.

    I am working with a large dataset of about 5,000 rows, but to explain the issue I do not need to use it. I will use instead a dummy dataset that reproduces the structure of my data, and the problem I am trying to solve.

    To hopefully make things clearer, I would like to first show a simpler structure I was able to work with.
    I start with a dataset of three variables: species, year and abundance.
    I then use ineq to calculate (successfully) the shannon index, by year

    the "shannon" variable was generated with this code:

    Code:
    ineq abundance, by (year) genent(shannon)
    
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str18 species str4 year byte abundance float shannon
    "conttontail rabbit" "2019" 13 2.2446363
    "fox squirrel"       "2019"  7 2.2446363
    "gray fox"           "2019" 11 2.2446363
    "house cat"          "2019"  5 2.2446363
    "march rabbit"       "2019"  3 2.2446363
    "opossum"            "2019" 22 2.2446363
    "otter"              "2019"  2 2.2446363
    "raccoon"            "2019" 20 2.2446363
    "red fox"            "2019"  4 2.2446363
    "spotted skunk"      "2019"  3 2.2446363
    "striped skunk"      "2019" 15 2.2446363
    "wild cat"           "2019"  7 2.2446363
    "conttontail rabbit" "2020" 14 2.2341623
    "fox squirrel"       "2020"  7 2.2341623
    "gray fox"           "2020" 10 2.2341623
    "house cat"          "2020"  5 2.2341623
    "march rabbit"       "2020"  3 2.2341623
    "opossum"            "2020" 22 2.2341623
    "otter"              "2020"  1 2.2341623
    "raccoon"            "2020" 20 2.2341623
    "red fox"            "2020"  5 2.2341623
    "spotted skunk"      "2020"  3 2.2341623
    "striped skunk"      "2020" 14 2.2341623
    "wild cat"           "2020"  7 2.2341623
    "conttontail rabbit" "2021" 13 2.2438474
    "fox squirrel"       "2021"  7 2.2438474
    "gray fox"           "2021" 11 2.2438474
    "house cat"          "2021"  5 2.2438474
    "march rabbit"       "2021"  2 2.2438474
    "opossum"            "2021" 21 2.2438474
    "otter"              "2021"  2 2.2438474
    "raccoon"            "2021" 21 2.2438474
    "red fox"            "2021"  4 2.2438474
    "spotted skunk"      "2021"  4 2.2438474
    "striped skunk"      "2021" 14 2.2438474
    "wild cat"           "2021"  7 2.2438474
    end
    The problems I find is when the data structure becomes slightly more complicated, like in the example below.

    I want to calculate the entropy index separately for each year within each region. So, I should end up with 4 different indicators
    a set for alaska 2019, one for alaska 2020, as well as one for minnesota 2019 and one for minnesota 2020.

    But I cannot find the right code for it. I looked through the ineq help file, and tested a few different options, without success.

    I tried something like the below:
    Code:
    ineq abundance, by (region year) genent(shannon)
    and this returns an error
    
    ineq abundance, by (year) by (region) genent(shannon)
    of even swapping year and region around in the code. But it did not produce the results I would expect.

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str9 region str4 year str18 species byte abundance
    "alaska"    "2019" "conttontail rabbit" 13
    "alaska"    "2019" "fox squirrel"        7
    "alaska"    "2019" "gray fox"           11
    "alaska"    "2019" "house cat"           5
    "alaska"    "2019" "march rabbit"        3
    "alaska"    "2019" "opossum"            22
    "alaska"    "2019" "otter"               2
    "alaska"    "2019" "raccoon"            20
    "alaska"    "2019" "red fox"             4
    "alaska"    "2019" "spotted skunk"       3
    "alaska"    "2019" "striped skunk"      15
    "alaska"    "2019" "wild cat"            7
    "alaska"    "2020" "conttontail rabbit" 14
    "alaska"    "2020" "fox squirrel"        7
    "alaska"    "2020" "gray fox"           10
    "alaska"    "2020" "house cat"           5
    "alaska"    "2020" "march rabbit"        3
    "alaska"    "2020" "opossum"            22
    "alaska"    "2020" "otter"               1
    "alaska"    "2020" "raccoon"            20
    "alaska"    "2020" "red fox"             5
    "alaska"    "2020" "spotted skunk"       3
    "alaska"    "2020" "striped skunk"      14
    "alaska"    "2020" "wild cat"            7
    "minnesota" "2019" "conttontail rabbit" 10
    "minnesota" "2019" "fox squirrel"        5
    "minnesota" "2019" "gray fox"           14
    "minnesota" "2019" "house cat"           3
    "minnesota" "2019" "march rabbit"        7
    "minnesota" "2019" "opossum"            22
    "minnesota" "2019" "otter"               4
    "minnesota" "2019" "raccoon"            28
    "minnesota" "2019" "red fox"             2
    "minnesota" "2019" "spotted skunk"      13
    "minnesota" "2019" "striped skunk"      23
    "minnesota" "2019" "wild cat"            7
    "minnesota" "2020" "conttontail rabbit" 12
    "minnesota" "2020" "fox squirrel"        6
    "minnesota" "2020" "gray fox"           15
    "minnesota" "2020" "house cat"           4
    "minnesota" "2020" "march rabbit"        9
    "minnesota" "2020" "opossum"            23
    "minnesota" "2020" "otter"               2
    "minnesota" "2020" "raccoon"            24
    "minnesota" "2020" "red fox"             4
    "minnesota" "2020" "spotted skunk"      14
    "minnesota" "2020" "striped skunk"      24
    "minnesota" "2020" "wild cat"            8
    end

    I would appreciate any suggestions/comments

    Many thanks
    Nicola


  • #2
    -ineq- is a user submitted command developed in 1998 by Nick Cox. Please note that you are asked to tell us whether or not a command is from a third party. I had trouble finding a reference to this command online, and it is not clear whether Nick still plans to maintain this command 25 years later. From the help file, it looks like this should be the correct syntax:

    Code:
    ineq abundance, by(region year) genent(shannon)
    But I get an error as well in Stata 17:

    Code:
    too many variables specified
    option c() incorrectly specified
    This line appears to result in the first by() option being ignored:

    Code:
    ineq abundance, by (year) by (region) genent(shannon)
    Here is a possible work around. It appears to work correctly with your example data:

    Code:
    gen byvar = region + "_" + year
    ineq abundance, by(byvar) genent(shannon)
    Code:
    . ineq abundance, by(byvar) genent(shannon)
    
    -----------------------------------------------------------------------
        group |          byvar           freq        Simpson        entropy
    ----------+------------------------------------------------------------
            1 |    alaska_2019             12          0.124          2.245
            2 |    alaska_2020             12          0.125          2.234
            3 | minnesota_2019             12          0.127          2.230
            4 | minnesota_2020             12          0.117          2.275
    -----------------------------------------------------------------------
    
    --------------------------
        group |        dissim.
    ----------+---------------
            1 |          0.307
            2 |          0.304
            3 |          0.308
            4 |          0.273
    --------------------------
    Edit: should probably show you how I generated byvar. See above.
    Last edited by Daniel Schaefer; 09 Nov 2023, 14:17.

    Comment


    • #3
      Hi Daniel,

      This is a work-around that I had not thought of, thank you very much.
      Also, thanks for pointing out we have to mention if the code is from a third party.


      And ...unfortunately, I seem to have found another obstacle.

      My dataset has 76,000 observations.
      I generated a new variable as you suggested, and when I run the command I get an error of too many values (see below)

      Do you know if there may be a way to solve this too?


      thanks
      Nicola

      Code:
      [P]     error . . . . . . . . . . . . . . . . . . . . . . . .  Return code 134
              too many values
              1) You attempted to encode a string variable that takes on
              more than 65,536 unique values.  2) You attempted to tabulate
              a variable or pair of variables that take on too many values.
              If you specified two variables, try interchanging them.
              3) You issued a graph command using the by option.  The
              by-variable takes on too many different values to construct
              a readable chart.

      Comment


      • #4
        Hmmm. I guess if all you care about is the shannon index, then you can always just calculate it by hand. See below:

        Code:
        * Example generated by -dataex-. For more info, type help dataex
        clear
        input str9 region str4 year str18 species byte abundance
        "alaska"    "2019" "conttontail rabbit" 13
        "alaska"    "2019" "fox squirrel"        7
        "alaska"    "2019" "gray fox"           11
        "alaska"    "2019" "house cat"           5
        "alaska"    "2019" "march rabbit"        3
        "alaska"    "2019" "opossum"            22
        "alaska"    "2019" "otter"               2
        "alaska"    "2019" "raccoon"            20
        "alaska"    "2019" "red fox"             4
        "alaska"    "2019" "spotted skunk"       3
        "alaska"    "2019" "striped skunk"      15
        "alaska"    "2019" "wild cat"            7
        "alaska"    "2020" "conttontail rabbit" 14
        "alaska"    "2020" "fox squirrel"        7
        "alaska"    "2020" "gray fox"           10
        "alaska"    "2020" "house cat"           5
        "alaska"    "2020" "march rabbit"        3
        "alaska"    "2020" "opossum"            22
        "alaska"    "2020" "otter"               1
        "alaska"    "2020" "raccoon"            20
        "alaska"    "2020" "red fox"             5
        "alaska"    "2020" "spotted skunk"       3
        "alaska"    "2020" "striped skunk"      14
        "alaska"    "2020" "wild cat"            7
        "minnesota" "2019" "conttontail rabbit" 10
        "minnesota" "2019" "fox squirrel"        5
        "minnesota" "2019" "gray fox"           14
        "minnesota" "2019" "house cat"           3
        "minnesota" "2019" "march rabbit"        7
        "minnesota" "2019" "opossum"            22
        "minnesota" "2019" "otter"               4
        "minnesota" "2019" "raccoon"            28
        "minnesota" "2019" "red fox"             2
        "minnesota" "2019" "spotted skunk"      13
        "minnesota" "2019" "striped skunk"      23
        "minnesota" "2019" "wild cat"            7
        "minnesota" "2020" "conttontail rabbit" 12
        "minnesota" "2020" "fox squirrel"        6
        "minnesota" "2020" "gray fox"           15
        "minnesota" "2020" "house cat"           4
        "minnesota" "2020" "march rabbit"        9
        "minnesota" "2020" "opossum"            23
        "minnesota" "2020" "otter"               2
        "minnesota" "2020" "raccoon"            24
        "minnesota" "2020" "red fox"             4
        "minnesota" "2020" "spotted skunk"      14
        "minnesota" "2020" "striped skunk"      24
        "minnesota" "2020" "wild cat"            8
        end
        
        * calculate the shannon index
        bysort region year: gen double total = sum(abundance)
        by region year: gen double proportion = abundance / total[_N]
        by region year: gen shannon = proportion * ln(proportion)
        by region year: replace shannon = sum(shannon)
        by region year: replace shannon = -shannon[_N]
        Code:
        * Test against -ineq- results.
        gen byvar = region + "_" + year
        ineq abundance, by(byvar) genent(shannon_ineq)
        
        assert shannon == shannon_ineq

        Comment


        • #5
          Daniel Schaefer is correct identifying ineq as a community-contributed command (terminology now recommended by StataCorp) first written by me in 1998 and accessible at SSC.

          Rummaging through my files I found a 2016 version which I never made public. A positive reason for letting ineq be was just in case someone had used it and wished to revisit or anyone else noticed that use and wished to play with it. But it's quite hard to find given the profusion of ways in which people talk about and work with measures of inequality. It's hard to distinguish from other commands to do with inequality. Back in 1998 if I recollect correctly no community-contributed command name could be longer than 8 characters and although I had scope for 4 more characters the name inequal was already taken and inequality was beyond reach.

          A bigger deal for me by far was as explained at https://www.statalist.org/forums/for...lable-from-ssc

          ineq came to seem awkward, especially for the ways in which I wanted to use it, and so I wrote entropyetc , also on SSC. That command name is more distinctive, whatever else anyone might think about it.

          So, where are we? Nicola Cenacchi and Daniel in #1 and #2 have helpfully identified a bug in ineq. It can't copy with two or more variables in by() even though the help implies that it can. The program doesn't fail until there is an attempt to show 6 (or more) variables as columns with tabdisp. That bug can be worked around by generating a composite variable, as Daniel flags.

          I don't yet understand the problem reported in #3. It may also be another problem with tabdisp. See

          Code:
          help limits
          in your Stata to see whether you are in effect asking for a table that is enormous. Also please

          Code:
          set trace on 
          set tracedepth 1
          before running ineq to see where the command crashes.

          As of new I can't see any advantage of ineq over entropyetc provided that you are using Stata 11.2 or a later version.

          The material below shows the correspondence between ineq and entropyetc for the data example in #1 (thanks!) . You don't need a work-around for two by() variables as entropyetc applies that internally.

          If the problem in #3 bites with entropyetc I will think about what to do. Note that the number of observations is not what might bite but the number of distinct results that are wanted.

          Pedantry corner: should be "cottontail rabbit", I guess.


          Code:
          * Example generated by -dataex-. For more info, type help dataex
          clear
          input str9 region str4 year str18 species byte abundance
          "alaska"    "2019" "conttontail rabbit" 13
          "alaska"    "2019" "fox squirrel"        7
          "alaska"    "2019" "gray fox"           11
          "alaska"    "2019" "house cat"           5
          "alaska"    "2019" "march rabbit"        3
          "alaska"    "2019" "opossum"            22
          "alaska"    "2019" "otter"               2
          "alaska"    "2019" "raccoon"            20
          "alaska"    "2019" "red fox"             4
          "alaska"    "2019" "spotted skunk"       3
          "alaska"    "2019" "striped skunk"      15
          "alaska"    "2019" "wild cat"            7
          "alaska"    "2020" "conttontail rabbit" 14
          "alaska"    "2020" "fox squirrel"        7
          "alaska"    "2020" "gray fox"           10
          "alaska"    "2020" "house cat"           5
          "alaska"    "2020" "march rabbit"        3
          "alaska"    "2020" "opossum"            22
          "alaska"    "2020" "otter"               1
          "alaska"    "2020" "raccoon"            20
          "alaska"    "2020" "red fox"             5
          "alaska"    "2020" "spotted skunk"       3
          "alaska"    "2020" "striped skunk"      14
          "alaska"    "2020" "wild cat"            7
          "minnesota" "2019" "conttontail rabbit" 10
          "minnesota" "2019" "fox squirrel"        5
          "minnesota" "2019" "gray fox"           14
          "minnesota" "2019" "house cat"           3
          "minnesota" "2019" "march rabbit"        7
          "minnesota" "2019" "opossum"            22
          "minnesota" "2019" "otter"               4
          "minnesota" "2019" "raccoon"            28
          "minnesota" "2019" "red fox"             2
          "minnesota" "2019" "spotted skunk"      13
          "minnesota" "2019" "striped skunk"      23
          "minnesota" "2019" "wild cat"            7
          "minnesota" "2020" "conttontail rabbit" 12
          "minnesota" "2020" "fox squirrel"        6
          "minnesota" "2020" "gray fox"           15
          "minnesota" "2020" "house cat"           4
          "minnesota" "2020" "march rabbit"        9
          "minnesota" "2020" "opossum"            23
          "minnesota" "2020" "otter"               2
          "minnesota" "2020" "raccoon"            24
          "minnesota" "2020" "red fox"             4
          "minnesota" "2020" "spotted skunk"      14
          "minnesota" "2020" "striped skunk"      24
          "minnesota" "2020" "wild cat"            8
          end
          
          egen which = group(region year), label 
          
          ineq abundance, by(which)
          
          entropyetc species [fw=abundance], by(which)
          
          entropyetc species [fw=abundance], by(region year)

          Code:
          . ineq abundance, by(which)
          
          --------------------------------------------------------------------------------------------------------------
              group | group(region year)                freq             Simpson             entropy             dissim.
          ----------+---------------------------------------------------------------------------------------------------
                  1 |        alaska 2019                  12               0.124               2.245               0.307
                  2 |        alaska 2020                  12               0.125               2.234               0.304
                  3 |     minnesota 2019                  12               0.127               2.230               0.308
                  4 |     minnesota 2020                  12               0.117               2.275               0.273
          --------------------------------------------------------------------------------------------------------------
          
          . 
          . entropyetc species [fw=abundance], by(which)
          
          ---------------------------------------------------------------------------
                   Group |  Shannon H      exp(H)     Simpson   1/Simpson     dissim.
          ---------------+-----------------------------------------------------------
             alaska 2019 |      2.245       9.437       0.124       8.041       0.307
             alaska 2020 |      2.234       9.339       0.125       7.985       0.304
          minnesota 2019 |      2.230       9.298       0.127       7.889       0.308
          minnesota 2020 |      2.275       9.732       0.117       8.536       0.273
          ---------------------------------------------------------------------------
          
          . 
          . entropyetc species [fw=abundance], by(region year)
          
          ---------------------------------------------------------------------------
                   Group |  Shannon H      exp(H)     Simpson   1/Simpson     dissim.
          ---------------+-----------------------------------------------------------
             alaska 2019 |      2.245       9.437       0.124       8.041       0.307
             alaska 2020 |      2.234       9.339       0.125       7.985       0.304
          minnesota 2019 |      2.230       9.298       0.127       7.889       0.308
          minnesota 2020 |      2.275       9.732       0.117       8.536       0.273
          ---------------------------------------------------------------------------

          Comment


          • #6
            Two extra points:

            1. You're not dependent on there being a suitable community-contributed command. You can just calculate Shannon entropy directly with official commands. (Detail: With the method here, any term that is 0 ln (1/0) just drops out of the calculation. Stata does that because the result is taken to be missing, and missings are ignored in totals; but that's equivalent to using the strong convention that 0 ln (1/0) is taken to be 0, which is justifiable more rigorously, with the consequent zeros making no difference to totals.)

            2. If tabdisp chokes on what you're trying to tabulate, then list should always work. In principle list is willing to list the entire dataset; it will just take time and space to do that.


            Code:
            * Example generated by -dataex-. For more info, type help dataex
            clear
            input str9 region str4 year str18 species byte abundance
            "alaska"    "2019" "conttontail rabbit" 13
            "alaska"    "2019" "fox squirrel"        7
            "alaska"    "2019" "gray fox"           11
            "alaska"    "2019" "house cat"           5
            "alaska"    "2019" "march rabbit"        3
            "alaska"    "2019" "opossum"            22
            "alaska"    "2019" "otter"               2
            "alaska"    "2019" "raccoon"            20
            "alaska"    "2019" "red fox"             4
            "alaska"    "2019" "spotted skunk"       3
            "alaska"    "2019" "striped skunk"      15
            "alaska"    "2019" "wild cat"            7
            "alaska"    "2020" "conttontail rabbit" 14
            "alaska"    "2020" "fox squirrel"        7
            "alaska"    "2020" "gray fox"           10
            "alaska"    "2020" "house cat"           5
            "alaska"    "2020" "march rabbit"        3
            "alaska"    "2020" "opossum"            22
            "alaska"    "2020" "otter"               1
            "alaska"    "2020" "raccoon"            20
            "alaska"    "2020" "red fox"             5
            "alaska"    "2020" "spotted skunk"       3
            "alaska"    "2020" "striped skunk"      14
            "alaska"    "2020" "wild cat"            7
            "minnesota" "2019" "conttontail rabbit" 10
            "minnesota" "2019" "fox squirrel"        5
            "minnesota" "2019" "gray fox"           14
            "minnesota" "2019" "house cat"           3
            "minnesota" "2019" "march rabbit"        7
            "minnesota" "2019" "opossum"            22
            "minnesota" "2019" "otter"               4
            "minnesota" "2019" "raccoon"            28
            "minnesota" "2019" "red fox"             2
            "minnesota" "2019" "spotted skunk"      13
            "minnesota" "2019" "striped skunk"      23
            "minnesota" "2019" "wild cat"            7
            "minnesota" "2020" "conttontail rabbit" 12
            "minnesota" "2020" "fox squirrel"        6
            "minnesota" "2020" "gray fox"           15
            "minnesota" "2020" "house cat"           4
            "minnesota" "2020" "march rabbit"        9
            "minnesota" "2020" "opossum"            23
            "minnesota" "2020" "otter"               2
            "minnesota" "2020" "raccoon"            24
            "minnesota" "2020" "red fox"             4
            "minnesota" "2020" "spotted skunk"      14
            "minnesota" "2020" "striped skunk"      24
            "minnesota" "2020" "wild cat"            8
            end
            
            replace species = subinstr(species, "conttontail", "cottontail", .)
            
            egen long total = total(abundance), by(region year)
            
            egen double entropy = total((abundance / total) * ln(total/abundance)), by(region year)
            
            format entropy %4.3f 
            
            tabdisp region year, c(entropy)
            
            egen tag = tag(region year)
            
            list region year entropy if tag, noobs

            Code:
            . tabdisp region year, c(entropy)
            
            ------------------------
                      |     year    
               region |  2019   2020
            ----------+-------------
               alaska | 2.245  2.234
            minnesota | 2.230  2.275
            ------------------------
            
            . list region year entropy if tag, noobs 
            
              +----------------------------+
              |    region   year   entropy |
              |----------------------------|
              |    alaska   2019     2.245 |
              |    alaska   2020     2.234 |
              | minnesota   2019     2.230 |
              | minnesota   2020     2.275 |
              +----------------------------+

            Comment


            • #7
              Thanks, Nick, for your thoughtful advice (as always) and for your continued support for this community-contributed command. It looks like I can reproduce the issue reported in #3.

              Code:
              clear
              
              local num_species = 100
              local num_years = 1000
              local num_regions = 10
              local num_obs = `num_species' * `num_years' * `num_regions'
              
              set obs `num_obs'
              
              egen species = seq(), f(1) t(`num_species')
              egen year = seq(), f(1) t(`num_years') b(`num_species')
              egen region = seq(), f(1) t(`num_regions') b(`=`num_species' * `num_years'')
              gen abundance = runiformint(1, 100)
              egen byvar = group(region year), label
              
              set trace on 
              set tracedepth 1
              
              capture noisily ineq abundance, by(byvar) genent(shannon_ineq)
              
              set trace off
              Code:
              . capture noisily ineq abundance, by(byvar) genent(shannon_ineq)
                --------------------------------------------------------------- begin ineq ---
                - version 5.0
                - local varlist "max(1)"
                - local if "opt"
                - local in "opt"
                - local options "BY(string) Format(string) Numeq"
                - local options "`options' GENEnt(string) GENSim(string) GENDiss(string) *"
                = local options "BY(string) Format(string) Numeq GENEnt(string) GENSim(string)
              >  GENDiss(string) *"
                - local weight "fweight aweight noprefix"
                - parse "`*'"
                = parse "abundance, by(byvar) genent(shannon_ineq)"
                - tempvar touse sum prop propsq plnp group first freq diss
                - mark `touse' `if' `in'
                = mark __000000  
                - markout `touse' `varlist'
                = markout __000000 abundance
                - capture assert `varlist' >= 0 if `touse'
                = capture assert abundance >= 0 if __000000
                - if _rc {
                  di in r "`varlist' has negative values"
                  exit 411
                  }
                - if "`by'" != "" {
                = if "byvar" != "" {
                - unabbrev `by'
                = unabbrev byvar
                - local by "$S_1"
                = local by "byvar"
                - }
                - sort `touse' `by'
                = sort __000000 byvar
                - if "`exp'" != "" { local exp "* `exp'" }
                = if "" != "" { local exp "* " }
                - qui {
                - by `touse' `by' : gen double `sum' = sum(`varlist' `exp') if `touse'
                = by __000000 byvar : gen double __000001 = sum(abundance ) if __000000
                - by `touse' `by' : replace `sum' = `sum'[_N] if `touse'
                = by __000000 byvar : replace __000001 = __000001[_N] if __000000
                - gen `prop' = (`varlist' `exp') / `sum'
                = gen __000002 = (abundance ) / __000001
                - gen `propsq' = `prop'^2
                = gen __000003 = __000002^2
                - by `touse' `by' : replace `propsq' = sum(`propsq')
                = by __000000 byvar : replace __000003 = sum(__000003)
                - by `touse' `by' : replace `propsq' = `propsq'[_N]
                = by __000000 byvar : replace __000003 = __000003[_N]
                - if "`numeq'" != "" { replace `propsq' = 1 / `propsq' }
                = if "" != "" { replace __000003 = 1 / __000003 }
                - gen `plnp' = cond(`prop'==0, 0, `prop' * ln(`prop'))
                = gen __000004 = cond(__000002==0, 0, __000002 * ln(__000002))
                - by `touse' `by' : replace `plnp' = sum(`plnp')
                = by __000000 byvar : replace __000004 = sum(__000004)
                - by `touse' `by' : replace `plnp' = -`plnp'[_N]
                = by __000000 byvar : replace __000004 = -__000004[_N]
                - if "`numeq'" != "" { replace `plnp' = exp(`plnp') }
                = if "" != "" { replace __000004 = exp(__000004) }
                - by `touse' `by' : gen byte `first' = _n == 1 & `touse'
                = by __000000 byvar : gen byte __000006 = _n == 1 & __000000
                - gen int `group' = sum(`first')
                = gen int __000005 = sum(__000006)
                - gen str1 `freq' = ""
                = gen str1 __000007 = ""
                - by `touse' `by' : replace `freq' = string(_N)
                = by __000000 byvar : replace __000007 = string(_N)
                - by `touse' `by' : gen `diss' = sum(abs(`prop' - 1 / _N))
                = by __000000 byvar : gen __000008 = sum(abs(__000002 - 1 / _N))
                - by `touse' `by' : replace `diss' = `diss'[_N] / 2
                = by __000000 byvar : replace __000008 = __000008[_N] / 2
                - }
                - label var `group' "group"
                = label var __000005 "group"
                - if "`by'" == "" {
                = if "byvar" == "" {
                  label def `group' 1 all
                  label val `group' `group'
                  }
                - label var `propsq' "Simpson"
                = label var __000003 "Simpson"
                - label var `plnp' "entropy"
                = label var __000004 "entropy"
                - label var `freq' "freq"
                = label var __000007 "freq"
                - label var `diss' "dissim."
                = label var __000008 "dissim."
                - if "`format'" == "" { local format "%4.3f" }
                = if "" == "" { local format "%4.3f" }
                - tabdisp `group' if `first', c(`by' `freq' `propsq' `plnp' `diss') `options' 
              > format(`format')
                = tabdisp __000005 if __000006, c(byvar __000007 __000003 __000004 __000008)  
              > format(%4.3f)
              too many values
                ----------------------------------------------------------------- end ineq ---
              Looks like the problem line is the -tabdisp- command. I also decided to do a bit of a parameter search:

              Code:
              frame change default
              capture frame drop results
              frame create results int num_species int num_years int num_regions byte failed
              foreach num_species of numlist 10(10)100 {
                  display "num_species: `num_species'"
                  foreach num_years of numlist 10(10)100 {
                      display "num_year: `num_years'"
                      foreach num_regions of numlist 10(10)100 {
                          clear
                          local num_obs = `num_species' * `num_years' * `num_regions'
                          qui set obs `num_obs'
                          egen species = seq(), f(1) t(`num_species')
                          egen year = seq(), f(1) t(`num_years') b(`num_species')
                          egen region = seq(), f(1) t(`num_regions') b(`=`num_species' * `num_years'')
                          gen abundance = runiformint(1, 100)
                          egen byvar = group(region year), label
                          capture ineq abundance, by(byvar) genent(shannon_ineq)
                          if _rc {
                              frame post results (`num_species') (`num_years') (`num_regions') (1)
                          }
                          else {
                              frame post results (`num_species') (`num_years') (`num_regions') (0)
                          }
                      }
                  }
              }
              frame change results
              Warning, that takes a while to halt. It looks to me like you are correct; the size of the table is indeed the issue in #3. The limit for the number of rows must be between 3,000 and 3,200.

              Comment


              • #8
                Daniel Schaefer Thanks very much for #7 and indeed for #4, which wasn't visible when I started writing #5, which I guess took about a hour one way or another.

                I have only just noticed #4 which makes the same point as I made in #6. The code solutions are closer than they may seem as egen internally uses code very similar to #4.

                My mind boggles a bit at anyone wanting to see a table with 10000 elements, but neither ineq nor entropyetc offers an alternative listing. I guess I will rewrite entropyetc to use list and perhaps fix the SSC documentation for ineq to point to entropyetc.

                Comment


                • #9
                  That sounds like a reasonable way to move forward. I doubt anyone would want to look through a table with 10000 elements as well, but I suppose someone (like the OP) might want to generate a new variable that summarizes a measure of entropy over a large number of clusters. Such a person might not really care to look at a table, seeing the table as just a side effect of creating the new variable. Perhaps such a person would generally be better off calculating that new variable themselves. I really don't know what the expected use-case is for entropyetc, let alone if any of this is relevant though.

                  Comment


                  • #10
                    Dear Daniel and Nick,

                    thank you very much for digging into my question.
                    I have to devote some time to understand all the points you are making, but perhaps a couple of clarifications may help.

                    My actual dataset is over 70,000 observations. Each observation is a pixel of 50x50km on the world map. And, to give you a little more context, you can think of my large dataset as one where the "species" you saw in the dummy dataset I created, are replaced by types of land cover (corn, wheat, forest, savanna, tundra etc) . Each pixel contains a certain number of land covers, similarly to how the regions in my example contain various species.

                    Now. I do not need to generate a table with the entropy indices (being Shannon, Simpson or others). In fact, when I calculate shannon using ineq I start my command with "quietly", and I generate a new variable with the shannon value.
                    That is the value that I then plot on a map, color coded, to see how that measure changes across geographies and across years (2005 and 2050 in my case).

                    To provide an additional piece of info. I successfully used ineq in the past with half that number of observations, i.e., with around 40,000. This is because initially I was running ineq first on all the 2005 data, and then on all the 2050 data. I have then plotted my results successfully on a map.

                    I can still do that ... and go back to running each year separately.
                    But, I am trying to learn and improve my coding and be able to deal with more complex datasets, which is why of my starting question on using ineq over multiple variables.

                    You have provided thoughts/ideas on how to answer my questions, so I will carefully go through your answers and get back to you.

                    for the moment, many thanks again for your time, very appreciated.
                    Nicola

                    Comment


                    • #11
                      Nicola Cenacchi Thanks for the thanks. For your purposes I recommend just using official commands as explained in #4 and #6.

                      Meanwhile I have some work to do making entropyetc more flexible. I may well hitineq or its documentation, at least to flag if not to fix some limitations.

                      Comment


                      • #12
                        See https://www.statalist.org/forums/for...lable-from-ssc for announcement of an update.

                        It was remiss of me not to thank Nicola Cenacchi and Daniel Schaefer for their questions identifying problems that could arise with large numbers of categories. That will be done in the next version of the help file.

                        Comment

                        Working...
                        X