Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • egen bug? Failure to upgrade variable type.

    In Stata15.1, some parts of egen apparently do not consider variable type, which can result in data loss and errors. For example:

    Code:
    clear
    set obs `=2^24 + 10'
    gen long x = _n
    egen group = group(x)
    format %21.0gc x group
    l in `=_N - 10' / `=_N'
    We can see that the output variable "group" does not have the requisite levels because floats cannot accurately represent integers past 2^24. The source are lines 35-37 in _ggroup.ado, which basically do

    Code:
    bys x: gen `type' group = (_n == 1)
    replace group = sum(group)
    When the user does not specify a type, it detauls to c(type) which in turn defaults to "float". However, the output of "group" will always be integers, hence the following should be more appropriate when the user does not pass a type:

    Code:
    bys x: gen byte group = (_n == 1)
    replace group = sum(group)
    This correctly upgrades the variable type to "int" or "long" and also saves memory if the sum does not overflow. Another example:

    Code:
    clear
    set obs 10
    gen double x = 1e50
    egen y = sum(x)
    l
    collapse (sum) y = x
    l
    In this case, collapse correctly sums x but egen gives y as all missing.
    Last edited by Mauricio Caceres; 06 Mar 2018, 17:20.

  • #2
    In this regard, egen behaves identically to generate (as I would expect of a command whose name is meant to suggest extensions to generate), creating a new numeric variable with the storage characteristics specified by the user, or by a previous set default command, or lacking that, as a float.

    In contrast, the replace command does not allow the user to explicitly specify a new storage type (the recast command is reserved for that) but it will promote the storage type if required by the new data. Note though that while promotion will change integer variables (byte, int, or long) into an appropriate floating point type (float or double), and will promote integer types into larger integer types, it will not promote a float into a double in any event.

    Comment


    • #3
      Interesting; I wondered if it was a design decision, and it would make sense to that it was meant to mimic generate. I can see that "gen group = x" and "gen y = sum(x)" run into similar problems in my examples.

      I still find it odd because the commands are already not always consistent. "egen, group" involves a "replace", so, as you point out, "egen byte group = group(x)" will upgrade "group" from byte to long, while "gen byte group = x" will give you a lot of missing values.

      Anyway, it's good to think that there is some reason behind it. Thanks for the reply.

      Comment


      • #4
        I think the pivot here is that the default default [repetition intended] variable type is float. Every now and again someone gets very agitated about this being a very bad idea -- often with some good grounds as it just takes some of their data not to fit into that type for this to bite the unwary -- to which the simplest answers are

        * If you want
        double, you can change the default! Feel free! Just watch your datasets inflate!

        * Your data are that precise? (Only the other day I was reading that the population of a certain area was 34,567,891 or some such and thinking What? All counted without error and no changes since the date you're not even giving? Sometimes avoiding spurious precision, among the first rudiments of numeracy, seems among the skills not taught any more.)

        I know, as a contributor to
        egen, that I've often consciously wired into the code byte as result for a variable when the results can only be 0 and 1 (or 0, 1 and missing) and even ignored the user-specified type if there was one. So, that underscores the importance of programmers thinking about the range of possibilities that a program could produce.

        This is the other end, however (very big numbers; or just possibly, very small numbers too) -- and everyone is right, I guess.

        1. It is considered a user error not to specify the variable type you really need. It's explicit in the syntax that you can do that, and it will sometimes be true that you must.

        2. At the same time, some
        egen functions can more commonly produce very large numbers than was true when they were first written and it would be a good idea for StataCorp to revisit the code.

        In fact, from users' meetings and elsewhere, it seems clear that this is all part of the agenda in the territory between speeding up and better performance with big datasets. But there are clearly no promises about what will be in Stata 16. Or 17(1)infinity. My own request is just user-written functions in Stata and letting egen fade away slowly.

        Note. This is a corner for not very good jokes about bite and byte. It's the lack of doubles that is biting, if anything.

        Comment


        • #5
          I don't think the default default (right, twice because you can set a different default) being float is necessarily a bad idea. My confuision was more with it not being obvious that in some places variable types get upgraded and in others they don't (e.g. "replace" vs "gen" or "collapse" vs "egen"). Specially when the output of a function is meant to be a certain type (an integer in the case of "egen, group").

          Originally posted by Nick Cox View Post
          1. It is considered a user error not to specify the variable type you really need. It's explicit in the syntax that you can do that, and it will sometimes be true that you must.
          It seems to be the norm to not require the user to specify variable types. I can't think of many Stata aternatives that are harsh on the user when a type is not specified (maybe Julia?). While it's certainly good practice, if the expectation is that the user will not specify a type then we should program functions with that in mind, right?

          Anyway, egen is certainly a nice idea but I'd also second it fading away in lieu of user-written functions.

          Comment


          • #6
            Following up on Nick's comment, I would argue that on the specific topic of "egen group" there is absolutely no benefit in defaulting to -float-. If you want to default to something, just default to -long-, which takes as much space, and offers much more precision (float starts giving incorrect results after 30 million groups, while -long- works up to 2 billion groups).

            The reason why I think that's important is that both -ftools- and gtools try to mimic egen's behavior, but there are places where that behavior is, at the very least, suboptimal.

            Comment


            • #7
              What does group() do with very large values?

              Comment


              • #8
                When there are more than 2^24 groups, "egen, group" will not have enough levels unless it's a long or a double, so it can come up in specially large data sets (I can recall an example where I was working with 150M obs and 30M groups, so it's not just a theoretical concern).

                Comment

                Working...
                X