Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How does Stata choose numeric types for generating new variables?

    I believe that I've read an explanation for this at some point, but I can't find it, so if someone knows where it's explained, I'd appreciate a nudge in that direction.

    The issue I'm running into deals with creating new variables and how Stata decides which variable type to assign them. I'm using a variable of long numeric identifiers and the issue can be seen like this.

    Code:
    clear
    set obs 1
    gen x = 1366548281
    gen long y = 1366548281
    format %15.0gc x y
    If I then go into the data editor and look at the values:
    x y
    1,366,548,224 1,366,548,281
    The values are not the same and x is a float and y is a long

    If I then go

    Code:
    gen z = y
    format %15.0gc z
    z is a float equal to 1,366,548,224, which obviously isn't the number I want.

    My question is how does Stata decide which numeric type to use when generating new variables without specifying what type to be used? My challenge came because I assumed that Stata would generate z as a long (like y is) but instead generates it as a float. Also, when it generates the float, why does the number change from what I typed in? Since these are identifiers, the change of a single digit is very problematic. I finally figured out the issue for my current project, but I'd be interested in knowing what is happening behind the curtain for future reference.

  • #2
    Stata's manual clarifies:
    generate [type] newvar[:lblname] =exp [if] [in]

    If no type is specified, the new variable type is determined by the type of result returned by =exp. A float variable (or a double, according to set type) is created if the result is numeric, and a
    string variable is created if the result is a string.
    Variables are created as doubles if you employ:

    Specify default storage type assigned to new variables

    set type {float|double} [, permanently]
    The numeric value changes from what you typed because it doesn't fit into float's precision. It is equivalent to rounding, as Stata assumes that for such a magnitude such rounding is tolerable. To say that it is not tolerable, specify double as storage type. You can't have both compact storage and high precision. Either go for precision and settle for doubles, or go for compactness and tolerate rounding.

    Your case is rather straightforward, since you see the RHS and can immediately identify the problem. In some cases the RHS is less apparent, such as when it is a result of the clock() function. Such rounding means losing a few seconds in a timestamp, which is almost always not good.

    Having double type as default for generate would be a much safer option. IMHO Stata should get rid of the set type rudimentary setting, while allowing users to specify type directly to settle for floats if necessary.

    Your IDs should use longs if sufficient, or strings, especially if leading zeroes are essential. In some cases you can create your own ids for the project from original dataset ids, and store the crosswalk file somewhere. This is especially helpful if IDs are long and cumbersome, as well as precludes from using embedded information, such as "6th to 8th digits are district code".

    Best, Sergiy Radyakin

    Comment


    • #3
      I am puzzled by the implication that the documentation for this matter is elusive. The point is explained up front in the help for generate, which is the command being used here. With a numeric argument, the type defaults to float.

      It's fair to say that this question divides experienced and even expert users right down the middle. Sergiy certainly qualifies as an experienced and expert user, but here I disagree with him. Using double as default type would cause more problems than it solved for learners. For every variable that needs the user to specify long or double, there are many more, including indicator and categorical variables, for which a default of double would just waste memory and bloat dataset size unnecessarily.

      Stata is here, as elsewhere, paying the user a compliment by assuming that you mean what you say and that you care enough to read the documentation to understand what you are doing. It is naturally arguable that the decision-making Stata makes on your behalf should be extended here, but that's a different question. It is certainly true that generate doesn't stick to the stance "if you don't say otherwise, you mean
      float
      " if you present it with a string argument.

      Comment


      • #4
        Thank you both for your help and explanations. Following Sergiy's comments on float's precision, I found this blog post by William Gould that clarifies the precision of variable types to be quite helpful: http://blog.stata.com/2011/06/17/pre...-again-part-i/
        Last edited by David Muhlestein; 12 Nov 2014, 07:17.

        Comment

        Working...
        X