Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • A problem when Stata functions returns missing instead of an error?

    One of my coleques recently discovered the following.

    We discussed how to proceed and decided to give the community a fair warning of one of the more subtle elements in Stata.
    The reason for our decision is that is more a matter of design in Stata than an actual error.

    Consider the dataset:
    Code:
    clear
    input str6 Name str10 datestr
     Lena 09112008
     Laila 09112013
    end
    
    gen birthday = date( datestr , "DMY")
    drop datestr
    format %td birthday
    And test the -codebook-command vs the -count- command:
    Code:
    . codebook Name if birthday < date("31122009","DMY")
    
    ----------------------------------------------------------------------------------------------------------------------------------------
    Name                                                                                                                         (unlabeled)
    ----------------------------------------------------------------------------------------------------------------------------------------
    
                      type:  string (str6), but longest is str5
    
             unique values:  2                        missing "":  0/2
    
                tabulation:  Freq.  Value
                                 1  "Laila"
                                 1  "Lena"
    
    . count if birthday < date("31122009","DMY")
        1
    Why do -codebook- include both observations, whereas -count- only one?

    Because codebook.ado starts like this:
    Code:
    *! version 1.5.1  26jan2012
    program codebook, rclass
        version 8.1, born(09sep2003) missing
    And the functionality of the date()-function has been changed from Stata version 10 and onwards:
    Code:
    . version 8: dis date("31122009","DMY")
    .
    . version 9: dis date("31122009","DMY")
    .
    . version 10: dis date("31122009","DMY")
    18262
    . version 11: dis date("31122009","DMY")
    18262
    . version 12: dis date("31122009","DMY")
    18262
    In other cases things goes well, for instance:
    Code:
    . version 8:dis date("31.12.2009","DMY")
    18262
    . version 12:dis date("31.12.2009","DMY")
    18262
    So depending on what version of Stata you are running and what input you use you'll get different results


    It might have been better to return an error like for instance:
    Code:
    . dis datt("31.12.2009","DMY")
    unknown function datt()
    r(133);
    This becomes very problematic when combined with the practice of changing Stata-version in subroutines of different time of development - and of course the convention of considering a missing numerical value an actual (very large) number.

    This problem eg also occurs for:
    Code:
    di =mdy(200, 121, 2001)
    .
    We did some more digging and found:
    Code:
    . codebook Name if birthday > date("31122009","DMY")
    no observations
    r(2000);
    Using trace we found:
    Code:
      - Codebook_vars `varlist' `if' `in' , `mv' `quiet' `notes' tabulate(`tabulate') `lnopt'
      = Codebook_vars Name if birthday > date("31122009","DMY")  ,    tabulate(9)
        --------------------------------------------------------------------------------------------------- begin codebook.Codebook_vars ---
        - syntax [varlist] [if] [in] [, Tabulate(integer 9) Mv Notes quiet languages(str) ]
        - marksample touse, novarlist
        - qui count if `touse'
        = qui count if __000001
        - if r(N) == 0 {
        - error 2000
    no observations
          }
    We could fear that it might be the -syntax-/-marksample- that doesn't evaluate conditions and functions therein properly.
    But that we can't check.
    However if that is the case then the problem could be quite widespread in Stata.

    The morale of this that you have to check your conditions and Stata functions very carefully.
    Verify that you actually get what you want.

    That's all, folks
    Kind regards

    nhb

  • #2
    I wouldn't say that Stata treats very large numbers as missing values as much as they reserve bit sequences at the limits of what the particular type can store for missing values. So values > 100 of type byte are treated as missing because StataCorp reserved those last 27 integers for missing data of type byte. Because they did that, the data handling in Stata is pretty amazing. We can treat data as missing generically or can assign meaning to specific types of missingness via extended missing values. The bigger issue that you point out about the handling of date types is definitely worth a mention and was something I'd never noticed myself either.

    Comment


    • #3
      Try daily() not date() in this case.

      Comment


      • #4
        Why use daily? From the documentation it says: daily(s1,s2[,Y]) is a synonym for date(s1,s2[,Y]).
        Kind regards

        nhb

        Comment


        • #5
          The idea was that it was introduced after Stata 8.1 and so would not behave differently under version control.

          On the other hand I just tried it and it appears that daily() is merely a wrapper for the same code and so that does not help.
          Last edited by Nick Cox; 17 Nov 2015, 02:47.

          Comment


          • #6
            See this recent thread for an example where this version control issue bites.

            Comment

            Working...
            X