Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Conversion of string to number changes number

    Hi,

    I am having trouble when Stata converts a variable from numeric to a string, as it appears to be incrementing the number by 1.

    I have two variables that each hold dates. If one holds a date, the other is blank. Both date variables are numeric variables that are not yet converted to function as dates in Stata. I want a single variable that holds the date info from both variables, and use egen rowtotal. I then use tostring to convert the combined variable to a string so that I can use the Stata's date commands.

    Code:
    . desc EnrollmentEnd GraduationDate
    
    Variable      Storage   Display    Value
        name         type    format    label      Variable label
    -------------------------------------------------------------------------------------------------
    EnrollmentEnd   long    %10.0g                Enrollment End
    GraduationDate  long    %10.0g                Graduation Date
    
    . egen date_grad_last=rowtotal(EnrollmentEnd GraduationDate),missing
    (1,322 missing values generated)
    
    . tostring date_grad_last, gen(stdate_grad_last)
    stdate_grad_last generated as str8
    
    . format %15.0g date_grad_last
    Below are the inputs and output on a case level. The first and second columns are original variables that hold dates. The third column holds the combined variable (that results from egen rowtotal above), and the last column displays the results of the tostring command. Note that the last digit in the cases below change from "31" to "32" in case 2, and from "01" to "00" in case 14. These are problematic as these entries are dates and 00 and 32 generate errors

    Code:
    . list EnrollmentEnd GraduationDate date_grad_last stdate_grad_last in 2
    
         +-------------------------------------------+
         | Enroll~d   Gradua~e   date_g~t   stdate~t |
         |-------------------------------------------|
      2. |        .       20160531   20160532   20160532 |
         +-------------------------------------------+
    
         +-------------------------------------------+
         | Enroll~d      Gradua~e   date_g~t   stdate~t |
         |-------------------------------------------|
     14. | 20100601          .         20100600   20100600 |
         +-------------------------------------------+
    I use the "date" command to convert the last column (string) above to a date variable. As you can see below, the conversion for these cases with 'out of bounds' days ("00" and "32"), and the result is a missing entry.

    Code:
    gen dtdate_grad_last= date(stdate_grad_last, "YMD")
    format %td dtdate_grad_last
    
    . list EnrollmentEnd GraduationDate date_grad_last stdate_grad_last dtdate_grad_last in 2
    
         +------------------------------------------------------+
         | Enroll~d   Gradua~e   date_g~t   stdate~t   dtdate~t |
         |------------------------------------------------------|
      2. |        .        20160531   20160532 20160532          . |
         +------------------------------------------------------+
    
    . list EnrollmentEnd GraduationDate date_grad_last stdate_grad_last dtdate_grad_last in 14
    
         +------------------------------------------------------+
         | Enroll~d   Gradua~e   date_g~t   stdate~t   dtdate~t |
         |------------------------------------------------------|
     14. | 20100601          .      20100600   20100600          . |
         +------------------------------------------------------+
    Is there a way to avoid the incrementing of number in conversion to string?

    Thanks!

  • #2
    This is a precision issue. By default, when you create a new variable in Stata with -egen- or -gen-, it is stored as a -float- data type. But -float- does not have enough mantissa bits to hold all of those digits, so you are getting a truncation error. If you change your command to:
    Code:
    egen long date_grad_last=rowtotal(EnrollmentEnd GraduationDate),missing
    Stata will store the result as a long data type, which has enough bits for the purpose.

    Added: You might note that as your -describe- output shows, the original EnrollmentEnd and GraduationDate variables are stored as long. There's a reason for that!
    Last edited by Clyde Schechter; 11 Mar 2022, 14:27.

    Comment


    • #3
      A different point is that the sum of two dates in the present century expressed as integers of the order of 20 million has no obvious use or interpretation. Or what I am missing? If the point is that in practice one value is always missing using max() or min() to select the non-missing value seems more direct.
      Last edited by Nick Cox; 11 Mar 2022, 15:15.

      Comment


      • #4
        Since at least one of the two variables is always missing, the following would do the job without bothering with rowtotal, which is just misleading.
        Code:
        clonevar dtdate_grad_last = GraduationDate
        replace dtdate_grad_last = EnrollmentEnd if missing(dtdate_grad_last)
        and as a side benefit ensures that dtdate_grad_last is stored in the same way as GraduationDate - that is, as a long - and with the same format, if it had been given a meaningful format.

        Also, note that this code will work is the two variables are string variables, as long as the missing value is "" - that is, the 0-length empty string that is treated by Stata as a missing string value.

        But really, after importing data, step 1 should be to convert numeric values that are stored in string variables to numeric variables using destring, and step 2 is to convert to a Stata date variable anything that contains a date using the datetime functions - which is especially convenient if they were originally stored as strings.

        Comment


        • #5
          tostring is doing what you wanted, and fine by me. I am down as the original author, but no royalties there.

          But you can convert to string on the fly without needing to create new variable(s).

          Here is a demonstration for an individual value but the principle carries over to generate or replace for variables.

          Code:
          . di  %td   daily(string(20160531, "%8.0f"), "YMD")
          31may2016
          In this case, and others, you do need to specify the display format.

          Comment

          Working...
          X