Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Import CSV dates

    I am completely new to Stata. I have imported a dataset of approximately 50 columns by 500,000 rows. About eight of the columns are dates in the format YMD , eg 2015-05-14. The dataset imports nicely but all of the dates are in red, I assume this means they are strings.
    From the manual I have tried

    generate eventdate = date(admidate, "YMD")
    format eventdate %td

    I am not sure if I am on the right track here, - I assumed that I need to convert the string to a date format that Stata understood?. This comment seems to go through but the data remains the same.
    When I look at the date it is still red, – and when I look at the variable properties under format it says it is still %10s.

    Can anyone get me on the right track please.

  • #2
    you generated a new variable and that is the one you need to look at; what you did will not change the old variable in any way

    Comment


    • #3
      aah, thank you. I see. I tried to do this a different way using the menu bar, as I now can see how to change the foormat and learn the commands from this video
      Data>Create or change data>Create new variable
      create
      date(date2,"DMY")

      but Stata complained that "date2 not found"

      Why didn't Stata create it ?

      Comment


      • #4
        date2 has to be an existing variable for that to work. (Or a scalar -- but ignore that.) Stata is in effect saying "I don't know what that is".

        Comment


        • #5
          Not arguing, just trying to understand, - when I used command 'generate eventdate' Stata generated it, - presumably it is a known variable and different to one I have named ?

          Comment


          • #6
            Also, the video I followed was published by Stata and the variable they produced was a new user defined variable generated in the menu bar generated command.

            Comment


            • #7
              Not arguing either, but I don't understand what the problem is yet. Many people find videos helpful in learning and that's fine but they're awfully difficult to write about. But like Rich in #2 I am not even clear that you have a problem. From what you said you have a date variable now created in which case you can move on to whatever is next.

              Comment


              • #8
                re: #5 - I don't fully understand what you say but you can easily check whether it was generated via, e.g.,
                Code:
                describe eventdate
                to the extent that I do understand what I wrote, you appear to be confused - Stata should have, and according to you did, generate this new variable eventdate but you then ignored it

                Comment


                • #9
                  To Rich's diagnosis in post #8, let me add some advice I give to members who identify themselves as new users of Stata.

                  I'm sympathetic to you as a new user of Stata - there is a lot to absorb. I'd like to encourage you to take a step back from your immediate tasks.

                  When I began using Stata in a serious way, I started, as have others here, by reading my way through the Getting Started with Stata manual relevant to my setup. Chapter 18 then gives suggested further reading, much of which is in the Stata User's Guide, and I worked my way through much of that reading as well. There are a lot of examples to copy and paste into Stata's do-file editor to run yourself, and better yet, to experiment with changing the options to see how the results change.

                  All of these manuals are included as PDFs in the Stata installation and are accessible from within Stata - for example, through the PDF Documentation section of Stata's Help menu. The objective in doing the reading was not so much to master Stata - I'm still far from that goal - as to be sure I'd become familiar with a wide variety of important basic techniques, so that when the time came that I needed them, I might recall their existence, if not the full syntax, and know how to find out more about them in the help files and PDF manuals.

                  Stata supplies exceptionally good documentation that amply repays the time spent studying it - there's just a lot of it. The path I followed surfaces the things you need to know to get started in a hurry and to work effectively.

                  The problem with looking at random videos, or reading random help files, is that without an overall appreciation of how Stata is designed, it's easy to miss that what you were doing was creating a new variable with a new name that you would use going forward, not somehow magically changing the old variable with the old name from a string variable to a numeric variable.

                  Comment


                  • #10
                    Thanks William, sage advice. I have been reading the pdf manual. All I want to do to get started is import my data and tidy it up. Starting with the dates.
                    I want to turn columns with dates ( 2015-05-14 ) into a format Stata can understand and preferably formatted as DMY as I am in the Uk.
                    Seems simple enough but my first attempts all told me to create a new file formatted as I wanted. I tried two different methods.

                    1.Using the commands from the manual (just because it was an example I could understand)

                    generate admidate2 = date(admidate, "YMD")
                    format admidate2 %td
                    This has created a new column (eventdate)
                    This turns the date into 14may2015, which is not quite what I wanted.

                    2.Using the drop down
                    Data>Create or change data>Create new variable
                    create
                    date(admidate2,"DMY")

                    This results in an error
                    generate admidate2 = date(admidate2,"DMY")
                    admidate2 not found
                    r(111);

                    So, A few issues.
                    1. Why do the two methods differ in their success ( ie using dropdown doesnt work) when they use the same commands,
                    2. Why can i not just modify the date file in situ?,
                    3. I have 8 date fields, I dont want to create all the extra work of new columns editing the titles, dropping old columns etc.
                    Last edited by Nick Phillips; 19 Mar 2021, 13:48. Reason: typo

                    Comment


                    • #11
                      It is impossible to troubleshoot your problem with the drop-down approach because nobody can see how you filled in the boxes in the window. Suffice it to say, the command that Stata created, -generate admitdate2 = date(admitdate2, "DMY")- suggests to me that you attempt to create a variable named admitdate2, while specifying as its source the as yet non-existent admitdate2. That's my guess.

                      That said, if you have 8 date variables that are strings and you want to convert them to Stata internal format numerical date variables, you can write a quick loop. The hardest part of that is just putting the names of the 8 existing string variables into a local macro. Sometimes it's easy because they are named systematically, e.g. they all end in date and are the only variables which do so. Sometimes they are just 8 haphazard names with no common pattern. In that case you just have list them out. I'll assume the latter scenario. To reduce the tedium, I'll just illustrate with four such variables:

                      Code:
                      local to_convert date_of_birth marriage_date dt_first_child when_divorced
                      
                      foreach v of varlist `to_convert' {
                          gen _`v' = daily(`v', "YMD")
                          assert missing(`v') == missing(_`v')
                          format _`v' %td
                          order _`v', before(`v')
                          drop `v'
                          rename _`v' `v'
                      }
                      Several things to note about this code:

                      1. If it runs successfully, you will have replaced all of your original date variables with Stata internal format date variables,with the same names and in the same locations as the original string versions, and with the %td display format applied.

                      2. It may not run successfully. In large data sets there are usually errors, and date variables are particularly prone to error. It is not uncommon to encounter "dates" like 31 June, or 29 Feb in a non-leap year, etc. Stata's -date()- function (date() and daily() are the same function and are interchangeable in all respects) responds to those by returning missing value. You can also get missing values if Stata encounters a date that is valid but does not conform to the format specified ("YMD") in this case. In other words, given that the command calls for "YMD," if, interspersed among dates of that format, there is a value like "June 30, 2015," that, too, will trigger a missing value for the result. This is the rationale behind the -assert- command. If there are no invalid or non-conforming dates in the original data, then the only missing values for the new variable will be those cases where the original variable had a missing value (e.g. perhaps not everybody ha a divorce in the example shown). But if there are some in the data, the -assert- will fail. You will be able to tell which variable is at fault, because it will be the last one to have been converted to numeric before the code breaks, and a separate variable having that name preceded by underscore will still be in memory. If you then browse the observations for which that variable and the underscore-prefixed variable differ in missingness, you will see the offending observations and can do something about them.

                      3. No example data was provided, so this code is untested. It may contain typos or other errors.

                      Comment


                      • #12
                        Nick -
                        Back to the post #10. What you fail to understand in that post is the difference between the value of a numeric variable and how it is formatted for display.
                        Code:
                        . * example data
                        . list, noobs
                        
                          +------------+
                          |   admidate |
                          |------------|
                          | 2015-05-14 |
                          +------------+
                        
                        . * create a new variable (not "column") admidate2 (not "evendate")
                        . * with a Stata Internal Format numeric version of the date
                        . generate admidate2 = date(admidate, "YMD")
                        
                        . list, noobs
                        
                          +-----------------------+
                          |   admidate   admida~2 |
                          |-----------------------|
                          | 2015-05-14      20222 |
                          +-----------------------+
                        
                        . * it needs a format to display it as a date
                        . format admidate2 %td
                        
                        . list, noobs
                        
                          +------------------------+
                          |   admidate   admidate2 |
                          |------------------------|
                          | 2015-05-14   14may2015 |
                          +------------------------+
                        
                        . * that's not the display I want, so I'll change the format
                        . format admidate2 %tdCCYY-NN-DD
                        
                        . list, noobs
                        
                          +-------------------------+
                          |   admidate    admidate2 |
                          |-------------------------|
                          | 2015-05-14   2015-05-14 |
                          +-------------------------+
                        
                        . * that was the display I want, this is intentionally ridiculous
                        . format admidate2 %14.3fc
                        
                        . list, noobs
                        
                          +-------------------------+
                          |   admidate    admidate2 |
                          |-------------------------|
                          | 2015-05-14   20,222.000 |
                          +-------------------------+
                        
                        . * that's four different ways of displaying the same numeric value 20222.
                        More reading that you need to do.

                        Stata's "date and time" variables are complicated and there is a lot to learn. If you have not already read the very detailed Chapter 24 (Working with dates and times) of the Stata User's Guide PDF, do so now. If you have, it's time for a refresher. After that, the help datetime documentation will usually be enough to point the way. You can't remember everything; even the most experienced users end up referring to the help datetime documentation or back to the manual for details. But at least you will get a good understanding of the basics and the underlying principles. An investment of time that will be amply repaid. And amongst all the reading, be sure to read the details on creating display formats for SIF date values.

                        So, A few issues.
                        1. Why do the two methods differ in their success ( ie using dropdown doesnt work) when they use the same commands,
                        2. Why can i not just modify the date file in situ?,
                        3. I have 8 date fields, I dont want to create all the extra work of new columns editing the titles, dropping old columns etc.
                        1. The reason the menu approach failed is because you told it to "create a new variable" admidate2 using the variable admidate2 which being new does yet ot exist. Had it existed, the command would have failed because admidate2 cannot be a new variable since it already exists.
                        2. Why can't my Mini Cooper S hold three adults in the back seat? (Or even one, for that matter.) The answer to both questions is, that is the result of design decisions made by their creators.
                        3. Clyde's code in post #11 addressed that.

                        Comment


                        • #13
                          Clyde, William - this is very helpful and when I can get away from the dayjob (I am a surgeon) I will explore your advice and get back. I have been using python (pandas) up until now and the datetime object is equally challenging. I want to use Stata for many other advantages and reasons.
                          Last edited by Nick Phillips; 21 Mar 2021, 08:29.

                          Comment

                          Working...
                          X