Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Preserving leading zeros upon import of file (without importing all variables as string)

    Dear community,

    Is there a simply way how to import a .csv file that preserves leading zeros?

    I have a data set that allows partial times, and upon import, a time of 00:10 and a time of 10: _ _ , they both comes out at 10 when I import as .csv, which means I cannot distinguish between partial times and times during the night.

    I found the following page: https://stackoverflow.com/questions/...variable-names,
    but this imports all variables as string, which is something I do not want. Also because the dataset keeps evolving over time and I do not know in advance which variables it will contain on which I can apply destring. I would simply like to specify the variables that need to be imported as string.

    Would you happen to know if there is a way to do this?

    Thank you for your thoughts,

    Best wishes,

    Moniek

  • #2
    You can specify the column number that you want imported as a string using the -stringcols()- option. For example, if this is column #4

    Code:
    import delimited "myfile.txt", stringcols(4) clear

    Comment


    • #3
      Thank you Andrew for thinking along, I appreciate it!

      Is there still some easier way that allows for automation? The reason I ask is because my datasets change over time (e.g. in case of a database amendment), so the column numbers of the affected variables will also change over time. As I work on regulatory studies, the code needs to be validated and I do not want to bring out a new version with each database amendment, where I have to manually call each data set (I receive > 40 each week) and update the column number. This does not sound like an efficient way how to preserve leading zeros.

      Best wishes,

      Moniek
      Last edited by Moniek Bresser; 19 Sep 2023, 05:34. Reason: Update: removal of incorrect information

      Comment


      • #4
        Are you sure that your times have colons? Stata will not read

        00:10
        as a number unless it translates it to a time variable. I can suggest a workaround, but I do not think that you have described your problem properly. First, import a sample converting all variables to strings and again the same sample with no added options. Present both here using the dataex command.
        Last edited by Andrew Musau; 19 Sep 2023, 05:02.

        Comment


        • #5
          Originally posted by Moniek Bresser View Post
          I found out that R does preserve leading zeros in .csv files
          I am curious: How does R do that if not by treating the variable as a string? As far as I know, there is no such thing as leading zeros in numric variables in R, either.

          I think your best bet is to import all variables as string, then identify leading zeros. This can be automated.

          Comment


          • #6
            Apologies Andrew,

            Let me clarify, a time variable that is entered as 00:10 (in the database system), will export as 0010, but upon import into Stata, the leading zeros are removed. We also have partial times in the database system, entered as 10: in the database system and which exports as 10 and remains 10 upon import into Stata.

            Also, I corrected the statement about R preserving zeros, this is incorrect, apologies.

            Comment


            • #7
              Dear Daniel,

              Thank you for the idea, this seems like the best option for now. I still might have actual string variables (recorded as string in the system) that could start with "0", for example for free text variables. I appreciate this is not very likely, but still ideally I would want to call which variables I want to import as string.

              Sorry, to clarify about R: Upon import, you can call which variable to import as string, which I like better than calling the column number.

              So just to be 100% clear: It is not possible to indicate which variable to import as string in Stata, included in the import command? Only column number is possible?

              Thank you so much for the feedback,

              Moniek

              Comment


              • #8
                Well, the mapping from variable names to column number is available after import; e.g., as:

                Code:
                mata : st_varindex("varname")

                Comment


                • #9
                  Assuming that the variables with leading digits are always 4 digits in length as below, the following should select such variables.

                  *MYFILE.TXT
                  Code:
                  apples oranges pears tomatoes peaches
                  1 1355 green 50000 1111
                  2 0122 red 22000 2222
                  0 0055 orange 1300 0500
                  5 1333 yellow 2222 1300
                  2 1400 blue 1111 2222
                  *CODE
                  Code:
                  import delimited "myfile.txt", delimiter(space) stringcols(_all) clear
                  local cols
                  local i 0
                  foreach var of varlist *{
                      local ++i
                      qui count if ustrregexm(trim(itrim(`var')),  "^0[0-9]{3}")
                      if r(N)>0{
                          local cols `cols' `i'
                      }
                  }
                  import delimited "myfile.txt", delimiter(space) stringcols(`cols') clear
                  Res.:

                  Code:
                  Contains data
                   Observations:             5                  
                      Variables:             5                  
                  -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
                  Variable      Storage   Display    Value
                      name         type    format    label      Variable label
                  -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
                  apples          byte    %8.0g                
                  oranges         str4    %9s                  
                  pears           str6    %9s                  
                  tomatoes        long    %12.0g                
                  peaches         str4    %9s                  
                  -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
                  Sorted by:
                       Note: Dataset has changed since last saved.

                  Comment


                  • #10
                    Wouldn't it be simpler to import once, then apply destring to all variables that do not have leading zeros?

                    Code:
                    import delimited ...
                    foreach var of varlist _all {
                        
                        qui count if ustrregexm(trim(itrim(`var')),  "^0[0-9]{3}")
                        if ( !r(N) ) capture noisily destring `var' , replace
                    }
                    Last edited by daniel klein; 19 Sep 2023, 07:42.

                    Comment


                    • #11
                      Originally posted by daniel klein View Post
                      Wouldn't it be simpler to import once, then apply destring to all variables that do not have leading zeros?
                      Should be so.

                      Comment


                      • #12
                        I agree that this is easiest!. However, destringing all variables that do not have leading zeros is based on the assumptions that there will not be variables that take string in the database and that someone has entered text in the database that starts with a zero. Yes, very unlikely, but I would prefer the code to be bullet proof and not based on assumptions that may or may not hold.

                        Comment


                        • #13
                          Dear Daniel,

                          As I have never used Mata before: I tried your code and it indeed nicely gives me the column index of the respective time variable. Would you be so kind to share how I can save this mata column index into a Stata local?

                          Thank you and best wishes,

                          Moniek

                          Comment


                          • #14
                            You can manipulate Stata locals from within Mata via

                            Code:
                            st_local()
                            However, Mata takes data types seriously. What I mean is that st_varindex() returns a real scalar; you need to transform this into a string before you can store it in a Stata local:

                            Code:
                            mata : st_local(index, strofreal(st_varindex("varname")))
                            This will put the position of variable varname into local macro index.

                            Comment


                            • #15
                              Thank you Daniel,

                              I really appreciate your help.

                              Can I check with you if there is anything additional that I need installed? When I use the original statement, I indeed see that column index at a scalar,


                              Code:
                              ]import delimited "${data_raw}/test_file.csv", clear    
                              mata : st_varindex("imptimel")
                              But when I include st_local and try to generate the local macro index, I receive the following message that index cannot be found:

                              <istmt>: 3499 index not found

                              Code:
                              import delimited "${data_raw}/test_file.csv", clear    
                              mata: st_local(index, strofreal(st_varindex("varname")))
                              Thank you and best wishes,

                              Moniek

                              Comment

                              Working...
                              X