Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • how to find variables within a dataset without opening it

    Dear Statalist,

    I'm trying to write an ado that searches all datasets within a given directory for variables whose names are either given or can be determined using regular expressions.
    In the end I would simply like to pass a list to the ado, which can contain both concrete variable names as well as regular expressions or at least abbreviated variable names (such as *var, va*r or var*)
    I would know how to do it relatively easily if I opened the data sets, but the problem is that I would like to avoid exactly that because I am dealing with very, very large data sets and otherwise the runtime would simply be too long.
    So the challenge is to do it without fully opening any dataset.

    I was thinking about looping over something like
    Code:
    describe var* using datset, varlist
    but unfortunately in the resulting local varlist not the variables that match var* - as I would have expected - are stored, but rather all the variables in the data set. I'm very sure that there must be a fairly simple way to do this, but unfortunately I can't find it and would be very grateful for any help.

    To give everyone an easy-to-follow example:
    Code:
    // saving auto.dta filepaths to make the example executable for everyone
    quietly: sysuse auto
    local auto `r(fn)'
    clear
    
    describe t* using `"`auto'"', varlist
    // so the result shows the two variables trunk and turn
    
    display "`r(varlist)'"
    // but local varlist contains ALL variables in the dataset
    So my question is how to store the above result in a local.

    Thank you for your help in advance
    Benno





  • #2
    You could open the data using a single obs. That is faster than opening all
    use file in 1

    Comment


    • #3
      Hello Fernando,
      I really like the idea and that could solve my problem. Thank you very much.
      I knew there was an easy way, but sometimes you're stuck ;-)

      All the best,
      Benno

      Comment


      • #4
        FWIW, in my projects that accumulate a large number of .dta files (most of them), I create an additional .dta file that I call whats_where.dta. It is basically an append of the results of -describe- on each of the other data sets, with an additional variable naming the .dta file, sorted on variable name. This serves as a handy index of variables that I use to remind myself of where to find particular variables. It can be kept open in a separate frame, and it can be searched for a given variable by exact name, or by wildcard using the -strmatch()- function.

        That's not quite the same functionality as having an ado that searches the directory in real time and returns a list, but I find it satisfactory for my needs. In part this works for me because my usual workflow on a project begins with the creation of all of the working data sets, followed by analysis coming later. Creation of new data sets, other than temporary files, once analysis has begun is uncommon for me.
        Last edited by Clyde Schechter; 14 Jun 2024, 11:04.

        Comment


        • #5
          Hello Clyde,
          that sounds like an interesting approach too. To produce ScientificUseFiles for our research institution, we typically have to merge hundreds of individual data sets from surveys. In order to ensure consistency between repeated surveys and to be able to quickly find the corresponding variables in the individual data sets in the event of errors or incorrect codings , I wrote the ado. Thanks to Fernando's tip, everything works perfectly and super quickly. The preliminary version is now usable for me. Now I just need to add some understandable error messages and return values.
          But thank you again for your comment.

          Comment


          • #6
            Depending on how complex you want the selection, you can use strmatch() in Mata and work with describe's returned variable list:

            Code:
            // saving auto.dta filepaths to make the example executable for everyone
            quietly: sysuse auto
            local auto `r(fn)'
            clear
            
            describe t* using `"`auto'"', varlist
            // so the result shows the two variables trunk and turn
            
            
            mata {
                
                r_varlist = tokens(st_global("r(varlist)"))
                
                selected_varlist = select(r_varlist,strmatch(r_varlist,"t*"))
                
                st_local("selected_varlist", invtokens(selected_varlist))
                
            }
            
            
            display "`r(varlist)'"
            // but local varlist contains ALL variables in the dataset
            
            display "`selected_varlist'"
            // this contains only the selected variable names
            This way, you won't even have to load the first observation of the datasets.

            Edit: re-reading the initial post, Mata has regular expressions, too, of course. You are not confined to strmatch().

            Comment


            • #7
              Hi Daniel,

              I haven't worked with mata yet and will therefore stick with the usual Stata on-board tools for now. But if you have a good tip on how to quickly get started with mata like "mata for dummies" please let me know.

              Comment


              • #8
                Originally posted by Benno Schoenberger View Post
                But if you have a good tip on how to quickly get started with mata like "mata for dummies" please let me know.
                Not sure about quickly, but The Mata Book is the best I have read on Mata.
                Last edited by daniel klein; 18 Jun 2024, 08:40.

                Comment

                Working...
                X