how to find variables within a dataset without opening it

Benno Schoenberger

Join Date: Apr 2024

Posts: 61
#1

how to find variables within a dataset without opening it

14 Jun 2024, 02:24

Dear Statalist,

I'm trying to write an ado that searches all datasets within a given directory for variables whose names are either given or can be determined using regular expressions.
In the end I would simply like to pass a list to the ado, which can contain both concrete variable names as well as regular expressions or at least abbreviated variable names (such as *var, va*r or var*)
I would know how to do it relatively easily if I opened the data sets, but the problem is that I would like to avoid exactly that because I am dealing with very, very large data sets and otherwise the runtime would simply be too long.
So the challenge is to do it without fully opening any dataset.

I was thinking about looping over something like

Code:

describe var* using datset, varlist

but unfortunately in the resulting local varlist not the variables that match var* - as I would have expected - are stored, but rather all the variables in the data set. I'm very sure that there must be a fairly simple way to do this, but unfortunately I can't find it and would be very grateful for any help.

To give everyone an easy-to-follow example:

Code:

// saving auto.dta filepaths to make the example executable for everyone quietly: sysuse auto local auto `r(fn)' clear describe t* using `"`auto'"', varlist // so the result shows the two variables trunk and turn display "`r(varlist)'" // but local varlist contains ALL variables in the dataset

So my question is how to store the above result in a local.

Thank you for your help in advance
Benno
Tags: None
FernandoRios

Join Date: Apr 2014

Posts: 2469
#2

14 Jun 2024, 02:30

You could open the data using a single obs. That is faster than opening all
use file in 1
Comment
Benno Schoenberger

Join Date: Apr 2024

Posts: 61
#3

14 Jun 2024, 02:34

Hello Fernando,
I really like the idea and that could solve my problem. Thank you very much.
I knew there was an easy way, but sometimes you're stuck ;-)

All the best,
Benno
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30097
#4

14 Jun 2024, 10:01

FWIW, in my projects that accumulate a large number of .dta files (most of them), I create an additional .dta file that I call whats_where.dta. It is basically an append of the results of -describe- on each of the other data sets, with an additional variable naming the .dta file, sorted on variable name. This serves as a handy index of variables that I use to remind myself of where to find particular variables. It can be kept open in a separate frame, and it can be searched for a given variable by exact name, or by wildcard using the -strmatch()- function.

That's not quite the same functionality as having an ado that searches the directory in real time and returns a list, but I find it satisfactory for my needs. In part this works for me because my usual workflow on a project begins with the creation of all of the working data sets, followed by analysis coming later. Creation of new data sets, other than temporary files, once analysis has begun is uncommon for me.

Last edited by Clyde Schechter; 14 Jun 2024, 10:04.
Comment
Benno Schoenberger

Join Date: Apr 2024

Posts: 61
#5

17 Jun 2024, 02:30

Hello Clyde,
that sounds like an interesting approach too. To produce ScientificUseFiles for our research institution, we typically have to merge hundreds of individual data sets from surveys. In order to ensure consistency between repeated surveys and to be able to quickly find the corresponding variables in the individual data sets in the event of errors or incorrect codings , I wrote the ado. Thanks to Fernando's tip, everything works perfectly and super quickly. The preliminary version is now usable for me. Now I just need to add some understandable error messages and return values.
But thank you again for your comment.
Comment

daniel klein

Join Date: Mar 2014
Posts: 3848

17 Jun 2024, 03:11

Depending on how complex you want the selection, you can use strmatch() in Mata and work with describe's returned variable list:

Code:

// saving auto.dta filepaths to make the example executable for everyone
quietly: sysuse auto
local auto `r(fn)'
clear

describe t* using `"`auto'"', varlist
// so the result shows the two variables trunk and turn


mata {
    
    r_varlist = tokens(st_global("r(varlist)"))
    
    selected_varlist = select(r_varlist,strmatch(r_varlist,"t*"))
    
    st_local("selected_varlist", invtokens(selected_varlist))
    
}


display "`r(varlist)'"
// but local varlist contains ALL variables in the dataset

display "`selected_varlist'"
// this contains only the selected variable names

This way, you won't even have to load the first observation of the datasets.

Edit: re-reading the initial post, Mata has regular expressions, too, of course. You are not confined to strmatch().

Comment

Benno Schoenberger

Join Date: Apr 2024

Posts: 61
#7

18 Jun 2024, 06:50

Hi Daniel,

I haven't worked with mata yet and will therefore stick with the usual Stata on-board tools for now. But if you have a good tip on how to quickly get started with mata like "mata for dummies" please let me know.
Comment
daniel klein

Join Date: Mar 2014

Posts: 3848
#8

18 Jun 2024, 07:03

Originally posted by Benno Schoenberger View Post

But if you have a good tip on how to quickly get started with mata like "mata for dummies" please let me know.

Not sure about quickly, but The Mata Book is the best I have read on Mata.

Last edited by daniel klein; 18 Jun 2024, 07:40.
Comment

Announcement

how to find variables within a dataset without opening it

Comment

Comment

Comment

Comment

Comment

Comment

Comment