Is there a way to get the data type of a variable without loading the dataset into memory?

Michael Anbar

Join Date: Aug 2014

Posts: 116
#1

Is there a way to get the data type of a variable without loading the dataset into memory?

02 Jun 2016, 13:13

I have an ado program that merges (in a complicated way, not just the "merge" command) two large data sets, and I'd like the program to perform type checking on each dataset before beginning the merge. Specifically, given a variable name and a "using" dataset, I'd like to get the data type of that variable in that dataset. For the dataset in memory, this is easily done with macro extended functions, but for a dataset stored on disk, I can't find a way to do this without manually parsing the dta format.

Obviously, this information is available to Stata without loading the dataset because it's built into the dta format and commands like describe can access it. I don't want to load a 300 GB dataset into memory just to check the data types, and I'd prefer not to have to parse the text of "describe using ..." manually.

Last edited by Michael Anbar; 02 Jun 2016, 13:19.
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 29948
#2

02 Jun 2016, 14:34

I am not aware of any way to do this in Stata. If somebody else knows of a way, I hope they'll chime in here, because I have confronted this situation in my own work (although I have never handled a 300 GB dataset, so for me it's more a matter of convenience). If nobody has a solution, I'd suggest adding it to the Wish List for Stata 15 thread.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35405
#3

02 Jun 2016, 15:30

Code:

describe using

reports on types. If you want more, I guess you need to open and close log and parse its contents.

Or you could read the first part of the .dta file and extract the information, but I would stick with the above.

EDIT: That didn't really add to what was said previously. My mind wandered.

Last edited by Nick Cox; 02 Jun 2016, 15:52.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29948
#4

02 Jun 2016, 15:38

If you want more, I guess you need to open and close log and parse its contents.

That's what I've done when I've needed the information. What I think Michael and I are both hoping for is something like a change to -describe using- so that this kind of information would be accessible in some more convenient way. What would work very well for me is if -describe using- had a -replace- option, just like that of -describe-. Then I could save the new data set and pull the information easily from there without having to write parsing code.

Now, I don't know if that solution would work for Michael, because he might object to having to -preserve- his original data set before bringing the -describe using- results into memory if it, too, is of the order of 300 GB, .
Comment
Steve Samuels

Join Date: Mar 2014

Posts: 1786
#5

02 Jun 2016, 15:43

Withdrawn. Bad answer.

Last edited by Steve Samuels; 02 Jun 2016, 15:45.

Steve Samuels
Statistical Consulting
[email protected]

Stata 14.2
Comment
Michael Anbar

Join Date: Aug 2014

Posts: 116
#6

02 Jun 2016, 16:13

Originally posted by Clyde Schechter View Post

That's what I've done when I've needed the information. What I think Michael and I are both hoping for is something like a change to -describe using- so that this kind of information would be accessible in some more convenient way. What would work very well for me is if -describe using- had a -replace- option, just like that of -describe-. Then I could save the new data set and pull the information easily from there without having to write parsing code.

Now, I don't know if that solution would work for Michael, because he might object to having to -preserve- his original data set before bringing the -describe using- results into memory if it, too, is of the order of 300 GB, .

I don't think this will work for me, especially since on Windows, -preserve- seems to write to disk, and that's a major performance bottleneck even on an SSD. I was hoping there was a way to use a macro extended function, or even a Mata function, that would work like "local vartype: type <varname>" except on a dataset on disk.

Originally posted by Clyde Schechter View Post

If nobody has a solution, I'd suggest adding it to the Wish List for Stata 15 thread.

Is Statacorp usually responsive to feature requests of this sort? I've only been using Stata since version 11, but from what I've seen of the wishlist threads, the features that are implemented (if any) are those that attract market share, not those that help programmers who are diving deeper into the internals.

Last edited by Michael Anbar; 02 Jun 2016, 16:17.
Comment
wbuchanan

Join Date: Mar 2014

Posts: 1361
#7

05 Jun 2016, 10:16

Michael Anbar while helping user programmers is great, if the feature is only valuable to a small portion of the user community it wouldn't make sense for Stata to invest time, effort, and financial resources to an additional feature that doesn't benefit the greatest number of current and future users. One possible alternative would be to parse the .dta header and grab the data from there. As an extended macro function it would be extremely poor for performance given the high penalty on I/O operations. If the results were cached, then it runs the risk of tying up needed system resources that might negatively affect performance in other ways. I'm assuming the files were created by someone (at some point), so perhaps they could send you a file specification/layout so you could read that smaller file into memory/parse it instead?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29948
#8

05 Jun 2016, 10:23

One problem with doing this by parsing the .dta file header is that this changes from version to version in Stata. So if you write the code in version 14, it may well fail once we hit version 15 or 16.
Comment
wbuchanan

Join Date: Mar 2014

Posts: 1361
#9

06 Jun 2016, 08:53

It also depends on the underlying processor architecture of the machine that wrote the file and the machine that is reading the file. For example, if the file was written by a big endian machine and was being read by a little endian machine, the order of the bytes would need to be swapped.
Comment
daniel klein

Join Date: Mar 2014

Posts: 3820
#10

06 Jun 2016, 10:47

For example, if the file was written by a big endian machine and was being read by a little endian machine, the order of the bytes would need to be swapped.

Why exactly would this be a problem? The byte order of the .dta file is recorded somewhere in the header. Otherwise Stata would also choke on such things.

Best
Daniel

Edit

And for "somewhere in the header" read: in the second byte. For more

Code:

help dta

Edit 2

That is true for Stata versions < 13 ..., but see Clyde's comment above.

Last edited by daniel klein; 06 Jun 2016, 10:50.
Comment
wbuchanan

Join Date: Mar 2014

Posts: 1361
#11

06 Jun 2016, 17:44

If the machine is big endian and the file was written on a little endian system, the program would need to swap the byte order. If the file was written on a big endian machine and was being read on a big endian machine the byte order would not be swapped. I never said it was a problem, but it is another factor that needs to be considered in the programming and I'm not sure if there is a Stata command that returns the endianness of the system.
Comment
daniel klein

Join Date: Mar 2014

Posts: 3820
#12

07 Jun 2016, 02:38

Indeed, it is something that needs consideration. I was just thinking that, compared to Clyde's point, it is a minor inconvenience. Anyone interested in writing such a thing, see Mata function byteorder().

Best
Daniel
Comment
wbuchanan

Join Date: Mar 2014

Posts: 1361
#13

08 Jun 2016, 12:18

Daniel Klein, thanks for pointing out that Mata function. I definitely was not aware of it previously and assume others may not have known about it as well.
Comment

Announcement

Is there a way to get the data type of a variable without loading the dataset into memory?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment