use first x variables from a file without naming the variable names

Klaudia Erhardt

Join Date: Mar 2015

Posts: 74
#1

use first x variables from a file without naming the variable names

24 Apr 2018, 08:01

Is there a way to use a file by adressing the using variables not by their names (or wildcarded names), but by their position in the file - like var number 1 through var number 100, for instance?

The background of my question: I have written syntax to process a host of Stata files from a directory. Some of them have a lot of variables which increases the runtime exceedingly. I want to open those files in a loop by using the first x variables, then the next x variables and so on until the end.

I know I could open the file and extract the list of all variables and use that list. But I ask myself if there is a way to adress parts of the using variables in the use <datafile>, using... command if you have no knowledge at all on the variable names of that file. If that was possible, the syntax I write could be used by any user of our panel data (SOEP), even if he/she uses a Stata version that can not open files with more than 2.047. variables.
Tags: data management, syntax
daniel klein

Join Date: Mar 2014

Posts: 3824
#2

24 Apr 2018, 08:13

Look at usesome (SSC) for an attempt. The code is buggy and requires an update but I have not gotten around to do this. Will write a bit more on the problems later ...

Best
Daniel
Comment
Robert Picard

Join Date: Mar 2014

Posts: 1536
#3

24 Apr 2018, 08:54

You can get the list of variables in a Stata dataset without having to load it into memory using describe. Here's an example where I make a list of the first 3 variables of an online dataset and then load only those variables.

Code:

describe using http://www.stata-press.com/data/r15/states, varlist ret list local vlist `r(varlist)' local first3 forvalue i = 1/3 { local v = word("`vlist'",`i') local first3 `first3' `v' } dis "`first3'" use `first3' using http://www.stata-press.com/data/r15/states, clear

Last edited by Robert Picard; 24 Apr 2018, 09:18.
Comment
Klaudia Erhardt

Join Date: Mar 2015

Posts: 74
#4

24 Apr 2018, 09:15

Robert Picard

Thank you very much, while not answering my question on how to use variables by their index, it solves my problem anyway! I did not know that I can use describe without loading the file into memory. I just tested it with a small Stata version, and it works.

daniel klein

Thank you for answering. I had a look at your ado file but did not understand how I could use some of the syntax to include into my syntax. I don't want to ask the prospective and unknown users of my application to install an ado file before they can use it.
My problem was solved by Roberts answer, but hopefully the Stata developpers include the possibility to adress variables by their position in the file in the near future.
Comment
daniel klein

Join Date: Mar 2014

Posts: 3824
#5

24 Apr 2018, 10:11

Originally posted by Klaudia Erhardt View Post

daniel klein

My problem was solved by Roberts answer

I would not be sure about that. Try creating a Stata dataset with 32,000 (actually you can only create 31,999 [well, nowadays 119,999]) variables with long names in Stata SE or MP, then run describe with the varlist option on Stata IC; it will choke and throw error 103, because the maximum length of macros, such as r(varlist) are limited according to c(maxvar). You can even stick with Stata SE or MP. Run this

Code:

clear all set maxvar 32000 forvalues j = 1/31999 { generate longvariablename`j' = 42 } save toolarge.dta clear set maxvar 2048 describe using toolarge.dta , varlist

to get

Code:

(output omitted) longvariablename31999 float %9.0g ------------------------------------------------------------------------------- Sorted by: too many variables r(103);

This might be called a bug, but it is why you will have to read the variable names from the dta file under those circumstances, which is what usesome tries to do (and does not correctly with dataset labels and in Stata versions > 13).

Best
Daniel
Comment
Klaudia Erhardt

Join Date: Mar 2015

Posts: 74
#6

25 Apr 2018, 03:05

You are completely right, Daniel, concerning the max macro length as a problem with this solution.

Indeed, my syntax could not be run with Stata Small Edition, which has a maximum macro length of 13,400. The list of varnames of our biggest file (3,170 variables at the moment) is 26,105 characters long, including blanks. There is enough room until we reach the limits of StataIC with 165,200 characters, though.

The main reason why I want to segmentize the files I am processing, is the runtime issue. I noticed that the processing of 680 cross-sectional files with more than 60,000 variables overall lasts only 1 1/2 hours, while the processing of one of the files with 3,100 variables (and 620,000 obs) lasts 4,5 hours !
I have to experiment and try if segmentation of the observations or of the variables yields lower runtime results.

Anyway, I think it would be very useful to have a possibility to adress variables in a file by their index instead of by their name.
Comment

Announcement

use first x variables from a file without naming the variable names

Comment

Comment

Comment

Comment

Comment