Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • use first x variables from a file without naming the variable names

    Is there a way to use a file by adressing the using variables not by their names (or wildcarded names), but by their position in the file - like var number 1 through var number 100, for instance?

    The background of my question: I have written syntax to process a host of Stata files from a directory. Some of them have a lot of variables which increases the runtime exceedingly. I want to open those files in a loop by using the first x variables, then the next x variables and so on until the end.

    I know I could open the file and extract the list of all variables and use that list. But I ask myself if there is a way to adress parts of the using variables in the use <datafile>, using... command if you have no knowledge at all on the variable names of that file. If that was possible, the syntax I write could be used by any user of our panel data (SOEP), even if he/she uses a Stata version that can not open files with more than 2.047. variables.

  • #2
    Look at usesome (SSC) for an attempt. The code is buggy and requires an update but I have not gotten around to do this. Will write a bit more on the problems later ...

    Best
    Daniel

    Comment


    • #3
      You can get the list of variables in a Stata dataset without having to load it into memory using describe. Here's an example where I make a list of the first 3 variables of an online dataset and then load only those variables.

      Code:
      describe using http://www.stata-press.com/data/r15/states, varlist
      ret list
      local vlist `r(varlist)'
      
      local first3
      forvalue i = 1/3 {
          local v = word("`vlist'",`i')
          local first3 `first3' `v'
      }
      dis "`first3'"
      
      use `first3' using http://www.stata-press.com/data/r15/states, clear
      Last edited by Robert Picard; 24 Apr 2018, 09:18.

      Comment


      • #4
        Robert Picard

        Thank you very much, while not answering my question on how to use variables by their index, it solves my problem anyway! I did not know that I can use describe without loading the file into memory. I just tested it with a small Stata version, and it works.

        daniel klein

        Thank you for answering. I had a look at your ado file but did not understand how I could use some of the syntax to include into my syntax. I don't want to ask the prospective and unknown users of my application to install an ado file before they can use it.
        My problem was solved by Roberts answer, but hopefully the Stata developpers include the possibility to adress variables by their position in the file in the near future.

        Comment


        • #5
          Originally posted by Klaudia Erhardt View Post
          daniel klein

          My problem was solved by Roberts answer
          I would not be sure about that. Try creating a Stata dataset with 32,000 (actually you can only create 31,999 [well, nowadays 119,999]) variables with long names in Stata SE or MP, then run describe with the varlist option on Stata IC; it will choke and throw error 103, because the maximum length of macros, such as r(varlist) are limited according to c(maxvar). You can even stick with Stata SE or MP. Run this

          Code:
          clear all
          set maxvar 32000
          forvalues j = 1/31999 {
              generate longvariablename`j' = 42
          }
          save toolarge.dta
          
          clear
          set maxvar 2048
          describe using toolarge.dta , varlist
          to get

          Code:
          (output omitted)
          longvariablename31999
                          float   %9.0g                
          -------------------------------------------------------------------------------
          Sorted by:
          too many variables
          r(103);
          This might be called a bug, but it is why you will have to read the variable names from the dta file under those circumstances, which is what usesome tries to do (and does not correctly with dataset labels and in Stata versions > 13).

          Best
          Daniel

          Comment


          • #6
            You are completely right, Daniel, concerning the max macro length as a problem with this solution.

            Indeed, my syntax could not be run with Stata Small Edition, which has a maximum macro length of 13,400. The list of varnames of our biggest file (3,170 variables at the moment) is 26,105 characters long, including blanks. There is enough room until we reach the limits of StataIC with 165,200 characters, though.

            The main reason why I want to segmentize the files I am processing, is the runtime issue. I noticed that the processing of 680 cross-sectional files with more than 60,000 variables overall lasts only 1 1/2 hours, while the processing of one of the files with 3,100 variables (and 620,000 obs) lasts 4,5 hours !
            I have to experiment and try if segmentation of the observations or of the variables yields lower runtime results.

            Anyway, I think it would be very useful to have a possibility to adress variables in a file by their index instead of by their name.

            Comment

            Working...
            X