Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Standard approach for converting a variable found in several datasets from string to int

    We need to append annual data to a large dataset and are running into memory constraints (the final dataset is ~22GB and our server has 24GB RAM). Historically, we have encoded strings to int to decrease dataset size but the value labels differ per dataset and this practice yielded data quality issues post-merge. This phenomenon is already documented on Statalist.

    Essentially, I would like to expand Eric's work (shown below) into a more generic solution that would allow us to convert strings to integers across datasets to decrease our dataset size. Please provide any feedback, advice, alternatives or things to consider. We process a lot of data with Stata and our use of encode has been sloppy. I am a new Stata user but have worked in Python, SAS, SQL and Powershell previously.

    The generic function I have in mind has the following characteristics.

    Input
    • Dataset(s) with matching string variables
    • Optional: list of string variables to generate/apply labels for
    • Optional: name of do file containing latest value label definition

    Output
    • .Do file containing fully specified value label for each string variable
    • Updated dataset(s) with strings converted to integers with attached value labels

    Requirements
    • Automatically generate value labels for all string variables in a dataset unless variables are specified
    • Automatically apply value labels for "all strings" or "specified strings" in specified datasets
    • Give int-like variables logical value labels during encode
    • Check for an existing do file containing value labels and append new value labels to it so variables are nor recoded year-to-year
    • Throw-error when appending and value labels don't match


    Code:
     
    **solution**
    
    //1.  mk lookup table of values//
    clear
    save "lookuptable.dta", emptyok replace
    foreach file in using master   /* put all your files here */  {
        append using "`file'.dta", keep(citation)
        }
        duplicates drop
        g citation_number = _n
        l
        save "lookuptable.dta", emptyok replace
        cap which labmask
        if _rc ssc install labutil, replace
        labmask citation_number, value(citation) lblname(cit)
        la save cit  using "citationlabels.do" , replace
        
        
        
    //2. mk final tbl w. citation_number, not citation//
    clear
    save "final.dta", emptyok replace
    
    foreach file in using master  /* put all your files here */  {
        u "`file'.dta", clear
        merge 1:1 citation using "lookuptable.dta"
        drop if _m!=3
        drop _m
        drop citation
        sa "`file'_encoded.dta", replace
        append using "final.dta"
        sa "final.dta", replace
        }
    
    //3.  apply labels to citation_number//
    l
    fre citation
    do "citationlabels.do"
    lab val citation_number cit
    fre citation

  • #2
    You didn't get a quick answer. It is very hard to evaluate such programs without the data to run them. It is also hard to know what would happen since (as Clyde noted in the previous exchange you linked to) there are often idiosyncratic problems that vary from data file to data file.

    If it is just annual data, then you may not have few enough files to pay attention to each individually. Also, I thought Stata was willing to use virtual hard drive space on the standard storage device. While this is slower, it might suffice if you only need to put it together before pulling a smaller data set to analyze. As a new user, you have chosen a tough problem to program.

    Comment

    Working...
    X