We need to append annual data to a large dataset and are running into memory constraints (the final dataset is ~22GB and our server has 24GB RAM). Historically, we have encoded strings to int to decrease dataset size but the value labels differ per dataset and this practice yielded data quality issues post-merge. This phenomenon is already documented on Statalist.
Essentially, I would like to expand Eric's work into a more generic solution that would allow us to convert strings to integers across datasets to decrease our dataset size. Please provide any feedback, advice, alternatives or things to consider. We process a lot of data with Stata and our use of encode has been sloppy. I am a new Stata user but have worked in Python, SAS, SQL and Powershell previously.
Input
Output
Requirements
Essentially, I would like to expand Eric's work into a more generic solution that would allow us to convert strings to integers across datasets to decrease our dataset size. Please provide any feedback, advice, alternatives or things to consider. We process a lot of data with Stata and our use of encode has been sloppy. I am a new Stata user but have worked in Python, SAS, SQL and Powershell previously.
Input
- Dataset(s) with matching string variables
- Optional: list of string variables to generate/apply labels for
Output
- .Do file containing fully specified value label for each string variable
- Updated dataset(s) with strings converted to integers with attached value labels
Requirements
- Automatically generate value labels for all string variables in a dataset unless variables are specified
- Automatically apply value labels for "all strings" or "specified strings" in specified datasets
- Give int-like variables logical value labels during encode
- Check for an existing do file containing value labels and append new value labels to it so variables are nor recoded year-to-year
- Throw-error when appending and value labels don't match
Code:
**solution** //1. mk lookup table of values// clear save "lookuptable.dta", emptyok replace foreach file in using master /* put all your files here */ { append using "`file'.dta", keep(citation) } duplicates drop g citation_number = _n l save "lookuptable.dta", emptyok replace cap which labmask if _rc ssc install labutil, replace labmask citation_number, value(citation) lblname(cit) la save cit using "citationlabels.do" , replace //2. mk final tbl w. citation_number, not citation// clear save "final.dta", emptyok replace foreach file in using master /* put all your files here */ { u "`file'.dta", clear merge 1:1 citation using "lookuptable.dta" drop if _m!=3 drop _m drop citation sa "`file'_encoded.dta", replace append using "final.dta" sa "final.dta", replace } //3. apply labels to citation_number// l fre citation do "citationlabels.do" lab val citation_number cit fre citation