Standard approach for converting a variable found in several datasets from string to int

Kyle BK

Join Date: May 2018

Posts: 5
#1

Standard approach for converting a variable found in several datasets from string to int

05 Jun 2018, 09:40

We need to append annual data to a large dataset and are running into memory constraints (the final dataset is ~22GB and our server has 24GB RAM). Historically, we have encoded strings to int to decrease dataset size but the value labels differ per dataset and this practice yielded data quality issues post-merge. This phenomenon is already documented on Statalist.

Essentially, I would like to expand Eric's work (shown below) into a more generic solution that would allow us to convert strings to integers across datasets to decrease our dataset size. Please provide any feedback, advice, alternatives or things to consider. We process a lot of data with Stata and our use of encode has been sloppy. I am a new Stata user but have worked in Python, SAS, SQL and Powershell previously.

The generic function I have in mind has the following characteristics.

Input
Dataset(s) with matching string variables

Optional: list of string variables to generate/apply labels for

Optional: name of do file containing latest value label definition

Output
.Do file containing fully specified value label for each string variable

Updated dataset(s) with strings converted to integers with attached value labels

Requirements
Automatically generate value labels for all string variables in a dataset unless variables are specified

Automatically apply value labels for "all strings" or "specified strings" in specified datasets

Give int-like variables logical value labels during encode

Check for an existing do file containing value labels and append new value labels to it so variables are nor recoded year-to-year

Throw-error when appending and value labels don't match

Code:

**solution** //1. mk lookup table of values// clear save "lookuptable.dta", emptyok replace foreach file in using master /* put all your files here */ { append using "`file'.dta", keep(citation) } duplicates drop g citation_number = _n l save "lookuptable.dta", emptyok replace cap which labmask if _rc ssc install labutil, replace labmask citation_number, value(citation) lblname(cit) la save cit using "citationlabels.do" , replace //2. mk final tbl w. citation_number, not citation// clear save "final.dta", emptyok replace foreach file in using master /* put all your files here */ { u "`file'.dta", clear merge 1:1 citation using "lookuptable.dta" drop if _m!=3 drop _m drop citation sa "`file'_encoded.dta", replace append using "final.dta" sa "final.dta", replace } //3. apply labels to citation_number// l fre citation do "citationlabels.do" lab val citation_number cit fre citation
Tags: None
Phil Bromiley

Join Date: Apr 2014

Posts: 4348
#2

06 Jun 2018, 10:29

You didn't get a quick answer. It is very hard to evaluate such programs without the data to run them. It is also hard to know what would happen since (as Clyde noted in the previous exchange you linked to) there are often idiosyncratic problems that vary from data file to data file.

If it is just annual data, then you may not have few enough files to pay attention to each individually. Also, I thought Stata was willing to use virtual hard drive space on the standard storage device. While this is slower, it might suffice if you only need to put it together before pulling a smaller data set to analyze. As a new user, you have chosen a tough problem to program.
Comment

Announcement

Standard approach for converting a variable found in several datasets from string to int

Comment