Hello everyone,
I am using Stata 14.2 on macOS 10.11.6. I am working with data that has multiple fixed format-text files for multiple years and I would like to read this data into Stata. I wrote a dictionary to use with infile or infix and I wrote a do file that loops through the folder that contains each year and each text file held within each file and then appends them into a consolidated file for each year. These files for each year are later appended and cleaned for use.
My problem is that despite Stata reading in the data from the text files, some observations incorrectly have missing values for variables "tempmm", "tpayq1", and "tanpay". To be clear, the text files do in fact have values for these variables for the affected observations but after being read into Stata they are treated as missing. Sometimes an observation has missing values for all three and other times just one or two of them. You can see a demonstration of this clearly by looking at the first row of data in each of the text and Stata examples below.
These three variables are numeric and have a width of 12 columns in the text files. If they do not fill all twelve columns the column spaces preceding the values are filled with zeros or blanks. Initially, my code used -infile- in conjunction with my dictionary file but upon searching I believe I read something about there potentially being an issue in dealing with those preceding zeros and blanks and so swapped to the -infix- command. The problem persisted. I've tried searching through the forum, as well as many internet searches, and reading the documentation for the commands I used. My apologies if I've missed something obvious.
Any help would be greatly appreciated.
Many thanks,
Taylor.
Here's a sample of the data from the text file:
Here's the dictionary:
Here's the do file to read the text files into Stata:
Here's what those same rows from the above text file looks like in Stata:
I am using Stata 14.2 on macOS 10.11.6. I am working with data that has multiple fixed format-text files for multiple years and I would like to read this data into Stata. I wrote a dictionary to use with infile or infix and I wrote a do file that loops through the folder that contains each year and each text file held within each file and then appends them into a consolidated file for each year. These files for each year are later appended and cleaned for use.
My problem is that despite Stata reading in the data from the text files, some observations incorrectly have missing values for variables "tempmm", "tpayq1", and "tanpay". To be clear, the text files do in fact have values for these variables for the affected observations but after being read into Stata they are treated as missing. Sometimes an observation has missing values for all three and other times just one or two of them. You can see a demonstration of this clearly by looking at the first row of data in each of the text and Stata examples below.
These three variables are numeric and have a width of 12 columns in the text files. If they do not fill all twelve columns the column spaces preceding the values are filled with zeros or blanks. Initially, my code used -infile- in conjunction with my dictionary file but upon searching I believe I read something about there potentially being an issue in dealing with those preceding zeros and blanks and so swapped to the -infix- command. The problem persisted. I've tried searching through the forum, as well as many internet searches, and reading the documentation for the commands I used. My apologies if I've missed something obvious.
Any help would be greatly appreciated.
Many thanks,
Taylor.
Here's a sample of the data from the text file:
Code:
11001 ---- 000000029241000000046932000000197877001971001156000341000224000148000058000027000011000005000001000001000000000000000000 11000 23001 11001 07-- B000000000000000000000000000000000000000015000011000002000000000002000000000000000000000000000000000000000000000000000000 11000 23001 11001 0700 B000000000000000000000000000000000000000015000011000002000000000002000000000000000000000000000000000000000000000000000000 11000 23001 11001 0720 A000000000000000000000000000000000000000002000002000000000000000000000000000000000000000000000000000000000000000000000000 11000 23001 11001 0740 000000000015000000000021000000000099000004000003000001000000000000000000000000000000000000000000000000000000000000000000 11000 23001 11001 0750 A000000000000000000000000000000000000000002000002000000000000000000000000000000000000000000000000000000000000000000000000 11000 23001 11001 0780 000000000069000000000104000000000480000007000004000001000000000002000000000000000000000000000000000000000000000000000000 11000 23001 11001 10-- A000000000000000000000000000000000000000002000001000001000000000000000000000000000000000000000000000000000000000000000000 11000 23001 11001 1400 A000000000000000000000000000000000000000002000001000001000000000000000000000000000000000000000000000000000000000000000000 11000 23001 11001 1440 A000000000000000000000000000000000000000002000001000001000000000000000000000000000000000000000000000000000000000000000000 11000 23001
Here's the dictionary:
Code:
dictionary { _lines(1) _line(1) _column(1) str2 stateid %2s "State Code" _column(3) str3 countyid %3s "County Code" _column(6) str1 blank1 %1s "Blank 1" _column(7) str4 siccode2 %4s "SIC Code" _column(11) str1 blank2 %1s "Blank 2" _column(12) str1 tflag %1s "Data Suppression Flag" _column(13) int tempmm %12f "Total Mid-March Employmees" _column(25) int tpayq1 %12f "Total First Quarter Payroll ($1,000)" _column(37) int tanpay %12f "Total Annual Payroll ($1,000)" _column(49) int testab %6f "Total Number of Establishments" _column(55) int ctyemp1 %6f "Number of Establishments: Employment Size Class: 1-4 Employees" _column(61) int ctyemp2 %6f "Number of Establishments: Employment Size Class: 5-9 Employees" _column(67) int ctyemp3 %6f "Number of Establishments: Employment Size Class: 10-19 Employees" _column(73) int ctyemp4 %6f "Number of Establishments: Employment Size Class: 20-49 Employees" _column(79) int ctyemp5 %6f "Number of Establishments: Employment Size Class: 50-99 Employees" _column(85) int ctyemp6 %6f "Number of Establishments: Employment Size Class: 100-249 Employees" _column(91) int ctyemp7 %6f "Number of Establishments: Employment Size Class: 250-499 Employees" _column(97) int ctyemp8 %6f "Number of Establishments: Employment Size Class: 500-999 Employees" _column(103) int ctyemp9 %6f "Number of Establishments: Employment Size Class: 1000 or more Employees" _column(109) int ctyemp10 %6f "Number of Establishments: Employment Size Class: 1000-1499 Employees" _column(115) int ctyemp11 %6f "Number of Establishments: Employment Size Class: 1500-2499 Employees" _column(121) int ctyemp12 %6f "Number of Establishments: Employment Size Class: 2500-4999 Employees" _column(127) int ctyemp13 %6f "Number of Establishments: Employment Size Class: 5000 or more Employees" _column(133) int blank3 %1f "Blank 3" _column(134) int ssastat2 %2f "SSA State Code" _column(136) int ssacty2 %3f "SSA County Code" _column(139) int blank4 %1f "Blank 4" _column(140) str2 fipstate %2s "FIPS State Code" _column(142) str3 fipscty2 %3s "FIPS County Code" _column(145) int blank5 %6f "Blank 5" }
Code:
local path "/Users/taylorwright/Dropbox/My RA Work/Abel's 2017 R&R/CBP Files/Dictionary 1" local folderList : dir "`path'" dirs "19*" local yr=74 * This loops through each of the folders (with the title being the year) in the directory foreach folder of local folderList { cd "`path'/19`yr'" local fileList : dir "`path'/19`yr'" files "RG029.CBP`yr'.T2*.txt" * This loops through each file in each folder (this is needed because data is split into geographical divisions foreach file of local fileList { infile using "/Users/taylorwright/Dropbox/My RA Work/Abel's 2017 R&R/Dictionaries/CBP_dict_1974-1986.dct", using(`file') save `file'.dta, replace clear } * This appends the data files from each geographical division into one data file for the entire year use RG029.CBP`yr'.T2I1.txt.dta foreach num of numlist 2/9 { append using RG029.CBP`yr'.T2I`num'.txt.dta, } cd "/Users/taylorwright/Dropbox/My RA Work/Abel's 2017 R&R/Cleaned Data/" save 19`yr'.dta, replace local yr=`yr'+1 clear }
Code:
* Example generated by -dataex-. To install: ssc install dataex clear input str2 stateid str3 countyid int(tempmm tpayq1 tanpay) "11" "001" 29241 . . "11" "001" 0 0 0 "11" "001" 0 0 0 "11" "001" 0 0 0 "11" "001" 15 21 99 "11" "001" 0 0 0 "11" "001" 69 104 480 "11" "001" 0 0 0 "11" "001" 0 0 0 "11" "001" 0 0 0 end
Comment