Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Reading fixed-format text file generating missing values for some variables

    Hello everyone,

    I am using Stata 14.2 on macOS 10.11.6. I am working with data that has multiple fixed format-text files for multiple years and I would like to read this data into Stata. I wrote a dictionary to use with infile or infix and I wrote a do file that loops through the folder that contains each year and each text file held within each file and then appends them into a consolidated file for each year. These files for each year are later appended and cleaned for use.

    My problem is that despite Stata reading in the data from the text files, some observations incorrectly have missing values for variables "tempmm", "tpayq1", and "tanpay". To be clear, the text files do in fact have values for these variables for the affected observations but after being read into Stata they are treated as missing. Sometimes an observation has missing values for all three and other times just one or two of them. You can see a demonstration of this clearly by looking at the first row of data in each of the text and Stata examples below.

    These three variables are numeric and have a width of 12 columns in the text files. If they do not fill all twelve columns the column spaces preceding the values are filled with zeros or blanks. Initially, my code used -infile- in conjunction with my dictionary file but upon searching I believe I read something about there potentially being an issue in dealing with those preceding zeros and blanks and so swapped to the -infix- command. The problem persisted. I've tried searching through the forum, as well as many internet searches, and reading the documentation for the commands I used. My apologies if I've missed something obvious.

    Any help would be greatly appreciated.

    Many thanks,

    Taylor.

    Here's a sample of the data from the text file:
    Code:
    11001 ----  000000029241000000046932000000197877001971001156000341000224000148000058000027000011000005000001000001000000000000000000 11000 23001
    11001 07-- B000000000000000000000000000000000000000015000011000002000000000002000000000000000000000000000000000000000000000000000000 11000 23001
    11001 0700 B000000000000000000000000000000000000000015000011000002000000000002000000000000000000000000000000000000000000000000000000 11000 23001
    11001 0720 A000000000000000000000000000000000000000002000002000000000000000000000000000000000000000000000000000000000000000000000000 11000 23001
    11001 0740  000000000015000000000021000000000099000004000003000001000000000000000000000000000000000000000000000000000000000000000000 11000 23001
    11001 0750 A000000000000000000000000000000000000000002000002000000000000000000000000000000000000000000000000000000000000000000000000 11000 23001
    11001 0780  000000000069000000000104000000000480000007000004000001000000000002000000000000000000000000000000000000000000000000000000 11000 23001
    11001 10-- A000000000000000000000000000000000000000002000001000001000000000000000000000000000000000000000000000000000000000000000000 11000 23001
    11001 1400 A000000000000000000000000000000000000000002000001000001000000000000000000000000000000000000000000000000000000000000000000 11000 23001
    11001 1440 A000000000000000000000000000000000000000002000001000001000000000000000000000000000000000000000000000000000000000000000000 11000 23001

    Here's the dictionary:
    Code:
    dictionary {
    
    _lines(1)
    
    _line(1)
    
    
    _column(1)        str2    stateid        %2s        "State Code"
    _column(3)        str3    countyid    %3s        "County Code"
    _column(6)        str1     blank1        %1s        "Blank 1"
    _column(7)        str4    siccode2    %4s        "SIC Code"
    _column(11)        str1    blank2        %1s        "Blank 2"
    _column(12)     str1     tflag        %1s        "Data Suppression Flag"
    _column(13)        int        tempmm        %12f    "Total Mid-March Employmees"
    _column(25)        int     tpayq1        %12f    "Total First Quarter Payroll ($1,000)"
    _column(37)        int     tanpay        %12f    "Total Annual Payroll ($1,000)"
    _column(49)        int        testab        %6f        "Total Number of Establishments"
    _column(55)        int        ctyemp1        %6f        "Number of Establishments: Employment Size Class: 1-4 Employees"
    _column(61)        int        ctyemp2        %6f        "Number of Establishments: Employment Size Class: 5-9 Employees"
    _column(67)        int        ctyemp3        %6f        "Number of Establishments: Employment Size Class: 10-19 Employees"
    _column(73)        int        ctyemp4        %6f        "Number of Establishments: Employment Size Class: 20-49 Employees"
    _column(79)        int        ctyemp5        %6f        "Number of Establishments: Employment Size Class: 50-99 Employees"
    _column(85)        int        ctyemp6        %6f        "Number of Establishments: Employment Size Class: 100-249 Employees"
    _column(91)        int        ctyemp7        %6f        "Number of Establishments: Employment Size Class: 250-499 Employees"
    _column(97)        int        ctyemp8        %6f        "Number of Establishments: Employment Size Class: 500-999 Employees"
    _column(103)    int        ctyemp9        %6f        "Number of Establishments: Employment Size Class: 1000 or more Employees"
    _column(109)    int        ctyemp10    %6f        "Number of Establishments: Employment Size Class: 1000-1499 Employees"
    _column(115)    int        ctyemp11    %6f        "Number of Establishments: Employment Size Class: 1500-2499 Employees"
    _column(121)    int        ctyemp12    %6f        "Number of Establishments: Employment Size Class: 2500-4999 Employees"
    _column(127)    int        ctyemp13    %6f        "Number of Establishments: Employment Size Class: 5000 or more Employees"
    _column(133)    int        blank3        %1f        "Blank 3"
    _column(134)    int        ssastat2    %2f        "SSA State Code"
    _column(136)    int        ssacty2        %3f        "SSA County Code"
    _column(139)    int        blank4        %1f        "Blank 4"
    _column(140)    str2    fipstate    %2s        "FIPS State Code"
    _column(142)    str3    fipscty2    %3s        "FIPS County Code"
    _column(145)    int        blank5        %6f        "Blank 5"
    
    }
    Here's the do file to read the text files into Stata:
    Code:
    local path "/Users/taylorwright/Dropbox/My RA Work/Abel's 2017 R&R/CBP Files/Dictionary 1"
    local folderList : dir  "`path'" dirs "19*"
    local yr=74
    
    * This loops through each of the folders (with the title being the year) in the directory
    foreach folder of local folderList {
    
    cd "`path'/19`yr'"
    
    local fileList : dir "`path'/19`yr'" files "RG029.CBP`yr'.T2*.txt"
    
    * This loops through each file in each folder (this is needed because data is split into geographical divisions
    foreach file of local fileList {
        infile using "/Users/taylorwright/Dropbox/My RA Work/Abel's 2017 R&R/Dictionaries/CBP_dict_1974-1986.dct", using(`file')
        save `file'.dta, replace    
        clear
    }
    * This appends the data files from each geographical division into one data file for the entire year
    use RG029.CBP`yr'.T2I1.txt.dta
        foreach num of numlist 2/9    {
            append using RG029.CBP`yr'.T2I`num'.txt.dta,
        }
        cd "/Users/taylorwright/Dropbox/My RA Work/Abel's 2017 R&R/Cleaned Data/"
        save 19`yr'.dta, replace
    local yr=`yr'+1
    clear
    }
    Here's what those same rows from the above text file looks like in Stata:
    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str2 stateid str3 countyid int(tempmm tpayq1 tanpay)
    "11" "001" 29241   .   .
    "11" "001"     0   0   0
    "11" "001"     0   0   0
    "11" "001"     0   0   0
    "11" "001"    15  21  99
    "11" "001"     0   0   0
    "11" "001"    69 104 480
    "11" "001"     0   0   0
    "11" "001"     0   0   0
    "11" "001"     0   0   0
    end

  • #2
    In your dictionary, the variables tpayq1 and tanpay are read into an int variable type with a 12 character wide format:
    Code:
    _column(25)        int     tpayq1        %12f    "Total First Quarter Payroll ($1,000)"
    _column(37)        int     tanpay        %12f    "Total Annual Payroll ($1,000)"
    The first observation in your data example, starting at column 25 is
    Code:
    000000046932000000197877
    If you try to store 46932 and 197877 in an int type variable, you get a missing value. See help data_types.

    Comment


    • #3
      Originally posted by Robert Picard View Post
      In your dictionary, the variables tpayq1 and tanpay are read into an int variable type with a 12 character wide format:
      Code:
      _column(25) int tpayq1 %12f "Total First Quarter Payroll ($1,000)"
      _column(37) int tanpay %12f "Total Annual Payroll ($1,000)"
      The first observation in your data example, starting at column 25 is
      Code:
      000000046932000000197877
      If you try to store 46932 and 197877 in an int type variable, you get a missing value. See help data_types.
      Thanks for the prompt reply and advice Robert!

      Based on the help file, I should instead alter the dictionary to read these variables as doubles and that should solve the problem. Is that the correct interpretation?

      Cheers,

      Taylor.

      Comment


      • #4
        Yes, that's the safest approach. You can always use compress later on to make the dataset more compact.

        Comment


        • #5
          Perfect. I feel a bit embarrassed that the problem was so simple (and of course self-inflicted), but I do feel a bit relieved that the solution is simple.

          Thank you very much for your help Robert.

          Comment

          Working...
          X