Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Reading in 10GB .csv dataset with not enough memory on my machine

    Hello everyone,
    I need help reading in a 10GB .csv dataset with very limited physical memory on my computer. I am having trouble with the dictionary I believe.

    I am trying to read in the FCC dataset https://opendata.fcc.gov/Wireline/Fi...2019/whue-6pnt which I've downloaded on my usb. I need only CA state data. Because I only have 10GB on my machine as memory (and I can't change that today), I want to infile this only within the range of the state that I need and the variables that I need.

    I can see the data if I run
    import delim using "Fixed_Broadband_Deployment_Data__December_2019.cs v" , rowr(1:100) clear

    but can't run
    infile using fccdict in 1, clear

    as I get the following:

    'LogicalReco' cannot be read as a number for logicalrecordnumber[1]
    'rdNumber,Pr' cannot be read as a number for providerid[1]
    'oviderID,FR' cannot be read as a number for frn[1]
    'rName,DBA' cannot be read as a number for censusblockfipscode[1]
    'Name,Ho' cannot be read as a number for consumer[1]
    'ldingCo' cannot be read as a number for maxadvertiseddownstreamspeedmbps[1]
    'mpanyNam' cannot be read as a number for maxadvertisedupstreamspeedmbps[1]
    (1 observation read)

    .


    . list

    +-------------------------------------------------------------------------------+
    | logica~r provid~d frn state census~e consumer m~down~s m~upst~s |
    |-------------------------------------------------------------------------------|
    1. | . . . N, . . . . |
    +-------------------------------------------------------------------------------+



    with the dictionary file fccdict.dct being

    dictionary using "Fixed_Broadband_Deployment_Data__December_2019.cs v"{
    * for fcc data
    *
    *1 logicalrecord~r long %12.0g Logical Record Number
    *2 providerid long %12.0g Provider ID
    *3 frn long %12.0g FRN
    *4 providername str46 %46s Provider Name
    *5 dbaname str39 %39s DBA Name
    *6 holdingcompan~e str46 %46s Holding Company Name
    *7 holdingcompan~r long %12.0g Holding Company Number
    *8 holdingcompan~l str46 %46s Holding Company Final
    *9 state str2 %9s State
    *10 censusblockfi~e double %10.0g Census Block FIPS Code
    *11 technologycode byte %8.0g Technology Code
    *12consumer byte %8.0g Consumer
    *13 m~downstreams~s int %8.0g Max Advertised Downstream Speed (mbps)
    *14m~upstreamspe~s float %9.0g Max Advertised Upstream Speed (mbps)
    *15business byte %8.0g Business
    *
    _first(1)
    _lines(1)
    long logicalrecordnumber %12.0g
    long providerid %12.0g
    long frn %12.0g
    str2 state %9s
    double censusblockfipscode %10.0g
    byte consumer %8.0g
    int maxadvertiseddownstreamspeedmbps %8.0g
    float maxadvertisedupstreamspeedmbps %9.0g

    }



    Doing
    import delim using "Fixed_Broadband_Deployment_Data__December_2019.cs v"

    just gets Stata stuck and according to stata I need a lot of memory. I am on my work computer so getting more memory is a whole bureaucratic affair that will have to wait till Monday and I need to process my dataset today.

    THANKS IN ADVANCE!

    About information on my machine:

    Stata/MP 14.2 for Windows (64-bit x86-64)
    Revision 29 Jan 2018
    Copyright 1985-2015 StataCorp LLC

    Total physical memory: 16594036 KB
    Available physical memory: 9371188 KB

    538-user 4-core Stata network perpetual license:

    Licensed to: Stata/MP 14 (4 cores)

  • #2
    I am confused at the start. When I follow the link you give to the data source

    https://opendata.fcc.gov/Wireline/Fi...2019/whue-6pnt

    and then follow the link on that page to the data download page

    https://www.fcc.gov/general/broadban...a-fcc-form-477

    there are separate state-level datasets available, which would seem to meet your need to restrict your data to California.

    Comment


    • #3
      [Crossed in the ether with the preceding, so perhaps the following is irrelevant.]

      I have only rarely used -infile-, but my guess here would be that the csv file has variable names on the first line, and infile is trying to read that first line with your numeric formats and failing. I'd suggest using _first(2) rather than _first(1) in your dictionary file and see if that helps.

      If adjusting the dictionary file doesn't work, and no one else with better knowledge comes in here to help, I can suggest a crude solution, which is to use -import delimited- to read pieces of the file, keep only the "CA" records from each one, and appending each piece into a whole file. I think something like the following untested code would work:

      Code:
      // File has "73eM" observations according to the website.  
      // Read it 500,000 observations at a time.
      // Stop at 800,000,000 to be sure to be big enough.
      // I don't recall if -import delimited- will complain
      // about reading a row number that is too large.
      clear
      tempfile whole
      save `whole', emptyok
      forval start= 1(500000)800000000
         local stop = `start' + 500000 - 1
         di "working on rows `i' to `stop'"
          import delim using "Fixed_Broadband_Deployment_Data__December_2019.cs v" , ///
            rowr(`i':`stop' ) clear
         keep if state == "CA"  
         append using `whole'
         save `whole'
      }  
      clear
      use `whole'
      Finally, there is a user-written utility called -chunky- aimed at problems like yours. Take a look -ssc describe chunky-, which might be even easier.







      Last edited by Mike Lacy; 21 Feb 2021, 08:57.

      Comment


      • #4
        Super helpful, both!
        Mike Lacy -- thank you, super useful. I am discovering chunky , I stumbled on it and experimenting with what size i can then have import delim read, which seems to be even 1.3 g. This code is super helpful!

        William Lisowski -- thanks for pointing me to it, i think i need to show the code that i can read a huge dataset and then keep only what i need...

        I'll report back but I think i am getting there. Stata is slow as s*$%

        Comment


        • #5
          The fact remains that if you sre trying to load data into memory that exceeds your capacity, the operating system will necessarily cache to disk, which is far far slower than operations performed in memory. This is true of Stata and any other application, because it's a limitation of your workstation.

          Comment


          • #6
            yeah i know, work pays for a perpetual stata MP license and then gives us crappy computers.

            Comment


            • #7
              Why not follow the advice in #2
              Code:
              copy "https://transition.fcc.gov/form477/BroadbandData/Fixed/Dec19/Version%201/CA-Fixed-Dec2019.zip" .
              unzipfile "CA-Fixed-Dec2019.zip"
              import delimited using "CA-Fixed-Dec2019-v1.csv" , clear
              save "CA-Fixed-Dec2019.dta" 
              describe
              datasignature set
              Code:
              Contains data from CA-Fixed-Dec2019.dta
                obs:     5,141,304                          
               vars:            15                          21 Feb 2021 19:10
              ------------------------------------------------------------------------------------------------------------------------------------------------
                            storage   display    value
              variable name   type    format     label      variable label
              ------------------------------------------------------------------------------------------------------------------------------------------------
              logrecno        long    %8.0g                 LogRecNo
              provider_id     long    %12.0g                Provider_Id
              frn             long    %12.0g                FRN
              providername    str123  %123s                 ProviderName
              dbaname         str57   %57s                  DBAName
              holdingcompan~e str54   %54s                  HoldingCompanyName
              hoconum         long    %12.0g                HocoNum
              hocofinal       str54   %54s                  HocoFinal
              stateabbr       str2    %9s                   StateAbbr
              blockcode       double  %10.0g                BlockCode
              techcode        byte    %8.0g                 TechCode
              consumer        byte    %8.0g                 Consumer
              maxaddown       double  %8.0g                 MaxAdDown
              maxadup         double  %8.0g                 MaxAdUp
              business        byte    %8.0g                 Business
              ------------------------------------------------------------------------------------------------------------------------------------------------
              Sorted by: 
              
              . datasignature set
                5141304:15(99452):1377333074:155390271       (data signature set)

              Comment


              • #8
                Bjarte Aagnes -- Ellegant!

                Comment


                • #9
                  I am happy to see that -chunky- is still helping folks solve problems with big datasets. It was written in the days of 32bit OS machines that generally had a 1GB dataset limit or less so one had to get creative.

                  Often when -infile- or -insheet- is used on a data or CSV file you end up with far larger datatypes than you need. It is sometimes helpful to -infile- or -insheet- a chunk and then run -compress- to see if the data will fit into smaller datatypes.

                  Anna kindly pointed out that I had an error in my example code for turning the text files into data files. The proper code example is below, to be run after -chunky-:

                  foreach in_fn in `s(filelist)' {
                  clear
                  insheet using `"`in_fn'"' , comma names
                  // create a saving filename based on input filename minus the extension using regular expressions
                  if regexm(`"`in_fn'"',"(.*)[.].*$") == 1 {
                  local save_fn = regexs(1)
                  save `"`save_fn'"' , replace
                  }
                  else {
                  display `"{err: Cannot extract savename from `in_fn'}"'
                  error
                  }
                  }

                  One could then -net install dsconcat- and use -dsconcat- to assemble the fragments after getting a macro filelist

                  local chunks: dir . files "[myfiles]*.dta"

                  dsconcat `chunks', dsname(set)

                  Comment

                  Working...
                  X