Reading in 10GB .csv dataset with not enough memory on my machine

Anna George

Join Date: Feb 2021

Posts: 4
#1

Reading in 10GB .csv dataset with not enough memory on my machine

21 Feb 2021, 06:58

Hello everyone,
I need help reading in a 10GB .csv dataset with very limited physical memory on my computer. I am having trouble with the dictionary I believe.

I am trying to read in the FCC dataset https://opendata.fcc.gov/Wireline/Fi...2019/whue-6pnt which I've downloaded on my usb. I need only CA state data. Because I only have 10GB on my machine as memory (and I can't change that today), I want to infile this only within the range of the state that I need and the variables that I need.

I can see the data if I run
import delim using "Fixed_Broadband_Deployment_Data__December_2019.cs v" , rowr(1:100) clear

but can't run
infile using fccdict in 1, clear

as I get the following:

'LogicalReco' cannot be read as a number for logicalrecordnumber[1]
'rdNumber,Pr' cannot be read as a number for providerid[1]
'oviderID,FR' cannot be read as a number for frn[1]
'rName,DBA' cannot be read as a number for censusblockfipscode[1]
'Name,Ho' cannot be read as a number for consumer[1]
'ldingCo' cannot be read as a number for maxadvertiseddownstreamspeedmbps[1]
'mpanyNam' cannot be read as a number for maxadvertisedupstreamspeedmbps[1]
(1 observation read)

.

. list

+-------------------------------------------------------------------------------+
| logica~r provid~d frn state census~e consumer m~down~s m~upst~s |
|-------------------------------------------------------------------------------|
1. | . . . N, . . . . |
+-------------------------------------------------------------------------------+

with the dictionary file fccdict.dct being

dictionary using "Fixed_Broadband_Deployment_Data__December_2019.cs v"{
* for fcc data
*
*1 logicalrecord~r long %12.0g Logical Record Number
*2 providerid long %12.0g Provider ID
*3 frn long %12.0g FRN
*4 providername str46 %46s Provider Name
*5 dbaname str39 %39s DBA Name
*6 holdingcompan~e str46 %46s Holding Company Name
*7 holdingcompan~r long %12.0g Holding Company Number
*8 holdingcompan~l str46 %46s Holding Company Final
*9 state str2 %9s State
*10 censusblockfi~e double %10.0g Census Block FIPS Code
*11 technologycode byte %8.0g Technology Code
*12consumer byte %8.0g Consumer
*13 m~downstreams~s int %8.0g Max Advertised Downstream Speed (mbps)
*14m~upstreamspe~s float %9.0g Max Advertised Upstream Speed (mbps)
*15business byte %8.0g Business
*
_first(1)
_lines(1)
long logicalrecordnumber %12.0g
long providerid %12.0g
long frn %12.0g
str2 state %9s
double censusblockfipscode %10.0g
byte consumer %8.0g
int maxadvertiseddownstreamspeedmbps %8.0g
float maxadvertisedupstreamspeedmbps %9.0g

}

Doing
import delim using "Fixed_Broadband_Deployment_Data__December_2019.cs v"

just gets Stata stuck and according to stata I need a lot of memory. I am on my work computer so getting more memory is a whole bureaucratic affair that will have to wait till Monday and I need to process my dataset today.

THANKS IN ADVANCE!

About information on my machine:

Stata/MP 14.2 for Windows (64-bit x86-64)
Revision 29 Jan 2018
Copyright 1985-2015 StataCorp LLC

Total physical memory: 16594036 KB
Available physical memory: 9371188 KB

538-user 4-core Stata network perpetual license:

Licensed to: Stata/MP 14 (4 cores)
Tags: None
William Lisowski

Join Date: Dec 2014

Posts: 10150
#2

21 Feb 2021, 08:51

I am confused at the start. When I follow the link you give to the data source

https://opendata.fcc.gov/Wireline/Fi...2019/whue-6pnt

and then follow the link on that page to the data download page

https://www.fcc.gov/general/broadban...a-fcc-form-477

there are separate state-level datasets available, which would seem to meet your need to restrict your data to California.
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2416
#3

21 Feb 2021, 08:54

[Crossed in the ether with the preceding, so perhaps the following is irrelevant.]

I have only rarely used -infile-, but my guess here would be that the csv file has variable names on the first line, and infile is trying to read that first line with your numeric formats and failing. I'd suggest using _first(2) rather than _first(1) in your dictionary file and see if that helps.

If adjusting the dictionary file doesn't work, and no one else with better knowledge comes in here to help, I can suggest a crude solution, which is to use -import delimited- to read pieces of the file, keep only the "CA" records from each one, and appending each piece into a whole file. I think something like the following untested code would work:

Code:

// File has "73eM" observations according to the website. // Read it 500,000 observations at a time. // Stop at 800,000,000 to be sure to be big enough. // I don't recall if -import delimited- will complain // about reading a row number that is too large. clear tempfile whole save `whole', emptyok forval start= 1(500000)800000000 local stop = `start' + 500000 - 1 di "working on rows `i' to `stop'" import delim using "Fixed_Broadband_Deployment_Data__December_2019.cs v" , /// rowr(`i':`stop' ) clear keep if state == "CA" append using `whole' save `whole' } clear use `whole'

Finally, there is a user-written utility called -chunky- aimed at problems like yours. Take a look -ssc describe chunky-, which might be even easier.

Last edited by Mike Lacy; 21 Feb 2021, 08:57.
1 like
Comment
Anna George

Join Date: Feb 2021

Posts: 4
#4

21 Feb 2021, 09:25

Super helpful, both!
Mike Lacy -- thank you, super useful. I am discovering chunky , I stumbled on it and experimenting with what size i can then have import delim read, which seems to be even 1.3 g. This code is super helpful!

William Lisowski -- thanks for pointing me to it, i think i need to show the code that i can read a huge dataset and then keep only what i need...

I'll report back but I think i am getting there. Stata is slow as s*$%
Comment
Leonardo Guizzetti

Join Date: Jul 2016

Posts: 2402
#5

21 Feb 2021, 10:38

The fact remains that if you sre trying to load data into memory that exceeds your capacity, the operating system will necessarily cache to disk, which is far far slower than operations performed in memory. This is true of Stata and any other application, because it's a limitation of your workstation.
Comment
Anna George

Join Date: Feb 2021

Posts: 4
#6

21 Feb 2021, 10:44

yeah i know, work pays for a perpetual stata MP license and then gives us crappy computers.
Comment

Bjarte Aagnes

Join Date: Apr 2014
Posts: 784

21 Feb 2021, 11:18

Why not follow the advice in #2

Code:

copy "https://transition.fcc.gov/form477/BroadbandData/Fixed/Dec19/Version%201/CA-Fixed-Dec2019.zip" .
unzipfile "CA-Fixed-Dec2019.zip"
import delimited using "CA-Fixed-Dec2019-v1.csv" , clear
save "CA-Fixed-Dec2019.dta" 
describe
datasignature set

Code:

Contains data from CA-Fixed-Dec2019.dta
  obs:     5,141,304                          
 vars:            15                          21 Feb 2021 19:10
------------------------------------------------------------------------------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
------------------------------------------------------------------------------------------------------------------------------------------------
logrecno        long    %8.0g                 LogRecNo
provider_id     long    %12.0g                Provider_Id
frn             long    %12.0g                FRN
providername    str123  %123s                 ProviderName
dbaname         str57   %57s                  DBAName
holdingcompan~e str54   %54s                  HoldingCompanyName
hoconum         long    %12.0g                HocoNum
hocofinal       str54   %54s                  HocoFinal
stateabbr       str2    %9s                   StateAbbr
blockcode       double  %10.0g                BlockCode
techcode        byte    %8.0g                 TechCode
consumer        byte    %8.0g                 Consumer
maxaddown       double  %8.0g                 MaxAdDown
maxadup         double  %8.0g                 MaxAdUp
business        byte    %8.0g                 Business
------------------------------------------------------------------------------------------------------------------------------------------------
Sorted by: 

. datasignature set
  5141304:15(99452):1377333074:155390271       (data signature set)

Comment

Anna George

Join Date: Feb 2021

Posts: 4
#8

21 Feb 2021, 11:52

Bjarte Aagnes -- Ellegant!
Comment
DCElliott

Join Date: May 2014

Posts: 2
#9

21 Feb 2021, 22:18

I am happy to see that -chunky- is still helping folks solve problems with big datasets. It was written in the days of 32bit OS machines that generally had a 1GB dataset limit or less so one had to get creative.

Often when -infile- or -insheet- is used on a data or CSV file you end up with far larger datatypes than you need. It is sometimes helpful to -infile- or -insheet- a chunk and then run -compress- to see if the data will fit into smaller datatypes.

Anna kindly pointed out that I had an error in my example code for turning the text files into data files. The proper code example is below, to be run after -chunky-:

foreach in_fn in `s(filelist)' {
clear
insheet using `"`in_fn'"' , comma names
// create a saving filename based on input filename minus the extension using regular expressions
if regexm(`"`in_fn'"',"(.*)[.].*$") == 1 {
local save_fn = regexs(1)
save `"`save_fn'"' , replace
}
else {
display `"{err: Cannot extract savename from `in_fn'}"'
error
}
}

One could then -net install dsconcat- and use -dsconcat- to assemble the fragments after getting a macro filelist

local chunks: dir . files "[myfiles]*.dta"

dsconcat `chunks', dsname(set)
Comment

Announcement

Reading in 10GB .csv dataset with not enough memory on my machine

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment