Importing large *.XML files on Mac

Patrick Florian

Join Date: Feb 2016

Posts: 2
#1

Importing large *.XML files on Mac

10 Feb 2016, 17:18

Hello everyone,

I'm trying to import large *.xml files (around 250MB; version 1.0, ISO-8859-1 encoding; see below for full file header) to Stata 12 on a Mac (I could also access a Windows machine with Stata 14 if that would make a difference).

I already tried xmluse and usexmlex without success.

I also tried to first import the files to Excel and then bring them to Stata. This, however, proved to a) be very tedious (Excel Mac doesn't support XML import, so I first have to slice the files, import them to Excel on one of my Universities PCs (32bit) and then bring them to my Laptop) and b) prone to mistakes (I get different column names in Excel and then end up with too many columns in Stata -- I could correct those mistakes manually, the risk of not getting rid of all errors appears too high, though).

Is there any good way to import those files directly to Stata?

------------------------------
The header of the files reads:

Code:

<?xml version="1.0" encoding="ISO-8859-1" standalone="yes"?><pdv_liste>

If it is of any help, I downloaded the data here (website on French):
http://www.prix-carburants.gouv.fr/rubrique/opendata/
------------------------------

I would be very grateful for any help!

Thank you in advance!

Best,
Patrick
Tags: None

Robert Picard

Join Date: Mar 2014
Posts: 1536

10 Feb 2016, 21:10

When data is made available in XML or JSON format, this usually means that the data is not rectangular (variables and observations). So usually you'll have to do some work to get the data into Stata. Fortunately, both format are in plain text and can be easily read. Since your files are very large, you can peak at the contents using Stata's type command. For example, I downloaded the 2015 data from http://donnees.roulez-eco.fr/opendata/annee/2015. Here are the first 20 lines

Code:

. type PrixCarburants_annuel_2015.xml, lines(20)
<?xml version="1.0" encoding="ISO-8859-1" standalone="yes"?><pdv_liste>
<pdv id="1000001" latitude="4620114" longitude="519791" cp="01000" pop="R">
  <adresse>596 AVENUE DE TREVOUX</adresse>
  <ville>SAINT-DENIS-L&#xE8;S-BOURG</ville>
  <ouverture debut="01:00:00" fin="01:00:00" saufjour=""/>
  <services>
    <service>Automate CB</service>
    <service>Vente de gaz domestique</service>
    <service>Station de gonflage</service>
  </services>
  <prix nom="Gazole" id="1" maj="2015-01-02T11:01:45" valeur="1141"/>
  <prix nom="Gazole" id="1" maj="2015-01-03T09:01:42" valeur="1141"/>
  <prix nom="Gazole" id="1" maj="2015-01-07T10:01:44" valeur="1141"/>
  <prix nom="Gazole" id="1" maj="2015-01-08T10:01:06" valeur="1115"/>
  <prix nom="Gazole" id="1" maj="2015-01-09T10:01:19" valeur="1115"/>
  <prix nom="Gazole" id="1" maj="2015-01-10T09:01:33" valeur="1099"/>
  <prix nom="Gazole" id="1" maj="2015-01-14T09:01:23" valeur="1088"/>
  <prix nom="Gazole" id="1" maj="2015-01-16T14:01:10" valeur="1074"/>
  <prix nom="Gazole" id="1" maj="2015-01-24T09:01:38" valeur="1074"/>
  <prix nom="Gazole" id="1" maj="2015-01-27T12:01:56" valeur="1088"/>
.

Since this is a text file, you can simply input it into Stata using the following code

Code:

* read the whole xml for 2015, 
infix str line 1-244 using "PrixCarburants_annuel_2015.xml", clear
compress
leftalign    // to install, type ssc install leftalign

The XML tag that identifies gas stations is "<pdv>". You can use the following code to create a gas station identifier. Since most of the data is about prices, I split the price data from the station data.

Code:

* Group observations per <pdv> tag (gas station)
gen pdvid = sum(strpos(line,"<pdv "))

* split data about gas station and prices
preserve
keep if strpos(line,"<prix ")
save "prix2015.dta", replace
restore
drop if strpos(line,"<prix ")
save "pdv2015.dta", replace

There are many ways to extract the data from each line. I used regular expressions. Here's how to get the price data

Code:

use "prix2015.dta", clear
gen prix_nom = regexs(1) if regexm(line,`"<prix nom="([^"]+)""')
gen prix_id = regexs(1) if regexm(line,`"<prix .+ id="([^"]+)""')
gen prix_maj = regexs(1) if regexm(line,`"<prix .+ maj="([^"]+)""')
gen prix_valeur = regexs(1) if regexm(line,`"<prix .+ valeur="([^"]+)""')

assert !mi(prix_nom, prix_id, prix_maj, prix_valeur)
save "prix2015vars.dta", replace

The gas station data is more involved but you can use the same strategy

Code:

use "pdv2015.dta", clear

* drop first line
list line in 1
drop in 1

* extract data
gen pdv_id = regexs(1) if regexm(line,`"<pdv id="([^"]+)""')
gen double pdv_lat = real(regexs(1)) / 100000 if regexm(line,`"<pdv .+ latitude="([^"]+)""')
gen double pdv_lon = real(regexs(1)) / 100000 if regexm(line,`"<pdv .+ longitude="([^"]+)""')
gen pdv_cp = regexs(1) if regexm(line,`"<pdv .+ cp="([^"]+)""')
gen pdv_pop = regexs(1) if regexm(line,`"<pdv .+ pop="([^"]+)""')

* some addresses are split into two observation
gen pdv_adresse = regexs(1) if regexm(line,`"<adresse>(.+)</adresse>"')
replace pdv_adresse = regexs(1) if regexm(line,`"<adresse>(.+)"') & mi(pdv_adresse)
replace pdv_adresse = regexs(1) if regexm(line,`"(.+)</adresse>"') & mi(pdv_adresse)

gen pdv_ville = regexs(1) if regexm(line,`"<ville>(.+)</ville>"')

gen pdv_ouvert_debut = regexs(1) if regexm(line,`"<ouverture debut="([^"]+)""')
gen pdv_ouvert_fin = regexs(1) if regexm(line,`"<ouverture .+ fin="([^"]+)""')
gen pdv_ouvert_sauf = regexs(1) if regexm(line,`"<ouverture .+ saufjour="([^"]+)""')

gen pdv_service = regexs(1) if regexm(line,`"<service>(.+)</service>"')

gen pdv_rupt_id = regexs(1) if regexm(line,`"<rupture id="([^"]+)""')
gen pdv_rupt_nom = regexs(1) if regexm(line,`"<rupture .+ nom="([^"]+)""')
gen pdv_rupt_debut = regexs(1) if regexm(line,`"<rupture .+ debut="([^"]+)""')
gen pdv_rupt_fin = regexs(1) if regexm(line,`"<rupture .+ fin="([^"]+)""')

gen pdv_ferm_type = regexs(1) if regexm(line,`"<fermeture type="([^"]+)""')
gen pdv_ferm_debut = regexs(1) if regexm(line,`"<fermeture .+ debut="([^"]+)""')
gen pdv_ferm_fin = regexs(1) if regexm(line,`"<fermeture .+ fin="([^"]+)""')

* drop lines with no data
gen pdvn = 0
ds pdv_*
foreach v of varlist `r(varlist)' {
    replace pdvn = pdvn + !mi(`v')
}

tab line if pdvn == 0
drop if pdvn == 0

save "pdv2015vars.dta", replace

I had to tweak the address because a few were split across two lines. The code at the end checks that all variables have been extracted and that the remaining tags are singletons with no data. The results of the tabulate:

Code:

. tab line if pdvn == 0

                                   line |      Freq.     Percent        Cum.
----------------------------------------+-----------------------------------
                                 </pdv> |     12,333       23.06       23.06
                            </services> |     10,848       20.29       43.35
                           <fermeture/> |     10,637       19.89       63.25
                                <prix/> |      2,601        4.86       68.11
                             <rupture/> |      4,719        8.83       76.94
                            <services/> |      1,485        2.78       79.71
                             <services> |     10,848       20.29      100.00
----------------------------------------+-----------------------------------
                                  Total |     53,471      100.00

There's still some work to be done but this should give you a good starting point. How to proceed depends a lot on what you are trying to do.

Comment

wbuchanan

Join Date: Mar 2014

Posts: 1361
#3

11 Feb 2016, 01:47

If you're familiar with Java there are a fair number of libraries for handling XML data that support queries against the files (e.g., they represent the structure of the document and then you can descend and ascend different nodes). Even in cases where the data are rectangular, there doesn't seem to be a great way of parsing things automatically since type specification isn't required by the standard. If anything, you could probably write something in Mata that would more closely resemble the structure of the data (e.g., a collection of objects with properties), but there isn't going to be an easy way to rectangularize the data.
Comment
Patrick Florian

Join Date: Feb 2016

Posts: 2
#4

11 Feb 2016, 04:43

Thank you so much, Robert!

This seems to work like a charm. I will now try to do it for all the datasets.

I just saw that you are also the author of the - geodist - and - geonear - packages. I will be using those with the data as well, so thank you very much for those, too :-)

Best,
Patrick

Last edited by Patrick Florian; 11 Feb 2016, 04:49.
Comment
Robert Picard

Join Date: Mar 2014

Posts: 1536
#5

11 Feb 2016, 09:00

To all the database gods,

If your data is rectangular, please make it available in plain text, comma or tab delimited (or even Excel), seriously. If not, please take a few minutes to think about those who do not wish to simply query or modify your sacred data but want to analyze it. Perhaps there's a way to make two or more rectangular datasets. The example in #2 shows that the price data could be stored in a single dataset, with one observation per price (over 3 million of them) and an identifier for the gas station. A second dataset can carry all the gas station info and is connected to the price dataset using the gas station identifier. Efficient, compact, an eminently analyzable.

Thank you for your consideration, Robert
1 like
Comment
wbuchanan

Join Date: Mar 2014

Posts: 1361
#6

11 Feb 2016, 14:30

Robert Picard I'm not sure it is always easy/practical for DBAs and/or database developers to rectangularize the data. In a strictly relational DB it is probably not too difficult, but how should they handle cases when working with object databases where said objects can contain varying numbers of elements? It's great for storage, but makes it a pain and costly to use for analysis purposes. In this particular context I completely agree with you, but the combination of the representation of data in Stata (e.g., a single strictly rectangularized collection) and the variety of ways to represent data (e.g., schemaless formats, etc...) will continue to bother all of us for a while. Hopefully I'll be able to get some stuff put together to make it a bit easier to work with JSON input (just finishing a parser for Google's geocoding API this morning) but it is definitely something where the Java API and the different tools available in the Java world (e.g., Spark, Shark, Flume, etc...) could be helpful.
Comment

Announcement