Hello!
Stata version: 16.1
Issue: Don't know how to approach the cleaning of this dataset as all the relevant information are stored in strings
Hello everyone.
I'm working with a dataset on gasoline prices I retrieved in the french government website https://www.prix-carburants.gouv.fr/rubrique/opendata/.
The format of the downloaded data is xml.
After the following comands:
The dataset contains daily, retail, gasoline prices charged by the Universe of the French Gas-stations.
I am left with a dataset that looks like this:
My objective is to end up with a panel dataset with:
gas-station- id as the individual identifier
the latest daily hour as the time identifier (e.g. if I have maj="2023-01-02T06:05:21" and maj="2023-01-02T11:37:53", I would pick the second one and call it 2023-01-02)
all the gas station characteristics stored in the variables (e.g. latitude="4620100" longitude="519800", <adresse>596 AVENUE DE TREVOUX</adresse>, <ville>SAINT-DENIS-LèS-BOURG</ville>)
It is the first time I work with such a messy dataset and any guidance would be helpful.
Stata version: 16.1
Issue: Don't know how to approach the cleaning of this dataset as all the relevant information are stored in strings
Hello everyone.
I'm working with a dataset on gasoline prices I retrieved in the french government website https://www.prix-carburants.gouv.fr/rubrique/opendata/.
The format of the downloaded data is xml.
After the following comands:
Code:
ssc install leftalign // Run only once ******************************************************************************** **# Load data ******************************************************************************** *Load the data clear infix str line 1-700 using ..\Raw\PrixCarburants_annuel_2023.xml compress leftalign
The dataset contains daily, retail, gasoline prices charged by the Universe of the French Gas-stations.
I am left with a dataset that looks like this:
Code:
* Example generated by -dataex-. For more info, type help dataex clear input str104 line `"<pdv id="1000001" latitude="4620100" longitude="519800" cp="01000" pop="R">"' "<adresse>596 AVENUE DE TREVOUX</adresse>" "<ville>SAINT-DENIS-LèS-BOURG</ville>" `"<horaires automate-24-24="">"' `"<jour id="1" nom="Lundi" ferme="1"/>"' `"<jour id="2" nom="Mardi" ferme="1"/>"' `"<jour id="3" nom="Mercredi" ferme="1"/>"' `"<jour id="4" nom="Jeudi" ferme="1"/>"' `"<jour id="5" nom="Vendredi" ferme="1"/>"' `"<jour id="6" nom="Samedi" ferme="1"/>"' `"<jour id="7" nom="Dimanche" ferme="1"/>"' "</horaires>" "<services>" "<service>Station de gonflage</service>" "<service>Vente de gaz domestique (Butane, Propane)</service>" "<service>DAB (Distributeur automatique de billets)</service>" "</services>" `"<prix nom="Gazole" id="1" maj="2023-01-02T07:53:26" valeur="1.867"/>"' `"<prix nom="Gazole" id="1" maj="2023-01-05T09:33:37" valeur="1.877"/>"' `"<prix nom="Gazole" id="1" maj="2023-01-09T14:51:49" valeur="1.875"/>"' `"<prix nom="Gazole" id="1" maj="2023-01-11T09:23:54" valeur="1.859"/>"' `"<prix nom="Gazole" id="1" maj="2023-01-13T09:07:40" valeur="1.862"/>"' `"<prix nom="Gazole" id="1" maj="2023-01-16T09:34:16" valeur="1.885"/>"' `"<pdv id="1000002" latitude="4621842" longitude="522767" cp="01000" pop="R">"' "<adresse>16 Avenue de Marboz</adresse>" "<ville>BOURG-EN-BRESSE</ville>" `"<horaires automate-24-24="">"' `"<jour id="1" nom="Lundi" ferme=""/>"' `"<jour id="2" nom="Mardi" ferme=""/>"' `"<jour id="3" nom="Mercredi" ferme=""/>"' `"<jour id="4" nom="Jeudi" ferme=""/>"' `"<jour id="5" nom="Vendredi" ferme=""/>"' `"<jour id="6" nom="Samedi" ferme=""/>"' `"<jour id="7" nom="Dimanche" ferme=""/>"' "</horaires>" "<services>" "<service>Vente de gaz domestique (Butane, Propane)</service>" "<service>DAB (Distributeur automatique de billets)</service>" "</services>" `"<prix nom="Gazole" id="1" maj="2023-01-02T06:05:21" valeur="1.859"/>"' `"<prix nom="Gazole" id="1" maj="2023-01-02T11:37:53" valeur="1.799"/>"' `"<prix nom="Gazole" id="1" maj="2023-01-05T11:24:03" valeur="1.859"/>"' end
My objective is to end up with a panel dataset with:
gas-station- id as the individual identifier
the latest daily hour as the time identifier (e.g. if I have maj="2023-01-02T06:05:21" and maj="2023-01-02T11:37:53", I would pick the second one and call it 2023-01-02)
all the gas station characteristics stored in the variables (e.g. latitude="4620100" longitude="519800", <adresse>596 AVENUE DE TREVOUX</adresse>, <ville>SAINT-DENIS-LèS-BOURG</ville>)
It is the first time I work with such a messy dataset and any guidance would be helpful.
Comment