Dear all
I am working on a national dataset of up to 12m observations (individuals), which will need to frequently be aggregated to a smaller number of four types of nested higher level units (i.e. let us say neighbourhoods, local authorities, counties and regions). The data will need to be collapsed using the higher level units in the "by()" specification of the -collapse- command.
Given the size of the databases (3gb+) data storage efficiency is important.
In the dataset the smallest level of geographical unit (a neighbourhood) is defined by a unique 8 digit code as thus RRCCLLNN, where
NN is the two digit neighbourhood code from 01 to 99
LL is the two digit local authority code from 01 to 99
CC is the two digit county code from 01 to 99
RR is the two digit region code from 01 to 99
There are three other variables in the dataset used as geographical identifiers at other levels: local authorities are defined by a separate variable RRCCLL, likewise for counties (RRCC) and regions (RR). This creates four geographical identifiers which contain no extra information than one of them (the neighbourhood one, which contains all higher level codes).
I will need to aggregate individual level data using the -collapse- command to each of these four levels. For storage efficiency it would be preferable to only have a single geographical identifier than all four.
I think this will only be possible if it is possible to -collapse- data by part of the string. I would like some help from the expert community on whether this is possible.
I have tried the following command, which did not work, but hopefully illustrates what I would like to achieve. Assuming the variable "geographical_unit" is of the structure RRCCLLNN:
collapse (sum) cases, by(regexm(geographical_unit, "^[0-9][0-9][0-9][0-9][0-9][0-9]"))
which in my mind (but not in Stata) should return the sum of "cases" according to the first 6 digits of the "geographical_unit" variable i.e. according to RRCCLL i.e. at Local Authority level.
Stata returned an r101 error "factor variables and time-seies operators not allowed in option by()
I also tried
collapse (sum) cases, by(substr(geographical_unit,1,6))
with comparable failure.
Any help would be very much appreciated and apologies if this topic has been previously covered, I did inspect the old mailing lists first.
You might consider this a trivial storage issue as it appears from my simplified description, because I am only trying to reduce my dataset by 3 variables. However, individuals are coded to their neighbourhood in 30 separate years, so the actual storage gain is 30 geographical identifier variables vs 120, across up to 12m records (individuals).
Yours, with thanks
James Kirkbride
UCL
I am working on a national dataset of up to 12m observations (individuals), which will need to frequently be aggregated to a smaller number of four types of nested higher level units (i.e. let us say neighbourhoods, local authorities, counties and regions). The data will need to be collapsed using the higher level units in the "by()" specification of the -collapse- command.
Given the size of the databases (3gb+) data storage efficiency is important.
In the dataset the smallest level of geographical unit (a neighbourhood) is defined by a unique 8 digit code as thus RRCCLLNN, where
NN is the two digit neighbourhood code from 01 to 99
LL is the two digit local authority code from 01 to 99
CC is the two digit county code from 01 to 99
RR is the two digit region code from 01 to 99
There are three other variables in the dataset used as geographical identifiers at other levels: local authorities are defined by a separate variable RRCCLL, likewise for counties (RRCC) and regions (RR). This creates four geographical identifiers which contain no extra information than one of them (the neighbourhood one, which contains all higher level codes).
I will need to aggregate individual level data using the -collapse- command to each of these four levels. For storage efficiency it would be preferable to only have a single geographical identifier than all four.
I think this will only be possible if it is possible to -collapse- data by part of the string. I would like some help from the expert community on whether this is possible.
I have tried the following command, which did not work, but hopefully illustrates what I would like to achieve. Assuming the variable "geographical_unit" is of the structure RRCCLLNN:
collapse (sum) cases, by(regexm(geographical_unit, "^[0-9][0-9][0-9][0-9][0-9][0-9]"))
which in my mind (but not in Stata) should return the sum of "cases" according to the first 6 digits of the "geographical_unit" variable i.e. according to RRCCLL i.e. at Local Authority level.
Stata returned an r101 error "factor variables and time-seies operators not allowed in option by()
I also tried
collapse (sum) cases, by(substr(geographical_unit,1,6))
with comparable failure.
Any help would be very much appreciated and apologies if this topic has been previously covered, I did inspect the old mailing lists first.
You might consider this a trivial storage issue as it appears from my simplified description, because I am only trying to reduce my dataset by 3 variables. However, individuals are coded to their neighbourhood in 30 separate years, so the actual storage gain is 30 geographical identifier variables vs 120, across up to 12m records (individuals).
Yours, with thanks
James Kirkbride
UCL
Comment