Collapse on part of string

jameskb101

Join Date: May 2014

Posts: 18
#1

Collapse on part of string

06 May 2014, 08:01

Dear all

I am working on a national dataset of up to 12m observations (individuals), which will need to frequently be aggregated to a smaller number of four types of nested higher level units (i.e. let us say neighbourhoods, local authorities, counties and regions). The data will need to be collapsed using the higher level units in the "by()" specification of the -collapse- command.

Given the size of the databases (3gb+) data storage efficiency is important.

In the dataset the smallest level of geographical unit (a neighbourhood) is defined by a unique 8 digit code as thus RRCCLLNN, where

NN is the two digit neighbourhood code from 01 to 99
LL is the two digit local authority code from 01 to 99
CC is the two digit county code from 01 to 99
RR is the two digit region code from 01 to 99

There are three other variables in the dataset used as geographical identifiers at other levels: local authorities are defined by a separate variable RRCCLL, likewise for counties (RRCC) and regions (RR). This creates four geographical identifiers which contain no extra information than one of them (the neighbourhood one, which contains all higher level codes).

I will need to aggregate individual level data using the -collapse- command to each of these four levels. For storage efficiency it would be preferable to only have a single geographical identifier than all four.

I think this will only be possible if it is possible to -collapse- data by part of the string. I would like some help from the expert community on whether this is possible.

I have tried the following command, which did not work, but hopefully illustrates what I would like to achieve. Assuming the variable "geographical_unit" is of the structure RRCCLLNN:

collapse (sum) cases, by(regexm(geographical_unit, "^[0-9][0-9][0-9][0-9][0-9][0-9]"))

which in my mind (but not in Stata) should return the sum of "cases" according to the first 6 digits of the "geographical_unit" variable i.e. according to RRCCLL i.e. at Local Authority level.

Stata returned an r101 error "factor variables and time-seies operators not allowed in option by()

I also tried

collapse (sum) cases, by(substr(geographical_unit,1,6))

with comparable failure.

Any help would be very much appreciated and apologies if this topic has been previously covered, I did inspect the old mailing lists first.

You might consider this a trivial storage issue as it appears from my simplified description, because I am only trying to reduce my dataset by 3 variables. However, individuals are coded to their neighbourhood in 30 separate years, so the actual storage gain is 30 geographical identifier variables vs 120, across up to 12m records (individuals).

Yours, with thanks

James Kirkbride
UCL

Last edited by jameskb101; 06 May 2014, 08:14.
Tags: None
Nick Cox

Join Date: Mar 2014

Posts: 35698
#2

06 May 2014, 08:09

collapse requires a varlist as argument to by() and will accept nothing else. I don't have time right now to read the rest of your problem carefully, but others will almost certainly add the details.
Comment
jameskb101

Join Date: May 2014

Posts: 18
#3

06 May 2014, 08:17

Thanks for your quick reply Nick, at least I know that what I wish to do isn't possible in they way I was thinking. I will see if others have any alternative suggestions.
Comment
Andreas Horr

Join Date: May 2014

Posts: 1
#4

06 May 2014, 08:25

As Nick suggested, you just need a variable instead of an expression:

gen newvar = substr(geographical_unit,1,6)
collapse (sum) cases, by(newvar)
Comment
Joe Canner

Join Date: Mar 2014

Posts: 580
#5

06 May 2014, 08:26

James,

Given that collapse requires variable names, I would suggest splitting up your geographical_unit variable into it's four components and then dropping the original variable. You can then use various combinations of those four components to achieve your purposes:

Code:

gen NN=substr(geographical_unit,1,2) gen LL=substr(geographical_unit,3,2) gen CC=substr(geographical_unit,5,2) gen RR=substr(geographical_unit,7,2) drop geographical_unit .... collapse (sum) cases, by(NN LL CC)

Regards,
Joe
Comment

Announcement

Collapse on part of string

Comment

Comment

Comment

Comment