Grouping observations and reshaping data

Govinda Bernier

Join Date: Jun 2018

Posts: 3
#1

Grouping observations and reshaping data

05 Jun 2018, 11:08

Hi,
I'm trying to generate a new variable named "group_id" that would group connected corporations together. In Canada, these connected corporations each have to file a schedule 23 in their T2 corporate tax return. The code below shows an example of how the data is presented in the database containing all completed schedule 23 as well as the values I would like my new variable "group_id" to take (in this example the business numbers "BN" are one digit, but in reality the business number is a nine-digit number that gives businesses its own unique identifier).

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input byte(BN related_BN group_id) 1 2 1 1 3 1 2 1 1 2 3 1 3 1 1 3 2 1 4 7 2 5 6 3 6 5 3 7 4 2 end

The first column is the business number of the corporation filing the schedule 23. The second column is the business number of any connected corporation (thus if the corporation is connected to more than one other corporation, we will have multiple observations for BN==1), and the third column is the one I would want to generate. As you can see in my example, the corporation with BN == 1 is connected to corporations 2 and 3, corp. 4 is connected to corp. 7, and corp. 5 and 6 are also connected.

Q: How can I get Stata to know that when it reaches the first observation of BN==2, it should assign the value of 1 to "group_id"? Basically, what code do I use to generate variable "group_id"?

Once the variable "group_id" is properly generated, I would simply use the following code to drop the variable "related_BN" and simply keep one observation for each BN:

Code:

drop related_BN duplicates drop BN, force

This should give me something like the data below, which is what I want in the end:

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input byte(BN group_id) 1 1 2 1 3 1 4 2 5 3 6 3 7 2 end

Thanks for any advice.
Tags: None

Clyde Schechter

Join Date: Apr 2014
Posts: 29948

05 Jun 2018, 11:54

I believe this will work:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input byte(BN related_BN)
1 2
1 3
2 1
2 3
3 1
3 2
4 7
5 6
6 5
7 4
end

gen long obs_no = _n
gen group_id = .
replace group_id = 1 in 1
local current_obs = 1
local current_group = 1

while `current_obs' < _N {
    if !missing(group_id[`current_obs']) {
        local use_this_group = group_id[`current_obs']
    }
    else {
        local ++current_group
        local use_this_group = `current_group'
    }
    replace group_id = `use_this_group' if inlist(BN[`current_obs'], BN, related_BN) ///
        & missing(group_id)
    local ++current_obs
}

sort group_id BN
list, noobs clean

Note: This approach uses looping over observations and is, at best, unStata-ish. The problem here is that of computing a transitive closure. The above algorithm requires O(N²) computations, and will be painfully slow in a very large data set. More efficient transitive closure algorithms are known. But the other algorithms I know do not lend themselves to easy coding with Stata's data structure, so I have not attempted to implement one of them here.

If somebody else knows a better way to do this, please do chime in.

Comment

Robert Picard

Join Date: Mar 2014
Posts: 1536

05 Jun 2018, 12:36

There an appropriately named group_id program available from SSC that can handle this:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input byte(BN related_BN)
1 2
1 3
2 1
2 3
3 1
3 2
4 7
5 6
6 5
7 4
end

* convert id pairs to a long layout
gen long obs_no = _n
expand 2
bysort obs_no: gen id = cond(_n == 1, BN, related_BN)

* group identifiers if they are paired by obs_no
group_id id, matchby(obs_no)

* return to the original observations and order
bysort obs_no: keep if _n == 1
sort id BN obs_no

* if needed, renumber grouped identifiers sequentially
egen id_seq = group(id)
list, sepby(id)

and the results:

Code:

. list, sepby(id)

     +--------------------------------------+
     | BN   relate~N   obs_no   id   id_seq |
     |--------------------------------------|
  1. |  1          2        1    1        1 |
  2. |  1          3        2    1        1 |
  3. |  2          1        3    1        1 |
  4. |  2          3        4    1        1 |
  5. |  3          1        5    1        1 |
  6. |  3          2        6    1        1 |
     |--------------------------------------|
  7. |  4          7        7    4        2 |
  8. |  7          4       10    4        2 |
     |--------------------------------------|
  9. |  5          6        8    5        3 |
 10. |  6          5        9    5        3 |
     +--------------------------------------+

.

Comment

Govinda Bernier

Join Date: Jun 2018

Posts: 3
#4

05 Jun 2018, 12:43

Dear Clyde,

Thanks a lot for your answer. The data set I'm using is fairly large (between 200K and 2M observations depending on the subset I'll use) so I hope it will not be too painfully slow. I don't mind letting it run through the night.
Comment
Govinda Bernier

Join Date: Jun 2018

Posts: 3
#5

05 Jun 2018, 12:55

Dear Robert,
Thanks a lot for your reply. The group_id program worked well on my computer, I hope it will run smoothly on Statistics Canada servers where I access the microdata. Thanks again to both of you.
Comment

Announcement