Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Grouping observations and reshaping data

    Hi,
    I'm trying to generate a new variable named "group_id" that would group connected corporations together. In Canada, these connected corporations each have to file a schedule 23 in their T2 corporate tax return. The code below shows an example of how the data is presented in the database containing all completed schedule 23 as well as the values I would like my new variable "group_id" to take (in this example the business numbers "BN" are one digit, but in reality the business number is a nine-digit number that gives businesses its own unique identifier).

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input byte(BN related_BN group_id)
    1 2 1
    1 3 1
    2 1 1
    2 3 1
    3 1 1
    3 2 1
    4 7 2
    5 6 3
    6 5 3
    7 4 2
    end
    The first column is the business number of the corporation filing the schedule 23. The second column is the business number of any connected corporation (thus if the corporation is connected to more than one other corporation, we will have multiple observations for BN==1), and the third column is the one I would want to generate. As you can see in my example, the corporation with BN == 1 is connected to corporations 2 and 3, corp. 4 is connected to corp. 7, and corp. 5 and 6 are also connected.

    Q: How can I get Stata to know that when it reaches the first observation of BN==2, it should assign the value of 1 to "group_id"? Basically, what code do I use to generate variable "group_id"?

    Once the variable "group_id" is properly generated, I would simply use the following code to drop the variable "related_BN" and simply keep one observation for each BN:

    Code:
    drop related_BN
    duplicates drop BN, force
    This should give me something like the data below, which is what I want in the end:

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input byte(BN group_id)
    1 1
    2 1
    3 1
    4 2
    5 3
    6 3
    7 2
    end
    Thanks for any advice.

  • #2
    I believe this will work:

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input byte(BN related_BN)
    1 2
    1 3
    2 1
    2 3
    3 1
    3 2
    4 7
    5 6
    6 5
    7 4
    end
    
    gen long obs_no = _n
    gen group_id = .
    replace group_id = 1 in 1
    local current_obs = 1
    local current_group = 1
    
    while `current_obs' < _N {
        if !missing(group_id[`current_obs']) {
            local use_this_group = group_id[`current_obs']
        }
        else {
            local ++current_group
            local use_this_group = `current_group'
        }
        replace group_id = `use_this_group' if inlist(BN[`current_obs'], BN, related_BN) ///
            & missing(group_id)
        local ++current_obs
    }
    
    sort group_id BN
    list, noobs clean
    Note: This approach uses looping over observations and is, at best, unStata-ish. The problem here is that of computing a transitive closure. The above algorithm requires O(N2) computations, and will be painfully slow in a very large data set. More efficient transitive closure algorithms are known. But the other algorithms I know do not lend themselves to easy coding with Stata's data structure, so I have not attempted to implement one of them here.

    If somebody else knows a better way to do this, please do chime in.

    Comment


    • #3
      There an appropriately named group_id program available from SSC that can handle this:

      Code:
      * Example generated by -dataex-. To install: ssc install dataex
      clear
      input byte(BN related_BN)
      1 2
      1 3
      2 1
      2 3
      3 1
      3 2
      4 7
      5 6
      6 5
      7 4
      end
      
      * convert id pairs to a long layout
      gen long obs_no = _n
      expand 2
      bysort obs_no: gen id = cond(_n == 1, BN, related_BN)
      
      * group identifiers if they are paired by obs_no
      group_id id, matchby(obs_no)
      
      * return to the original observations and order
      bysort obs_no: keep if _n == 1
      sort id BN obs_no
      
      * if needed, renumber grouped identifiers sequentially
      egen id_seq = group(id)
      list, sepby(id)
      and the results:
      Code:
      . list, sepby(id)
      
           +--------------------------------------+
           | BN   relate~N   obs_no   id   id_seq |
           |--------------------------------------|
        1. |  1          2        1    1        1 |
        2. |  1          3        2    1        1 |
        3. |  2          1        3    1        1 |
        4. |  2          3        4    1        1 |
        5. |  3          1        5    1        1 |
        6. |  3          2        6    1        1 |
           |--------------------------------------|
        7. |  4          7        7    4        2 |
        8. |  7          4       10    4        2 |
           |--------------------------------------|
        9. |  5          6        8    5        3 |
       10. |  6          5        9    5        3 |
           +--------------------------------------+
      
      .

      Comment


      • #4
        Dear Clyde,

        Thanks a lot for your answer. The data set I'm using is fairly large (between 200K and 2M observations depending on the subset I'll use) so I hope it will not be too painfully slow. I don't mind letting it run through the night.

        Comment


        • #5
          Dear Robert,
          Thanks a lot for your reply. The group_id program worked well on my computer, I hope it will run smoothly on Statistics Canada servers where I access the microdata. Thanks again to both of you.

          Comment

          Working...
          X