Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Counting unique instances of a shared identifier without double counting


    My data is uniquely identified at the incident-student level, and I want to create a variable called "degree" that counts the number of students that one student has been in incidents with. For example, student X with studentid 325342 has been in incidents with 6 other students. I'm stuck figuring out how to not double count students that X has been in multiple incidents with (for example 325342 and 457875 were in 2 incidents together, but I only want 457875 counted once for student 325342). What I've unsuccessfully attempted:

    Merge in the same data set, then reshape long (so there will be one studentid with many columns of studentid2). This only merged in one matched studentid per student even if there were multiple students per one incident... so I'm not sure what went wrong there
    Code:
      rename studentid studentid1  
     m:m merge incident using samedataset.dta
    Here is a sample of the data I'm using:
    Code:
     clear
    input incident studentid
    1 325342
    1 457875
    1 367875
    2 325342
    2 342634
    2 765486
    3 354676
    3 346745
    3 547767
    3 454748
    4 578659
    5 325342
    5 457875
    5 854755
    5 354657
    6 885675
    end

  • #2
    one (tedious) way to do this,
    Code:
    use data, clear
    levelsof studentid, local(levels)
    
    foreach l of local levels {
    use data if studentid == `l', clear    
    rename studentid studentid1
    merge 1:m incident using data, keep(3) nogen
    bys studentid studentid1: keep if _n==1
    collapse (count) degree = incident , by(studentid1)
    tempfile stud`l'
    save `stud`l''
    }
    
    clear
    foreach l of local levels {
    append using `stud`l''    
    }
    
    replace degree = degree - 1 if degree > 1
    rename studentid1 studentid
    merge 1:m studentid using data, keep(3) nogen

    Comment


    • #3
      (I was about to cross with the preceding, but the code below shows a common way to work with pairs --joinby or cross-- and so perhaps is useful.)

      One possible special case here would be a student who 1) is involved in an incident with no other students, such as 578659 or 885675; and 2) but *is* involved in at least one other incident involving another student, which is not true of either of those people. I'd presume such persons are possible in your data, and so to your example data, I added an entry for incident 5 for studentid 885675. I presume also that you do have at least some students with degree = 0, so I left 578659 in incident 4 in the data.

      I have not completely checked the following code, but I think it should set you going in the right direction. And, regarding m:m merges, "just say no" <grin>. Whenever you think you want that, you almost certainly want to use the built-in -joinby- or -cross- commands. You *might* be able to accomplish something like what I do below with -reshape- (?), but I believe -joinby- is generally the best way to go here. Finally, the following could likely be done more nicely with frames, but I don't have that capacity.

      Take a look at the following and see if it handles your data correctly:

      Code:
      clear
      input incident studentid
      1 325342
      1 457875
      1 367875
      2 325342
      2 342634
      2 765486
      3 354676
      3 346745
      3 547767
      3 454748
      4 578659
      5 325342
      5 457875
      5 854755
      5 354657
      5 885675
      6 885675
      end
      // Students involved only in incidents with *no* other students
      // appear to be a special case, so I'd treat them separately.
      bysort incident: gen degree = 0 if _N == 1
      preserve
      rename studentid studentid1
      keep if degree == 0
      tempfile zeros
      save `zeros'
      restore
      drop if degree == 0  // we'll get these people back later
      //
      //  Put the original file into "edge" format by joining it with a copy of itself.
      tempfile temp
      rename studentid studentid2
      save `temp'
      rename studentid2 studentid1
      joinby incident using `temp'
      //
      // Drop self pairs and duplicates
      drop if studentid1 == studentid2
      duplicates drop studentid1 studentid2, force
      //
      bysort studentid1: replace degree = _N  // partners across all incidents
      by studentid1: keep if _n ==1  // just one record per student is needed
      keep studentid1 degree
      append using `zeros'
      //
      list studentid1 degree

      Comment


      • #4
        Perhaps you will find that this code starts you in a useful direction.
        Code:
        rename studentid studentid1
        tempfile s1
        save `s1'
        rename studentid1 studentid2
        joinby incident using `s1'
        drop if studentid1==studentid2
        order studentid1 studentid2
        collapse (count) times=incident, by(studentid1 studentid2)
        list if studentid1<=346745, sepby(studentid1) abbreviate(20)
        Code:
        . list if studentid1<=346745, sepby(studentid1) abbreviate(20)
        
             +---------------------------------+
             | studentid1   studentid2   times |
             |---------------------------------|
          1. |     325342       342634       1 |
          2. |     325342       354657       1 |
          3. |     325342       367875       1 |
          4. |     325342       457875       2 |
          5. |     325342       765486       1 |
          6. |     325342       854755       1 |
             |---------------------------------|
          7. |     342634       325342       1 |
          8. |     342634       765486       1 |
             |---------------------------------|
          9. |     346745       354676       1 |
         10. |     346745       454748       1 |
         11. |     346745       547767       1 |
             +---------------------------------+

        Comment


        • #5
          Some modifications of William's code in #4 would directly give out the wanted count.
          Code:
          tempfile Original
          save `Original'
          
          ren studentid id2
          joinby using `Original'
          
          bys studentid id2: keep if _n==1 & studentid != id2
          collapse (count) Wanted = id2, by(studentid)
          The modifications are indeed minor since essentially, the code just follows the well-established and straight logic suggested by William, whereas the key commands are -joinby- and -collapse-.

          Note the line in red, which is the one that addresses the concern of double counting the incidents. Check it out.

          Comment


          • #6
            Mike Lacy shows greater insight than mine by retaining students who are in incidents alone.

            Romalpa Akzo points out a difference in our approaches to the problem.

            Rather than aim for the expressed goal of counting the number of other students each student had been in an incident with, I thought it interesting to produce the intermediate dataset that could be used as a starting point for both that and other analysis - for example, to find cliques of students who are frequently in incidents together. I left the final step as an exercise for the reader. :-)

            Comment


            • #7
              William reminds the point of retaining "no-incident" students, which has been well pointed out by Mike. A minor modification for the code in 5 could also do that.
              Code:
              tempfile Original
              save `Original'
              
              ren studentid id2
              joinby using `Original'
              
              bys studentid id2: keep if _n==1
              collapse (count) Wanted = id2, by(studentid)
              
              replace Wanted = Wanted - 1

              Comment


              • #8
                I'll throw in one more thought, which perhaps Ashley Schwanebeck or someone else who is network-savvy can use: This problem can be construed as a network problem, as suggested by the terminology "degree" for the number of relations to others each student has. On that count, there might be a shorter and perhaps conceptually better way to treat this data using the -nwcommands- package (- net describe nwcommands-ado, from(http://www.nwcommands.org) -). To my limited understanding, the data here are given as what social network folk would call a two-mode network (students and incidents), which I believe one can manipulate with -nwcommands- to yield all sorts of information about the student-student relationships. Not only would it calculate the degree variable--I presume--but it could also offer other insights about actor-actor relationships. About the only use I've made of -nwcommands- is to suggest and illustrate some uses of it to posters on this list, so I can't be of much further help here.
                Last edited by Mike Lacy; 03 Feb 2022, 22:28.

                Comment

                Working...
                X