Counting unique instances of a shared identifier without double counting

Ashley Schwanebeck

Join Date: Mar 2019

Posts: 7
#1

Counting unique instances of a shared identifier without double counting

03 Feb 2022, 12:00

My data is uniquely identified at the incident-student level, and I want to create a variable called "degree" that counts the number of students that one student has been in incidents with. For example, student X with studentid 325342 has been in incidents with 6 other students. I'm stuck figuring out how to not double count students that X has been in multiple incidents with (for example 325342 and 457875 were in 2 incidents together, but I only want 457875 counted once for student 325342). What I've unsuccessfully attempted:

Merge in the same data set, then reshape long (so there will be one studentid with many columns of studentid2). This only merged in one matched studentid per student even if there were multiple students per one incident... so I'm not sure what went wrong there

Code:

rename studentid studentid1 m:m merge incident using samedataset.dta

Here is a sample of the data I'm using:

Code:

clear input incident studentid 1 325342 1 457875 1 367875 2 325342 2 342634 2 765486 3 354676 3 346745 3 547767 3 454748 4 578659 5 325342 5 457875 5 854755 5 354657 6 885675 end
Tags: identifier, long, merge, unique

Øyvind Snilsberg

Join Date: Oct 2021
Posts: 591

03 Feb 2022, 13:55

one (tedious) way to do this,

Code:

use data, clear
levelsof studentid, local(levels)

foreach l of local levels {
use data if studentid == `l', clear    
rename studentid studentid1
merge 1:m incident using data, keep(3) nogen
bys studentid studentid1: keep if _n==1
collapse (count) degree = incident , by(studentid1)
tempfile stud`l'
save `stud`l''
}

clear
foreach l of local levels {
append using `stud`l''    
}

replace degree = degree - 1 if degree > 1
rename studentid1 studentid
merge 1:m studentid using data, keep(3) nogen

Comment

Mike Lacy

Join Date: Apr 2014

Posts: 2416
#3

03 Feb 2022, 13:59

(I was about to cross with the preceding, but the code below shows a common way to work with pairs --joinby or cross-- and so perhaps is useful.)

One possible special case here would be a student who 1) is involved in an incident with no other students, such as 578659 or 885675; and 2) but *is* involved in at least one other incident involving another student, which is not true of either of those people. I'd presume such persons are possible in your data, and so to your example data, I added an entry for incident 5 for studentid 885675. I presume also that you do have at least some students with degree = 0, so I left 578659 in incident 4 in the data.

I have not completely checked the following code, but I think it should set you going in the right direction. And, regarding m:m merges, "just say no" <grin>. Whenever you think you want that, you almost certainly want to use the built-in -joinby- or -cross- commands. You *might* be able to accomplish something like what I do below with -reshape- (?), but I believe -joinby- is generally the best way to go here. Finally, the following could likely be done more nicely with frames, but I don't have that capacity.

Take a look at the following and see if it handles your data correctly:

Code:

clear input incident studentid 1 325342 1 457875 1 367875 2 325342 2 342634 2 765486 3 354676 3 346745 3 547767 3 454748 4 578659 5 325342 5 457875 5 854755 5 354657 5 885675 6 885675 end // Students involved only in incidents with *no* other students // appear to be a special case, so I'd treat them separately. bysort incident: gen degree = 0 if _N == 1 preserve rename studentid studentid1 keep if degree == 0 tempfile zeros save `zeros' restore drop if degree == 0 // we'll get these people back later // // Put the original file into "edge" format by joining it with a copy of itself. tempfile temp rename studentid studentid2 save `temp' rename studentid2 studentid1 joinby incident using `temp' // // Drop self pairs and duplicates drop if studentid1 == studentid2 duplicates drop studentid1 studentid2, force // bysort studentid1: replace degree = _N // partners across all incidents by studentid1: keep if _n ==1 // just one record per student is needed keep studentid1 degree append using `zeros' // list studentid1 degree
2 likes
Comment

William Lisowski

Join Date: Dec 2014
Posts: 10150

03 Feb 2022, 14:01

Perhaps you will find that this code starts you in a useful direction.

Code:

rename studentid studentid1
tempfile s1
save `s1'
rename studentid1 studentid2
joinby incident using `s1'
drop if studentid1==studentid2
order studentid1 studentid2
collapse (count) times=incident, by(studentid1 studentid2)
list if studentid1<=346745, sepby(studentid1) abbreviate(20)

Code:

. list if studentid1<=346745, sepby(studentid1) abbreviate(20)

     +---------------------------------+
     | studentid1   studentid2   times |
     |---------------------------------|
  1. |     325342       342634       1 |
  2. |     325342       354657       1 |
  3. |     325342       367875       1 |
  4. |     325342       457875       2 |
  5. |     325342       765486       1 |
  6. |     325342       854755       1 |
     |---------------------------------|
  7. |     342634       325342       1 |
  8. |     342634       765486       1 |
     |---------------------------------|
  9. |     346745       354676       1 |
 10. |     346745       454748       1 |
 11. |     346745       547767       1 |
     +---------------------------------+

Comment

Romalpa Akzo

Join Date: Oct 2017

Posts: 369
#5

03 Feb 2022, 16:02

Some modifications of William's code in #4 would directly give out the wanted count.

Code:

tempfile Original save `Original' ren studentid id2 joinby using `Original' bys studentid id2: keep if _n==1 & studentid != id2 collapse (count) Wanted = id2, by(studentid)

The modifications are indeed minor since essentially, the code just follows the well-established and straight logic suggested by William, whereas the key commands are -joinby- and -collapse-.

Note the line in red, which is the one that addresses the concern of double counting the incidents. Check it out.
2 likes
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#6

03 Feb 2022, 16:35

Mike Lacy shows greater insight than mine by retaining students who are in incidents alone.

Romalpa Akzo points out a difference in our approaches to the problem.

Rather than aim for the expressed goal of counting the number of other students each student had been in an incident with, I thought it interesting to produce the intermediate dataset that could be used as a starting point for both that and other analysis - for example, to find cliques of students who are frequently in incidents together. I left the final step as an exercise for the reader. :-)
1 like
Comment
Romalpa Akzo

Join Date: Oct 2017

Posts: 369
#7

03 Feb 2022, 17:08

William reminds the point of retaining "no-incident" students, which has been well pointed out by Mike. A minor modification for the code in 5 could also do that.

Code:

tempfile Original save `Original' ren studentid id2 joinby using `Original' bys studentid id2: keep if _n==1 collapse (count) Wanted = id2, by(studentid) replace Wanted = Wanted - 1
2 likes
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2416
#8

03 Feb 2022, 20:58

I'll throw in one more thought, which perhaps Ashley Schwanebeck or someone else who is network-savvy can use: This problem can be construed as a network problem, as suggested by the terminology "degree" for the number of relations to others each student has. On that count, there might be a shorter and perhaps conceptually better way to treat this data using the -nwcommands- package (- net describe nwcommands-ado, from(http://www.nwcommands.org) -). To my limited understanding, the data here are given as what social network folk would call a two-mode network (students and incidents), which I believe one can manipulate with -nwcommands- to yield all sorts of information about the student-student relationships. Not only would it calculate the degree variable--I presume--but it could also offer other insights about actor-actor relationships. About the only use I've made of -nwcommands- is to suggest and illustrate some uses of it to posters on this list, so I can't be of much further help here.

Last edited by Mike Lacy; 03 Feb 2022, 21:28.
1 like
Comment

Announcement

Counting unique instances of a shared identifier without double counting

Comment

Comment

Comment

Comment

Comment

Comment

Comment