identifying indirect relationships

Paula de Souza Leao Spinola

Join Date: Jun 2015

Posts: 384
#1

identifying indirect relationships

18 Jan 2019, 06:49

I have a dataset with physicians id (variable named phy_id) working in different hospitals (variable named hosp_id) in different years/months. I say that physicians are colleagues when they work together in the same hospital at the same time. Physicians may work in different hospitals at the same time and/or over time.

I would like to identify, for each physician at a given hospital in a given time, all physicians that have worked with their current colleagues in different hospitals in the past (but not with them).

Code:

clear input str1 hosp_id period str1 phy_id "i" 1 "A" "i" 1 "B" "i" 1 "C" "i" 1 "D" "i" 2 "A" "i" 2 "B" "i" 2 "C" "i" 2 "D" "j" 1 "B" "j" 1 "C" "j" 1 "E" "j" 2 "B" "j" 2 "C" "j" 2 "E" "k" 1 "B" "k" 1 "F" "k" 2 "G" end label var phy_id "physician id" label var hosp_id "hospital id"

Consider the sample above. I have three hospitals (i, j, k), 2 periods (1, 2) and 7 physicians (A, B, C, D, E, F, G). In hospital i at periods 1 and 2 physicians A, B, C, D are colleagues. In hospital j at periods 1 and 2 physicians B, C and E are colleagues. In hospital k at period 1 physicians B and F are colleagues and at period 2 physician G works alone. Hence, I would like to identify for physician A at time 2 in hospital i, physicians E (once they worked with physicians B and C at time 1 in hospital j) and F (once he worked with physicians B at time 1 in hospital k).

Any ideas on how I can code this?

Many thanks
Paula
Tags: None

1 like

Mike Lacy

Join Date: Apr 2014
Posts: 2404

18 Jan 2019, 11:13

OK, I think I understand you and have something that might be a solution. I'm going to use ego/alter terminology because that helps me think.
An "ego" is any one of your physicians. An "alter" is anyone who works with a physician in the same hospital. With that language, I think you want: "For each ego, get a list of all the *past* alters of ego's *current* alters."

Here's what I have, which I have not empirically checked, but which I hope is in the right direction. Someone more clever than me will likely have a better way to think about and do this.
I've just done the problem for the *current* egos. And, note that your data set makes it easier by having only two periods.

Code:

clear
input str1 hosp_id period str1 phy_id
"i" 1 "A"
"i" 1 "B"
"i" 1 "C"
"i" 1 "D"
"i" 2 "A"
"i" 2 "B"
"i" 2 "C"
"i" 2 "D"
"j" 1 "B"
"j" 1 "C"
"j" 1 "E"
"j" 2 "B"
"j" 2 "C"
"j" 2 "E"
"k" 1 "B"
"k" 1 "F"
"k" 2 "G"
end
label var phy_id "physician id"
label var hosp_id "hospital id"
rename phy_id ego_id      // ego/alter helps me think
rename hosp_id ego_hosp
rename period ego_period
tempfile ego
save `ego'
// Make all possible pairs of egos and alters
rename ego* alter*
tempfile temp
save `temp'
//
use `ego', clear
cross using `temp' // cross is the key
drop if (ego_id >= alter_id) // no self pairs or reverse order pairs
tempfile allpairs
save `allpairs'
//
// File of ego/alter pairs in same *current* hospital
keep if (ego_period ==2) & (alter_period ==2) & (ego_hosp == alter_hosp) // current
keep ego_id alter_id
gen byte period = 2
order ego_id alter_id period
tempfile current
save `current'  
list
//
// File of ego/alter pairs in same past hospital
use `allpairs'
keep if (ego_period ==1) & (alter_period ==2) & (ego_hosp == alter_hosp) // past
keep ego_id alter_id
gen byte period = 1
keep ego_id alter_id period
// But these are alters paired with alter's alters.
rename alter_id alter_alter_id
rename ego_id alter_id // egos
// We want a list of all alter_alters for each alter, so must go wide
bysort alter_id: gen byte seq = _n
reshape wide alter_alter_id, i(alter_id) j(seq)
keep alter* // only relevant variables
tempfile past
save `past'  
list
// Now for each current ego, merge the list of the past alters of each of ego's alters
use `current', clear
merge m:1 alter_id using `past'

Again, I would not be surprised if -nwcommands- had a five line solution to this.

Comment

Mike Lacy

Join Date: Apr 2014
Posts: 2404

18 Jan 2019, 11:58

Whoops, I found one mistake. The *past* pairs should both be *period == 1.

Code:

// File of ego/alter pairs in same past hospital
use `allpairs'
keep if (ego_period ==1) & (alter_period ==1) & (ego_hosp == alter_hosp) // past
                          ^^^^^^^^^^^^^^^^^^^^^

Comment

Paula de Souza Leao Spinola

Join Date: Jun 2015
Posts: 384

01 Feb 2019, 08:48

Many thanks for your kind response, Mike! And sorry for the delay in getting back to you - I was working on this code.

I used the reasoning you suggested above and wrote the code below which gives me exactly what I want: for each pairwise combination of physicians at a given time in a given hospital I identify the physicians with whom the alter physician has worked in the past in other hospitals (excluding those who have ever worked with the ego physician).

Code:

clear
input str1 hosp_id period str1 phy_id
"i" 1 "A"
"i" 1 "B"
"i" 1 "C"
"i" 1 "D"
"i" 2 "A"
"i" 2 "B"
"i" 2 "C"
"i" 2 "D"
"j" 1 "B"
"j" 1 "C"
"j" 1 "E"
"j" 2 "B"
"j" 2 "C"
"j" 2 "E"
"k" 1 "B"
"k" 1 "F"
"k" 2 "G" /*isolate nodes do not show up*/
end

* File with egos (sender)
rename phy_id ego_id
rename hosp_id ego_hosp
rename period ego_period
save egos, replace

* File with alters (receiver)
rename ego* alter*
save alters, replace

* File with all possible pairs of ego/alter working in same hospital at same time
use egos, clear
cross using alters.dta // cross forms every pairwise combination between 2 datasets
drop if (ego_id == alter_id) // no self pairs
keep if ego_period == alter_period & ego_hosp == alter_hosp // same hospital at same time
rename (ego_period ego_hosp) (period hosp)
drop alter_period alter_hosp
save colleagues.dta, replace

* Reshaping colleagues at any period - hospital
use colleagues.dta, clear
keep ego_id alter_id
duplicates drop
bysort ego_id: gen byte seq = _n
reshape wide alter_id, i(ego_id) j(seq)
save colleagues_reshape.dta, replace

* Adding colleagues of colleagues in different hospitals sometime in the past
use colleagues.dta, clear
rename (hosp period ego_id alter_id) (a_hosp a_period a_ego_id a_alter_id)
cross using colleagues.dta
order hosp period ego_id alter_id
keep if alter_id==a_ego_id //only colleagues of the physician's colleague
drop a_ego_id
drop if ego_id == a_alter_id //droping when the colleague of the physician's colleague is the physician himself
drop if hosp == a_hosp //only colleagues of colleagues in different hospitals
drop if period <= a_period //only past colleagues of colleagues
sort hosp period ego_id alter_id a_hosp a_period a_alter_id

// droping colleagues of the physician's colleagues who have never worked with him
merge m:1 ego_id using colleagues_reshape, keep(master match) nogenerate 
drop if a_alter_id == alter_id1 | a_alter_id == alter_id2 | a_alter_id == alter_id3 | a_alter_id == alter_id4 | a_alter_id == alter_id5
drop alter_id1 alter_id2 alter_id3 alter_id4 alter_id5

My issue is that when I try to apply this code to my real data (804,209 obs) for which I have 120 different periods, 4,330 different hospitals and 78,610 different physicians, Stata does not allow me to proceed (see below). I do drop a lot of the observations generated by -cross- as you can see in the code above (e.g. pairwise combination of physicians in different hospital) and hence there would be no need to generate such combination. However the command -cross- does not allow me to specify which combinations to avoid. Would you have any idea of how I could change this code to make it feasible for my dataset?

Code:

. * File with all possible pairs of ego/alter working in same hospital at same time
. use egos, clear

. cross using alters.dta // cross forms every pairwise combination between 2 datasets
sum of expand values exceed 2,147,483,620
    The dataset may not contain more than 2,147,483,620 observations.
r(459);

end of do-file

r(459);

Many thanks!

Comment

Mike Lacy

Join Date: Apr 2014

Posts: 2404
#5

01 Feb 2019, 09:14

This is an interesting problem, particularly at the scale that you are working. My crude approach can't work here. I have a couple of thoughts:
1) Can you do this in pieces? That is, can you start with the "egos" as just the physicians in period T, finding all their past alters. Then, do it for physicians in period T-1, etc.?
If there are, say, 78000/120 = 650 physicians in period T, and you cross them with the approx. 77000 physicians from previous periods, that would make the file from -cross- have size of only about _N = 50e6, which might work?

2) Ask Stata tech support for help. This is a problem that would really stretch Stata, and therefore should interest them as a way to demonstrate its capacities.

3) Go to https://nwcommands.wordpress.com/, and try to get help. Again, your problem is difficult enough, I think, to get very thoughtful people interested. There may be some completely different algorithmic approach known to people who do social network analysis.

Please keep following up with this on the list here.
Comment
Paula de Souza Leao Spinola

Join Date: Jun 2015

Posts: 384
#6

01 Feb 2019, 10:17

Hi Mike, thanks again!! Sure, I will keep you informed.

Regarding your points:

1) 78,610 is the number of unique ids of physicians (I get it from running - codebook-). The same physicians show up over time. Instead, I tried to run the code for physicians working in one city. Still, Stata did not allow me to run the second - cross - command.

2) Will do that now!

3) I tried to contact Thomas Grund but haven't heard back from him. Also, the website and the related google group have not been updated since 2016. Would you know why? Would you have any contact I could reach out to?

Many thanks
Paula
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2404
#7

01 Feb 2019, 12:10

Hmmm. When I last sent a question about -nwcommands- a few years ago, Grund responded in a few days. I see there is a forthcoming StataPress book on this
https://www.stata.com/gsearch.php?q=...td&restrict=SP
for what it's worth.

One thought: What about doing -joinby- to form all pairs at the same hospital at the same period, and then trimming down from there?

And: A good full size example would help here. I'll offer the following as a simulated full-sized data set, perhaps with some bad assumptions. I can't follow up on this today, but perhaps having an example will stimulate some other people to jump in.

Code:

set seed 4754 clear local Nphysicians = 78000 local Nperiod = 120 local Nhospital = 4000 local NHosp_per_doc = 4 // assumption local Nperiod_per_doc = 8 set obs `Nphysicians' gen long ego_id = _n gen int np = ceil(runiform() * `Nperiod_per_doc') gen byte nh = ceil(runiform() * `NHosp_per_doc') expand np * nh gen int ego_period = ceil(runiform() * `Nperiod') gen int ego_hosp = ceil(runiform() * `Nhospital')
Comment

Robert Picard

Join Date: Mar 2014
Posts: 1536

01 Feb 2019, 21:02

I'm not familiar with nwcommands or network analysis so I don't know if you could get to a solution that way.

I'm not seeing any shortcuts that would reduce the scale of what needs to be done but I think you can leverage the fact that each observation has a solution that is relatively easy to get to. If it is programmed correctly, you can then rinse and repeat for all remaining observations using runby (from SSC).

For example, let's assume that the data in memory contains only the test case observation mentioned in #1:

Code:

     +---------------------------+
     | hosp_id   period   phy_id |
     |---------------------------|
  1. |       i        2        A |
     +---------------------------+

and let's assume that the example dataset in #1 is saved as "statalist_test.dta". The following program will:

make a list of current colleagues (same hospital and period);
fetch all observations for these colleagues from the full dataset;
reduce to observations from past periods at different hospitals;
form all pairwise combinations with the full dataset within the same period and hospital to get all colleagues of current colleagues;
remove observations from current colleagues.

I'm assuming that you only want one instance of a past colleague of a current colleague so I pick one case (rather arbitrarily). Copy the following program and save it in a file called "do_one_case.do" in Stata's current directory.

Code:

program do_one_case

    local h = hosp_id[1]
    local t = period[1]
    local p = phy_id[1]

    use if hosp_id == "`h'" & period == `t' & phy_id != "`p'" using "statalist_test.dta", clear
    if _N == 0 exit
    keep phy_id
    isid phy_id, sort
    dis _n "---------- colleagues at same hospital in same period -------------"
    list
    tempfile hold
    save "`hold'"

    merge 1:m phy_id using "statalist_test.dta", keep(match) nogen
    dis _n "----------  past history for these colleagues at different hospitals -------------"
    drop if period >= `t'
    drop if hosp_id == "`h'"

    if _N == 0 exit
    isid phy_id hosp_id period, sort
    gen long case = _n
    list

    rename phy_id curcoll
    joinby period hosp_id using "statalist_test.dta"
    if _N == 0 exit
    dis _n "---------- colleagues at matching time and hospital -------------"
    isid case phy_id, sort
    list , sepby(case)

    // remove obs from current colleagues
    merge m:1 phy_id using "`hold'", keep(master) nogen
    dis _n "---------- colleagues at matching time and hospital minus current colleagues -------------"
    if _N == 0 exit
    isid case phy_id, sort
    list , sepby(case)

    // adjust var names and reduce to one obs per past colleague
    rename phy_id pastcoll
    rename hosp_id pasthosp
    rename period pastp
    isid pastcoll curcoll pasthosp pastp, sort
    by pastcoll: keep if _n == 1

    // identify current case for each past colleague
    gen hosp_id = "`h'"
    gen period  = `t'
    gen phy_id  = "`p'"
    order hosp_id period phy_id

end

Copy the following in a do-file, also in the current directory:

Code:

version 14
clear all
input str1 hosp_id period str1 phy_id
"i" 1 "A"
"i" 1 "B"
"i" 1 "C"
"i" 1 "D"
"i" 2 "A"
"i" 2 "B"
"i" 2 "C"
"i" 2 "D"
"j" 1 "B"
"j" 1 "C"
"j" 1 "E"
"j" 2 "B"
"j" 2 "C"
"j" 2 "E"
"k" 1 "B"
"k" 1 "F"
"k" 2 "G"
end

* make sure there are no duplicates!
isid hosp_id period phy_id, sort
save "statalist_test.dta", replace

* load the program
run "do_one_case.do"

keep if phy_id == "A" & period == 2 & hosp_id == "i"
list

* run the program
do_one_case

* show results
order hosp_id period phy_id
list

Here's the output if you run the above do-file:

Code:

. version 14

. clear all

. input str1 hosp_id period str1 phy_id

       hosp_id     period     phy_id
  1. "i" 1 "A"
  2. "i" 1 "B"
  3. "i" 1 "C"
  4. "i" 1 "D"
  5. "i" 2 "A"
  6. "i" 2 "B"
  7. "i" 2 "C"
  8. "i" 2 "D"
  9. "j" 1 "B"
 10. "j" 1 "C"
 11. "j" 1 "E"
 12. "j" 2 "B"
 13. "j" 2 "C"
 14. "j" 2 "E"
 15. "k" 1 "B"
 16. "k" 1 "F"
 17. "k" 2 "G"
 18. end

.
. * make sure there are no duplicates!
. isid hosp_id period phy_id, sort
(data now sorted by hosp_id period phy_id)

. save "statalist_test.dta", replace
file statalist_test.dta saved

.
. * load the program
. run "do_one_case.do"

.
. keep if phy_id == "A" & period == 2 & hosp_id == "i"
(16 observations deleted)

. list

     +---------------------------+
     | hosp_id   period   phy_id |
     |---------------------------|
  1. |       i        2        A |
     +---------------------------+

.
. * run the program
. do_one_case
(data now sorted by phy_id)

---------- colleagues at same hospital in same period -------------

     +--------+
     | phy_id |
     |--------|
  1. |      B |
  2. |      C |
  3. |      D |
     +--------+
file /var/folders/cp/z8cssshn6935x9p181c71_7m0000gn/T//S_13346.000001 saved

    Result                           # of obs.
    -----------------------------------------
    not matched                             0
    matched                                11  
    -----------------------------------------

----------  past history for these colleagues at different hospitals -------------
(5 observations deleted)
(3 observations deleted)
(data now sorted by phy_id hosp_id period)

     +----------------------------------+
     | phy_id   hosp_id   period   case |
     |----------------------------------|
  1. |      B         j        1      1 |
  2. |      B         k        1      2 |
  3. |      C         j        1      3 |
     +----------------------------------+

---------- colleagues at matching time and hospital -------------
(data now sorted by case phy_id)

     +--------------------------------------------+
     | curcoll   hosp_id   period   case   phy_id |
     |--------------------------------------------|
  1. |       B         j        1      1        B |
  2. |       B         j        1      1        C |
  3. |       B         j        1      1        E |
     |--------------------------------------------|
  4. |       B         k        1      2        B |
  5. |       B         k        1      2        F |
     |--------------------------------------------|
  6. |       C         j        1      3        B |
  7. |       C         j        1      3        C |
  8. |       C         j        1      3        E |
     +--------------------------------------------+

    Result                           # of obs.
    -----------------------------------------
    not matched                             3
        from master                         3  
        from using                          0  

    matched                                 0  
    -----------------------------------------

---------- colleagues at matching time and hospital minus current colleagues -------------
(data now sorted by case phy_id)

     +--------------------------------------------+
     | curcoll   hosp_id   period   case   phy_id |
     |--------------------------------------------|
  1. |       B         j        1      1        E |
     |--------------------------------------------|
  2. |       B         k        1      2        F |
     |--------------------------------------------|
  3. |       C         j        1      3        E |
     +--------------------------------------------+
(data now sorted by pastcoll curcoll pasthosp pastp)
(1 observation deleted)

.
. * show results
. order hosp_id period phy_id

. list

     +--------------------------------------------------------------------------+
     | hosp_id   period   phy_id   curcoll   pasthosp   pastp   case   pastcoll |
     |--------------------------------------------------------------------------|
  1. |       i        2        A         B          j       1      1          E |
  2. |       i        2        A         B          k       1      2          F |
     +--------------------------------------------------------------------------+

.

The do_one_case program puts out a lot of output to help track what is going on.

Once you are sure that the program does what you want, you can use runby to call the do_one_case program for each observation in the data. With runby, what's left in memory after your program terminates is considered results and accumulates. If there's nothing in memory, then runby moves on to the next group. So a complete solution for the data example in #1 would be:

Code:

clear all
use "statalist_test.dta"

* confirm that obs are uniquely identified using ...
isid hosp_id period phy_id, sort

run "do_one_case.do"

runby do_one_case, by(period hosp_id phy_id) status

list

and the results:

Code:

. runby do_one_case, by(period hosp_id phy_id) status

  elapsed ----------- by-groups ----------    ------- observations ------       time
     time      count     errors    no-data        processed         saved  remaining
------------------------------------------------------------------------------------
 00:00:00         17          0         10               17            19   00:00:00

--------------------------------------
Number of by-groups    =            17
by-groups with errors  =             0
by-groups with no data =            10
Observations processed =            17
Observations saved     =            19
--------------------------------------

.
. list

     +--------------------------------------------------------------------------+
     | hosp_id   period   phy_id   curcoll   pasthosp   pastp   case   pastcoll |
     |--------------------------------------------------------------------------|
  1. |       i        2        A         B          j       1      1          E |
  2. |       i        2        A         B          k       1      2          F |
  3. |       i        2        B         C          j       1      1          B |
  4. |       i        2        B         C          j       1      1          E |
  5. |       i        2        C         B          j       1      1          C |
     |--------------------------------------------------------------------------|
  6. |       i        2        C         B          j       1      1          E |
  7. |       i        2        C         B          k       1      2          F |
  8. |       i        2        D         B          j       1      1          E |
  9. |       i        2        D         B          k       1      2          F |
 10. |       j        2        B         C          i       1      1          A |
     |--------------------------------------------------------------------------|
 11. |       j        2        B         C          i       1      1          B |
 12. |       j        2        B         C          i       1      1          D |
 13. |       j        2        C         B          i       1      1          A |
 14. |       j        2        C         B          i       1      1          C |
 15. |       j        2        C         B          i       1      1          D |
     |--------------------------------------------------------------------------|
 16. |       j        2        C         B          k       1      2          F |
 17. |       j        2        E         B          i       1      1          A |
 18. |       j        2        E         B          i       1      1          D |
 19. |       j        2        E         B          k       1      2          F |
     +--------------------------------------------------------------------------+

.

Note that all the output generated by the do_one_case program is suppressed by default.

I'm not sure how well this will scale on the full problem but I would try to first sort the data by period and try it only on a few initial periods. Here's a small scale test on random data:

Code:

version 14
clear all
set seed 31231

set obs 100
gen phy_id = "dr" + string(_n, "%04.0f")

expand runiform(1,4)
bysort phy_id: gen hosp_id = "hosp" + string(runiformint(1,10), "%03.0f")

bysort phy_id hosp_id: keep if _n == 1
gen start = runiformint(1,10)
expand runiformint(1,10)
bysort phy_id hosp_id: gen long period = start + _n
drop start

isid period hosp_id phy_id, sort

save "statalist_test.dta", replace

* load the program
run "do_one_case.do"

runby do_one_case, by(period hosp_id phy_id) status

and the results on my computer:

Code:

. runby do_one_case, by(period hosp_id phy_id) status

  elapsed ----------- by-groups ----------    ------- observations ------       time
     time      count     errors    no-data        processed         saved  remaining
------------------------------------------------------------------------------------
 00:00:01        163          0         53              163         1,073   00:00:06
 00:00:02        290          0         54              290         4,190   00:00:06
 00:00:03        404          0         54              404         8,443   00:00:06
 00:00:04        506          0         54              506        13,261   00:00:06
(now reporting every 5 seconds)
 00:00:09        925          0         54              925        41,092   00:00:03
 00:00:12      1,212          0         72            1,212        57,183   00:00:00

--------------------------------------
Number of by-groups    =         1,212
by-groups with errors  =             0
by-groups with no data =            72
Observations processed =         1,212
Observations saved     =        57,183
--------------------------------------

.

Since this requires a recursive window (from the earliest period up to the current period), you may want to create subsets of the initial data, each covering data up to the current period. You could then run the same setup as above using only data up to that period. Also, if you are only looking for past colleagues in the same city, break-up the data by city and use the approach on the smaller city datasets.

Last edited by Robert Picard; 01 Feb 2019, 21:09.

Comment

Mike Lacy

Join Date: Apr 2014

Posts: 2404
#9

02 Feb 2019, 09:58

I just thought of a different question about the definition of the problem: What level of "indirectness" do you want? Clearly, if A worked with B, and B worked with C, you want B and C linked to A.
But what if A worked with B worked with C worked with D? Do you want A linked with D? (and so forth .... A ... Z, ....)
1 like
Comment

Robert Picard

Join Date: Mar 2014
Posts: 1536

#10

02 Feb 2019, 10:11

l was losing steam last night but now rested, I'll show how to adapt the solution so that you can run this in parts. Assume that the following is a (small) analog for the full data:

Code:

version 14
clear all
set seed 31231

set obs 100
gen phy_id = "dr" + string(_n, "%04.0f")

expand runiform(1,4)
bysort phy_id: gen hosp_id = "hosp" + string(runiformint(1,10), "%03.0f")

bysort phy_id hosp_id: keep if _n == 1
gen start = runiformint(1,10)
expand runiformint(1,10)
bysort phy_id hosp_id: gen long period = start + _n
drop start

isid period hosp_id phy_id, sort

save "statalist_fulldata.dta", replace

Then the solution to process the whole thing in one pass amounts to (recycling what's in my previous post):

Code:

clear all

run "do_one_case.do"
use "statalist_fulldata.dta"

runby do_one_case, by(period hosp_id phy_id) status

isid hosp_id period phy_id pastcoll, sort
save "results_all.dta", replace

To do the same in parts, using only the relevant data would look like:

Code:

clear all

run "do_one_case.do"
use "statalist_fulldata.dta"

sum period
local firstyr = r(min) + 1
local lastyr = r(max)

clear
save "results_p`firstyr'_p`lastyr'.dta", replace emptyok

forvalues y = `firstyr'/`lastyr' {
    dis "doing " `y'
    * save data from first year up to current year
    use if period <= `y' using "statalist_fulldata.dta"
    save "statalist_test.dta", replace
    
    keep if period == `y'
    runby do_one_case, by(period hosp_id phy_id) status
    append using "results_p`firstyr'_p`lastyr'.dta"
    save "results_p`firstyr'_p`lastyr'.dta", replace
    
}

isid hosp_id period phy_id pastcoll, sort
cf _all using "results_all.dta", all

The last line confirms that the results match those produced in one pass.

I suspect that either way, the process will be quite slow for later periods as the window from the earliest to the current period becomes large. I suggest that you try the above but manually specify the range and target just one or a few periods, something like:

Code:

local firstyr = 18
local lastyr = 19

and that would create a file called "results_p18_p19.dta". You can always append results from separate runs later. Better yet, if you can restrict how far back you are willing to look for past colleagues, say 10 periods prior, change

Code:

use if period <= `y' using "statalist_fulldata.dta"

for something like

Code:

local back = `y' - 10
use if inrange(period, `back', `y') using "statalist_fulldata.dta"

Comment

Romalpa Akzo

Join Date: Oct 2017

Posts: 369
#11

03 Feb 2019, 17:59

The puzzle is quite interesting. And the solutions of Robert and Mike are very inspirational. And I jump in now, since for me, a clarification from Paula for what she needs might be necessary.

Specifically, Paula’s target, as stated in #1, is

to identify, for each physician at a given hospital in a given time, all physicians that have worked with their current colleagues in different hospitals in the past.

Actually, this description, with its “strict” meaning, could be considered as fairly clear to establish a solvable puzzle. But in considering the general concept of “indirect relation”, I am still feeling a small chance that it might be not exactly what is expected. My suspicion lies in the 2 bold words in the above quotation (“current” and “different”). And below example is given to illustrate my point.

Code:

+------------------------------------------+ | period | phy_id | hosp_id | ind1 | ind2 | |--------+--------+---------+------+-------| | 1 | A | UU | | | | 1 | B | VV | | | | 1 | C | VV | | | | 1 | D | XX | | | | 1 | E | XX | | | |--------+--------+---------+------+-------| | 2 | A | YY | C | C | | 2 | B | YY | | | | 2 | D | ZZ | | | | 2 | F | ZZ | | | |--------+--------+---------+------+-------| | 3 | A | ZZ | E | E-F-C | | 3 | D | ZZ | | | +------------------------------------------+

In this example, in period 3, physician A has only (direct) “current” colleague D, who has had direct connection with E (hospital XX, period 1) and F (hospital ZZ, period 2). Previously, in period 2, A is a colleague with B (firm YY), who has direct relation with C (firm VV) in period 1. Then in period 3, for A, it seems clearly that E is listed as indirect relation. But C, who has had direct relation with A through B (period 2), has been “forgotten” since B is no longer "current" colleague of A in period 3. Meanwhile, F, although being direct colleague with D, but not in "different" hospital (same hospital ZZ), is also unlisted. (Please compare ind1 and ind2 variable).

The necessary clarification is: ind1 or ind2, which one is exactly the output that Paula expects? In other word, whether “current” here means “within the given period only” (as its strict meaning) or might means “up to the given period” (or “so far”). Similarly, whether "different" explicitly point to “must be not the same (current) firm”, or just means "not having the direct relation with" (similar to what is concerned by Mike in #9).

Then Paula, please kindly clarify the situation. Upon that, an appropriate code might be provided to solve the puzzle effectively.
3 likes
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2404
#12

03 Feb 2019, 20:44

Completely in accord with Romalpa here the continued need for clarification. I have a suggestion that might help: I've come to realize (?) that "indirect relationship" is another way of saying "ordered triplet." So, maybe it would help if you (Paula) could give us a definition of the problems in terms of the properties of the members of an ordered triplet, i.e., "I want to find all triplets (Person1, Person2, Person3) such that P1 is from such and such a defined set of persons, P2 has such and such a relation to P1, and P3 has ... relation to P.

I've been thinking about this problem, and there may be a relatively easy and fast approach involving -joinby- and -merge-, but I became less sure of the problem definition.
1 like
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2404
#13

04 Feb 2019, 07:44

Sorry, should read:"... and P3 has ... relation to P2."
Comment

Paula de Souza Leao Spinola

Join Date: Jun 2015
Posts: 384

#14

04 Feb 2019, 07:46

Hi all, sorry for not being very clear. What I want is variable ind1 in Romalpa's example. By current I mean "within the given period only" and by different I mean not the same current hospital. The way I think of it in the example above is: for physician A (ego physician) working together with physician D (alter physician) at time 3 in hospital ZZ, I would like to know all physicians who have worked with physician D in the past (at times 1 or 2) in different hospitals (UU, VV, XX or YY) who have never worked together in the same hospital with physician A (at any period, including periods after period 3 in case they existed). Therefore, when we see physician A as ego and D as alter, the alter-alter physician would be only physician E. Additionally, I would like to see this same pairwise combination between A and D where D is ego and A is alter. In this case the alter-alter physician would be those who worked with A in the past in different hospitals and have never worked with D. In the example above, we would have physician B as the alter-alter physician.

To answer Mike in #9, I would like a first order level of indirectness. If A worked with B who worked with C who worked with D. I would like A only linked with C.

As Mike suggested, I replaced -cross- by -joinby- and the code got a lot more efficient! The code below gives me exactly the output I am looking for: for which pair of physicians working at the same time in the same hospital (ego and alter physicians) it informs me the alter-alter physicians (those who have never worked with the ego physician and have worked with the alter physician sometime in the past in different hospitals) and in which period and hospital the latter worked with the alter physician. Running the code with Romalpa's example:

Code:

clear
input period str1 phy_id str2 hosp
1 "A" "UU"
1 "B" "VV"
1 "C" "VV"
1 "D" "XX"
1 "E" "XX"
2 "A" "YY"
2 "B" "YY"
2 "D" "ZZ"
2 "F" "ZZ"
3 "A" "ZZ"
3 "D" "ZZ"
end

* File with egos (sender)
rename phy_id ego_id
save egos, replace

* File with alters (receiver)
rename ego_id alter_id
save alters, replace

* File with pairwise combinations of ego/alter working in same hospital at same time
use egos, clear
joinby hosp period using alters.dta
drop if ego_id == alter_id // no self pairs
save colleagues.dta, replace

* Adding colleagues of colleagues in different hospitals sometime in the past
use colleagues.dta, clear
rename (hosp period alter_id ego_id) (a_hosp a_period a_alter_id alter_id)
joinby alter_id using colleagues.dta
drop if ego_id == a_alter_id //droping when the colleague of the physician's colleague is the physician himself
drop if hosp == a_hosp //only colleagues of colleagues in different hospitals
drop if period <= a_period //only past colleagues of colleagues
order hosp period ego_id alter_id
save colleagues_iv_temp.dta, replace

* Keeping only colleagues of the physician's colleagues who have never worked with him
//reshaping colleagues at any period - hospital
use colleagues.dta, clear
keep ego_id alter_id
duplicates drop
bysort ego_id: gen byte seq = _n
reshape wide alter_id, i(ego_id) j(seq)
save colleagues_reshape.dta, replace

//droping colleagues of the physician's colleagues who have worked with him
forvalues i = 1/3 { /*how to find out automatically the number of variables generated in the last step?*/
use colleagues_iv_temp, clear
merge m:1 ego_id using colleagues_reshape, keep(master match) keepusing(alter_id`i') nogenerate
drop if a_alter_id == alter_id`i'
drop alter_id`i'
save colleagues_iv_temp.dta, replace
}
sort period hosp ego_id

The output is

Code:

. list

     +------------------------------------------------------------------+
     | hosp   period   ego_id   alter_id   a_period   a_hosp   a_alte~d |
     |------------------------------------------------------------------|
  1. |   YY        2        A          B          1       VV          C |
  2. |   ZZ        2        F          D          1       XX          E |
  3. |   ZZ        3        A          D          1       XX          E |
  4. |   ZZ        3        D          A          2       YY          B |
     +------------------------------------------------------------------+

However, I am still having problems when running the code above in my real data. Specifically, Stata does not allow me to run the second -jointby-. Please let me know in case you have any suggestions. I will have a careful look on the code proposed by Robert.

Code:

. * File with egos (sender)
. rename phy_id ego_id

. save egos, replace
file egos.dta saved

.
. * File with alters (receiver)
. rename ego_id alter_id

. save alters, replace
file alters.dta saved

.
. * File with pairwise combinations of ego/alter working in same hospital at same time
. use egos, clear

. joinby hosp period using alters.dta

. drop if ego_id == alter_id // no self pairs
(805,594 observations deleted)

. save colleagues.dta, replace
file colleagues.dta saved

.
. * Adding colleagues of colleagues in different hospitals sometime in the past
. use colleagues.dta, clear

. rename (hosp period alter_id ego_id) (a_hosp a_period a_alter_id alter_id)

. joinby alter_id using colleagues.dta //still not running for the whole dataset
sum of expand values exceed 2,147,483,620
    The dataset may not contain more than 2,147,483,620 observations.
r(459);

end of do-file

r(459);

.

Last edited by Paula de Souza Leao Spinola; 04 Feb 2019, 07:55.

Comment

Mike Lacy

Join Date: Apr 2014
Posts: 2404

#15

04 Feb 2019, 11:15

Paula, I think you can get past this one last issue by doing the last -joinby- on subsets ("batches") of the colleagues file: You can take a batch of the colleagues (ego/alter pairs), -joinby- to get the past matches, do the next batch, etc.? If the logic is ok, you just need to make the batches small enough to not exceed the -expand- limits.

Code:

* Adding colleagues of colleagues in different hospitals sometime in the past
use colleagues.dta, clear
rename (hosp period alter_id ego_id) (a_hosp a_period a_alter_id alter_id)
// New stuff starts
// 10 batches; maybe more (smaller subsets) would be needed.
gen byte batch = 1 + mod(_N,10)
preserve
forval i = 1/10 {
  keep if (batch == `i')
  joinby alter_id using colleagues.dta
  save batch`i'
  restore
}
clear
save whole.dta, emptyok
forval i = 1/10 {
  append using batch`i'
}

Last edited by Mike Lacy; 04 Feb 2019, 11:20.

Announcement