Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • identifying indirect relationships

    I have a dataset with physicians id (variable named phy_id) working in different hospitals (variable named hosp_id) in different years/months. I say that physicians are colleagues when they work together in the same hospital at the same time. Physicians may work in different hospitals at the same time and/or over time.

    I would like to identify, for each physician at a given hospital in a given time, all physicians that have worked with their current colleagues in different hospitals in the past (but not with them).

    Code:
    clear
    input str1 hosp_id period str1 phy_id
    "i" 1 "A"
    "i" 1 "B"
    "i" 1 "C"
    "i" 1 "D"
    "i" 2 "A"
    "i" 2 "B"
    "i" 2 "C"
    "i" 2 "D"
    "j" 1 "B"
    "j" 1 "C"
    "j" 1 "E"
    "j" 2 "B"
    "j" 2 "C"
    "j" 2 "E"
    "k" 1 "B"
    "k" 1 "F"
    "k" 2 "G"
    end
    
    label var phy_id "physician id"
    label var hosp_id "hospital id"
    Consider the sample above. I have three hospitals (i, j, k), 2 periods (1, 2) and 7 physicians (A, B, C, D, E, F, G). In hospital i at periods 1 and 2 physicians A, B, C, D are colleagues. In hospital j at periods 1 and 2 physicians B, C and E are colleagues. In hospital k at period 1 physicians B and F are colleagues and at period 2 physician G works alone. Hence, I would like to identify for physician A at time 2 in hospital i, physicians E (once they worked with physicians B and C at time 1 in hospital j) and F (once he worked with physicians B at time 1 in hospital k).

    Any ideas on how I can code this?

    Many thanks
    Paula

  • #2
    OK, I think I understand you and have something that might be a solution. I'm going to use ego/alter terminology because that helps me think.
    An "ego" is any one of your physicians. An "alter" is anyone who works with a physician in the same hospital. With that language, I think you want: "For each ego, get a list of all the *past* alters of ego's *current* alters."

    Here's what I have, which I have not empirically checked, but which I hope is in the right direction. Someone more clever than me will likely have a better way to think about and do this.
    I've just done the problem for the *current* egos. And, note that your data set makes it easier by having only two periods.
    Code:
    clear
    input str1 hosp_id period str1 phy_id
    "i" 1 "A"
    "i" 1 "B"
    "i" 1 "C"
    "i" 1 "D"
    "i" 2 "A"
    "i" 2 "B"
    "i" 2 "C"
    "i" 2 "D"
    "j" 1 "B"
    "j" 1 "C"
    "j" 1 "E"
    "j" 2 "B"
    "j" 2 "C"
    "j" 2 "E"
    "k" 1 "B"
    "k" 1 "F"
    "k" 2 "G"
    end
    label var phy_id "physician id"
    label var hosp_id "hospital id"
    rename phy_id ego_id      // ego/alter helps me think
    rename hosp_id ego_hosp
    rename period ego_period
    tempfile ego
    save `ego'
    // Make all possible pairs of egos and alters
    rename ego* alter*
    tempfile temp
    save `temp'
    //
    use `ego', clear
    cross using `temp' // cross is the key
    drop if (ego_id >= alter_id) // no self pairs or reverse order pairs
    tempfile allpairs
    save `allpairs'
    //
    // File of ego/alter pairs in same *current* hospital
    keep if (ego_period ==2) & (alter_period ==2) & (ego_hosp == alter_hosp) // current
    keep ego_id alter_id
    gen byte period = 2
    order ego_id alter_id period
    tempfile current
    save `current'  
    list
    //
    // File of ego/alter pairs in same past hospital
    use `allpairs'
    keep if (ego_period ==1) & (alter_period ==2) & (ego_hosp == alter_hosp) // past
    keep ego_id alter_id
    gen byte period = 1
    keep ego_id alter_id period
    // But these are alters paired with alter's alters.
    rename alter_id alter_alter_id
    rename ego_id alter_id // egos
    // We want a list of all alter_alters for each alter, so must go wide
    bysort alter_id: gen byte seq = _n
    reshape wide alter_alter_id, i(alter_id) j(seq)
    keep alter* // only relevant variables
    tempfile past
    save `past'  
    list
    // Now for each current ego, merge the list of the past alters of each of ego's alters
    use `current', clear
    merge m:1 alter_id using `past'

    Again, I would not be surprised if -nwcommands- had a five line solution to this.





    Comment


    • #3
      Whoops, I found one mistake. The *past* pairs should both be *period == 1.
      Code:
      // File of ego/alter pairs in same past hospital
      use `allpairs'
      keep if (ego_period ==1) & (alter_period ==1) & (ego_hosp == alter_hosp) // past
                                ^^^^^^^^^^^^^^^^^^^^^

      Comment


      • #4
        Many thanks for your kind response, Mike! And sorry for the delay in getting back to you - I was working on this code.

        I used the reasoning you suggested above and wrote the code below which gives me exactly what I want: for each pairwise combination of physicians at a given time in a given hospital I identify the physicians with whom the alter physician has worked in the past in other hospitals (excluding those who have ever worked with the ego physician).

        Code:
        clear
        input str1 hosp_id period str1 phy_id
        "i" 1 "A"
        "i" 1 "B"
        "i" 1 "C"
        "i" 1 "D"
        "i" 2 "A"
        "i" 2 "B"
        "i" 2 "C"
        "i" 2 "D"
        "j" 1 "B"
        "j" 1 "C"
        "j" 1 "E"
        "j" 2 "B"
        "j" 2 "C"
        "j" 2 "E"
        "k" 1 "B"
        "k" 1 "F"
        "k" 2 "G" /*isolate nodes do not show up*/
        end
        
        * File with egos (sender)
        rename phy_id ego_id
        rename hosp_id ego_hosp
        rename period ego_period
        save egos, replace
        
        * File with alters (receiver)
        rename ego* alter*
        save alters, replace
        
        * File with all possible pairs of ego/alter working in same hospital at same time
        use egos, clear
        cross using alters.dta // cross forms every pairwise combination between 2 datasets
        drop if (ego_id == alter_id) // no self pairs
        keep if ego_period == alter_period & ego_hosp == alter_hosp // same hospital at same time
        rename (ego_period ego_hosp) (period hosp)
        drop alter_period alter_hosp
        save colleagues.dta, replace
        
        * Reshaping colleagues at any period - hospital
        use colleagues.dta, clear
        keep ego_id alter_id
        duplicates drop
        bysort ego_id: gen byte seq = _n
        reshape wide alter_id, i(ego_id) j(seq)
        save colleagues_reshape.dta, replace
        
        * Adding colleagues of colleagues in different hospitals sometime in the past
        use colleagues.dta, clear
        rename (hosp period ego_id alter_id) (a_hosp a_period a_ego_id a_alter_id)
        cross using colleagues.dta
        order hosp period ego_id alter_id
        keep if alter_id==a_ego_id //only colleagues of the physician's colleague
        drop a_ego_id
        drop if ego_id == a_alter_id //droping when the colleague of the physician's colleague is the physician himself
        drop if hosp == a_hosp //only colleagues of colleagues in different hospitals
        drop if period <= a_period //only past colleagues of colleagues
        sort hosp period ego_id alter_id a_hosp a_period a_alter_id
        
        // droping colleagues of the physician's colleagues who have never worked with him
        merge m:1 ego_id using colleagues_reshape, keep(master match) nogenerate 
        drop if a_alter_id == alter_id1 | a_alter_id == alter_id2 | a_alter_id == alter_id3 | a_alter_id == alter_id4 | a_alter_id == alter_id5
        drop alter_id1 alter_id2 alter_id3 alter_id4 alter_id5
        My issue is that when I try to apply this code to my real data (804,209 obs) for which I have 120 different periods, 4,330 different hospitals and 78,610 different physicians, Stata does not allow me to proceed (see below). I do drop a lot of the observations generated by -cross- as you can see in the code above (e.g. pairwise combination of physicians in different hospital) and hence there would be no need to generate such combination. However the command -cross- does not allow me to specify which combinations to avoid. Would you have any idea of how I could change this code to make it feasible for my dataset?

        Code:
        . * File with all possible pairs of ego/alter working in same hospital at same time
        . use egos, clear
        
        . cross using alters.dta // cross forms every pairwise combination between 2 datasets
        sum of expand values exceed 2,147,483,620
            The dataset may not contain more than 2,147,483,620 observations.
        r(459);
        
        end of do-file
        
        r(459);
        Many thanks!

        Comment


        • #5
          This is an interesting problem, particularly at the scale that you are working. My crude approach can't work here. I have a couple of thoughts:
          1) Can you do this in pieces? That is, can you start with the "egos" as just the physicians in period T, finding all their past alters. Then, do it for physicians in period T-1, etc.?
          If there are, say, 78000/120 = 650 physicians in period T, and you cross them with the approx. 77000 physicians from previous periods, that would make the file from -cross- have size of only about _N = 50e6, which might work?

          2) Ask Stata tech support for help. This is a problem that would really stretch Stata, and therefore should interest them as a way to demonstrate its capacities.

          3) Go to https://nwcommands.wordpress.com/, and try to get help. Again, your problem is difficult enough, I think, to get very thoughtful people interested. There may be some completely different algorithmic approach known to people who do social network analysis.

          Please keep following up with this on the list here.

          Comment


          • #6
            Hi Mike, thanks again!! Sure, I will keep you informed.

            Regarding your points:

            1) 78,610 is the number of unique ids of physicians (I get it from running - codebook-). The same physicians show up over time. Instead, I tried to run the code for physicians working in one city. Still, Stata did not allow me to run the second - cross - command.

            2) Will do that now!

            3) I tried to contact Thomas Grund but haven't heard back from him. Also, the website and the related google group have not been updated since 2016. Would you know why? Would you have any contact I could reach out to?

            Many thanks
            Paula

            Comment


            • #7
              Hmmm. When I last sent a question about -nwcommands- a few years ago, Grund responded in a few days. I see there is a forthcoming StataPress book on this
              https://www.stata.com/gsearch.php?q=...td&restrict=SP
              for what it's worth.

              One thought: What about doing -joinby- to form all pairs at the same hospital at the same period, and then trimming down from there?


              And: A good full size example would help here. I'll offer the following as a simulated full-sized data set, perhaps with some bad assumptions. I can't follow up on this today, but perhaps having an example will stimulate some other people to jump in.

              Code:
              set seed 4754
              clear
              local Nphysicians = 78000
              local Nperiod = 120
              local Nhospital = 4000
              local NHosp_per_doc = 4 // assumption
              local Nperiod_per_doc = 8
              set obs `Nphysicians'
              gen long ego_id = _n
              gen int np = ceil(runiform() * `Nperiod_per_doc')
              gen byte nh = ceil(runiform() * `NHosp_per_doc')
              expand np * nh
              gen int ego_period = ceil(runiform() * `Nperiod')
              gen int ego_hosp = ceil(runiform() * `Nhospital')

              Comment


              • #8
                I'm not familiar with nwcommands or network analysis so I don't know if you could get to a solution that way.

                I'm not seeing any shortcuts that would reduce the scale of what needs to be done but I think you can leverage the fact that each observation has a solution that is relatively easy to get to. If it is programmed correctly, you can then rinse and repeat for all remaining observations using runby (from SSC).

                For example, let's assume that the data in memory contains only the test case observation mentioned in #1:
                Code:
                     +---------------------------+
                     | hosp_id   period   phy_id |
                     |---------------------------|
                  1. |       i        2        A |
                     +---------------------------+
                and let's assume that the example dataset in #1 is saved as "statalist_test.dta". The following program will:
                • make a list of current colleagues (same hospital and period);
                • fetch all observations for these colleagues from the full dataset;
                • reduce to observations from past periods at different hospitals;
                • form all pairwise combinations with the full dataset within the same period and hospital to get all colleagues of current colleagues;
                • remove observations from current colleagues.
                I'm assuming that you only want one instance of a past colleague of a current colleague so I pick one case (rather arbitrarily). Copy the following program and save it in a file called "do_one_case.do" in Stata's current directory.
                Code:
                program do_one_case
                
                    local h = hosp_id[1]
                    local t = period[1]
                    local p = phy_id[1]
                
                    use if hosp_id == "`h'" & period == `t' & phy_id != "`p'" using "statalist_test.dta", clear
                    if _N == 0 exit
                    keep phy_id
                    isid phy_id, sort
                    dis _n "---------- colleagues at same hospital in same period -------------"
                    list
                    tempfile hold
                    save "`hold'"
                
                    merge 1:m phy_id using "statalist_test.dta", keep(match) nogen
                    dis _n "----------  past history for these colleagues at different hospitals -------------"
                    drop if period >= `t'
                    drop if hosp_id == "`h'"
                
                    if _N == 0 exit
                    isid phy_id hosp_id period, sort
                    gen long case = _n
                    list
                
                    rename phy_id curcoll
                    joinby period hosp_id using "statalist_test.dta"
                    if _N == 0 exit
                    dis _n "---------- colleagues at matching time and hospital -------------"
                    isid case phy_id, sort
                    list , sepby(case)
                
                    // remove obs from current colleagues
                    merge m:1 phy_id using "`hold'", keep(master) nogen
                    dis _n "---------- colleagues at matching time and hospital minus current colleagues -------------"
                    if _N == 0 exit
                    isid case phy_id, sort
                    list , sepby(case)
                
                    // adjust var names and reduce to one obs per past colleague
                    rename phy_id pastcoll
                    rename hosp_id pasthosp
                    rename period pastp
                    isid pastcoll curcoll pasthosp pastp, sort
                    by pastcoll: keep if _n == 1
                
                    // identify current case for each past colleague
                    gen hosp_id = "`h'"
                    gen period  = `t'
                    gen phy_id  = "`p'"
                    order hosp_id period phy_id
                
                end
                Copy the following in a do-file, also in the current directory:
                Code:
                version 14
                clear all
                input str1 hosp_id period str1 phy_id
                "i" 1 "A"
                "i" 1 "B"
                "i" 1 "C"
                "i" 1 "D"
                "i" 2 "A"
                "i" 2 "B"
                "i" 2 "C"
                "i" 2 "D"
                "j" 1 "B"
                "j" 1 "C"
                "j" 1 "E"
                "j" 2 "B"
                "j" 2 "C"
                "j" 2 "E"
                "k" 1 "B"
                "k" 1 "F"
                "k" 2 "G"
                end
                
                * make sure there are no duplicates!
                isid hosp_id period phy_id, sort
                save "statalist_test.dta", replace
                
                * load the program
                run "do_one_case.do"
                
                keep if phy_id == "A" & period == 2 & hosp_id == "i"
                list
                
                * run the program
                do_one_case
                
                * show results
                order hosp_id period phy_id
                list
                Here's the output if you run the above do-file:
                Code:
                . version 14
                
                . clear all
                
                . input str1 hosp_id period str1 phy_id
                
                       hosp_id     period     phy_id
                  1. "i" 1 "A"
                  2. "i" 1 "B"
                  3. "i" 1 "C"
                  4. "i" 1 "D"
                  5. "i" 2 "A"
                  6. "i" 2 "B"
                  7. "i" 2 "C"
                  8. "i" 2 "D"
                  9. "j" 1 "B"
                 10. "j" 1 "C"
                 11. "j" 1 "E"
                 12. "j" 2 "B"
                 13. "j" 2 "C"
                 14. "j" 2 "E"
                 15. "k" 1 "B"
                 16. "k" 1 "F"
                 17. "k" 2 "G"
                 18. end
                
                .
                . * make sure there are no duplicates!
                . isid hosp_id period phy_id, sort
                (data now sorted by hosp_id period phy_id)
                
                . save "statalist_test.dta", replace
                file statalist_test.dta saved
                
                .
                . * load the program
                . run "do_one_case.do"
                
                .
                . keep if phy_id == "A" & period == 2 & hosp_id == "i"
                (16 observations deleted)
                
                . list
                
                     +---------------------------+
                     | hosp_id   period   phy_id |
                     |---------------------------|
                  1. |       i        2        A |
                     +---------------------------+
                
                .
                . * run the program
                . do_one_case
                (data now sorted by phy_id)
                
                ---------- colleagues at same hospital in same period -------------
                
                     +--------+
                     | phy_id |
                     |--------|
                  1. |      B |
                  2. |      C |
                  3. |      D |
                     +--------+
                file /var/folders/cp/z8cssshn6935x9p181c71_7m0000gn/T//S_13346.000001 saved
                
                    Result                           # of obs.
                    -----------------------------------------
                    not matched                             0
                    matched                                11  
                    -----------------------------------------
                
                ----------  past history for these colleagues at different hospitals -------------
                (5 observations deleted)
                (3 observations deleted)
                (data now sorted by phy_id hosp_id period)
                
                     +----------------------------------+
                     | phy_id   hosp_id   period   case |
                     |----------------------------------|
                  1. |      B         j        1      1 |
                  2. |      B         k        1      2 |
                  3. |      C         j        1      3 |
                     +----------------------------------+
                
                ---------- colleagues at matching time and hospital -------------
                (data now sorted by case phy_id)
                
                     +--------------------------------------------+
                     | curcoll   hosp_id   period   case   phy_id |
                     |--------------------------------------------|
                  1. |       B         j        1      1        B |
                  2. |       B         j        1      1        C |
                  3. |       B         j        1      1        E |
                     |--------------------------------------------|
                  4. |       B         k        1      2        B |
                  5. |       B         k        1      2        F |
                     |--------------------------------------------|
                  6. |       C         j        1      3        B |
                  7. |       C         j        1      3        C |
                  8. |       C         j        1      3        E |
                     +--------------------------------------------+
                
                    Result                           # of obs.
                    -----------------------------------------
                    not matched                             3
                        from master                         3  
                        from using                          0  
                
                    matched                                 0  
                    -----------------------------------------
                
                ---------- colleagues at matching time and hospital minus current colleagues -------------
                (data now sorted by case phy_id)
                
                     +--------------------------------------------+
                     | curcoll   hosp_id   period   case   phy_id |
                     |--------------------------------------------|
                  1. |       B         j        1      1        E |
                     |--------------------------------------------|
                  2. |       B         k        1      2        F |
                     |--------------------------------------------|
                  3. |       C         j        1      3        E |
                     +--------------------------------------------+
                (data now sorted by pastcoll curcoll pasthosp pastp)
                (1 observation deleted)
                
                .
                . * show results
                . order hosp_id period phy_id
                
                . list
                
                     +--------------------------------------------------------------------------+
                     | hosp_id   period   phy_id   curcoll   pasthosp   pastp   case   pastcoll |
                     |--------------------------------------------------------------------------|
                  1. |       i        2        A         B          j       1      1          E |
                  2. |       i        2        A         B          k       1      2          F |
                     +--------------------------------------------------------------------------+
                
                .
                The do_one_case program puts out a lot of output to help track what is going on.

                Once you are sure that the program does what you want, you can use runby to call the do_one_case program for each observation in the data. With runby, what's left in memory after your program terminates is considered results and accumulates. If there's nothing in memory, then runby moves on to the next group. So a complete solution for the data example in #1 would be:
                Code:
                clear all
                use "statalist_test.dta"
                
                * confirm that obs are uniquely identified using ...
                isid hosp_id period phy_id, sort
                
                run "do_one_case.do"
                
                runby do_one_case, by(period hosp_id phy_id) status
                
                list
                and the results:
                Code:
                . runby do_one_case, by(period hosp_id phy_id) status
                
                  elapsed ----------- by-groups ----------    ------- observations ------       time
                     time      count     errors    no-data        processed         saved  remaining
                ------------------------------------------------------------------------------------
                 00:00:00         17          0         10               17            19   00:00:00
                
                --------------------------------------
                Number of by-groups    =            17
                by-groups with errors  =             0
                by-groups with no data =            10
                Observations processed =            17
                Observations saved     =            19
                --------------------------------------
                
                .
                . list
                
                     +--------------------------------------------------------------------------+
                     | hosp_id   period   phy_id   curcoll   pasthosp   pastp   case   pastcoll |
                     |--------------------------------------------------------------------------|
                  1. |       i        2        A         B          j       1      1          E |
                  2. |       i        2        A         B          k       1      2          F |
                  3. |       i        2        B         C          j       1      1          B |
                  4. |       i        2        B         C          j       1      1          E |
                  5. |       i        2        C         B          j       1      1          C |
                     |--------------------------------------------------------------------------|
                  6. |       i        2        C         B          j       1      1          E |
                  7. |       i        2        C         B          k       1      2          F |
                  8. |       i        2        D         B          j       1      1          E |
                  9. |       i        2        D         B          k       1      2          F |
                 10. |       j        2        B         C          i       1      1          A |
                     |--------------------------------------------------------------------------|
                 11. |       j        2        B         C          i       1      1          B |
                 12. |       j        2        B         C          i       1      1          D |
                 13. |       j        2        C         B          i       1      1          A |
                 14. |       j        2        C         B          i       1      1          C |
                 15. |       j        2        C         B          i       1      1          D |
                     |--------------------------------------------------------------------------|
                 16. |       j        2        C         B          k       1      2          F |
                 17. |       j        2        E         B          i       1      1          A |
                 18. |       j        2        E         B          i       1      1          D |
                 19. |       j        2        E         B          k       1      2          F |
                     +--------------------------------------------------------------------------+
                
                .
                Note that all the output generated by the do_one_case program is suppressed by default.

                I'm not sure how well this will scale on the full problem but I would try to first sort the data by period and try it only on a few initial periods. Here's a small scale test on random data:
                Code:
                version 14
                clear all
                set seed 31231
                
                set obs 100
                gen phy_id = "dr" + string(_n, "%04.0f")
                
                expand runiform(1,4)
                bysort phy_id: gen hosp_id = "hosp" + string(runiformint(1,10), "%03.0f")
                
                bysort phy_id hosp_id: keep if _n == 1
                gen start = runiformint(1,10)
                expand runiformint(1,10)
                bysort phy_id hosp_id: gen long period = start + _n
                drop start
                
                isid period hosp_id phy_id, sort
                
                save "statalist_test.dta", replace
                
                * load the program
                run "do_one_case.do"
                
                runby do_one_case, by(period hosp_id phy_id) status
                and the results on my computer:
                Code:
                . runby do_one_case, by(period hosp_id phy_id) status
                
                  elapsed ----------- by-groups ----------    ------- observations ------       time
                     time      count     errors    no-data        processed         saved  remaining
                ------------------------------------------------------------------------------------
                 00:00:01        163          0         53              163         1,073   00:00:06
                 00:00:02        290          0         54              290         4,190   00:00:06
                 00:00:03        404          0         54              404         8,443   00:00:06
                 00:00:04        506          0         54              506        13,261   00:00:06
                (now reporting every 5 seconds)
                 00:00:09        925          0         54              925        41,092   00:00:03
                 00:00:12      1,212          0         72            1,212        57,183   00:00:00
                
                --------------------------------------
                Number of by-groups    =         1,212
                by-groups with errors  =             0
                by-groups with no data =            72
                Observations processed =         1,212
                Observations saved     =        57,183
                --------------------------------------
                
                .
                Since this requires a recursive window (from the earliest period up to the current period), you may want to create subsets of the initial data, each covering data up to the current period. You could then run the same setup as above using only data up to that period. Also, if you are only looking for past colleagues in the same city, break-up the data by city and use the approach on the smaller city datasets.
                Last edited by Robert Picard; 01 Feb 2019, 21:09.

                Comment


                • #9
                  I just thought of a different question about the definition of the problem: What level of "indirectness" do you want? Clearly, if A worked with B, and B worked with C, you want B and C linked to A.
                  But what if A worked with B worked with C worked with D? Do you want A linked with D? (and so forth .... A ... Z, ....)

                  Comment


                  • #10
                    l was losing steam last night but now rested, I'll show how to adapt the solution so that you can run this in parts. Assume that the following is a (small) analog for the full data:
                    Code:
                    version 14
                    clear all
                    set seed 31231
                    
                    set obs 100
                    gen phy_id = "dr" + string(_n, "%04.0f")
                    
                    expand runiform(1,4)
                    bysort phy_id: gen hosp_id = "hosp" + string(runiformint(1,10), "%03.0f")
                    
                    bysort phy_id hosp_id: keep if _n == 1
                    gen start = runiformint(1,10)
                    expand runiformint(1,10)
                    bysort phy_id hosp_id: gen long period = start + _n
                    drop start
                    
                    isid period hosp_id phy_id, sort
                    
                    save "statalist_fulldata.dta", replace
                    Then the solution to process the whole thing in one pass amounts to (recycling what's in my previous post):
                    Code:
                    clear all
                    
                    run "do_one_case.do"
                    use "statalist_fulldata.dta"
                    
                    runby do_one_case, by(period hosp_id phy_id) status
                    
                    isid hosp_id period phy_id pastcoll, sort
                    save "results_all.dta", replace
                    To do the same in parts, using only the relevant data would look like:
                    Code:
                    clear all
                    
                    run "do_one_case.do"
                    use "statalist_fulldata.dta"
                    
                    sum period
                    local firstyr = r(min) + 1
                    local lastyr = r(max)
                    
                    clear
                    save "results_p`firstyr'_p`lastyr'.dta", replace emptyok
                    
                    forvalues y = `firstyr'/`lastyr' {
                        dis "doing " `y'
                        * save data from first year up to current year
                        use if period <= `y' using "statalist_fulldata.dta"
                        save "statalist_test.dta", replace
                        
                        keep if period == `y'
                        runby do_one_case, by(period hosp_id phy_id) status
                        append using "results_p`firstyr'_p`lastyr'.dta"
                        save "results_p`firstyr'_p`lastyr'.dta", replace
                        
                    }
                    
                    isid hosp_id period phy_id pastcoll, sort
                    cf _all using "results_all.dta", all
                    The last line confirms that the results match those produced in one pass.

                    I suspect that either way, the process will be quite slow for later periods as the window from the earliest to the current period becomes large. I suggest that you try the above but manually specify the range and target just one or a few periods, something like:
                    Code:
                    local firstyr = 18
                    local lastyr = 19
                    and that would create a file called "results_p18_p19.dta". You can always append results from separate runs later. Better yet, if you can restrict how far back you are willing to look for past colleagues, say 10 periods prior, change
                    Code:
                    use if period <= `y' using "statalist_fulldata.dta"
                    for something like
                    Code:
                    local back = `y' - 10
                    use if inrange(period, `back', `y') using "statalist_fulldata.dta"

                    Comment


                    • #11
                      The puzzle is quite interesting. And the solutions of Robert and Mike are very inspirational. And I jump in now, since for me, a clarification from Paula for what she needs might be necessary.

                      Specifically, Paula’s target, as stated in #1, is
                      to identify, for each physician at a given hospital in a given time, all physicians that have worked with their current colleagues in different hospitals in the past.
                      Actually, this description, with its “strict” meaning, could be considered as fairly clear to establish a solvable puzzle. But in considering the general concept of “indirect relation”, I am still feeling a small chance that it might be not exactly what is expected. My suspicion lies in the 2 bold words in the above quotation (“current” and “different”). And below example is given to illustrate my point.
                      Code:
                        +------------------------------------------+
                        | period | phy_id | hosp_id | ind1 |  ind2 |
                        |--------+--------+---------+------+-------|
                        |      1 |      A |      UU |      |       |
                        |      1 |      B |      VV |      |       |
                        |      1 |      C |      VV |      |       |
                        |      1 |      D |      XX |      |       |
                        |      1 |      E |      XX |      |       |
                        |--------+--------+---------+------+-------|
                        |      2 |      A |      YY |    C |     C |
                        |      2 |      B |      YY |      |       |
                        |      2 |      D |      ZZ |      |       |
                        |      2 |      F |      ZZ |      |       |
                        |--------+--------+---------+------+-------|
                        |      3 |      A |      ZZ |    E | E-F-C |
                        |      3 |      D |      ZZ |      |       |
                        +------------------------------------------+
                      In this example, in period 3, physician A has only (direct) “current” colleague D, who has had direct connection with E (hospital XX, period 1) and F (hospital ZZ, period 2). Previously, in period 2, A is a colleague with B (firm YY), who has direct relation with C (firm VV) in period 1. Then in period 3, for A, it seems clearly that E is listed as indirect relation. But C, who has had direct relation with A through B (period 2), has been “forgotten” since B is no longer "current" colleague of A in period 3. Meanwhile, F, although being direct colleague with D, but not in "different" hospital (same hospital ZZ), is also unlisted. (Please compare ind1 and ind2 variable).

                      The necessary clarification is: ind1 or ind2, which one is exactly the output that Paula expects? In other word, whether “current” here means “within the given period only” (as its strict meaning) or might means “up to the given period” (or “so far”). Similarly, whether "different" explicitly point to “must be not the same (current) firm”, or just means "not having the direct relation with" (similar to what is concerned by Mike in #9).

                      Then Paula, please kindly clarify the situation. Upon that, an appropriate code might be provided to solve the puzzle effectively.

                      Comment


                      • #12
                        Completely in accord with Romalpa here the continued need for clarification. I have a suggestion that might help: I've come to realize (?) that "indirect relationship" is another way of saying "ordered triplet." So, maybe it would help if you (Paula) could give us a definition of the problems in terms of the properties of the members of an ordered triplet, i.e., "I want to find all triplets (Person1, Person2, Person3) such that P1 is from such and such a defined set of persons, P2 has such and such a relation to P1, and P3 has ... relation to P.

                        I've been thinking about this problem, and there may be a relatively easy and fast approach involving -joinby- and -merge-, but I became less sure of the problem definition.

                        Comment


                        • #13
                          Sorry, should read:"... and P3 has ... relation to P2."

                          Comment


                          • #14
                            Hi all, sorry for not being very clear. What I want is variable ind1 in Romalpa's example. By current I mean "within the given period only" and by different I mean not the same current hospital. The way I think of it in the example above is: for physician A (ego physician) working together with physician D (alter physician) at time 3 in hospital ZZ, I would like to know all physicians who have worked with physician D in the past (at times 1 or 2) in different hospitals (UU, VV, XX or YY) who have never worked together in the same hospital with physician A (at any period, including periods after period 3 in case they existed). Therefore, when we see physician A as ego and D as alter, the alter-alter physician would be only physician E. Additionally, I would like to see this same pairwise combination between A and D where D is ego and A is alter. In this case the alter-alter physician would be those who worked with A in the past in different hospitals and have never worked with D. In the example above, we would have physician B as the alter-alter physician.

                            To answer Mike in #9, I would like a first order level of indirectness. If A worked with B who worked with C who worked with D. I would like A only linked with C.

                            As Mike suggested, I replaced -cross- by -joinby- and the code got a lot more efficient! The code below gives me exactly the output I am looking for: for which pair of physicians working at the same time in the same hospital (ego and alter physicians) it informs me the alter-alter physicians (those who have never worked with the ego physician and have worked with the alter physician sometime in the past in different hospitals) and in which period and hospital the latter worked with the alter physician. Running the code with Romalpa's example:

                            Code:
                            clear
                            input period str1 phy_id str2 hosp
                            1 "A" "UU"
                            1 "B" "VV"
                            1 "C" "VV"
                            1 "D" "XX"
                            1 "E" "XX"
                            2 "A" "YY"
                            2 "B" "YY"
                            2 "D" "ZZ"
                            2 "F" "ZZ"
                            3 "A" "ZZ"
                            3 "D" "ZZ"
                            end
                            
                            * File with egos (sender)
                            rename phy_id ego_id
                            save egos, replace
                            
                            * File with alters (receiver)
                            rename ego_id alter_id
                            save alters, replace
                            
                            * File with pairwise combinations of ego/alter working in same hospital at same time
                            use egos, clear
                            joinby hosp period using alters.dta
                            drop if ego_id == alter_id // no self pairs
                            save colleagues.dta, replace
                            
                            * Adding colleagues of colleagues in different hospitals sometime in the past
                            use colleagues.dta, clear
                            rename (hosp period alter_id ego_id) (a_hosp a_period a_alter_id alter_id)
                            joinby alter_id using colleagues.dta
                            drop if ego_id == a_alter_id //droping when the colleague of the physician's colleague is the physician himself
                            drop if hosp == a_hosp //only colleagues of colleagues in different hospitals
                            drop if period <= a_period //only past colleagues of colleagues
                            order hosp period ego_id alter_id
                            save colleagues_iv_temp.dta, replace
                            
                            * Keeping only colleagues of the physician's colleagues who have never worked with him
                            //reshaping colleagues at any period - hospital
                            use colleagues.dta, clear
                            keep ego_id alter_id
                            duplicates drop
                            bysort ego_id: gen byte seq = _n
                            reshape wide alter_id, i(ego_id) j(seq)
                            save colleagues_reshape.dta, replace
                            
                            //droping colleagues of the physician's colleagues who have worked with him
                            forvalues i = 1/3 { /*how to find out automatically the number of variables generated in the last step?*/
                            use colleagues_iv_temp, clear
                            merge m:1 ego_id using colleagues_reshape, keep(master match) keepusing(alter_id`i') nogenerate
                            drop if a_alter_id == alter_id`i'
                            drop alter_id`i'
                            save colleagues_iv_temp.dta, replace
                            }
                            sort period hosp ego_id
                            The output is
                            Code:
                            . list
                            
                                 +------------------------------------------------------------------+
                                 | hosp   period   ego_id   alter_id   a_period   a_hosp   a_alte~d |
                                 |------------------------------------------------------------------|
                              1. |   YY        2        A          B          1       VV          C |
                              2. |   ZZ        2        F          D          1       XX          E |
                              3. |   ZZ        3        A          D          1       XX          E |
                              4. |   ZZ        3        D          A          2       YY          B |
                                 +------------------------------------------------------------------+
                            However, I am still having problems when running the code above in my real data. Specifically, Stata does not allow me to run the second -jointby-. Please let me know in case you have any suggestions. I will have a careful look on the code proposed by Robert.

                            Code:
                            . * File with egos (sender)
                            . rename phy_id ego_id
                            
                            . save egos, replace
                            file egos.dta saved
                            
                            .
                            . * File with alters (receiver)
                            . rename ego_id alter_id
                            
                            . save alters, replace
                            file alters.dta saved
                            
                            .
                            . * File with pairwise combinations of ego/alter working in same hospital at same time
                            . use egos, clear
                            
                            . joinby hosp period using alters.dta
                            
                            . drop if ego_id == alter_id // no self pairs
                            (805,594 observations deleted)
                            
                            . save colleagues.dta, replace
                            file colleagues.dta saved
                            
                            .
                            . * Adding colleagues of colleagues in different hospitals sometime in the past
                            . use colleagues.dta, clear
                            
                            . rename (hosp period alter_id ego_id) (a_hosp a_period a_alter_id alter_id)
                            
                            . joinby alter_id using colleagues.dta //still not running for the whole dataset
                            sum of expand values exceed 2,147,483,620
                                The dataset may not contain more than 2,147,483,620 observations.
                            r(459);
                            
                            end of do-file
                            
                            r(459);
                            
                            .
                            Last edited by Paula de Souza Leao Spinola; 04 Feb 2019, 07:55.

                            Comment


                            • #15
                              Paula, I think you can get past this one last issue by doing the last -joinby- on subsets ("batches") of the colleagues file: You can take a batch of the colleagues (ego/alter pairs), -joinby- to get the past matches, do the next batch, etc.? If the logic is ok, you just need to make the batches small enough to not exceed the -expand- limits.

                              Code:
                              * Adding colleagues of colleagues in different hospitals sometime in the past
                              use colleagues.dta, clear
                              rename (hosp period alter_id ego_id) (a_hosp a_period a_alter_id alter_id)
                              // New stuff starts
                              // 10 batches; maybe more (smaller subsets) would be needed.
                              gen byte batch = 1 + mod(_N,10)
                              preserve
                              forval i = 1/10 {
                                keep if (batch == `i')
                                joinby alter_id using colleagues.dta
                                save batch`i'
                                restore
                              }
                              clear
                              save whole.dta, emptyok
                              forval i = 1/10 {
                                append using batch`i'
                              }
                              Last edited by Mike Lacy; 04 Feb 2019, 11:20.

                              Comment

                              Working...
                              X