Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Data preparation question for networks

    Dear Statlist users,

    Firstly, I am a new user of Statalist, so I apologize in advance if I am not using dataex properly to describe my data.

    I am trying to prepare my data for analysis and running into some problems. I have respondent-level data in long format on up to 5 members in their network. I am now running into a separate issue with the variables that identify whether the network members know each other (variable is called know_* where the * indicates whether person 1 knows person 2, such as know_1_2 etc.). A separate variable n* indicates the initials of the pair e.g. n1_2 is EM BK. The goal here is to create an edge list so that I can do social network analysis (information on the nodes is in a separate file in long format, by householdid on each of the names mentioned in the network). I had wanted the final edge list dataset to look like the below:
    householdid name pair know
    10101 EM BK Yes
    10101 EM PK Yes
    10101 BK PK Yes
    A tranche of the current data is copied below (the text in red corresponds to the data in red above)
    householdid names_ names_repeat_count n1_2 n1_3 n1_4 n2_3 n2_4 n3_4 count person know_1_1 know_1_2 know_1_3 know_1_4 know_1_5 know_2_1 know_2_2 know_2_3
    10101 BK 3 EM BK EM PK BK PK 1 2 Yes Yes Yes
    10101 EM 3 EM BK EM PK BK PK 1 1 Yes Yes Yes
    10101 PK 3 EM BK EM PK BK PK 1 3 Yes Yes Yes
    Example generated by -dataex-. To install: ssc install dataex
    clear
    input str5 householdid str2 names_ str5(n1_2 n1_3 n2_3) byte(know_1_2 know_1_3 know_2_3)
    "10101" "BK" "EM BK" "EM PK" "BK PK" 1 1 1
    "10101" "EM" "EM BK" "EM PK" "BK PK" 1 1 1
    "10101" "PK" "EM BK" "EM PK" "BK PK" 1 1 1
    end
    label values know_1_2 know_1_2
    label def know_1_2 1 "Yes", modify
    label values know_1_3 know_1_3
    label def know_1_3 1 "Yes", modify
    label values know_2_3 know_2_3
    label def know_2_3 1 "Yes", modify

    I would appreciate any advice on how to go about doing this. I've read some of the help files for reshaping but haven't come across a solution, or maybe this is just a simple thing that I don't quite know how to do yet?

    Thank you!
    Last edited by Uzaib Yasin; 04 Mar 2020, 09:20.

  • #2
    Your situation is likely easily solvable, but your presentation is quite hard to follow, for typographical and other reasons. Ask a colleague not familiar with your data set to look at what you have posted and offer suggestions about a clearer description. Re-read the StataList FAQ. Then, do this:


    1) Use -dataex- to show an example of your current data. Include the markers with it, as -dataex- instructs you to do onscreen. Right after that example, describe it so as to make clear what each observation is and what each variable means.
    2) Do the same for the data you *want* to have.

    I understand you have tried to do 1) and 2), but it's not working, at least not for me. Among other things, I don't know what a "tranche" means in this or any other context, and the material following that sentence did not show well on the screen, as I presume you can tell now.

    Comment


    • #3
      Thanks, Mike. Yes, sorry about that. I attempt below to re-frame my question in the hopes it is clear now.

      My data currently look like this (sample below of 10 obs)

      Code:
      * Example generated by -dataex-. To install: ssc install dataex
      clear
      input int householdid str2 names_ byte size str5(n1_2 n1_3 n1_4 n2_3 n2_4 n3_4) byte(know_1_2 know_1_3 know_1_4 know_2_3 know_2_4 know_3_4)
      10101 "EM" 3 "EM BK" "EM PK" ""      "BK PK" ""      ""      1 1 . 1 . .
      10101 "BK" 3 "EM BK" "EM PK" ""      "BK PK" ""      ""      1 1 . 1 . .
      10101 "PK" 3 "EM BK" "EM PK" ""      "BK PK" ""      ""      1 1 . 1 . .
      10101 ""   3 "EM BK" "EM PK" ""      "BK PK" ""      ""      1 1 . 1 . .
      10101 ""   3 "EM BK" "EM PK" ""      "BK PK" ""      ""      1 1 . 1 . .
      10102 "MA" 4 "MA MM" "MA JK" "MA BN" "MM JK" "MM BN" "JK BN" 1 1 1 1 1 1
      10102 "MM" 4 "MA MM" "MA JK" "MA BN" "MM JK" "MM BN" "JK BN" 1 1 1 1 1 1
      10102 "JK" 4 "MA MM" "MA JK" "MA BN" "MM JK" "MM BN" "JK BN" 1 1 1 1 1 1
      10102 "BN" 4 "MA MM" "MA JK" "MA BN" "MM JK" "MM BN" "JK BN" 1 1 1 1 1 1
      10102 ""   4 "MA MM" "MA JK" "MA BN" "MM JK" "MM BN" "JK BN" 1 1 1 1 1 1
      end
      label values know_1_2 know_1_2
      label def know_1_2 1 "Yes", modify
      label values know_1_3 know_1_3
      label def know_1_3 1 "Yes", modify
      label values know_1_4 know_1_4
      label def know_1_4 1 "Yes", modify
      label values know_2_3 know_2_3
      label def know_2_3 1 "Yes", modify
      label values know_2_4 know_2_4
      label def know_2_4 1 "Yes", modify
      label values know_3_4 know_3_4
      label def know_3_4 1 "Yes", modify
      label var householdid "What is the household ID (enter again)" 
      label var names_ "Please tell me the initials of one person outside the household you like to meet" 
      label var size "Size of network" 
      label var know_1_2 "Does 1 know 2?" 
      label var know_1_3 "Does 1 know 3?" 
      label var know_1_4 "Does 1 know 4?" 
      label var know_2_3 "Does 2 know 3?" 
      label var know_2_4 "Does 2 know 4?" 
      label var know_3_4 "Does 3 know 4?"

      The data are in long format, where each row shows the household ID ("householdid"), and then the initials of a person in the network of that household ID.

      So for example,
      • householdid 10101 has 3 people described in the network listed on 3 separate rows: EM, BK, PK. SImilarly householdid 10102 has 4 people in the network
      • The subsequent columns show different pairings of these network members (e.g. n1_2 is EM BK, n1_3 is EM PK and so forth).
      • The variable starting with know* denotes whether the pairs know each other (e.g. if person 1 knows person 2, know_1_2 ==1).
      The goal here is to create an edge list so that I can do social network analysis (information on the nodes is in a separate file in long format, by householdid on each of the names mentioned in the network).

      This is the data I want to have (with additional rows), so that each row shows the unique householdid and name pairs, and whether the pairs know each other
      householdid name pair know
      10101 EM BK Yes
      10101 EM PK Yes
      10101 BK PK Yes
      Would appreciate any thoughts, and I hope this presentation is a bit more clear!
      Last edited by Uzaib Yasin; 04 Mar 2020, 12:06.

      Comment


      • #4
        I feel your pain here--your data are quite a mess as regards constructing them as a social network.

        Because every line for a household has n* fields that duplicate, your data set seems very redundant. Why would I need anything other than one observation for each household in your original data? It would be much easier, I think, to start out with, in wide format, a list of named persons for each household (p1, p2, p3 ...) and a list of know* variables for that list of p* names. To do that, one would either need to rely on the order of names on observations within household being correct (dangerous), or the order of pairs listed in the n* fields being ok. I presume the latter is more trustworthy. Am I correct?

        Your problem is easier if the number of persons listed by a household is less than 10, i.e., there is nothing possible beyond n8_9. Is that correct?
        Last edited by Mike Lacy; 04 Mar 2020, 17:49.

        Comment


        • #5
          Aha, I think it's much simpler. I have something that works for your example data, which as I have noted, has the likely misleading feature that every possible pair of persons named by a household knows each other. That issue aside, try this:
          Code:
          bysort householdid: keep if _n ==1 //  multiple household observations are redundant
          drop size names_ // irrelevant variables 
          reshape long n know_, i(householdid) j(pair) string  // original data was *not* long
          drop if missing(know_)   // edge does not exist
          // Network programs will want nodes of edges stored in different variables
          gen str person1 = word(n, 1)
          gen str person2 = word(n, 2)
          drop pair n  // not needed
          rename know_ tie  // more standard term
          If I recall correctly, the user-written Stata module -nwcommands- (-search nwcommands-) can take an edgelist in this form, make a network out of it, and do most common network analyses.

          Comment


          • #6
            Thanks, Mike!

            To clarify, each household can only name up to (and including) 5 persons. Many choose to give only 2-3 on average, however. The dataset was initially in wide format with each row describing the list of named persons, and a list of know variables for that list of names, as you indicated. I created this long format dataset with the know variables thinking this would be easier. I used the code you had provided, and it worked if I made slight modifications to the know variable in the reshape command

            Code:
            bysort householdid: keep if _n ==1 //  multiple household observations are redundant
            drop size names_ // irrelevant variables
            reshape long n know, i(householdid) j(pair) string  // data was not long
            drop if missing(know) & n==""   // drop if edge does not exist
            // Network programs want nodes of edges stored in different variables
            gen str person1 = word(n, 1)
            gen str person2 = word(n, 2)
            The issue I now have is that I have pairs that have the “Yes”/”No” response for the know variable. As you’ll see from the example below, the first three rows have the pairs and the individual person1 and person2 variable as you had created. But the value for the know variable is blank, and instead only filled in the subsequent 3 rows.
            Code:
            * Example generated by -dataex-. To install: ssc install dataex
            clear
            input int householdid str4 pair byte know str5 n str2(person1 person2)
            10101 "1_2"  . "EM BK" "EM" "BK"
            10101 "1_3"  . "EM PK" "EM" "PK"
            10101 "2_3"  . "BK PK" "BK" "PK"
            10101 "_1_2" 1 ""      ""   "" 
            10101 "_1_3" 1 ""      ""   "" 
            10101 "_2_3" 1 ""      ""   "" 
            10102 "1_2"  . "MA MM" "MA" "MM"
            10102 "1_3"  . "MA JK" "MA" "JK"
            10102 "1_4"  . "MA BN" "MA" "BN"
            10102 "2_3"  . "MM JK" "MM" "JK"
            end
            label values know know_5_5
            label def know_5_5 1 "Yes", modify
            I am trying to fill in the missing values using the code below, but I know this is wrong as its incorrectly replacing the missing ones by using the data from the wrong column. Do you have any insights?

            Code:
            replace know = know[_n-1] if know>=.
            Thank you!

            Comment


            • #7
              I'm sorry, I don't understand your explanation of the problem. And, deleting the "_" from the "know" variable from the reshape command is wrong, I think, as "know" is the stub of the know* variables. When I run my code on your example data, I get a data set as follows:


              Code:
                   +------------------------------------+
                   | househ~d   tie   person1   person2 |
                   |------------------------------------|
                1. |    10101     1        EM        BK |
                2. |    10101     1        EM        PK |
                3. |    10101     1        BK        PK |
                4. |    10102     1        MA        MM |
                5. |    10102     1        MA        JK |
                   |------------------------------------|
                6. |    10102     1        MA        BN |
                7. |    10102     1        MM        JK |
                8. |    10102     1        MM        BN |
                9. |    10102     1        JK        BN |
                   +------------------------------------+

              Perhaps you have a different example data set from which you are producing the problem?
              .

              Comment


              • #8
                Thanks for walking me through this, Mike. I was using the larger dataset and did not delete another "know_" variable before reshaping it. So your code worked just fine. I will use the nwcommands package as you suggested. Thanks so much again for your advice!


                Comment

                Working...
                X