Data preparation question for networks

Uzaib Yasin

Join Date: Feb 2019
Posts: 4

Data preparation question for networks

04 Mar 2020, 09:15

Dear Statlist users,

Firstly, I am a new user of Statalist, so I apologize in advance if I am not using dataex properly to describe my data.

I am trying to prepare my data for analysis and running into some problems. I have respondent-level data in long format on up to 5 members in their network. I am now running into a separate issue with the variables that identify whether the network members know each other (variable is called know_* where the * indicates whether person 1 knows person 2, such as know_1_2 etc.). A separate variable n* indicates the initials of the pair e.g. n1_2 is EM BK. The goal here is to create an edge list so that I can do social network analysis (information on the nodes is in a separate file in long format, by householdid on each of the names mentioned in the network). I had wanted the final edge list dataset to look like the below:

householdid	name pair	know
10101	EM BK	Yes
10101	EM PK	Yes
10101	BK PK	Yes

A tranche of the current data is copied below (the text in red corresponds to the data in red above)

householdid

names_

names_repeat_count

n1_2

n1_3

n1_4

n2_3

n2_4

n3_4

count

person

know_1_1

know_1_2

know_1_3

know_1_4

know_1_5

know_2_1

know_2_2

know_2_3

10101

EM BK

EM PK

BK PK

Yes

10101

EM BK

EM PK

BK PK

Yes

10101

EM BK

EM PK

BK PK

Yes

Example generated by -dataex-. To install: ssc install dataex
clear
input str5 householdid str2 names_ str5(n1_2 n1_3 n2_3) byte(know_1_2 know_1_3 know_2_3)
"10101" "BK" "EM BK" "EM PK" "BK PK" 1 1 1
"10101" "EM" "EM BK" "EM PK" "BK PK" 1 1 1
"10101" "PK" "EM BK" "EM PK" "BK PK" 1 1 1
end
label values know_1_2 know_1_2
label def know_1_2 1 "Yes", modify
label values know_1_3 know_1_3
label def know_1_3 1 "Yes", modify
label values know_2_3 know_2_3
label def know_2_3 1 "Yes", modify

I would appreciate any advice on how to go about doing this. I've read some of the help files for reshaping but haven't come across a solution, or maybe this is just a simple thing that I don't quite know how to do yet?

Thank you!

Last edited by Uzaib Yasin; 04 Mar 2020, 09:20.

Tags: None

Mike Lacy

Join Date: Apr 2014

Posts: 2404
#2

04 Mar 2020, 10:02

Your situation is likely easily solvable, but your presentation is quite hard to follow, for typographical and other reasons. Ask a colleague not familiar with your data set to look at what you have posted and offer suggestions about a clearer description. Re-read the StataList FAQ. Then, do this:

1) Use -dataex- to show an example of your current data. Include the markers with it, as -dataex- instructs you to do onscreen. Right after that example, describe it so as to make clear what each observation is and what each variable means.
2) Do the same for the data you *want* to have.

I understand you have tried to do 1) and 2), but it's not working, at least not for me. Among other things, I don't know what a "tranche" means in this or any other context, and the material following that sentence did not show well on the screen, as I presume you can tell now.
Comment

Uzaib Yasin

Join Date: Feb 2019
Posts: 4

04 Mar 2020, 11:58

Thanks, Mike. Yes, sorry about that. I attempt below to re-frame my question in the hopes it is clear now.

My data currently look like this (sample below of 10 obs)

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input int householdid str2 names_ byte size str5(n1_2 n1_3 n1_4 n2_3 n2_4 n3_4) byte(know_1_2 know_1_3 know_1_4 know_2_3 know_2_4 know_3_4)
10101 "EM" 3 "EM BK" "EM PK" ""      "BK PK" ""      ""      1 1 . 1 . .
10101 "BK" 3 "EM BK" "EM PK" ""      "BK PK" ""      ""      1 1 . 1 . .
10101 "PK" 3 "EM BK" "EM PK" ""      "BK PK" ""      ""      1 1 . 1 . .
10101 ""   3 "EM BK" "EM PK" ""      "BK PK" ""      ""      1 1 . 1 . .
10101 ""   3 "EM BK" "EM PK" ""      "BK PK" ""      ""      1 1 . 1 . .
10102 "MA" 4 "MA MM" "MA JK" "MA BN" "MM JK" "MM BN" "JK BN" 1 1 1 1 1 1
10102 "MM" 4 "MA MM" "MA JK" "MA BN" "MM JK" "MM BN" "JK BN" 1 1 1 1 1 1
10102 "JK" 4 "MA MM" "MA JK" "MA BN" "MM JK" "MM BN" "JK BN" 1 1 1 1 1 1
10102 "BN" 4 "MA MM" "MA JK" "MA BN" "MM JK" "MM BN" "JK BN" 1 1 1 1 1 1
10102 ""   4 "MA MM" "MA JK" "MA BN" "MM JK" "MM BN" "JK BN" 1 1 1 1 1 1
end
label values know_1_2 know_1_2
label def know_1_2 1 "Yes", modify
label values know_1_3 know_1_3
label def know_1_3 1 "Yes", modify
label values know_1_4 know_1_4
label def know_1_4 1 "Yes", modify
label values know_2_3 know_2_3
label def know_2_3 1 "Yes", modify
label values know_2_4 know_2_4
label def know_2_4 1 "Yes", modify
label values know_3_4 know_3_4
label def know_3_4 1 "Yes", modify
label var householdid "What is the household ID (enter again)" 
label var names_ "Please tell me the initials of one person outside the household you like to meet" 
label var size "Size of network" 
label var know_1_2 "Does 1 know 2?" 
label var know_1_3 "Does 1 know 3?" 
label var know_1_4 "Does 1 know 4?" 
label var know_2_3 "Does 2 know 3?" 
label var know_2_4 "Does 2 know 4?" 
label var know_3_4 "Does 3 know 4?"

The data are in long format, where each row shows the household ID ("householdid"), and then the initials of a person in the network of that household ID.

So for example,

householdid 10101 has 3 people described in the network listed on 3 separate rows: EM, BK, PK. SImilarly householdid 10102 has 4 people in the network
The subsequent columns show different pairings of these network members (e.g. n1_2 is EM BK, n1_3 is EM PK and so forth).
The variable starting with know* denotes whether the pairs know each other (e.g. if person 1 knows person 2, know_1_2 ==1).

The goal here is to create an edge list so that I can do social network analysis (information on the nodes is in a separate file in long format, by householdid on each of the names mentioned in the network).

This is the data I want to have (with additional rows), so that each row shows the unique householdid and name pairs, and whether the pairs know each other

householdid	name pair	know
10101	EM BK	Yes
10101	EM PK	Yes
10101	BK PK	Yes

Would appreciate any thoughts, and I hope this presentation is a bit more clear!

Last edited by Uzaib Yasin; 04 Mar 2020, 12:06.

Comment

Mike Lacy

Join Date: Apr 2014

Posts: 2404
#4

04 Mar 2020, 17:16

I feel your pain here--your data are quite a mess as regards constructing them as a social network.

Because every line for a household has n* fields that duplicate, your data set seems very redundant. Why would I need anything other than one observation for each household in your original data? It would be much easier, I think, to start out with, in wide format, a list of named persons for each household (p1, p2, p3 ...) and a list of know* variables for that list of p* names. To do that, one would either need to rely on the order of names on observations within household being correct (dangerous), or the order of pairs listed in the n* fields being ok. I presume the latter is more trustworthy. Am I correct?

Your problem is easier if the number of persons listed by a household is less than 10, i.e., there is nothing possible beyond n8_9. Is that correct?

Last edited by Mike Lacy; 04 Mar 2020, 17:49.
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2404
#5

04 Mar 2020, 18:16

Aha, I think it's much simpler. I have something that works for your example data, which as I have noted, has the likely misleading feature that every possible pair of persons named by a household knows each other. That issue aside, try this:

Code:

bysort householdid: keep if _n ==1 // multiple household observations are redundant drop size names_ // irrelevant variables reshape long n know_, i(householdid) j(pair) string // original data was *not* long drop if missing(know_) // edge does not exist // Network programs will want nodes of edges stored in different variables gen str person1 = word(n, 1) gen str person2 = word(n, 2) drop pair n // not needed rename know_ tie // more standard term

If I recall correctly, the user-written Stata module -nwcommands- (-search nwcommands-) can take an edgelist in this form, make a network out of it, and do most common network analyses.
Comment
Uzaib Yasin

Join Date: Feb 2019

Posts: 4
#6

05 Mar 2020, 11:14

Thanks, Mike!

To clarify, each household can only name up to (and including) 5 persons. Many choose to give only 2-3 on average, however. The dataset was initially in wide format with each row describing the list of named persons, and a list of know variables for that list of names, as you indicated. I created this long format dataset with the know variables thinking this would be easier. I used the code you had provided, and it worked if I made slight modifications to the know variable in the reshape command

Code:

bysort householdid: keep if _n ==1 // multiple household observations are redundant drop size names_ // irrelevant variables reshape long n know, i(householdid) j(pair) string // data was not long drop if missing(know) & n=="" // drop if edge does not exist // Network programs want nodes of edges stored in different variables gen str person1 = word(n, 1) gen str person2 = word(n, 2)

The issue I now have is that I have pairs that have the “Yes”/”No” response for the know variable. As you’ll see from the example below, the first three rows have the pairs and the individual person1 and person2 variable as you had created. But the value for the know variable is blank, and instead only filled in the subsequent 3 rows.

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input int householdid str4 pair byte know str5 n str2(person1 person2) 10101 "1_2" . "EM BK" "EM" "BK" 10101 "1_3" . "EM PK" "EM" "PK" 10101 "2_3" . "BK PK" "BK" "PK" 10101 "_1_2" 1 "" "" "" 10101 "_1_3" 1 "" "" "" 10101 "_2_3" 1 "" "" "" 10102 "1_2" . "MA MM" "MA" "MM" 10102 "1_3" . "MA JK" "MA" "JK" 10102 "1_4" . "MA BN" "MA" "BN" 10102 "2_3" . "MM JK" "MM" "JK" end label values know know_5_5 label def know_5_5 1 "Yes", modify

I am trying to fill in the missing values using the code below, but I know this is wrong as its incorrectly replacing the missing ones by using the data from the wrong column. Do you have any insights?

Code:

replace know = know[_n-1] if know>=.

Thank you!
Comment

Mike Lacy

Join Date: Apr 2014
Posts: 2404

05 Mar 2020, 19:39

I'm sorry, I don't understand your explanation of the problem. And, deleting the "_" from the "know" variable from the reshape command is wrong, I think, as "know" is the stub of the know* variables. When I run my code on your example data, I get a data set as follows:

Code:

     +------------------------------------+
     | househ~d   tie   person1   person2 |
     |------------------------------------|
  1. |    10101     1        EM        BK |
  2. |    10101     1        EM        PK |
  3. |    10101     1        BK        PK |
  4. |    10102     1        MA        MM |
  5. |    10102     1        MA        JK |
     |------------------------------------|
  6. |    10102     1        MA        BN |
  7. |    10102     1        MM        JK |
  8. |    10102     1        MM        BN |
  9. |    10102     1        JK        BN |
     +------------------------------------+

Perhaps you have a different example data set from which you are producing the problem?
.

Comment

Uzaib Yasin

Join Date: Feb 2019

Posts: 4
#8

06 Mar 2020, 07:54

Thanks for walking me through this, Mike. I was using the larger dataset and did not delete another "know_" variable before reshaping it. So your code worked just fine. I will use the nwcommands package as you suggested. Thanks so much again for your advice!
Comment

Announcement

Data preparation question for networks

Comment

Comment

Comment

Comment

Comment

Comment

Comment