creating dummy variables for unique pairwise combinations

Sander Ramboer

Join Date: Feb 2015

Posts: 2
#1

creating dummy variables for unique pairwise combinations

07 Mar 2015, 12:05

Dear all,

I am looking for a way to create dummies from a twoway table of variables that contain the same values, so something similar to the “tabulate variable, gen(newvariable)” command but with a varlist instead of just one variable.

The setting is as follows:
In one column (variable cityA) I have a complete list of cities and in another (variable cityB) I have the neighbouring cities for each city in the first column. Because of this setup, each combination of two neighbouring cities is observed twice (as each city in cityB appears in cityA and vice versa). Additionally, a third and a fourth variable, regionA and regionB, indicate in which region cityA resp. cityB are located. They could be located in the same region or in different regions (in which case both are located at the border of their respective regions).

Now what I would like to do is create an indicator variable for each unique combination of regions in the dataset, with the aim of both dividing up my dataset geographically and being able to distinguish between intra and interregional neighbours. I thought of using “egen combo= group(regionA regionB)” followed by “tabulate combo, gen(neighbour)” but “egen, group” is not suitable for this because it assigns a different value to each combination whether or not it is unique. For example, the combination XYZ-ABC is considered different from ABC-XYZ, whereas I need it to be considered the same, just like “tabulate regionA regionB” would. Is there a way around this issue? If deleting one of the two possible combinations between two cities is my only option, how would I go about that?

Thanks!

Best,
Sander
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 29948
#2

07 Mar 2015, 12:30

There may be something that does this in a single command that I'm not aware of. But something like this should do it from first principles:

Code:

levelsof regionA, local(rA) levelsof region, local(rB) foreach a of local rA { foreach b of local rB { if "`a'" >= "`b'" { gen byte indicator_`a'_`b' = (regionA == "`a'" & regionB == "`b'") replace indicator_`a'_`b' = 1 if regionB == "`a" & "regionA == "`a'") } } }

Now, you may run into some problems with this. First, I'm assuming that regionA and regionB are both string variables. If they aren't, life is actually simple--all you have to do is replace

Code:

gen byte indicator_`a'_`b' = (regionA == "`a'" & regionB == "`b'") replace indicator_`a'_`b' = 1 if regionB == "`a" & "regionA == "`a'")

with

Code:

gen byte indicator_`a'_`b' = (regionA == `a' & regionB == `b') replace indicator_`a'_`b' = 1 if regionB == `a & "regionA == `a')

But if they are string variables, it is also possible that they contain characters that disqualify them from being part of a variable name (e.g. blanks, special characters). In that case, you have some choices. You can -encode- regionA and regionB to make them numeric, and then calculate the indicator variables from the -encode-d variables. The main drawback to that is that the resulting indicator variable names will have numbers rather than being directly of mnemonic value. You could then write some additional code to label those variables mnemonically if you need that convenience. Or you can edit the values of those variables to eliminate that problem. Of course this might create conflicts with other work you have already done with the variables using their original values. Another approach is to do it this way:

Code:

local newvarname = stroname(`"indicator_`a'_`b'"') gen byte `newvarname' = (regionA == "`a'" & regionB == "`b'") replace `newvarname' = 1 if regionB == "`a" & "regionA == "`a'")

That will probably get you around the problem, although if some values of `a' or `b' are very long, it may lead to obscure variable names.

Last edited by Clyde Schechter; 07 Mar 2015, 12:36.
1 like
Comment

Sander Ramboer

Join Date: Feb 2015
Posts: 2

07 Mar 2015, 15:52

Thank you very much, that did it!

Luckily the regions have their own, 2-digit codes so no need to encode them. Below is the code I used with some typos corrected:

Code:

levelsof regionA, local(rA)
levelsof region, local(rB)

foreach a of local rA {
    foreach b of local rB {
        if `a' >= `b' {
            gen byte indicator_`a'_`b' = (regionA == `a' & regionB == `b')
            replace indicator_`a'_`b' = 1 if regionB == `a' & regionA == `b'
        }
    }
}

Last edited by Sander Ramboer; 07 Mar 2015, 15:56.

Comment

Mohammad Ali

Join Date: Sep 2015

Posts: 1
#4

23 Sep 2015, 21:40

First of all thanks a lot Clyde for sharing the code. But my problem is that when I try this code, I get a message saying "too many variables specified". To give you a brief background of my data set, it involves 34 countries over a number of years and each country has a unique dyad with every other country which serve as panels.
Is there a possible way around this problem of too many variables?
Comment
Bram Hogendoorn

Join Date: Jun 2017

Posts: 31
#5

15 Apr 2019, 04:22

I have a similar problem. My data comprise information on each partner in a couple. Each partner has its own ID and names the partner's id using PID. I would like to create a variable that identifies the couples and assigns its value to both partners. This works for me.

Code:

gen coupleid = cond(id>pid,id,pid)

This assigns to both partners in a couple just one of their individual identifiers (ID or PID). This can be useful if, for example, you would like to adjust your standard errors to the couple dependency in the data (e.g. adding the option cluster(coupleid) to your estimator).

Last edited by Bram Hogendoorn; 15 Apr 2019, 04:33. Reason: clarification
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35405
#6

15 Apr 2019, 04:39

#5 But a couple isn't uniquely tagged by whichever identifier is lower. More at https://journals.sagepub.com/doi/pdf...867X0800800414
Comment
Bram Hogendoorn

Join Date: Jun 2017

Posts: 31
#7

26 Apr 2019, 11:30

Thank you Nick, I was struggling with that problem and was not able to find back your post, which is very helpful.
Comment

Announcement

creating dummy variables for unique pairwise combinations

Comment

Comment

Comment

Comment

Comment

Comment