Find Unique String in Many Variables

Trevor Andrew

Join Date: Apr 2016

Posts: 94
#1

Find Unique String in Many Variables

17 Jan 2025, 22:08

I have 20 string variables (VAR_1-VAR_20) that show different codes. I need to create an indicator equals to one if one of these variables have a specific string (e.g., A10) and no other strings. I could only write a very tedious code that does that manually for each combination. Something like:

Code:

gen dummy=0 replace dummy=1 if VAR_1=="A10" & VAR_2=="" & VAR_3=="" & ... & VAR_20=="" replace dummy=1 if VAR_1=="" & VAR_2=="A10" & ... & VAR_20=="" . . . replace dummy=1 if VAR_1=="" & VAR_2=="" & ... & & VAR_20=="A10"

I have no idea how to code a nice loop. Can someone help please?
Tags: None
Chen Samulsion

Join Date: Jan 2018

Posts: 714
#2

17 Jan 2025, 23:53

Your description is not very clear for me, I don't know whether the following code will solve your problem. And you should install egenmore from SSC.

Code:

ssc install egenmore egen dummy=rany(VAR_1 - VAR_20), cond(@=="A10")
Comment
Trevor Andrew

Join Date: Apr 2016

Posts: 94
#3

18 Jan 2025, 11:06

Chen Samulsion your dummy will equal to one if one of the VAR_1-VAR_20 is equal to A10, no matter what the other variables equal to, right? I need that the dummy equals to 1 if one of the variables equals to A10 and the other 19 are empty.
Comment
Trevor Andrew

Join Date: Apr 2016

Posts: 94
#4

18 Jan 2025, 11:11

Here is a data example with 5 variables:
VAR_1 VAR_2 VAR_3 VAR_4 VAR_5 DUMMY

A10 B10 0

A10 1

B10 C10 0

A10 B10 C10 0

A10 D10 0
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29683
#5

18 Jan 2025, 11:26

This is easy to do in long layout and very difficult with wide data.

Code:

* Example generated by -dataex-. For more info, type help dataex clear input str4(var_1 var_2 var_3) str1 var_4 str4 var_5 "A10" "B10" "" "" "" "A10" "" "" "" "" "" "B10" "C10" "" "" "A10" "B10" "C10" "" "" "A10" "" "" "" "D10" end gen `c(obs_t)' obs_no = _n reshape long var_, i(obs_no) by obs_no (_j), sort: egen a10_count = total(var_ == "A10" ) by obs_no (_j): egen mcount = total(missing(var_)) by obs_no (_j): gen byte wanted = (a10_count == 1) & (mcount == _N - 1) reshape wide

In the future, when showing data examples, please use the -dataex- command to do so, as I have here. If you are running version 18, 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.
Comment

Bjarte Aagnes

Join Date: Apr 2014
Posts: 781

18 Jan 2025, 11:36

Code:

qui des v1-v20, varlist
gen byte one_A10 = `=subinstr("`r(varlist)'", char(32), "+",.)' == "A10"

Comment

Nick Cox

Join Date: Mar 2014
Posts: 35059

18 Jan 2025, 14:21

There are in fact other solutions not too bizarre.

Note first that we are looking for precisely one instance of A10 and all other instances being missing.

For that it is necessary and sufficient that the concatenation of variables yields "A10" exactly and the count of missings across observations is the number of variables MINUS 1, We can press two

Code:

egen

functions into action there.

A more general technique is just to loop over variables counting what we seek.

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str4(var_1 var_2 var_3) str1 var_4 str4 var_5
"A10" "B10" ""    "" ""  
"A10" ""    ""    "" ""  
""    "B10" "C10" "" ""  
"A10" "B10" "C10" "" ""  
"A10" ""    ""    "" "D10"
end

* approach 1
egen all = concat(var_?)

egen nmissing = rowmiss(var_?)

gen wanted = all == "A10" & nmissing == 4

* approach 2
gen countA10 = 0
gen countmiss = 0

foreach v of var var_? {
    replace countA10 = countA10 + (`v' == "A10")
    replace countmiss = countmiss + missing(`v')
}

gen WANTED = countA10 == 1 & countmiss == 4

list


     +------------------------------------------------------------------------------------------------------+
     | var_1   var_2   var_3   var_4   var_5         all   nmissing   wanted   countA10   countm~s   WANTED |
     |------------------------------------------------------------------------------------------------------|
  1. |   A10     B10                              A10B10          3        0          1          3        0 |
  2. |   A10                                         A10          4        1          1          4        1 |
  3. |           B10     C10                      B10C10          3        0          0          3        0 |
  4. |   A10     B10     C10                   A10B10C10          2        0          1          2        0 |
  5. |   A10                             D10      A10D10          3        0          1          3        0 |
     +------------------------------------------------------------------------------------------------------+

See also https://journals.sagepub.com/doi/pdf...867X0900900107

Comment

Trevor Andrew

Join Date: Apr 2016

Posts: 94
#8

22 Jan 2025, 08:46

Nick Cox thanks that's a great approach. Do you actually need the non-missing condition? If the concatenated string only equals to A10 then only one variable equals to A10 and the others are empty.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35059
#9

22 Jan 2025, 09:07

You're right. It's sufficient to concatenate to a single A10 and other problems could be identified by looking at the concatenation.

Sometimes, substantive missings are coded as spaces, if only by accident, or to strings like "NA" or ".".

I first wrote concat() back in the day (Statalist 1999; then Stata Technical Bulletin 50) but it is entirely possible that I was just standing on the shoulders of some Statalist poster. notably Bill Gould, in the early years. All the archives for the early years have disappeared any way.
1 like
Comment
Trevor Andrew

Join Date: Apr 2016

Posts: 94
#10

22 Jan 2025, 09:09

Good point on NA or . or maybe even other codes for missing which complicates this problem.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35059
#11

22 Jan 2025, 09:15

To spell it out, a loop over variables could be complicated by e.g.

trimming leading and trailing spaces

counting NAs or -999 or whatever else.

A true story (at least it's in Douglas Altman's text) is that a police authority in Britain was in the habit of entering 99 as age in reports on road accidents when the age was not known. Naturally it is not always easy or even possible to elicit ages from participants, depending on the accident. When averaged in with other ages, the results pointed to an alarming increase in older people being affected by road accidents in that authority, until the coding was uncovered.
2 likes
Comment

Announcement