Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Find Unique String in Many Variables

    I have 20 string variables (VAR_1-VAR_20) that show different codes. I need to create an indicator equals to one if one of these variables have a specific string (e.g., A10) and no other strings. I could only write a very tedious code that does that manually for each combination. Something like:

    Code:
    gen dummy=0
    replace dummy=1 if VAR_1=="A10" & VAR_2=="" & VAR_3=="" & ... & VAR_20==""
    replace dummy=1 if VAR_1==""  & VAR_2=="A10"  & ... & VAR_20==""
    .
    .
    .
    replace dummy=1 if VAR_1==""  & VAR_2=="" & ... &  & VAR_20=="A10"
    I have no idea how to code a nice loop. Can someone help please?

  • #2
    Your description is not very clear for me, I don't know whether the following code will solve your problem. And you should install egenmore from SSC.
    Code:
    ssc install egenmore
    egen dummy=rany(VAR_1 - VAR_20), cond(@=="A10")

    Comment


    • #3
      Chen Samulsion your dummy will equal to one if one of the VAR_1-VAR_20 is equal to A10, no matter what the other variables equal to, right? I need that the dummy equals to 1 if one of the variables equals to A10 and the other 19 are empty.

      Comment


      • #4
        Here is a data example with 5 variables:
        VAR_1 VAR_2 VAR_3 VAR_4 VAR_5 DUMMY
        A10 B10 0
        A10 1
        B10 C10 0
        A10 B10 C10 0
        A10 D10 0

        Comment


        • #5
          This is easy to do in long layout and very difficult with wide data.

          Code:
          * Example generated by -dataex-. For more info, type help dataex
          clear
          input str4(var_1 var_2 var_3) str1 var_4 str4 var_5
          "A10" "B10" ""    "" ""   
          "A10" ""    ""    "" ""   
          ""    "B10" "C10" "" ""   
          "A10" "B10" "C10" "" ""   
          "A10" ""    ""    "" "D10"
          end
          
          gen `c(obs_t)' obs_no = _n
          reshape long var_, i(obs_no)
          by obs_no (_j), sort: egen a10_count = total(var_ == "A10" )
          by obs_no (_j): egen mcount = total(missing(var_))
          by obs_no (_j): gen byte wanted = (a10_count == 1) & (mcount == _N - 1)
          reshape wide
          In the future, when showing data examples, please use the -dataex- command to do so, as I have here. If you are running version 18, 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

          Comment


          • #6
            Code:
            qui des v1-v20, varlist
            gen byte one_A10 = `=subinstr("`r(varlist)'", char(32), "+",.)' == "A10"

            Comment


            • #7
              There are in fact other solutions not too bizarre.

              Note first that we are looking for precisely one instance of A10 and all other instances being missing.

              For that it is necessary and sufficient that the concatenation of variables yields "A10" exactly and the count of missings across observations is the number of variables MINUS 1, We can press two
              Code:
              egen
              functions into action there.

              A more general technique is just to loop over variables counting what we seek.

              Code:
              * Example generated by -dataex-. For more info, type help dataex
              clear
              input str4(var_1 var_2 var_3) str1 var_4 str4 var_5
              "A10" "B10" ""    "" ""  
              "A10" ""    ""    "" ""  
              ""    "B10" "C10" "" ""  
              "A10" "B10" "C10" "" ""  
              "A10" ""    ""    "" "D10"
              end
              
              * approach 1
              egen all = concat(var_?)
              
              egen nmissing = rowmiss(var_?)
              
              gen wanted = all == "A10" & nmissing == 4
              
              * approach 2
              gen countA10 = 0
              gen countmiss = 0
              
              foreach v of var var_? {
                  replace countA10 = countA10 + (`v' == "A10")
                  replace countmiss = countmiss + missing(`v')
              }
              
              gen WANTED = countA10 == 1 & countmiss == 4
              
              list
              
              
                   +------------------------------------------------------------------------------------------------------+
                   | var_1   var_2   var_3   var_4   var_5         all   nmissing   wanted   countA10   countm~s   WANTED |
                   |------------------------------------------------------------------------------------------------------|
                1. |   A10     B10                              A10B10          3        0          1          3        0 |
                2. |   A10                                         A10          4        1          1          4        1 |
                3. |           B10     C10                      B10C10          3        0          0          3        0 |
                4. |   A10     B10     C10                   A10B10C10          2        0          1          2        0 |
                5. |   A10                             D10      A10D10          3        0          1          3        0 |
                   +------------------------------------------------------------------------------------------------------+
              See also https://journals.sagepub.com/doi/pdf...867X0900900107

              Comment


              • #8
                Nick Cox thanks that's a great approach. Do you actually need the non-missing condition? If the concatenated string only equals to A10 then only one variable equals to A10 and the others are empty.

                Comment


                • #9
                  You're right. It's sufficient to concatenate to a single A10 and other problems could be identified by looking at the concatenation.

                  Sometimes, substantive missings are coded as spaces, if only by accident, or to strings like "NA" or ".".

                  I first wrote concat() back in the day (Statalist 1999; then Stata Technical Bulletin 50) but it is entirely possible that I was just standing on the shoulders of some Statalist poster. notably Bill Gould, in the early years. All the archives for the early years have disappeared any way.

                  Comment


                  • #10
                    Good point on NA or . or maybe even other codes for missing which complicates this problem.

                    Comment


                    • #11
                      To spell it out, a loop over variables could be complicated by e.g.

                      trimming leading and trailing spaces

                      counting NAs or -999 or whatever else.

                      A true story (at least it's in Douglas Altman's text) is that a police authority in Britain was in the habit of entering 99 as age in reports on road accidents when the age was not known. Naturally it is not always easy or even possible to elicit ages from participants, depending on the accident. When averaged in with other ages, the results pointed to an alarming increase in older people being affected by road accidents in that authority, until the coding was uncovered.

                      Comment

                      Working...
                      X