Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • A "better" way on how to create a binary variable based on criteria across multiple variables?

    Dear all,
    I have the following code:
    Code:
    clear
    set obs 1
    g A1="THIS430"
    g A2="THIS440"
    g A3="THIS450"
    g B="THIS99"
    g C="THIS230"
    g D="THIS530"
    
    gen dummy = strpos(A1, "THIS9") | strpos(A2, "THIS9") | strpos(A3, "THIS9")  | strpos(B, "THIS9")
    I have the variables A1, A2, A3, B, C, & D. I want to create a new variable called dummy that is 1 if any of the variables has values starting with "THIS9".
    Now, what I have done above is one way to achieve that, but I appreciate if you could share a better way that checks a varlist at once instead of one at a time.
    Thank you in advance.

  • #2
    Well, I can't think of anyway to write a single command that incorporates A1 A12 A3 B as a varlist. But you can certainly simplify the code with a loop:

    Code:
    gen byte dummy = 0
    foreach v of varlist A1 A2 A3 B {
    replace dummy = 1 if strpos(`v', "THIS9") == 1
    }
    Note: The code you wrote does not "create a new variable called dummy that is 1 if any of the variables has values starting with "THIS9"".. It creates a new variable which is 1 if any of those variables contains THIS9 as a substring (in any position, not just the first).

    The code I show here actually selects only for those observations where the value of the variable begins with THIS9.

    Comment


    • #3
      Consider also

      Code:
      gen dummy = inlist(1, strpos(A1, "THIS9"), strpos(A2, "THIS9"), strpos(A3, "THIS9"), strpos(B, "THIS9"))

      Comment


      • #4
        @Clyde Schechter
        Thank you very much for quick help.
        The loop is a great improvement. And thanks for pointing out the discrepencies between my description and my code.
        I just cant remember, but I thought I came across something that incorporated the criterias in A1, A2 and so on by another thread where Nick Cox
        had an excellent suggestion. Nick Cox sorry in advance if I what I recall is wrong.

        Comment


        • #5
          I was thinking along the same lines as Clyde, but included a BREAK, which could be useful if the actual variable list to be checked is much longer, or if the file is really large.

          Code:
          clear
          set obs 1
          g A1="THIS430"
          g A2="THIS440"
          g A3="THIS450"
          g B="THIS99"
          g C="THIS230"
          g D="THIS530"
          
          gen dummy = strpos(A1, "THIS9") | strpos(A2, "THIS9") | strpos(A3, "THIS9")  | strpos(B, "THIS9")
          
          generate byte dummy2 = 0
          foreach v of varlist A1-D {
           if dummy2 == 1 {
           continue, break
           }
           replace dummy2 = (strpos(`v', "THIS9") == 1)
          }
          list
          Output:
          Code:
               +---------------------------------------------------------------------------+
               |      A1        A2        A3        B         C         D   dummy   dummy2 |
               |---------------------------------------------------------------------------|
            1. | THIS430   THIS440   THIS450   THIS99   THIS230   THIS530       1        1 |
               +---------------------------------------------------------------------------+

          --
          Bruce Weaver
          Email: [email protected]
          Version: Stata/MP 18.5 (Windows)

          Comment


          • #6
            Update:
            Thank you Nick Cox the one-line code. That is great.
            The reason I wrote "better" is because not only will I have to check one criteria, i.e., "THIS9" but also other criterias such as "THIS530" for each variable. I then need to create new variable called dummy which will be equal to 1 if at least one of my varlist fulfill that condition (i.e. "THIS9" and "THIS530").

            Comment


            • #7
              Thanks @Bruce Weaver. I shall try to put all your excellent advice to use now and get back.

              Comment


              • #8
                Comments on various levels, pedantic or otherwise.

                1. Greek and English agree: one criterion, two or more criteria. "criterias" is wrong.

                2. On Abdul #4: I am happy to believe that I said something relevant and useful somewhere earlier, but I can't recall where either.

                3. On Bruce #5: that won't usually work as you intend. Regardless of what may happen in other software

                Code:
                if dummy2 == 1 {    
                    continue, break
                }
                can only mean to Stata

                Code:
                if dummy2[1] == 1 {
                    continue, break
                }
                as is documented in https://www.stata.com/support/faqs/p...-if-qualifier/ (although the title of that FAQ is the answer, not the question most people ask!).

                The point about breaking is presumably that all observations have been flagged with 1 so that you know you'll never change your mind. For that you'd need to (say)

                Code:
                count if dummy2 == 0
                and break if the result is 0 or

                Code:
                summarize dummy2, meanonly
                and break if the minimum is 1.

                Comment


                • #9
                  Update2: Both the code by @Bruce Weaver in #5 and the one by @Nick Cox in #8 (with
                  Code:
                  summarize dummy2, meanonly
                  & breaking if the minimum is 1 lead to same result. And they both seem to give me what I wanted. Thanks Again for your help.
                  Last edited by Abdul Adam; 15 Nov 2017, 10:53. Reason: grammar correction

                  Comment


                  • #10
                    Several posts criss-crossing here. I just want to emphasise one outcome. For the problem as posed here, you don't need a loop. Even with multiple criteria you can always do things like this:


                    Code:
                    gen dummy = inlist(1, strpos(A1, "THIS9"), strpos(A2, "THIS9"), strpos(A3, "THIS9"), strpos(B, "THIS9"))
                    replace dummy = dummy & inlist(1, strpos(A1, "THIS530"), strpos(A2, "THIS530"), strpos(A3, "THIS530"), strpos(B, "THIS530"))
                    and (at some potential risk of loss of clarity) you could combine those statements into one.

                    Comment


                    • #11
                      Nick, thanks for your 3rd point in #8. I was indeed expecting the code to check if dummy2==1 on every observation. And of course, it worked for the data sample that was given, because it had only one observation!

                      Cheers,
                      Bruce
                      --
                      Bruce Weaver
                      Email: [email protected]
                      Version: Stata/MP 18.5 (Windows)

                      Comment


                      • #12
                        Bruce: Indeed. As you know well, some other statistical software does work like that!

                        Comment

                        Working...
                        X