Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Determine if element is in local list

    Dear Statalist,

    I am trying to determine whether elements in a variable is in local list. I have two dataset. The first data set will be stored as a local list. Then, I want to see if the elements from the second dataset is in local list.

    Below is my approach:
    Code:
    /* read in the first data set */
    clear
    input str22 Name
    "A_1"         
    "A_2"     
    "A_3"
    "B_1"
    "B_2"
    "B_3"
    end
    /* save as local list */
    levelsof Name, local(varlist)
    di "`varlist'"
    /* read in the second data set */
    clear
    input str22 NameCheck
    "A_1"         
    "A_3"     
    "A_6"
    "A_4"
    "B_2"
    "B_4"
    end
    /* check whether element is in local list */
    gen byte var2_inlist = 0
       foreach m of local `varlist' {
       replace var2_inlist = 1 if !var2_inlist & (NameCheck == `m')
    }
    Error message of "_"A_1 invalid name". I don't understand what is the problem here.

    Thanks!

  • #2
    Your bug is a result of erroneous use of quotes -- instead of
    Code:
    di "`varlist'"
    You should have:

    Code:
    di `varlist'
    And instead of:
    Code:
    foreach m of local `varlist'
    You should have

    Code:
    foreach m of local varlist
    And instead of

    Code:
    (NameCheck == `m')
    You should have

    Code:
    (NameCheck == "`m'").
    More broadly, here is a better solution which uses regex:
    Code:
    /* read in the first data set */
    clear
    input str22 Name
    "A_1"        
    "A_2"    
    "A_3"
    "B_1"
    "B_2"
    "B_3"
    end
    /* save as local list */
    levelsof Name, local(varlist) clean sep("|")
    /* read in the second data set */
    clear
    input str22 NameCheck
    "A_1"        
    "A_3"    
    "A_6"
    "A_4"
    "B_2"
    "B_4"
    end
    /* check whether element is in local list */
    gen Check = ustrregexm(NameCheck,"`varlist'")
    Last edited by Ali Atia; 06 Sep 2022, 16:25.

    Comment


    • #3
      The solution offered in #2 is good. If you don't want to get involved with Unicode regular expression functions, it can also be done with the following as the final command:
      Code:
      gen Check = strpos(`"`varlist'"', NameCheck) > 0

      Comment


      • #4
        Actually, both the solutions above work for your example, but depend on no component being a substring of the other. If this is not true in your actual data, both will create false positives.

        Here is a small modification to the example (just the first element of the first list), and small edits to the code to fix the problem:

        Code:
        clear
        input str22 Name
        "A_12"        
        "A_2"    
        "A_3"
        "B_1"
        "B_2"
        "B_3"
        end
        
        levelsof Name, local(varlist) clean sep("|")
        local varlist = "`varlist'|"
        
        clear
        input str22 NameCheck
        "A_1"        
        "A_3"    
        "A_6"
        "A_4"
        "B_2"
        "B_4"
        end
        
        gen Check1 = ustrregexm(NameCheck,"`varlist'\|")
        gen Check2 = strpos(`"`varlist'"', NameCheck+"|") > 0
        assert Check1 == Check2
        (you can use either Check1 and Check2, I just created both and showed you they are equivalent).

        An alternative method might be to use -merge- instead:

        Code:
        clear
        input str22 Name
        "A_12"        
        "A_2"    
        "A_3"
        "B_1"
        "B_2"
        "B_3"
        end
        
        tempfile list1
        save `list1'
        
        clear
        input str22 NameCheck
        "A_1"        
        "A_3"    
        "A_6"
        "A_4"
        "B_2"
        "B_4"
        end
        
        duplicates drop NameCheck, force
        rename NameCheck Name
        
        merge 1:m Name using `list1', keep(1 3)
        gen byte Check = (_merge == 3)
        Last edited by Hemanshu Kumar; 06 Sep 2022, 20:59.

        Comment


        • #5
          Here is an amended regex solution which should be robust to the issue highlighted in #4.

          Code:
          /* read in the first data set */
          clear
          input str22 Name
          "A_1"        
          "A_2"    
          "A_3"
          "B_1"
          "B_2"
          "B_3"
          end
          /* save as local list */
          levelsof Name, local(varlist) clean sep("$|^")
          /* read in the second data set */
          clear
          input str22 NameCheck
          "A_1"        
          "A_3"    
          "A_6"
          "A_4"
          "B_2"
          "B_4"
          end
          /* check whether element is in local list */
          gen Check = ustrregexm(NameCheck,"^`varlist'$")
          Note: sent from mobile device -- may contain typos.

          Comment


          • #6
            Here is a yet another variation that is attempts to avoid false positives.

            Code:
            /* read in the first data set */
            clear
            input str22 Name
            "A_1"         
            "A_2"     
            "A_3"
            "B_1"
            "B_2"
            "B_3"
            end
            /* save as local list */
            levelsof Name, local(varlist) clean
            di "`varlist'" 
            
            /* read in the second data set */
            clear
            input str22 NameCheck
            "A_1"   
            "A_12"  
            "A_3"     
            "A_6"
            "A_4"
            "B_2"
            "B_4"
            end
            
            /* check whether element is in local list */
            gen byte var2_inlist = 0
              
            quietly foreach m of local varlist {
               replace var2_inlist = 1 if " `m' " == (" " + NameCheck + " ") 
            }
            
            list, sep(0)
            
                 +---------------------+
                 | NameCh~k   var2_i~t |
                 |---------------------|
              1. |      A_1          1 |
              2. |     A_12          0 |
              3. |      A_3          1 |
              4. |      A_6          0 |
              5. |      A_4          0 |
              6. |      B_2          1 |
              7. |      B_4          0 |
                 +---------------------+
            There is more on this kind of trickery -- picked up here on Statalist -- in a Tip in press for Stata Journal 22(4); that fact is no use for anyone wanting a solution in 2022, but this thread may be found by searches thereafter.

            Comment


            • #7
              Thank you all for your valuable suggestions.

              Comment


              • #8
                Originally posted by Ali Atia View Post
                Here is an amended regex solution which should be robust to the issue highlighted in #4.

                Code:
                /* read in the first data set */
                clear
                input str22 Name
                "A_1"
                "A_2"
                "A_3"
                "B_1"
                "B_2"
                "B_3"
                end
                /* save as local list */
                levelsof Name, local(varlist) clean sep("$|^")
                /* read in the second data set */
                clear
                input str22 NameCheck
                "A_1"
                "A_3"
                "A_6"
                "A_4"
                "B_2"
                "B_4"
                end
                /* check whether element is in local list */
                gen Check = ustrregexm(NameCheck,"^`varlist'$")
                Note: sent from mobile device -- may contain typos.
                may I know what does ^$ in "^`varlist'$" works?

                Comment


                • #9
                  In a regular expression, these characters have special meanings:
                  • ^ denotes the start of a string
                  • $ denotes the end of a string
                  • | denotes the logical OR
                  So in the -levelsof- command, the local varlist that is produced contains A_1$|^A_2$|^A_3$|^B_1$|^B_2$|^B_3
                  Then in the -gen- command in the last line, the string becomes ^A_1$|^A_2$|^A_3$|^B_1$|^B_2$|^B_3$

                  So the interpretation of the ustrregexm function is: check if the string in NameCheck is of the form:
                  • <beginning of string>A_1<end of string> OR
                  • <beginning of string>A_2<end of string> OR
                  • and so on...
                  this means that if NameCheck , for example, contains just "A_" it will give a 0, but if it contains "A_1" it will give a 1.
                  Last edited by Hemanshu Kumar; 09 Sep 2022, 03:04.

                  Comment

                  Working...
                  X