Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Keeping strings within quotes

    Dear Stata users,

    I have a string variable as the following:

    ABC
    DEF "XYZ"
    GJK "LMN"
    PQR
    ...

    I would need to only keep values within "" and those without the quotes, that is:

    ABC
    XYZ
    LMN
    PQR
    ...


    Thanks!

  • #2
    I used moss from SSC:


    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str9 abc
    "ABC"        
    `"DEF "XYZ""'
    `"GJK "LMN""'
    "PQR"        
    end
    
    moss abc , match("\"(.*)\"") regex
    gen wanted = cond(_match1 != "", _match1, abc)
    drop _* 
    l abc wanted
    
         +--------------------+
         |       abc   wanted |
         |--------------------|
      1. |       ABC      ABC |
      2. | DEF "XYZ"      XYZ |
      3. | GJK "LMN"      LMN |
      4. |       PQR      PQR |
         +--------------------+

    Comment


    • #3
      Thank you Nick, but it says "_match1 not found"

      Comment


      • #4
        Anyone?

        Comment


        • #5
          My code works with your example. Sorry, but I can't guess what else is going wrong without a reproducible example.

          Comment


          • #6
            Very weird. It does not work for me with your example
            Click image for larger version

Name:	stata_example.PNG
Views:	1
Size:	75.4 KB
ID:	1539409

            Comment


            • #7
              Ok so I spotted the error.


              it should be:
              Code:
              match("\(.*)\") regex
              Thank you for the input!
              Last edited by Andrea Cinque; 03 Mar 2020, 06:56.

              Comment


              • #8
                Running the code in post #2 fails for me as well. Post #2 says
                Code:
                match("\"(.*)\"")
                Removing the (optional, per help moss) quotation marks surrounding the regular expression argument to the match() option, as shown in the picture in post #7, solves the problem.
                Code:
                match(\"(.*)\")
                Replacing the optional surrounding quotation marks with Stata's compound double quotes, because the regular expression itself contains double quotes, solves the problem as well.
                ​​​​​​​
                Code:
                match(`"\"(.*)\""')
                Let me add that post #6 would have been improved by copying the several lines of command and output from Stata's Results window and pasting it into a CODE block, as you did with the code in post #7. Screen shots are not as helpful as people think they are. Copyable text is helpful.

                Comment


                • #9
                  I am puzzled now. I could have sworn that #2 was the code I used, but I am on a different computer, so can't check for that reason alone. Other way round, #7 isn't right either. But this works


                  Code:
                  moss abc , match(\"(.*)\") regex

                  Comment


                  • #10
                    #7 makes the same correction that you have done in #9. Alternatively, using compound double quotes around the regex pattern does also work.

                    Code:
                    moss abc , match(`"\"(.*)\""') regex

                    Comment


                    • #11
                      For amusement, here's a solution to the original problem using the built-in -split-
                      Code:
                      split abc, gen(s) parse(`"""')  // double quote is the parse character
                      reshape long s, i(abc) j(seq)
                      replace s = strtrim(s)
                      drop if (s== "")
                      drop abc seq

                      Comment


                      • #12
                        Nick is correct, and I mistated things in post #8, confusing the picture in post #6 with the code block in post #7.

                        The code in post #7 produces
                        Code:
                        . moss abc , match("\(.*)\") regex
                        regex option: no subexpression in match(\(.*)\)
                        r(198);
                        because the backslash in front of the left parenthesis removes its special meaning as the start of a regex subexpression.

                        The code in the picture in post #6 is the same as the code in post #2, and produces
                        Code:
                        . moss abc , match("\"(.*)\"") regex
                        
                        . gen wanted = cond(_match1 != "", _match1, abc)
                        _match1 not found
                        r(111);
                        I believe the problem lies in using double quotes to surround a string that itself contains double quotes.

                        Removing the (optional, per help moss) double quotes surrounding the argument to the match() option in the code from post #2 solves the problem.
                        Code:
                        . moss abc , match(\"(.*)\") regex
                        
                        . gen wanted = cond(_match1 != "", _match1, abc)
                        
                        . drop _* 
                        
                        . l abc wanted
                        
                             +--------------------+
                             |       abc   wanted |
                             |--------------------|
                          1. |       ABC      ABC |
                          2. | DEF "XYZ"      XYZ |
                          3. | GJK "LMN"      LMN |
                          4. |       PQR      PQR |
                             +--------------------+
                        Alternatively, replacing the optional surrounding double quotes in the code from post #2 with Stata's compound double quotes solves the problem as well.
                        Code:
                        . moss abc , match(`"\"(.*)\""') regex
                        
                        . gen wanted = cond(_match1 != "", _match1, abc)
                        
                        . drop _* 
                        
                        . l abc wanted
                        
                             +--------------------+
                             |       abc   wanted |
                             |--------------------|
                          1. |       ABC      ABC |
                          2. | DEF "XYZ"      XYZ |
                          3. | GJK "LMN"      LMN |
                          4. |       PQR      PQR |
                             +--------------------+

                        Comment


                        • #13
                          I don't mind split solutions....

                          Comment


                          • #14
                            Double quotes are special characters in regular expressions, which need to be escaped. Therefore, a direct solution using Stata's regexm is possible.

                            Code:
                            clear
                            input str9 abc
                            "ABC"      
                            `"DEF "XYZ""'
                            `"GJK "LMN""'
                            "PQR"      
                            end
                            
                            gen wanted= regexs(1) if regexm(abc,`"["\]([a-zA-Z]+)["\]"')
                            replace wanted= abc if missing(wanted)
                            Res.

                            Code:
                            . l
                            
                                 +--------------------+
                                 |       abc   wanted |
                                 |--------------------|
                              1. |       ABC      ABC |
                              2. | DEF "XYZ"      XYZ |
                              3. | GJK "LMN"      LMN |
                              4. |       PQR      PQR |
                                 +--------------------+

                            Comment


                            • #15
                              With the test data we've been using
                              Code:
                              gen wanted= regexs(1) if regexm(abc,`""([a-zA-Z]+)""')
                              replace wanted= abc if missing(wanted)
                              also results in
                              Code:
                              . list
                              
                                   +--------------------+
                                   |       abc   wanted |
                                   |--------------------|
                                1. |       ABC      ABC |
                                2. | DEF "XYZ"      XYZ |
                                3. | GJK "LMN"      LMN |
                                4. |       PQR      PQR |
                                   +--------------------+
                              suggesting that in fact double quotes have no syntactic significance within regular expressions generally. What they do is confuse Stata's parser into misreading them as part of the Stata syntax rather than part of the regular expression.

                              Also, I now realize that by using compound double quotes the second moss solution from post #12 can have the regular expression simplified similarly.
                              Code:
                              . moss abc , match(`""(.*)""') regex
                              
                              . gen wanted = cond(_match1 != "", _match1, abc)
                              
                              . drop _* 
                              
                              . l abc wanted
                              
                                   +--------------------+
                                   |       abc   wanted |
                                   |--------------------|
                                1. |       ABC      ABC |
                                2. | DEF "XYZ"      XYZ |
                                3. | GJK "LMN"      LMN |
                                4. |       PQR      PQR |
                                   +--------------------+

                              Comment

                              Working...
                              X