Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Shift values across columns (variables) to clean semi-regular parsing error

    I'm using Stata 17 MP on a hosted application (Windows environment), though I'm running test code on the example below in Stata 15 (Windows 10). I've consulted the manuals and other posts but I'm still stuck on something that seems fairly basic.

    I'm trying to clean an error that occurred when a text string variable containing commas (var1 = "Fred, Sarah, & Abdul Inc") was spread to 1-3 other variables in a parsing error (var1="Fred"; var2= "Sarah"; var3= "& Abdul Inc"), pushing that row out to three extra columns.

    My actual data set contains 1.5 million records and 28 variables. Here is a small example of the problem:

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input float visit str20(specialty location address condition v4)
    20 "Internal Medicine" "Main Clinic" "245 Oak St"      "Diabetes"         ""            
    21 "Pulmonology"       "Hospital B"  "8211 Peabody St" "COPD"             ""            
    22 "Dermatology"       "Cosmetic"    "Clinic 2"        "3588 King St"     "Dermabrasion"
    23 "Family Medicine"   "Clinic 2"    "3588 King St"    "Sinus Congestion" ""            
    24 "Dermatology"       "Cosmetic"    "Clinic 2"        "3588 King St"     "Scar Removal"
    end
    I tried two slightly different approaches, both involving tokenizing the variable list.

    APPROACH 1
    Code:
    local offset_vars address condition v4
    tokenize `offset_vars'
    forval i=1/2 {
        replace word `i' = "`i' + 1" if location == "Cosmetic"
        }
    replace specialty = "Dermatology_Cosmetic" if specialty == "Dermatology"
    The error received was "1 invalid name" and "if not found". OK, I get the first one. I then tried all manner of single parentheses to trigger an evaluation of (i + 1) without success.

    APPROACH 2
    Code:
    local offset_vars address condition v4
    set trace on
    forval i=1/2 {
        local var1 `: word `i' of `offset_vars''
        di var1
        local var2 `: word ``i' + 1' of `offset_vars''
        di var2
        replace `var1' = `var2' if location == "Cosmetic"
        }
    The error received was "var1 not found". I then tried slightly different syntax of ` and ' without success.

    Any suggestions? Kind regards and thank you in advance for your help.

  • #2
    I'm super-confused by this post. I don't see any relationship between the example data, other than having variables name address, condition, and v4 and the problems outlined in #1 and the rest of the post.

    What I can do is show you where your use of local macros is going wrong. Perhaps that will be enough to help you solve the problem you mention at the top.

    In Approach 2, after macro substitution takes place
    Code:
    replace word `i' = "`i' + 1" if location == "Cosmetic"
    
    // BECOMES
    
    replace word 1 = "1 + 1" if location == "Cosmetic"
    That final command, which Stata attempts to execute, fails because there is no variable named word. If there had been a variable named word, it still would have failed because of the stray 1 that falls between it and the equals sign. It's not clear to me what you actually want this command to look like. If you want the thing after -replace- to be -word1-, then the command with the macros needs to eliminate the space that's in the midst of -word `i'- and make it -word`i'-. Of course, at least in the example data you show, there is no variable named word1 either, so that, too will break. But perhaps your real data has such a variable. If so, it had better be a string variable to accommodate the value "1 + 1", which is not 2, nor "2". Perhaps you had some other intent in mind, but if so, it isn't clear to me.

    Turning to approach 1, to get the first display command to work, you need to bind var1 in macro quotes. So if you change it to:
    Code:
    di `var1'
    Stata will show you the value of address (or, on the second iteration, condition) in the first observation of your data. That's probably not your intention. More likely, you want Stata to actually show you "address" or "condition", that is, the names of the variables rather than their values. For that it has to be
    Code:
    di "`var1'"
    Now your -local var2- command will fail for a similar reason: it needs to be -di "`var2'"- if you want it to show the name of one of the variables in your offset_vars list. However, even with that correction, the code still breaks, because, in fact local macro var2 is empty. That's because its definition in the -local var2...- command is also incorrect. I assume your intent here is that you want local macro var2 to contain the name of the variable in offset list that immediately follows the variable named in local macro var1. Your instinct to increment i is correct, but the syntax is wrong. Here's a fully functional version of approach 1:

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input float visit str20(specialty location address condition v4)
    20 "Internal Medicine" "Main Clinic" "245 Oak St"      "Diabetes"         ""            
    21 "Pulmonology"       "Hospital B"  "8211 Peabody St" "COPD"             ""            
    22 "Dermatology"       "Cosmetic"    "Clinic 2"        "3588 King St"     "Dermabrasion"
    23 "Family Medicine"   "Clinic 2"    "3588 King St"    "Sinus Congestion" ""            
    24 "Dermatology"       "Cosmetic"    "Clinic 2"        "3588 King St"     "Scar Removal"
    end
    
    local offset_vars address condition v4
    forval i=1/2 {
        local var1 `: word `i' of `offset_vars''
        di "`var1'"
        local var2 `: word `=`i' + 1' of `offset_vars''
        di "`var2'"
        replace `var1' = `var2' if location == "Cosmetic"
        }
    I hope this helps.



    Comment


    • #3
      Thank you Clyde, for your help. I apologize if my post was unclear.

      Your code resolved the core issue. I made just one or two additional changes to your code to complete my solution.

      For the benefit of those reading, below, in words without the code, is what I'm trying to do. The first row of observations were correctly parsed. The second row of observations has the problem. I hope this presentation is more clear.

      ORIGINAL DATA
      specialty = "Internal Medicine"; location = "Main Clinic"; address = "245 Oak Street"; condition = "Diabetes"; v4 = ""
      specialty = "Dermatology; location = "Cosmetic"; address = "Clinic 2"; condition = "3588 King St"; v4 = "Dermabrasion"

      DESIRED RESULT
      specialty = "Internal Medicine"; location = "Main Clinic"; address = "245 Oak Street"; condition = "Diabetes"
      specialty = "Dermatology_Cosmetic; location = "Clinic 2"; address = "3588 King St"; condition = "Dermabrasion"

      In my actual data, fortunately this type of parsing errors occurs in only a handful of ways among the 1.5 million records and 28 variables. Together, in the original data, the various parsing errors led to the values being incorrectly offset by 1-3 columns (depending on the particular pattern).

      Here is Clyde's code, with a couple of minor additions to complete my solution.

      Code:
      replace specialty = "Dermatology_Cosmetic" if location =="Cosmetic"
      local offset_vars location address condition v4
      forval i=1/3 {
          local var1 `: word `i' of `offset_vars''
          di "`var1'"
          local var2 `: word `=`i' + 1' of `offset_vars''
          di "`var2'"
          replace `var1' = `var2' if specialty == "Dermatology_Cosmetic"
          }
      drop v4
      I am not all that comfortable using extended macros. I located Stata's explanation in a section under the manual for macros. Reading - or rereading -- through that section has also been helpful. All the same, a reference source with more explanation and examples would be even better, if you happen to know of one.

      Thank you again for your assistance!

      Comment

      Working...
      X