Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Looping over all string variables in dataset

    I have a dataset with a large number of variables (500+) and I am just realizing now that for many (possibly all?) string variables, what look like normal missings actually have a 'hidden' space (" "). I know the basic solution is quite simple:
    Code:
    replace FL2A_w2 = subinstr(FL2A_w2, " ", "", .)
    But, I'm wondering how to best handle this with such a large number of variables that I need to do this for. I tried this:

    Code:
    foreach var of varlist _all {
    replace `var' = subinstr(`var', " ", "", .)
    }
    But with that, I get a type mismatch error, I assume due to the fact that some of my variables in my dataset are not strings.

    Any ideas on the best way to handle this? Thank you much!

  • #2
    Someone will point to ds. I will note that ds internally loops over all variables to find the string variables. Looping over those string variables, again, to change the values seems inefficient given that the basic logic behind this specific application of ds can be incorporated into the loop with two additional lines of code:

    Code:
    foreach var of varlist _all {
        capture confirm string variable `var'
        if (_rc == 7) continue
        replace `var' = subinstr(`var', " ", "", .)
    }

    Comment


    • #3
      Thanks daniel klein. Just to make sure I understand, what is the if (_rc == 7) bit doing, exactly?

      Comment


      • #4
        Given my late response, you might have figured out the answer to your question already. For others who read this, I will explain, in some detail, what the two additional lines of code do:

        Code:
        confirm string variable `var'
        does nothing, if `var' is a string variable, exits with error, and return code 7, if `var' is a numeric variable, and exist with error, and another return code (e.g., 111 or 198) if there is another problem.

        Prefixing the code with

        Code:
        capture
        prevents Stata from exiting if an error occurs, and instead puts the respective error code into c(rc), which is also accessible as _rc. Note that capture sets c(rc) to 0 if no error occurs.

        The continue command in the second line is used inside loops and it tells Stata to skip the rest of the code for the current iteration and resumes execution at the top of the loop. The if (_rc == 7) statement tells Stata to execute continue if and only if the return code from the capture command is equal to 7 and, thus, indicates a numeric variable to be skipped.


        There are many other ways to achieve (almost) the same thing. Most people would tend to write the second line as

        Code:
        if _rc continue
        making use of Stata's logic that 0 evaluates to false and non-zero evaluates to true. They might also point out that you do not even need two additional lines of code:

        Code:
        foreach var of varlist _all {
            capture replace `var' = subinstr(`var', " ", "", .)
        }
        might be sufficient.

        My issue with both solutions is that they tend to mask unexpected problems. Both solutions will appear to work perfectly well even if there was a (syntax) error that has nothing to do with the variables being string of numeric.

        Here is another approach that is arguably sightly easier to read if you are willing to assume that the meaning of return code of 0 is better known than the meaning of return code 7.

        Code:
        foreach var of varlist _all {
            capture confirm string variable `var'
            if (_rc == 0) replace `var' = subinstr(`var', " ", "", .)
        }
        And, finally, you might not want to invoke confirm but access the storage type directly as in

        Code:
        foreach var of varlist _all {
            if substr("`: type `var''", 1, 3) == "str" ///
            replace `var' = subinstr(`var', " ", "", .)
        }
        which is fine, except that I find the code hard to read.

        By the way, it is probably totally fine to not overthink very simple problems if you are not getting fun out of this.
        Last edited by daniel klein; 16 Dec 2021, 01:56. Reason: formatting command names in typewriter font

        Comment


        • #5
          I don't disagree with anything daniel klein said but I am here to verify the prophecy

          Someone will point to ds.
          and someone in this case is notionally the command author!

          Code:
          ds, has(type string)  
          
          foreach var in `r(varlist)' {    
              replace `var' = subinstr(`var', " ", "", .)
          }
          Whenever I do this, I am thinking about my time as programmer-user, not thinking about Stata's macine time or efficiency. Stata is going to be much faster at looping over a lot of variable names and noting which are string than I would be in writing code that traps numeric variables.

          Whenever is the key word. Whenever I am writing a program and not one-off do-files, I try to remember efficiency too.

          All that said, I rewrote and greatly extended ds as findname from the Stata Journal, so that gets a mention too.

          Comment

          Working...
          X