Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Using substr to drop a variable conditionally

    Hello all,

    I am trying to drop a variable conditionally on it taking certain values in some of its observations. In the spirit of the prohibited:

    drop varname if substr(varname, a, b)=="some string value" (for a string variable).

    Searching the forum I found a way to do this via the following piece of code:

    foreach c of varlist _all {
    if substr(`c',1,3)=="Nov" | substr(`c',1,3)=="Oct" | substr(`c',1,3)=="Dec" | substr(`c',1,3)=="Aug" | substr(`c',1,3)=="Jun" | substr(`c',1,3)=="Jul" | substr(`c',1,3)=="Feb" | substr(`c',1,3)=="Mar" | substr(`c',1,3)=="Jan" | substr(`c',1,3)=="Sep" | substr(`c',1,3)=="Apr" | substr(`c',1,3)=="May" {
    drop `c'
    }
    }

    note1: i used all months to extinguish all the possible string values the variable in question can take. (if I had put only 1 or 2 it didn't work, no error, just didn't drop the variable in question)
    note2: also in the above variable there were spaces (as in one below), but these didn't create a problem when not specified.
    note 3: i am doing this for all possible variables as I need to loop this procedure over different files and the name of the variable changes but the string values it can take don't.

    The above code worked just fine. However when i tried, in the same spirit, to do the same for another variable (below), which takes 3 string values: " * " , "-" , " " . ( Its type is str78)

    Code:

    foreach c of varlist _all {
    if substr(`c',1,1)=="*" | substr(`c',1,1)=="-" {
    drop `c'
    }
    }

    I am returned the error 109: type mismatch
    In an expression, you attempted to combine a string and numeric
    subexpression in a logically impossible way. For instance, you
    attempted to subtract a string from a number or you attempted
    to take the substring of a number.

    Thus, I believe the reason stata would execute the first piece of code but not the second, may be that stata recognizes the charachter " * " as numeric rather than a string.
    Due to this, I tried without using substr:

    foreach c of varlist _all {
    if `c'=="*" | `c'=="-" {
    drop `c'
    }
    }

    And this runs without errors, but does not yield the desired result, ie does not drop the variable in question. (I believe because this works over observations rather than variables)

    Any assistance is greatly appreciated.

    Last edited by Adriano Mariani; 14 Apr 2020, 16:11.

  • #2
    There is considerable scope for confusion here.

    A test like

    Code:
    if substr(`c',1,3)=="Nov"
    can only ever test what is the first observation. So it is equivalent to

    Code:
    if substr(`c'[1],1,3)=="Nov"
    Sometimes that may by accident do what you want. Everything hinges on what is true of the first observation.

    If -- contrary to Stata good practice -- you are working spreadsheet style such that your entire first observation is actually metadata, then -- temporarily -- what you are doing could make sense. But as you don't show any kind of data example we can't be sure.

    https://www.stata.com/support/faqs/p...-if-qualifier/ says more.

    A test like

    Code:
    if substr(`c',1,1)=="*"
    will fail if and only if the variable so named is numeric. The failure has nothing to with whether the character within "" is in any sense numeric. The test would also fail with characters that are not numeric. You can't extract a substring of any length directly from a numeric variable.

    Backing up: Your stated problem is

    I am trying to drop a variable conditionally on it taking certain values in some of its observations
    Does "some" mean any? all? or something else? You can always do this -- so long as the variable concerned is string. --


    Code:
    count if inlist(substr(`c', 1, 3), "Jan", "Feb", "Mar", "Apr", "May", "Jun") | inlist(substr(`c;, 1, 3), "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")
    and then drop conditionally on the result of the count.

    Comment


    • #3
      Dear Nick,
      Thanks for the help.

      Both the variables are strings, this is why I have treated them as the same trying to extract the string value using substr in both cases. That is also why I was puzzled by why it worked for the first piece of code but not the second.
      The first string variable of interest can take only names of months, whilst the second string variable of interest take either: " * " or " - " or missing: " ".

      Quote:
      "
      Does "some" mean any? all? or something else? You can always do this -- so long as the variable concerned is string. --
      "

      Essentially, my aim is : if the string variable in question takes on either of the mentioned values at least once, drop it. (Thus responding to your question: 'any'. )
      (note in the first case I used all months as the month will change across forecasts and by including them all I am ensuring it will occur at least once for every dataset)

      Looking at the problem again, I noticed that indeed the method I used only works if you specify the first observation amongst the criteria, but as you mentioned this works by accident. The reason I can't do the same for the other string variable is that first observation is missing (but maybe won't be in other datasets), thus I need a more tractable method. For the first string, the first value will always be one of the months (so including them all ensures it drops the relevant variable), however this is not the case for the second string variable:

      In the spirit of your code above, I tried:

      foreach c of varlist _all {
      if inlist(substr(`c', 1, 1), "*", "-", " ") {
      drop `c'
      }
      }

      And this runs, ie returns no errors, but does not yield the desired result. Any suggestions?
      Last edited by Adriano Mariani; 15 Apr 2020, 03:47.

      Comment


      • #4
        Your last chunk of code makes the same mistake all over again! The if command and the if qualifier behave quite differently. Did you read the FAQ cited in #2?

        If all your variables are string -- why is that? -- then this code drops a variable if its value in any observation matches the criterion given.

        Code:
        foreach c of varlist _all {
            count if inlist(substr(`c', 1, 1), "*", "-", " ")
              
            if r(N) > 0  drop `c'
        }
        Last edited by Nick Cox; 15 Apr 2020, 04:06.

        Comment

        Working...
        X