Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Compact alternative to multiple if statements

    I often find myself writing a series of similar if statements, which in some other languages could be replaced by a switch-case statement. Here's an example of a switch-case statement in Python: https://jaxenter.com/implement-switc...on-138315.html

    Is there anything like this in Stata? The closest thing I've found is a nested cond statement formatted onto separate lines, as described in Kantor & Cox (2005): https://journals.sagepub.com/doi/pdf...867X0500500310

    Here's a series of if statements from some code I'm maintaining. It seems prolix, but maybe it's the best Stata can do?
    Code:
    generate AgeMin = 0
    replace AgeMin = 20 if AgeRange==1
    replace AgeMin = 35 if AgeRange==2
    replace AgeMin = 45 if AgeRange==3
    replace AgeMin = 55 if AgeRange==4
    replace AgeMin = 65 if AgeRange==5
    replace AgeMin = 75 if AgeRange==6
    
    generate AgeMax = 0
    replace AgeMax = 34 if AgeRange==1
    replace AgeMax = 44 if AgeRange==2
    replace AgeMax = 54 if AgeRange==3
    replace AgeMax = 64 if AgeRange==4
    replace AgeMax = 74 if AgeRange==5
    replace AgeMax = 76 if AgeRange==6
    Switch-case statements are a powerful tool for control in programming. In this article, Sreeram Sceenivasan goes over you can use them in Python.
    Last edited by paulvonhippel; 19 Dec 2019, 11:27.

  • #2
    Code:
    gen AgeMin = 35 + 10*(AgeRange - 2)
    replace AgeMin = 20 if AgeMin == 25
    
    gen AgeMax = 34 + 10*(AgeRange - 1)
    replace AgeMax = 76 if AgeMax == 84

    Comment


    • #3
      Deleted.

      Comment


      • #4
        I think Paul's point is worth taking farther, to the general case, in which context I think there's an interesting community contribution to be made. To wit: Some of us are capable of easily writing a complex nested cond() statement, but I'm not one of them, so having a user-friendly way to do that would be goof.

        Code:
        gen y = cond(x==1, 3, cond(x==2, 4, cond() .......  // hard to do right
        Could some form of "case" or "switch" control structure be implemented through an -egen- function? I'm thinking of a program that would, behind the scenes, construct a nested cond() statement for the user. I could imagine a pseudo syntax something like this:
        Code:
        egen y = case(expression) valuelist(numlist) resultlist(expression list)
        It would also be nice to extend the results option to take program calls instead, which might or might not assign anything to y.

        Comment


        • #5
          Mike Lacy That's an interesting idea. Why don't you repost it on the wish list for Stata 17?

          Comment


          • #6
            Clyde Schechter I actually had the thought that one of us mere mortals might be able to implement this in < 100 lines or so.

            Comment


            • #7
              From what I understand, the main reason for working with switch-case constructs is that they execute faster than if-else statements. If execution speed is the issue, then egen is probably not the best solution because egen is known to be pretty slow. Whether constructing nested cond() statements that are then executed has, in general, any speed gains over if-else, I cannot tell for sure but for (very) large datasets, that might be true.

              However, the concerns and solutions that have been stated here appear to be more about being able to type less (and maybe more readable) code. I would argue that the closest thing in Stata that resembles switch-case (at least syntax wise) for the problem in #1 is recode. It should be obvious that the initial

              Code:
              generate AgeMin = 0
              replace AgeMin = 20 if AgeRange==1
              replace AgeMin = 35 if AgeRange==2
              [...]
              replace AgeMin = 75 if AgeRange==6
              could be replaced by

              Code:
              recode AgeRange ///
                  (1 = 20)    ///
                  (2 = 35)    ///
                  [...]
                  (6 = 75) , generate(AgeMin)
              which is arguably both a little less to type and a little easier to read. You could even omit the parentheses.

              I would argue that nested cond() calls could be written in a similarly readable way, except perhaps for the closing parentheses

              Code:
              generate AgeMin =           ///
                  cond(AgeRange == 1, 20, ///
                  cond(AgeRange == 2, 35, ///
                  [...]
                  cond(AgeRange == 6, 75, .z))))))
              We might also gain a bit of speed with this, but we are back to typing a lot. Still the code is a bit more readable that if-else statements. It is certainly easier to read than #2, which, in turn, is clever, minimal typing, and probably fast.

              Nick Cox (2012) pointed to another solution that is clever, fast, and probably close to switch-case in terms of the underlying principle. It involves medium typing

              Code:
              matrix LookUp = 20, 35, 45, 55, 65, 75
              generate AgeMin = Lookup[1, AgeRange]
              Personally, I would probably go with recode here because I feel it has the best readability/typing ratio for the problem at hand.

              Best
              Daniel


              Cox, N. 2012. Speaking Stata: Matrices as look-up tables. The Stata Journal, 12(4), pp. 748--758.
              Last edited by daniel klein; 20 Dec 2019, 03:25.

              Comment


              • #8
                This is a very interesting thread.

                No-one has yet said the dread word "intuitive", as in "Such and such syntax seems intuitive to me, so why can't Stata support it?" so I can't make my stock remark that it is hard for me to intuit your intuition easily, and so "intuitive" just seems a roundabout way for you to say "familiar" or "appealing". Except that I just did.

                Seriously, paulvonhippel asks a great question.

                1. Syntax like that he flags is present in many programming languages and if it existed in Stata it would, I guess, be used a lot, principally by those who have used it previously in other languages -- or (and this is key too) now use it in other languages alongside Stata. That's especially important to those who program a lot, but not mainly or exclusively in Stata, which is more than fine,

                2. I can't buy the example linked in #1 as an example of a compact alternative. If there is a sales pitch, it could start with being general and supporting rules that might be messy or complicated and being fairly easy to read and write, But a long series of cases with repeated break statements does not seem particularly appealing,

                3. The mention of the 2005 paper by David Kantor and myself allows the small personal history to be mentioned, pertinent because the psychology is, I suspect, really quite general. David wrote the first version and then I got involved. Previously I hadn't used cond() much and was fairly leery of it. But getting involved in an expository paper pushed me over a threshold so that I got more comfortable with it and started using it much more. So, what seems awkward and unfamiliar is something you will avoid, sure, but something worthwhile you work at will become easier, as every parent and teacher tends to emphasise. (recode conversely is great for those fluent in it, but I have to look up the syntax every time and it's faster for me to write longer code using other commands. I don't like the way it uses equal signs either, but no-one else need care. If I recall correctly it owes a lot to syntax in other environments, so that is part of my first point again.)

                4. Clyde Schechter showed in #2 that in this example you don't need to handle multiple rules and I guess the deleted post by Bjarte Aagnes said the same thing, Here is the place to give an alternative to Clyde's code which was

                Code:
                gen AgeMin = 35 + 10*(AgeRange - 2)
                replace AgeMin = 20 if AgeMin == 25
                
                gen AgeMax = 34 + 10*(AgeRange - 1)
                replace AgeMax = 76 if AgeMax == 84
                The first could be

                Code:
                gen AgeMin = cond(AgeRange == 1, 20, 35 + 10*(AgeRange - 2))
                and the second could be

                Code:
                gen AgeMax = cond(AgeRange == 6, 76, 34 + 10*(AgeRange - 1))
                or even

                Code:
                gen AgeMax = cond(AgeRange < 6, 34 + 10*(AgeRange - 1), 76)
                I am not trying to sell any of these as better, beyond underlining that if compactness is the criterion, single statements score highest. But clarity is naturally important too.

                5. I have a love-hate relationship with egen, The main issue is shown by occasional requests I see elsewhere for the equivalent in R of egen, It is good that people who have used Stata found egen useful - even if they now regard themselves as "transitioning" to R. But there is no earthly reason to implement egen in anything else. There is no underlying core concept; it is just a convenient ragbag of stuff not yet implemented in other ways. For that and other reasons well explained by daniel klein I don't favour an egen solution here, providing yet another new syntax. But if someone works up a decent function that other people like, I will happily offer a home in egenmore (SSC).

                6. Here is yet another way to implement this kind of switching.

                Code:
                clear 
                set obs 1 
                gen AgeRange = 42 
                
                generate AgeMin = 0
                local Mins 20 35 45 55 65 75 
                forval Range = 1/6 { 
                    gettoken This Mins: Mins
                    replace AgeMin = `This' if AgeRange == `Range'
                }
                This way of looping over lists in parallel has been possible for a long time but various nice examples by Robert Picard underlined to me how helpful it could be,

                The first three lines are utterly optional. They are just a way of setting up a minimal sandbox so that what follows is legal code. Anyone wanting to play seriously would need a more elaborate sandbox. Note that such code isn't bulletproof and doesn't include a check that lists line up one to one.

                Comment


                • #9
                  I much like daniel klein's suggested layout for a nested cond()
                  Code:
                  generate AgeMin =           ///
                      cond(AgeRange == 1, 20, ///
                        [...]
                      cond(AgeRange == 6, 75, .z))))))
                  It's as clear as a common if-else structure and provides a nice way to keep track of the parentheses. I'd still have to count ")"s at the end.

                  A bit on the history "switch" type control structures: To my knowledge, they first appeared around 1975 as the "case" structure, in Pascal, a language designed for clarity rather than speed. So, that's the ground on which I'd say that an easy and yes concise way to write statements with a nested cond() structure would be nice, not because the speed would be a big deal. I'm also an advocate of -recode- (which, if I recall some experiments I tried, is slower than a series of -replace- statements).

                  All this being said: I always liked the "case" structure in Pascal, but truth be told, I rarely used it.

                  Now, as to what constitutes an easy to learn structure: I'd appeal by analogy to what some linguist friends of mine have told me, namely that an objective criterion for the difficulty of a sound or structure in a human language is the extent to which children or 2nd language learners have difficulty with it. I'd nominate nested uses of cond() for such a status <grin>

                  Comment


                  • #10
                    cond() is just a step way from nested if else statements. For that, precisely the same tip on multiple line layout, and much else https://www.stata-journal.com/sjpdf....iclenum=pr0016 already provides discussion and examples,

                    Comment


                    • #11
                      Some timings, on my laptop, running StataMP4, 100 reps of samples with 10^6 obs:
                      Code:
                      1:   11.76 cond( with ///
                      2:   11.74 cond( with ;
                      3:   12.03 ( A == 1 ) * 20  +  ( A == 2 ) * 35  ...
                      4:  109.75 qui recode ...
                      5:   13.11 LookUp[1, AgeRange]
                      6:    5.86 cond(AgeRange == 1, 20, 35 + 10*(AgeRange - 2))
                      Last edited by sladmin; 20 Dec 2019, 13:29. Reason: copy/paste error update

                      Comment


                      • #12
                        recode is, as should be expected, very slow. That has a lot to do with being implemented in ado (rather than C) and also with the generality of the command, e.g., changing multiple variables at once, recognizing keywords, creating appropriate value labels, etc.

                        I still feel the code is easier to understand than the fastest cond() solution. It would also be interesting to see how the original if-else structure does.

                        Best
                        Daniel

                        Comment


                        • #13
                          Code:
                                    II.     I.  
                          ------------------------------------------------------------------------------
                             1:     10.92   11.76 cond( with ///
                             2:     10.75   11.74 cond( with ;
                             3:     11.40   12.03 ( AgeRange == 1 ) * 20  +  ( AgeRange == 2 ) * 35  ...
                             4:    105.94  109.75 qui recode ...
                             5:     12.81   13.11 LookUp[1, AgeRange]
                             6:      5.38    5.86 cond(AgeRange == 1, 20, 35 + 10*(AgeRange - 2))
                            99:     40.10   ----  generate replace ... 
                          Code:
                          timer on 99
                          generate byte AgeMin = 0
                          replace AgeMin = 20 if AgeRange==1
                          replace AgeMin = 35 if AgeRange==2
                          replace AgeMin = 45 if AgeRange==3
                          replace AgeMin = 55 if AgeRange==4
                          replace AgeMin = 65 if AgeRange==5
                          replace AgeMin = 75 if AgeRange==6
                          timer off 99
                          Last edited by Bjarte Aagnes; 20 Dec 2019, 11:22.

                          Comment


                          • #14
                            In this particular example a conditional expression is much faster, as Bjarte Aagnes has shown, but one could argue that -recode- is easier to read, particularly using the layout shown by daniel klein in #7.

                            In the more general case where a series of calls to -cond()- is required, macros can provided syntactic sugar:

                            Code:
                            local case cond(AgeRange ==
                            gen AgeMin = ///
                                `case' 1, 20, ///
                                `case' 2, 35, ///
                                `case' 3, 45, ///
                                `case' 4, 55, ///
                                `case' 5, 65, ///
                                `case' 6, 75, .z))))))
                            As Mike Lacy notes, you still need to count parentheses at the end, but it is just one per case.

                            Comment


                            • #15
                              It seems from #11 & #13 that the difference in efficiency between approaches amounts to something like 50 milliseconds versus 1.1 seconds in a dataset of a million observations.

                              Based on that, it would seem that the take-home message from the thread is to favor clarity and familiarity (comfort), unless you have just enormous datasets—linear scale-up projection for a dataset of a hundred-million observations would still be less than two minutes for any approach illustrated.

                              I guess that if you have billions of observations or have to repeatedly do this operation in real time all day long on hundreds of millions of observations, then maybe go for an RDBMS and use the vendor's extension of SQL.

                              Comment

                              Working...
                              X