Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • More efficient way of writing code for flags?

    I am creating a flag to identify a subset of the sample that has non-missing data for several variables

    This is my code:

    Code:
    gen flag=1 if sex!=. & age!=. & marstat!=. & ownership
    This code runs well but if I have more variables, is there a more efficient way to write the code so that I don't have to keep writing "var!=. & var2!=. & var3!=." ?
    Im doing this to keep my analytic sample consistent when comparing models
    Thanks in advance!


  • #2
    This code runs well
    I beg to differ. It creates a variable that is coded 1/. But that performs poorly in Stata. Better is to code the variable as 1/0. Also, it won't work with any string variables in the mix.

    I also don't know what to make of the -& ownership- at the end of that command line. As written, flag is set to 1 if sex, age, and marstat are all non-missing and ownership is anything (including missing value) other than 0. Did you mean -& ownership != .- ? That's what I assume in the code below.

    Code:
    egen flag = mcount(sex age marstat ownership)
    replace flag = !flag

    Comment


    • #3
      Clyde Schechter - where did you get the egen function "mcount"? I don't have it and can't seem to find it

      Comment


      • #4
        Sorry, that's wrong. The function should be -rowmiss()-. Brain glitch! Need more caffeine, and probably should have gone to bed earlier last night. (The reason for this particular brain glitch is that when I use the -egen, rowmiss()- function I usually name the variable mcount, so that association in my brain got in the way.

        Code:
        egen flag = rowmiss(sex age marstat ownership)
        replace flag = !flag

        Comment


        • #5
          I know this is a bit pedantic, but in programing the word efficient has a technical meaning; it means that the solution minimizes the computational requirements either in time (by minimizing the total number of operations required to preform the calculation) or in space (by minimizing the amount of ram needed) or in both time and space. OP is looking for an elegant way to write the code. This is a laudable goal. Arguably elegance and readability are more important than efficiency in a high level language like Stata's Ado, but the distinction is still important.

          Comment


          • #6
            Clyde Schechter : you're right, the ownership should have also been "ownership !=."

            I wanted to understand your code better.
            Code:
              
             egen flag = rowmiss(sex age marstat ownership)
            replace flag = !flag
            Do these codes mean:
            line1 = This creates a variable named flag and that variable would include any missing observations in sex, age, marstat and ownership
            ine2 = This replaces the flag variable to only include those that are nonmissing in the variable flag, is that correct?

            so if I create a model that would include nonmissing observations for the said varlist, would I do this?
            Code:
            reg ownership age sex marstat if flag==!
            With my original intention of creating flag=1, I usually do
            Code:
            reg ownership age sex marstat if flag==1
            Last edited by gi peters; 15 Jul 2022, 13:06.

            Comment


            • #7
              Daniel Schaefer Thanks for explaining that. Yes, a more elegant way of writing the code is defintely what I am aiming for here

              Comment


              • #8
                OP is looking for an elegant way to write the code.
                While Daniel Schaefer is using the word "efficient" in the technically correct way, I don't think "elegance" is the issue here. Rather, I think that O.P. is using the word "efficient" thinking about his/her own time spent in writing the code. For many Stata programs, that are relatively straightforward calculations and only are run a small number of times, the time taken to write code can dwarf the total execution time.

                When I first started programming computers, back in 1962, computers were much slower than today's, and computer time was also very expensive. If I recall correctly, back then, an hour of time on an IBM 7090 cost about $900, and a programmer's weekly salary was probably around $100. In budgeting a project, you could almost ignore the cost of the programmer's labor. But today, you can own a computer that is both larger and faster than that for less than $900, and programmers do not come cheap.

                Comment


                • #9
                  Re #6: the first line of the code in #6 calculates the value of flag as the number of variables among sex, age, marstat, and ownership, that have a missing value.

                  The second line converts flag to its boolean negation. Remember that in Stata, with logical operators, any value other than 0 is true, and 0 is false. So if there were no missing values among the variables, flag would originally be 0, and then the second line switches it to 1. If there were some missing values among the variables, then flag is originally some number other than 0, and the second line switches it to 0.

                  To run some command using a sample that has no missing values for those four variables:

                  Code:
                  command if flag
                  does it. Remember, flag is a logical (boolean) variable here. And it is true if and only if there are no missing values in sex, age, marstat, or ownership. You can also write this as -if flag == 1-, but there is no need for the extra characters. -if flag- accomplishes the same thing more efficiently, both in terms of your own coding labor, and also in terms of program execution time.

                  I should add that this is the reason that it is bad practice in Stata to code a true/false variable as 1/missing. In Stata, anything other than zero, including a missing value, is true! So when you use 1/missing coding you cannot make natural use of Stata's logical operators, whereas with 1/0 coding you can.
                  Last edited by Clyde Schechter; 15 Jul 2022, 13:32.

                  Comment


                  • #10
                    Clyde Schechter Absolutely, and as any good programer will tell you, programer time is worth more than processor time. At least, that has been the case over the mere decade and a half that I've been writing code. I think some are tempted to trivialize the importance of elegance. The word certainly evokes the sense that it is just about esthetics, and I certainly agree that esthetics aren't terribly important. It is not important that your code is pretty as long as it works. When I say elegant, I mean code that is easy to write quickly with only a few lines and (just as important) is easy to read and understand later. Not only do I find idiomatic code (as in your solution above) esthetically pleasing, I find it an efficient time saver during the writing process. As an additional benefit, when I share my code with others, or when I come back and read it months and years later, it is easy to read. This is, incidentally, one of my favorite things about ado. The language is very elegant.

                    Of course, "elegant" doesn't have a technical meaning as far as I know, so I suppose the semantics are open to debate.
                    Last edited by Daniel Schaefer; 15 Jul 2022, 13:41.

                    Comment


                    • #11
                      The function -missing()-, which accepts multiple arguments, would also do the job here.

                      Code:
                       
                       gen flag=1 if !missing(sex, age, marstat, ownership)
                      or if a dummy variable is desired

                      Code:
                      gen flag01= !missing(sex, age, marstat, ownership)

                      Comment


                      • #12
                        Re #10: Couldn't agree more. It was the elegance of Stata's programming language that made me "fall in love at first sight" when I first came across it in 1994.

                        Comment


                        • #13
                          Note that -mark- and -markout- exist for this kind of problem. This is especially useful if you want to update the flag variable to include additional variables in a later step. However, -markout- is dangerous because you if you forget to specify your flag variable immediately next to -markout- you might accidentally change the values of the first variable in the list:
                          Code:
                          mark flag
                          markout flag ownership        // step 1
                          markout flag sex age marstat  // step 2
                          By the way, in #6 the code
                          Code:
                          reg ownership age sex marstat if flag==!
                          is invalid syntax.

                          Comment


                          • #14
                            Once upon the time -- maybe it was last summer, maybe two summers ago, or maybe three summers ago -- I had summer fun writing -egen- functions. The best way to learn how to write -egen- functions is to read the -egen- functions ancient Stata programmers have written before you, and so I did.

                            In #2 Clyde wrote " It creates a variable that is coded 1/. But that performs poorly in Stata. Better is to code the variable as 1/0. "

                            What Clyde probably means is that
                            If you code your flag as 1/., then the statement -if flag- will return true, and will back stab you because this is not what we expect. This is because Stata treats everything different from 0 as true, and missing (.) is different from 0, and hence true.
                            If you code your flag as 1/0, then the statement -if flag- will be well behaved and give us the results we expect.

                            But on the other hand I noticed that in ancient -egen- functions, many Stata programmers (or maybe the same programmer if he wrote many of the -egen- functions), preferred to define his flags as 1/. variable. The advantage of doing this, which I discovered in the process of writing my -egen- functions, is in how Stata sorts.
                            If you have a flag variable as 1/0, then -sort flag- will put the values that you are not interested in at the beginning of your groups, and the values you are interested in towards the end.
                            If you have a flag variable as 1/., the statement -sort flag- will flush the values you are not interested in at the end of your group.

                            So at the end I opted for using 1/. flags in my egen functions, as in this context the second consideration outweighted the first.

                            Related to #13 by Dirk, I did not see -mark- and -markout- used much in the ancient -egen- functions. I myself never understood what is the advantage that -mark- is giving me, when I can do the same with basic statements -- -mark- is neither shorter, nor easier, in fact I find it a pain because I need to read the manual every time I use -mark-.


                            Comment

                            Working...
                            X