Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Combination of if-qualifiers in logit model / regression

    Dear Statalist,

    I am trying to combine qualifiers in a regression, however, I receive error messages and would be very thankful if anyone could help improve my model.

    Unfortunately, I cannot share the data I am working on but I have created something similar using the exemplary auto data file. I apologize if the logit regression here does not make sense, I am trying to understand the main method.

    Using the auto datafile

    Code:
    clear
    sysuse auto
    gen exp = 1 if price > 5000, before(mpg)
    replace exp = 0 if price < 5000
    gen heavy = 1 if weight > 3000, before(length)
    replace heavy = 0 if weight < 3000
    gen gear_ratio_large = 1 if gear_ratio > 3, before(foreign)
    replace gear_ratio_large = 0 if gear_ratio < 3
    gen efficient = 1 if mpg <= 17, before(rep78)
    replace efficient = 0 if mpg > 17
    Code:
     logit exp if make == "AMC Pacer" i.foreign if make == "Buick Regal" i.heavy if make == "Chev. Monza" i.efficient i.gear_ratio_large if == 1
    In essence, I would only like to consider the cases where the make is AMC Pacer, only those i. foreign rows that have "Buick Regal" as their make, i.heavy if the make is Chev Monza, all rows of i.efficient and only those cases where the gear ratio is large.

    When running said logit model, Stata returns an invalid 'i.foreign' error

    Initially, I thought of creating new variables, such as exp & AMC Pacer etc., but that led to missing variables within that new variable, which I would like to avoid.

    I am thankful for any advice on my model and how to proceed.

  • #2
    There is a lot going wrong here. First, you can only have one -if- qualifier in any command. If you have several conditions to apply, you must string them together with ands and ors and nots within that single -if- qualifier. In doing that, you must also remember that Stata understands boolean expressions; it does not understand English. Actually, even trying to read your -if-s as English I find it confusing! I don't know what you mean by "i. foreign rows." There are no observation ("rows") that contain i.foreign. All of the observations contain some value for the variable foreign, which will be either 0 or 1. Even more confusing to me is "all rows of i.efficient." I'm going to guess that you mean all rows for which efficient == 1. Am I on the right track here?

    Anyway, putting together all of what you have written, as best I can guess their meanings, I suggest the following code:
    Code:
    gen byte include = ((make == "AMC Pacer") | (foreign & make == "Buick Regal") ///
        | (heavy & make == "Chev. Monza") | efficient) & gear_ratio_large == 1
        
    logit exp if include
    That said, your conditions are very restrictive. Only two observations in the auto.dta satisfy them. And, as it turns out, for both of those, exp == 1, so no logistic regression is possible. (There has to be variation in the outcome variable within the estimation sample in order to do a logistic regression.)

    I may well have misunderstood what you are trying to do. But even if I have, perhaps my suggested code will show you how to go about using boolean expressions to convey to Stata the meaning you actually intended. If not, please provide a clearer explanation when posting back.

    Comment


    • #3
      your if statement is a mess.

      help if

      you need to be using & and | (which is or) to set it up. only 1 "if" is required.

      might look something like this:
      Code:
      logit exp if (make == "AMC Pacer") | (foreign==1 & make == "Buick Regal") | (heavy==1 | make=="Chev. Monza") | (efficient==1) | (gear_ratio_large==1)
      the if statement makes no sense generally (a Buick is not a foreign car), but I think you're just using this as an example.

      with dummies, you can usually ignore the ==1 part; Stata assumes that's what you mean.

      You can clean up the front end too:

      Code:
      gen exp = price > 5000, before(mpg)
      gen heavy = weight > 3000, before(length)
      gen gear_ratio_large = gear_ratio > 3, before(foreign)
      gen efficient = mpg <= 17, before(rep78)

      Comment


      • #4
        Good advice by George Ford (in #3) that works in this special case, but in general you should be careful when using ">" because missing values are included (they are larger than the largest valid value possible).

        Here a demonstration with three variants of which only the last two are safe (except you explicitly want to code missing values as 1):
        Code:
        sysuse auto
        
        tab1 rep78, mi            // note: 5 values of rep78 are missing
        
        gen rep78g3m = rep78 > 3
        tab2 rep78 rep78g3m, mi   // note: rep78g3m is valid (and 1) if rep78 is missing
        
        gen rep78g3_1  = cond(rep78 >= .,.,rep78 > 3) // note: rep78g3_1 is missing  if rep78 is missing
        tab2 rep78 rep78g3_1, mi  //
        
        recode rep78 (min/3=0) (3/max=1), gen(rep78g3_2)  // alternative to rep78g3_1
        tab2 rep78 rep78g3_2, first mi
        Result:
        Code:
        . tab1 rep78, mi            // note: 5 values of rep78 are missing
        
        -> tabulation of rep78  
        
             Repair |
        record 1978 |      Freq.     Percent        Cum.
        ------------+-----------------------------------
                  1 |          2        2.70        2.70
                  2 |          8       10.81       13.51
                  3 |         30       40.54       54.05
                  4 |         18       24.32       78.38
                  5 |         11       14.86       93.24
                  . |          5        6.76      100.00
        ------------+-----------------------------------
              Total |         74      100.00
        
        .
        . gen rep78g3m = rep78 > 3
        
        . tab2 rep78 rep78g3m, mi   // note: rep78g3m is valid (and 1) if rep78 is missing
        
        -> tabulation of rep78 by rep78g3m  
        
            Repair |
            record |       rep78g3m
              1978 |         0          1 |     Total
        -----------+----------------------+----------
                 1 |         2          0 |         2
                 2 |         8          0 |         8
                 3 |        30          0 |        30
                 4 |         0         18 |        18
                 5 |         0         11 |        11
                 . |         0          5 |         5
        -----------+----------------------+----------
             Total |        40         34 |        74
        
        .
        . gen rep78g3_1  = cond(rep78 >= .,.,rep78 > 3) // note: rep78g3_1 is missing if rep78 is missing
        (5 missing values generated)
        
        . tab2 rep78 rep78g3_1, mi  //
        
        -> tabulation of rep78 by rep78g3_1  
        
            Repair |
            record |            rep78g3_1
              1978 |         0          1          . |     Total
        -----------+---------------------------------+----------
                 1 |         2          0          0 |         2
                 2 |         8          0          0 |         8
                 3 |        30          0          0 |        30
                 4 |         0         18          0 |        18
                 5 |         0         11          0 |        11
                 . |         0          0          5 |         5
        -----------+---------------------------------+----------
             Total |        40         29          5 |        74
        
        .
        . recode rep78 (min/3=0) (3/max=1), gen(rep78g3_2)  // alternative to rep78g3_1
        (69 differences between rep78 and rep78g3_2)
        
        . tab2 rep78 rep78g3_2, first mi
        
        -> tabulation of rep78 by rep78g3_2  
        
            Repair |  RECODE of rep78 (Repair record
            record |              1978)
              1978 |         0          1          . |     Total
        -----------+---------------------------------+----------
                 1 |         2          0          0 |         2
                 2 |         8          0          0 |         8
                 3 |        30          0          0 |        30
                 4 |         0         18          0 |        18
                 5 |         0         11          0 |        11
                 . |         0          0          5 |         5
        -----------+---------------------------------+----------
             Total |        40         29          5 |        74
        See: https://www.stata.com/support/faqs/d...rue-and-false/, and Kantor & Cox (2005): Depending on conditions.

        Comment


        • #5
          Good advice, Dirk.

          Comment


          • #6
            Dear Clyde,

            Thank you very much for your reply and kind support.

            With i. foreign rows that have "Buick Regal" as their make, I mean only row 8 (in my data set, there are more observations of this kind) – I think you implemented that with (foreign & make == "Buick Regal").

            With all rows of i.efficient, I mean the whole column, not only if efficient is either 1 or 0.

            May I kindly ask why you have “or” qualifiers in the include statement instead of “and” qualifiers?

            Thank you very much in advance!

            Comment


            • #7
              Thank you, George, for the code. Thank you very much, Dirk, for the note regarding missing values, which I must avoid in my regressions.

              Comment


              • #8
                Users often write & for and when they really need | for or.

                This misunderstanding was discussed in detail in https://journals.sagepub.com/doi/pdf...6867X231162009

                Comment


                • #9
                  Dear Nick, Thank you very much for the insightful article!

                  Comment


                  • #10
                    Sorry, but I am completely confused about what you want. You are using terms in ways that not only does not conform to standard Stata usage, but in ways that I cannot discern the intended meaning of.

                    With i. foreign rows that have "Buick Regal" as their make, I mean only row 8 (in my data set, there are more observations of this kind) – I think you implemented that with (foreign & make == "Buick Regal"
                    Yes, that is how I implemented it, as a guess about what you meant. But what does "i.foreign rows" mean. All of the observations ("rows") in the data have some value of foreign. As it happens, the only one that has make == "Buick Regal" is row 8, and in that one, the value of foreign is 0, so this condition, as I have (mis)interpreted it describes no observations.

                    With all rows of i.efficient, I mean the whole column, not only if efficient is either 1 or 0.
                    The overall context here is that you are trying to write some condition that selects a subset of the data for inclusion in the regression. But if you mean "the whole column", then that imposes no condition at all. So what is the point of even mentioning it?

                    May I kindly ask why you have “or” qualifiers in the include statement instead of “and” qualifiers?
                    To Nick's response to this, I will only add, for the benefit of others following the thread who may not choose to read the article Nick referenced, that while we often in English think about "include an observation if some condition and if some other condition," the representation of this in symbolic logic is "include an observation if (some condition or other condition)." Stata's syntax follows that of symbolic logic, not spoken English. There is good reason for this: the meaning of "and" in logical terms differs depending on the context in which it occurs, and it is much harder to write parsers for context-dependent language than for context-independent language.

                    That said, the article Nick cited in #8 is really excellent and covers nearly all of the situations where people get incorrect Stata results by writing code as if it were spoken English. It is well worth reading in full.

                    Comment


                    • #11
                      Thank you very much for your patience and support, Clyde!

                      Perhaps, I need to use a better example. I’m really sorry I can’t use dataex or provide closer information regarding my real data.

                      Let’s imagine my experiment consists of participants’ choosing among different flavors of ice cream. In every round, there is a choice between chocolate, strawberry, and vanilla.
                      I measure how much of every scoop (if at all) is chosen. The maximum a participant can choose is 3 scoops. Allocations of zero, zero, 3 scoops are possible.

                      Independent variables are 1) treatment or control (e.g., hot and low surrounding temperature) - the participant gets assigned once at the beginning and stays in that group, 2) the price per scoop of all three different flavors, 3) the scoops chosen ( e.g. if I measure how many strawberry scoops were chosen as a dependent variable, then I have chocolate and vanilla scoops as my independent variables à i.e., I am seeking to understand whether (and how much) chocolate and vanilla ice cream have any influence on choosing strawberry ice cream).

                      Now to the problem of the many if qualifiers: Every flavor has its own row per participant. Perhaps, to clarify: participant 1 has three rows (each with one flavor). So, it goes strawberry, vanilla, chocolate. Starting at the 4th row, the second participant also has three rows: strawberry, vanilla, chocolate. And so on with my roughly 255 participants. The allocations are listed in one column below each other.

                      I have created a quick dataex as an example, hopefully, this will clear things up.

                      Code:
                      * Example generated by -dataex-. For more info, type help dataex
                      clear
                      input int ID byte T_C str10 Ice_cream float Allocation str3 price
                      79 1 "Strawberry" 1 "1.1"
                      79 1 "Vanilla"    1 "1.2"
                      79 1 "Choc"       1 "1.1"
                      80 0 "Strawberry" 2 "1.5"
                      80 0 "Vanilla"    1 "1.3"
                      80 0 "Choc"       0 "1.4"
                      81 0 "Strawberry" 3 "1.3"
                      81 0 "Vanilla"    0 "1.2"
                      81 0 "Choc"       0 "1.0"
                      82 1 "Strawberry" 1 "1.2"
                      82 1 "Vanilla"    1 "1.3"
                      82 1 "Choc"       1 "1.1"
                      83 1 "Strawberry" 2 "2.1"
                      83 1 "Vanilla"    1 "1.1"
                      83 1 "Choc"       0 "1.0"
                      84 1 "Strawberry" 1 "1.8"
                      84 1 "Vanilla"    1 "1.4"
                      84 1 "Choc"       1 "1.3"
                      85 0 "Strawberry" 2 "1.6"
                      85 0 "Vanilla"    1 "1.2"
                      85 0 "Choc"       0 "1.1"
                      86 0 "Strawberry" 2 "1.9"
                      86 0 "Vanilla"    1 "1.3"
                      86 0 "Choc"       0 "1.1"
                      87 1 "Strawberry" 0 "1.5"
                      87 1 "Vanilla"    0 "1.3"
                      87 1 "Choc"       3 "1.1"
                      88 1 "Strawberry" 1 "1.9"
                      88 1 "Vanilla"    1 "1.5"
                      88 1 "Choc"       1 "1.8"
                      89 0 "Strawberry" 2 "1.3"
                      89 0 "Vanilla"    1 "1.2"
                      89 0 "Choc"       0 "1.1"
                      end

                      Back to the regression: I would like to understand the impact of vanilla and chocolate prices (not strawberry prices) if my dependent variable is strawberry ice cream. Therefore, I would use

                      Code:
                       logit strawberry c.price (if ice_cream vanilla OR choc) i.Allocation i.T_C
                      Likewise, I would like to use other regressions, such as the impact of choc allocation and strawberry price on vanilla allocation.

                      I would be most grateful for any advice!


                      Regarding your question: So what is the point of even mentioning it?
                      I mention the variable because I would like it as an independent variable in my regression.

                      Comment


                      • #12
                        OK, now I think I understand what you are trying to do. This is a surprising situation: I spend much of my time here on Statalist trying to persuade people to use long data layouts because there isn't much that can be done well with wide data. But in this case, your difficulties arise because your data are long when they need to be wide. Once we get the layout corrected, you will see that no if conditions are needed at all.

                        We also need to clear up one misunderstanding. Under the terms you have described, the total of the allocations to Vanilla, Choc, and Strawberry must always be three. So there is nothing to be learned by using the allocations to Vanilla and Choc as independent variables when the dependent variable is allocation to Strawberry, because the outcome of that regression is exactly determined by the study design: AllocationStrawberry = 3 - AllocationVanilla - AllocationChoc. The coefficients of everything else will be 0.

                        So I think what you need to do is this:
                        Code:
                        destring price, replace
                        reshape wide Allocation price, i(ID) j(Ice_cream) string
                        regress AllocationStrawberry i.T_C priceChoc priceVanilla
                        I have a few additional comments. Unless there is some deterministic relationship among the prices of the three flavors (which does not appear to be the case in the example data) I don't see why priceStrawberry shouldn't be included among the independent variables. Surely the price of Strawberry itself has some effect on the choice of allocation to Strawberry, no? Also, you might want to consider whether the effect of the treatment itself is dependent upon the prices of the three flavors. For example, it may be that in hot ambient temperature people are less price-sensitive in their allocation decisions than in cold. If that is the case, you would want to have interaction terms between treatment and the prices.

                        Added: It also isn't clear that a linear regression is the best model here. The outcome variable, AllocationStrawberry might be better treated as an ordinal variable or a 4-level categorical variable, analyzed with -ologit- or -mlogit-, respectively. Or treating the outcome as a binomial outcome with 3 trials seems promising.
                        Last edited by Clyde Schechter; 17 Aug 2023, 09:13.

                        Comment


                        • #13
                          Dear Clyde,

                          Thank you very much for your comments; they have yielded some very insightful discussions regarding my actual data over the past few days.

                          May I kindly ask you another related question: Would your recommendation to reshape to the wide format change when I have multiple decisions by the same individual?

                          In another project, I am working on experimental data (where participants are given a portfolio of cash and have to invest it in different asset classes). There I have 4 treatments. So four different decisions. Previously, I had them in the long format (each treatment had its own row) and clustered on the individual (vce(cluster ID)). Would a regression in the wide format work there, too, assuming I am interested in similar "if conditions"?

                          Or is the wide format only useful when there is one row per participant (like here with the ice cream allocation example)?

                          Thank you very much for your help and support.

                          Comment


                          • #14
                            In the scenario with four different decisions, then most likely you would want a long layout, with one observation per treatment per participant. In addition to using -vce(cluster ID)- you might want to use a panel-data analysis that includes fixed- or random- effects at the ID level.

                            In your original scenario described in #11, you really had only one observation per participant, and to properly analyze the data you needed to have that information in the same observation with the prices of the alternatives. Getting all of the prices into the same observation is what necessitated going to the wide layout.

                            More generally, for analyses in which some outcome(s) is(are) modeled in terms of explanatory variables, all of the explanatory variables need to be in the same observation as the outcome with which they are associated. In a situation where the participant has repeated measurements, each of those occasions requires its own observation, including all of the explanatory variables as they were at the time of the repeated measurements. If the explanatory variables do not change from one occasion to the next, they must nevertheless be repeated in each observation. This is the long layout that is standard for analyzing repeated measurement data in Stata.

                            Comment


                            • #15
                              Dear Clyde,

                              Thank you very much for your support and guidance!

                              I appreciate the clarification between the long and the wide layout; it simplifies when to use what format!

                              Comment

                              Working...
                              X