Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Generate mean of a variable for each level of another variable

    Hi there,
    I want to create several variables that store the mean of other "mother" variables (trunk and displacement) for different values of an index variable (rep 78). Then I want to estimate the difference between means, also for the different levels of the index value. I used the following code:
    Code:
    sysuse auto, clear
    drop if missing(rep78)
    levelsof rep78,local(levels)
    foreach l of local levels { 
    summarize trunk displacement if rep78 == `l'
    egen disp_`l'= mean(displacement) if rep78 == `l'
    egen trunk_`l'= mean(trunk) if rep78 == `l'
    gen dif_`l'= disp_`l' - trunk_`l'
    di dif_`l'
    }


    As you can see, the output displays the summaries of trunk and displacement for the different values of rep78. But it only displays the means difference for rep78=3. Why?

  • #2
    I don't actually get what you want to do, so I can't tell you how to fix it, but I can explain why Stata is doing what it's doing.

    The dif_* are all variables: you craeated them with the -gen- command by subtracting disp_1 - trunk_1, disp_2 - trunk_2, etc. Now if you look at how disp_* and trunk_* were created, each diff_`l' only gets created in those observations for which rep78 == `l'. That's what the -if- clause in your -egen- statements does. So, for example, disp_1 has missing values in any observation with rep78 != 1. Now, since diff_`l' is the difference between disp_`l' and trunk_`l', it, to take on missing values except when rep78 == `l'. So, for example, diff_1 will have only missing values for observations with rep78 != 1.

    Now let's look at your -display- command. -display- is not designed to display the entire list of values a variable takes on in the data set. It is designed to display single numbers or strings. For better or for worse, Stata adopts the convention that if you tell it to -display- a variable, it will show you the value of that variable in the first observation in the data set. As it happens, the first observation in the dataset at this point in time has rep78 == 3. You can verify that yourself in the browser, or by running -display rep78[1]-, or, for that matter, running -display rep78=. Since rep78 == 3 in that observation, the only diff_* variable that will be non-missing is diff_3. So that is why you are getting what you want.

    As I don't know what it actually is that you want, I can't advise what you should do. But I think there are two aspects of your code that are confusing and it may be that you need to fix one or both of these:

    1. If you wanted to generate non-missing values of disp_* trunk_* and dif_* in all observations, not just those with -rep78 == `l', then the correct code for the egen statements would be:

    Code:
    egen disp_`l' = mean(cond(rep78 == `l', disp, .))
    egen trunk_`l' = mean(cond(rep78 == `l', disp, .))
    gen dif_`l' = disp_`l' - trunk_`l'
    2. If you want to display all of the values of each dif_*, not just the one in the first observation, the command you need is -list-, not -display-.

    To be honest, neither of these things makes much sense to me anyway, but they might be close to what you want in some way.

    Here's something that makes more sense to me and might be related to what you want:

    Code:
    sysuse auto, clear
    drop if missing(rep78)
    by rep78, sort: egen mean_disp = mean(displacement)
    by rep78: egen mean_trunk = mean(trunk) 
    gen mean_dif = mean_disp - mean_trunk
    by rep78: list rep78 mean_dif if _n == 1
    Hope this helps you figure it out.

    Comment


    • #3
      When you use a variable name with a command like display, which expects a single value, Stata uses the value from the first observation of the dataset.

      However, the only dif_`l' value that will be nonmissing in your first observation is the one corresponding to the value of rep78 in your first observation.

      Try replacing the display command in your next-to-final line of code with
      Code:
      egen ddif_`l' = min(dif_`l')
      display dif_`l' " " ddif_`l'
      where egen will place in every observation of ddif_`l' the minimum (that is, the common non-missing value) for dif_`l',

      Added in edit: crossed with Clyde's more elegant solution. In the spirit of trying to give you something that might be closer to what you need, as Clyde did, consider the following.
      Code:
      sysuse auto, clear
      drop if missing(rep78)
      collapse (mean) trunk displacement, by(rep78)
      generate diff = displacement - trunk
      format trunk displacement diff %9.3f
      list, abbreviate(12)
      Code:
      . list, abbreviate(12)
      
           +-----------------------------------------+
           | rep78    trunk   displacement      diff |
           |-----------------------------------------|
        1. |     1    8.500        191.000   182.500 |
        2. |     2   14.625        242.250   227.625 |
        3. |     3   15.267        230.033   214.767 |
        4. |     4   13.500        178.833   165.333 |
        5. |     5   11.455        111.091    99.636 |
           +-----------------------------------------+
      Last edited by William Lisowski; 17 Sep 2018, 20:27.

      Comment


      • #4
        Thank you so much, Clyde and William, for your comprehensive explanations. They give me a better understanding of how to work in Stata. -- I'm sorry Clyde for my ambiguous questions, what William posted in the edit is what I wanted to get. -- Gracias, Valdemar.

        Comment


        • #5
          Here are two more small tricks in addition to those from Clyde Schechter and William Lisowski.

          First, the difference between means is just the mean difference and can be calculated directly because egen, mean() will work on expressions, which need not be single variable names.

          Second, tabdisp makes a tabulation of distinct values really easy.


          Code:
          . sysuse auto, clear
          (1978 Automobile Data)
          
          . egen wanted = mean(displacement - trunk), by(rep78) 
          
          . tabdisp rep78, c(wanted) format(%4.1f) 
          
          ----------------------
          Repair    |
          Record    |
          1978      |     wanted
          ----------+-----------
                  1 |      182.5
                  2 |      227.6
                  3 |      214.8
                  4 |      165.3
                  5 |       99.6
                  . |      176.2
          ----------------------

          Comment


          • #6
            Valdemar, note that the results from the code provided in #5 by Nick Cox will be different from Clyde's and William's if you have missing values on either of the two differencing variables (displacement & trunk). [I've been bitten by a similar issue in my own work in the past.]
            Stata/MP 14.1 (64-bit x86-64)
            Revision 19 May 2016
            Win 8.1

            Comment


            • #7
              Carole J. Wilson raises an excellent point. Indeed the difference between variables will be non-missing if and only if both values are, so the egen code scores on that point. That leads to a more detailed suggestion:


              Code:
              sysuse auto, clear
              egen difference = mean(displacement - trunk), by(rep78) 
              egen mean1 = mean(cond(difference < ., displacement, .)), by(rep78) 
              egen mean2 = mean(cond(difference < ., trunk, .)), by(rep78) 
              
              tabdisp rep78, c(mean1 mean2 difference) format(%3.2f) 
              
              ----------------------------------------------
              Repair    |
              Record    |
              1978      |      mean1       mean2  difference
              ----------+-----------------------------------
                      1 |     191.00        8.50      182.50
                      2 |     242.25       14.63      227.63
                      3 |     230.03       15.27      214.77
                      4 |     178.83       13.50      165.33
                      5 |     111.09       11.45       99.64
                      . |     187.60       11.40      176.20
              ----------------------------------------------

              Comment


              • #8
                Thank you, Carole J. Wilson & Nick. I see the nuance and will be aware of it.

                Comment


                • #9
                  Originally posted by Nick Cox View Post
                  Carole J. Wilson raises an excellent point. Indeed the difference between variables will be non-missing if and only if both values are, so the egen code scores on that point. That leads to a more detailed suggestion:


                  Code:
                  sysuse auto, clear
                  egen difference = mean(displacement - trunk), by(rep78)
                  egen mean1 = mean(cond(difference < ., displacement, .)), by(rep78)
                  egen mean2 = mean(cond(difference < ., trunk, .)), by(rep78)
                  
                  tabdisp rep78, c(mean1 mean2 difference) format(%3.2f)
                  
                  ----------------------------------------------
                  Repair |
                  Record |
                  1978 | mean1 mean2 difference
                  ----------+-----------------------------------
                  1 | 191.00 8.50 182.50
                  2 | 242.25 14.63 227.63
                  3 | 230.03 15.27 214.77
                  4 | 178.83 13.50 165.33
                  5 | 111.09 11.45 99.64
                  . | 187.60 11.40 176.20
                  ----------------------------------------------
                  Thank you for this nice solution!
                  Could you please write the meaning of
                  Code:
                  cond
                  and the dots and commas in
                  Code:
                  ., displacement, .
                  Also, is it possible to add 95%CIs in this equation.

                  Comment


                  • #10
                    -cond()- is a Stata function. It takes three arguments. -cond(expression1, expression2, expression3)- first tests whether expression1 evaluates to true (which means anything other than 0) or false (that is, 0). If true, it returns expression2; if false it returns expression3. The . characters are simply Stata's missing value. The commas are standard commas separating arguments.

                    As for adding confidence intervals to this, the simplest way is to use a different approach entirely.

                    Code:
                    levelsof rep78, local(values)
                    gen mean1 = .
                    gen lb1 = .
                    gen ub1 = .
                    foreach v of local values {
                        ci means displacement if !missing(dfference) & rep78 == `v'
                        replace mean1 = r(mean) if rep78 == `v'
                        replace lb1 = r(lb) if rep78 == `v'
                        replace ub1 = r(ub) if rep78 == `v'
                    }
                    Note: I only did this for displacement, but I think the additional code needed to handle trunk is obvious.

                    Comment

                    Working...
                    X