Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Generate Cluster Average Variable

    Hi
    I am attempting to construct a variable at the cluster level - the 'cluster average of women's working status'. I am working with DHS data and clusters corresponding to geographical units like villages.

    Women's working status is a variable (jobb) in binary form (0:not working and 1: working)
    The cluster ID variable is in continuous form.
    I need to generate the variable cluster average of women's working status.

    Code:
    egen cluster_avg = mean(jobb), by(v001)  OR  
    by v001,sort: egen clustaverage=mean(jobb)
    I am confused if this command is correct because I do not understand this variable correctly.
    1) Can the cluster average variable take up decimal values while the base variable 'women's working status' is in binary form?
    2) How do I deal with missing values while constructing the cluster average?
    3) What will be the total count (frequency) of the cluster average variable we generate - should that be the same as that of the base variable 'women's working status'?
    4) How do we construct the cluster average in a way that excludes the woman being considered to avoid correlation? (leaving-one-out technique)

    Any advice will be helpful. Thanking you.

    Code:
    dataex v001 clustaverage jobb in 60/100

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input long v001 float(clustaverage jobb)
    102 . .
    102 . .
    102 . .
    102 . .
    102 . .
    102 . .
    103 .06666667 0
    103 .06666667 .
    103 .06666667 0
    103 .06666667 0
    103 .06666667 .
    103 .06666667 .
    103 .06666667 .
    103 .06666667 1
    103 .06666667 0
    103 .06666667 0
    103 .06666667 .
    103 .06666667 .
    103 .06666667 0
    103 .06666667 .
    103 .06666667 0
    103 .06666667 0
    103 .06666667 0
    103 .06666667 0
    103 .06666667 0
    103 .06666667 .
    103 .06666667 .
    103 .06666667 .
    103 .06666667 .
    103 .06666667 0
    103 .06666667 .
    103 .06666667 0
    103 .06666667 .
    103 .06666667 .
    103 .06666667 .
    103 .06666667 0
    103 .06666667 .
    104 . .
    104 . .
    104 . .
    104 . .
    Last edited by steny rapheal; 26 Jul 2024, 02:02.

  • #2
    1) Can the cluster average variable take up decimal values while the base variable 'women's working status' is in binary form?
    Not only can it, but it almost always will. You are averaging a bunch of 0's and 1's. Unless they are all zero in a cluster, or all one in a cluster, the average in the cluster will be somewhere between 0 and 1 (not inclusive) and will therefore be fractional.

    2) How do I deal with missing values while constructing the cluster average?
    This question can be understood in two different ways. One of them has nothing specifically to do with the cluster average: what to do about missing data in general. That is a lengthy topic. I'll assume you are not asking that and will not pursue it here. The other way is how do (should) observations with a missing value be handled when calculating an average. The answer is that they should be excluded from both the numerator and denominator. The -egen, mean()- function handles it this way, so there is nothing to worry about here.

    3) What will be the total count (frequency) of the cluster average variable we generate - should that be the same as that of the base variable 'women's working status'?
    I'm not sure what you mean by the total count of the cluster average variable. If you mean how many observations in the cluster will have a value for it, the answer is, using the code you used, all of them, even the ones where the jobb variable was missing. Except, if a cluster had only missing values for the jobb variable, in which case there will be no non-missing values for the average.

    4) How do we construct the cluster average in a way that excludes the woman being considered to avoid correlation? (leaving-one-out technique)
    Code:
    by v001, sort: egen numerator = total(jobb)
    by v001: egen denominator = count(jobb)
    gen loo_mean = cond(missing(jobb), numerator/denominator, (numerator-jobb)/(denominator-1))

    Comment


    • #3
      Thank you, Clyde, for the response.

      If you mean how many observations in the cluster will have a value for it, the answer is, using the code you used, all of them, even the ones where the jobb variable was missing
      About this, should I modify the code in any way that if the job variable is missing, then the average will also carry the missing value?

      Code:
      egen jobb_total=total(jobb), by(v001)
      egen jobb_n=count(jobb),by(v001)
      gen jobb_mean_adj=(jobb_total-jobb)/(jobb_n-1)
      I believe this code mirrors the leave-one-out technique and reports missing value for the average if the jobb variable is missing. Should this be considered?

      Code:
      * Example generated by -dataex-. For more info, type help dataex
      clear
      input long v001 float(jobb_mean_adj loo_mean jobb)
      102          .          . .
      102          .          . .
      102          .          . .
      102          .          . .
      102          .          . .
      102          .          . .
      103 .071428575 .071428575 0
      103          .  .06666667 .
      103 .071428575 .071428575 0
      103 .071428575 .071428575 0
      103          .  .06666667 .
      103          .  .06666667 .
      103          .  .06666667 .
      103          0          0 1
      103 .071428575 .071428575 0
      103 .071428575 .071428575 0
      103          .  .06666667 .
      103          .  .06666667 .
      103 .071428575 .071428575 0
      103          .  .06666667 .
      103 .071428575 .071428575 0
      103 .071428575 .071428575 0
      103 .071428575 .071428575 0
      103 .071428575 .071428575 0
      103 .071428575 .071428575 0
      103          .  .06666667 .
      103          .  .06666667 .
      103          .  .06666667 .
      103          .  .06666667 .
      103 .071428575 .071428575 0
      103          .  .06666667 .
      103 .071428575 .071428575 0
      103          .  .06666667 .
      103          .  .06666667 .
      103          .  .06666667 .
      103 .071428575 .071428575 0
      103          .  .06666667 .
      104          .          . .
      104          .          . .
      104          .          . .
      104          .          . .
      end
      Last edited by steny rapheal; 27 Jul 2024, 23:02.

      Comment


      • #4
        It's perfectly fine to do it either way. The question is what you will be doing with this result afterward. If you will be doing observation-level calculations but want to exclude the cases where jobb is missing, then #3 is best: it will do that for you automatically. If, on the other hand, you will be doing cluster-level calculations with it, then it will be more convenient to have it in every observation because you can just tag an arbitrary observation within each cluster and use that. Like this:
        Code:
        egen tag = tag(v001)
        some_calculation if tag
        So either way is fine, and which is more convenient depends on what you will be doing next.

        Comment


        • #5
          okay.
          thank you so much for the response.

          Comment

          Working...
          X