Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Averaging categorical variables

    Dear All,
    Just a quick enquiry with respect to obtaining averages of categorical variables (e.g. gender, region etc) for each cohorts, being some the variables to be included in the list of controls. I rounded each average to the nearest whole number so that any value >=0.5 becomes '1' and <5, becomes '0'


    Thanks,

    Dapel

  • #2
    Don't do this.

    If you have an indicator variable, say 1 for female and 0 for male, the mean makes perfect sense as the proportion female. Rounding just degrades your result to "most or all female" or "most or all male" and throws away information.

    Otherwise the means of arbitrary categorical variables may be somewhere between difficult to interpret and utterly meaningless. Averaging region codes would be an example of the latter. Rounding cannot do anything to help this insuperable difficulty.

    What any of this has to do with some later modelling I cannot fathom, but it's mostly a very bad idea even if intended as basic descriptive statistics. This is, or should be, something understood after an elementary statistics course.
    Last edited by Nick Cox; 18 Feb 2015, 05:56.

    Comment


    • #3
      I'm very clear with this. Thank you Sir. What do you suggest? I need to include these categorical variables in list of controls. Should I just use them as proportions? If yes, how would one interpret such results?

      Comment


      • #4
        Categorical variables used as predictors in some model typically are entered as one or more indicator variables: read up on factor variable notation.

        This is your first question reversed, and as before I don't understand what that has to do with using means.

        Comment


        • #5
          Are you going to create aggregated data or something? If, say, your unit of analysis was countries, then your variables might include % female, % in poverty, % Catholic, and things like that.

          If your unit of analysis is the individual, then just use factor variables, as Nick suggests. Or maybe consider some sort of multi-level analysis if, say, you think people are affected by the number of people around them who live in poverty or something like that.

          I can't think of any circumstances where you would do the sort of rounding you initially suggested.

          If you were clearer about your research problem and the kind of data you have it might be possible to offer more specific advice.
          -------------------------------------------
          Richard Williams, Notre Dame Dept of Sociology
          StataNow Version: 19.5 MP (2 processor)

          EMAIL: [email protected]
          WWW: https://www3.nd.edu/~rwilliam

          Comment


          • #6
            Thank you all very much! To be clearer, I want to estimate a dynamic model, y_(i,t¬)= αy_(i,t-1)+x_(i,t) β+γ_i+ε_(i,t), i=1,…,N; t=1,2 …,T, but I do not have actual panel dataset. However, I have six cross-sections (T=6). To counter this shortcoming, I decided to craft pseudo or synthetic panel whereby cohort averages of household characteristics, e.g. age, income, gender, region etc, into N cohorts (i=1, ..., N) are used instead. These cohorts are then tracked through time. The cohorts are constructed on the basis of age. For instance, those born between 1980-85, 1986-1991, etc. In each of these cohorts, we have households from different regions,of different gender and income. If we take the average of the income its OK. But that's not the case with gender and region. That's the situation am seeking to square out.
            Last edited by Zuhumnan Dapel; 18 Feb 2015, 08:14.

            Comment


            • #7
              Any more help? Thanks

              Comment


              • #8
                I am still not totally sure what you want to do with these things once you have them. If you just want descriptive statistics, you could use things like the means and proportions commands, possibly combined with the over option. If you want to add variables (e.g. the proportion female in a region) you could use commands like collapse. You might need to create dummy variables out of some of your categorical variables, e.g. region1, region2, etc. Once you have a 0/1 dummy, the mean tells you the proportion that were coded 1.
                -------------------------------------------
                Richard Williams, Notre Dame Dept of Sociology
                StataNow Version: 19.5 MP (2 processor)

                EMAIL: [email protected]
                WWW: https://www3.nd.edu/~rwilliam

                Comment


                • #9
                  Thanks Dear Richard

                  Comment

                  Working...
                  X