Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Categorising height using stata

    I have some data for the height of each individual, I wish to categorise these individuals into those who are tall, average and short. My data is from the UK and I cannot find any papers on what is classified as tall average and short for males in the UK. I imagine that there is some way in which I can categorise them myself by finding the height distribution of my sample, and then finding the values two standard deviations below/above the mean as my short/tall boundaries. Any idea how I would go about doing this using stata?

  • #2
    Best to keep these queries together: http://www.statalist.org/forums/foru...ing-panel-data

    I am based on the UK and have no notion that there is any such classification, beyond "tall" being taller than myself, etc. A slightly deeper comment is that this just a recipe to degrade your data, with nothing else said.

    Comment


    • #3
      First, I could not agree more with Nick that classifying your data instead of using the heights will only impair your ability to use the data.

      But if you are intent on this destructive path, I will give you some rope to hang yourself with:

      You can download the -zanthro()- egen function(s) from Stata Journal. One of the standards available there is the UK 1990 Growth Charts. You can then calculate a z-score for length/height for age based on those norms. (The heights are in cm, and age goes up to 23 years. As virtually everyone reaches adult height by age 23, you can use age 23 as a proxy for all ages above that--unless you have elderly people, where shrinkage is an issue.) You can then, if you must, classify based on a high z-score from that national standard. At least using national norms like that your 2SD cutoff will be less indefensible than one based on your own data set (e.g. at least it will correspond to cutpoints that other people can relate to, verify, etc.)

      That said, in my 30 year career as an epidemiologist, I have never encountered a single situation where turning height into a categorical variable made sense, or did anything other than muddy the waters. And for a general statistical perspective on this bad idea, see Royston P, Altman DG, Sauerbrei. Dichotomoizing continuous predictors in multiple regression: a bad idea. Statist. Med 2006; 25:127-141.

      Oh, and did I mention that I think making a categorical classification out of height is not a good idea?

      OK, I'll stop beating up on your now.

      Comment


      • #4
        I think I have made a slip up in my explanation which has caused this misunderstanding, I want to categorise height and BMI for summary statistics by group (attractive/unattractive) on my other variables (log hourly wage/health dummies). When it comes to running pooled OLS on my data, height and BMI will be used as continuous variables.

        Comment


        • #5
          And in the spirit of using height and BMI directly in your regressions, I would suggest that you use them directly in the presentation of your summary statistics. Divide each into, say, 5 ranges (and in passing, I hope you're doing these analyses separately by gender, or for a single gender only) and present the summary statistics that way, within each of 5 ranges of height and within each of 5 ranges of BMI. Avoid attempting to define "beauty" to measure it's effects, as I suggested in your other post on these topics that Nick linked to in post #2 above.

          And in that thread, I earlier recommended, as I continue to recommend, that you find a statistician locally to guide you on your work.

          Comment


          • #6
            Well, I'm relieved that it's just for descriptive purposes. That's not so terrible. (Though I'm not sure how you will translate those into attractive and unattractive categories.)

            That said, although it's not so terrible to do this, there are alternatives that might be better for descriptive statistics. For example, keep the anthropometrics as continuous variables and present tables of correlation coefficients between them and the other variables. Or perhaps a matrix of scatter plots. (See -graph matrix-). The latter would not only be interesting as a way of presenting the descriptive data, they may well be useful to you in identifying non-linearities in the relationships among these variables that you can then use to better specify your regression model.

            If your work were targeting an audience with limited quantitative skills, then simple classifications that are familiar to all would perhaps be the clearest way. But for a dissertation, I think that the kinds of thing I describe in the preceding paragraph would be more appropriate.

            Comment

            Working...
            X