Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to standardize variables to make max value 1 and min value -1 ?

    Hello,

    I am having trouble standardizing to standardize some panel data from VDEM. The data doesn't seem to change much before and after I standardize. I thought that if I specified a mean of 0 and a standard deviation of 1, that the max of the new variable would be 1 and the min would -1, but that doesn't happen. The data is still kind of messy. For example, in one of my variables, the max before standardization is 3.882 and after it's 2.925161. And the min goes from -3.625 to -2.212299 . And the mean goes from -.3923272 to 2.65e-09. I'm assuming 2.65e-09 is close enough to 0. That doesn't seem very standard to me.
    The code I am using is:
    In general:
    Code:
    egen newvar = std(oldvar), mean(0) sd(1)
    Code:
        foreach variable in varlist {
        egen `variable'std = std(`variable'), mean(0) sd(1)
        sum `variable'std
     }
    I was worried that my loop was wrong but I tried doing them invidually too to check if the numbers change. I didn't include a data example because I didn't think it would be helpful in this but let me know if it would be. Is there a way to standardize variables to make the max 1 and the min -1?

    Thank you very much!

  • #2
    Your assumption is wrong in general. A variable with mean 0 and SD 1 won't typically have extremes -1 and 1 and indeed that is utterly exceptional. It is true for equal frequencies of -1 and 1 and a SD calculation using sample size, not sample size - 1. .

    egen, std() doesn't offer that kind of standardization. There may be code to do that somewhere but here I just loop over some variables and use minimum and maximum directly.


    Code:
     sysuse auto, clear
    
    
    foreach v in price mpg weight {
         su `v', meanonly
         gen double `v'_minmax = 2 * (`v' - r(min)) / (r(max) - r(min)) - 1
    }
     
    su *minmax
    
        Variable |        Obs        Mean    Std. dev.       Min        Max
    -------------+---------------------------------------------------------
    price_minmax |         74   -.5443113    .4676173         -1          1
      mpg_minmax |         74   -.3588071    .3990002         -1          1
    weight_min~x |         74   -.1821692    .5046711         -1          1
    Note that with this standardization the mean doesn't have to be 0. That would be true of symmetric distributions, and some others, but not in general.

    Standardization was, a while back, a helpful thing to do to minimise numerical problems, but I don't find that it helps much otherwise. If your values were extremely large or extremely small, you would notice more difference on standardization.


    The double here makes little or no substantive difference, but it's the answer to your secondary query of why means aren't exactly zero. You are using float by default to hold your new variables, while double would give you more bits.

    A data example never does any harm. You're right in this case that the problems are easy to spot.

    Comment


    • #3
      Hi Nick Cox ! Always love getting a response from you. Thank you for answering all of my questions. I have some follow-ups if that's all right.

      Re: standardization, it turns out my original code egen, std() was fine - I misinterpreted a request from my boss. But I appreciate your instruction on how to do that anyway - I really wanted to know, if just for myself.

      To respond to both your comments on this post and https://www.statalist.org/forums/for...tandardization, I want to explain why I'm interested in standardizing and hear your feedback. I should have tried explaining earlier, like months ago.

      The reason I'm standardizing is because I have different scales for different variables: an age variable observed on a scale of years, a percentage scale variable (life expectancy), and survey variables observed on a scale of 0-4 that have been (sort of) standardized by another source. Since I need to aggregate all of these different kinds of variables into one composite variable, I thought I should standardize them. Maybe not?

      When I say "sort of standardized", VDEM describes the data like so: "The scale of a measurement model variable is similar to a normal (ā€œZā€) score (i.e. typically between -5 and 5, with 0 approximately representing the mean for all country-years in the sample) though it does not necessarily follow a normal distribution." which sounds like standardization to me, but I'm not sure exactly what code they use to get this and the actual observed means seem far enough away from 0 to cause concern at least.

      To demonstrate this I took a dataex of 1 line of these generated mean variables:
      Code:
      foreach v in varlist {
        2.  egen `v'mean = mean(`v')
        3. }
      Code:
      * Example generated by -dataex-. For more info, type help dataex
      clear
      input float(v2caviolmean v2exresconmean v2exthftpsmean v2peasbeconmean v2peasbepolmean v2clprptymmean v2clpolclmean v2cacampsmean v2peapseconmean v2peedueqmean v2pehealthmean)
      -.3440833 .19982505 .2567743 -.018001989 -.17261003 .24471708 -.07129194 -.13108872 -.42729065 -.3303107 -.27642477
      end
      Is that cause for concern? It seems like I should standardize them again, while simultaneously standardizing my age and life expectancy variable. Or I can also access the same variables with their original observations on the original scale of 0-4 and standardize them myself? I'm not sure what the best practice is here.

      Final question: I am reversing the scale on some of these, like you so clearly spelled out how to do here: https://www.statalist.org/forums/for...le-in-variable - (really enjoyed going through the list of SSC commands provided and trying things out, it's nice to have options). should I reverse first and standardize after? Does the order of processes matter/make a difference? The range of numbers changes depending on my order, but the range size does not. From my own experience trying both orders, it seems to make more sense to me to reverse first and standardize second.


      Gratefully,
      Sam

      Comment


      • #4
        The reason I'm standardizing is because I have different scales for different variables: an age variable observed on a scale of years, a percentage scale variable (life expectancy), and survey variables observed on a scale of 0-4 that have been (sort of) standardized by another source. Since I need to aggregate all of these different kinds of variables into one composite variable, I thought I should standardize them. Maybe not?
        Depends. If you wanted to use something like PCA, it does it all for you, which can be helpful and which can be a bad idea. That's empty but it's the only short summary I can think of that is accurate. Composite variables I am usually sceptical about, but it sounds as if you are under orders.

        When I say "sort of standardized", VDEM describes the data like so: "The scale of a measurement model variable is similar to a normal (ā€œZā€) score (i.e. typically between -5 and 5, with 0 approximately representing the mean for all country-years in the sample) though it does not necessarily follow a normal distribution." which sounds like standardization to me, but I'm not sure exactly what code they use to get this and the actual observed means seem far enough away from 0 to cause concern at least.
        Sorry, but I've never heard of VDEM and have no comment to make.

        To demonstrate this I took a dataex of 1 line of these generated mean variables:
        foreach v in varlist { 2. egen `v'mean = mean(`v') 3. }
        To show a bunch of means, just use (e.g.) summarize or tabstat.

        Final question: I am reversing the scale on some of these, like you so clearly spelled out how to do here: https://www.statalist.org/forums/for...le-in-variable - (really enjoyed going through the list of SSC commands provided and trying things out, it's nice to have options). should I reverse first and standardize after? Does the order of processes matter/make a difference? The range of numbers changes depending on my order, but the range size does not. From my own experience trying both orders, it seems to make more sense to me to reverse first and standardize second.
        This is school algebra, although as no one is expecting proofs here, it's easiest just to think from examples. The order of reversing and standardizing as immaterial, so long as both are linear operations. To check try it out yourself. Or study the results of

        Code:
        sysuse auto, clear
        egen rep78_2 = std(rep78)
        replace rep78_2 = -rep78_2
        gen rep78_3 = 6 - rep78
        egen rep78_4 = std(rep78_3)
        Otherwise put, the effect of reversing is to flip the sign of a standardized value. When you do that does not matter.

        Comment


        • #5
          Nick Cox Thanks for answering all my questions!

          Comment

          Working...
          X