Standardizing and normalizing composite variable

Andreas Grytten

Join Date: May 2023

Posts: 6
#1

Standardizing and normalizing composite variable

19 May 2023, 03:56

Im writing a thesis about prosociality and social status. The data Im using consists of responses from USA, Poland, Germany and Sweden. Before I ran my regression analysis I standardized the variable for objective status(consisting of variables for job prestige, education and income). However, the objective status variable should be standardized at country level, not across all countries together. I standardize each indicator at country level by coding this:

bysort country: egen z2utdanning = std(education)
bysort country: egen z2jobbprestisje = std(jobbprestisje)
bysort country: egen z2inntektsdesil = std(incomedecile)

Then I use rowmean when I merge the variables to a composite:
egen row_objektivstatus= rowmean (z2inntektsdesil z2jobbprestisje z2utdanning)

Is it necessary to standardize the composite variable(row_objektivstatus) by country again, or is it already standardized because each indicator was standardized before I constructed the composite variable?

I also normalize the variable:
bysort country: su row_objektivstatus
bysort country: ge z1objektivstatus = (row_objektivstatus-r(min))/(r(max)-r(min))
bysort country: su z1objektivstatus

The minimum and maximun variable of objective status in USA are 0 and 1 without decimals, while for the other countries the values are equal to 0 and 1, but with decimals. Any suggestions why USA got perfect numbers without decimals? Im just asking because Im not sure I did the normalization and standardization correctly. When I normalize the objective status variables per country, the objective status composite does not have 1 in standard deviation anymore, but it seems like having 0 and 1 as values by normalizing is preferable to having 1 in st deviation for all countries. It is also a bit strange that for every country the mean value is about 0,57 and standard deviation is about 0.23(very small differences between countries). Would it not make sense to have bigger differences in those values?

Thanks in advance!
Tags: None
Nick Cox

Join Date: Mar 2014

Posts: 35429
#2

19 May 2023, 04:50

I have various unrelated comments. I can't follow all of what you are asking and focus only on what I think I understand.

1. On terminology, watch out. I think "standardize" has just one common meaning, namely (value - mean ) / SD, but I've seen the word "normalize" used in various different ways, so it is essential to explain your meaning, as you did, here (value - min) / (max - min).

2. When you go for a row mean of job prestige, education and income you have mashed together things (as a lay person here) I can think about and produced a composite that I can't think about. Part of the art and the point of regression is to have predictors that can be thought about, including what their relation is to each other.

3. I can't see that yet more scaling will help interpretation of a regression. Regression allows predictors to have utterly different units of measurement and we look at t and SE results to keep track in a way that ignores any such difference.

4. This code is quite wrong for a subtle reason that has bitten many users.

Code:

bysort country: su row_objektivstatus bysort country: ge z1objektivstatus = (row_objektivstatus-r(min))/(r(max)-r(min)) bysort country: su z1objektivstatus

After

Code:

bysort country: su row_objektivstatus

just one r(max) and one r(min) are accessible, for the last country summarized. Your code makes a presumption that a separate r(min) and r(max) are accessible for each distinct country, but not so, as you can see by running

Code:

return li

after the first command. Odds are that USA was the last country looked at, which explains the anomaly.

There may be an egen function somewhere to provide min-max scaling but it is quicker to write fresh code. Here is a pattern to follow with your longer variable names.

Code:

bysort g : egen min = min(x) by g : egen max = max(x) gen wanted = (x - min) / (max - min)

If you had no missing values that would be even easier at

Code:

bysort g (x) : gen wanted = (x - x[1]) / (x[_N] - x[1])

5. If you standardize (mean 0, SD 1) within groups, then the overall variable also has mean 0 but it's not guaranteed to have SD 1. Someone may have a neat explanation why. but that appears to be the case.
1 like
Comment
Maarten Buis

Join Date: Mar 2014

Posts: 3426
#3

19 May 2023, 05:12

Originally posted by Andreas Grytten View Post

Is it necessary to standardize the composite variable(row_objektivstatus) by country again, or is it already standardized because each indicator was standardized before I constructed the composite variable?

Standardized, in the way you seem to use that word, means that a variable has a mean of 0 and a standard deviation of 1. The purpose is to fix the unit. There is no natural unit of status ( you can't measure it in kg or $). So standardization can be used to provide such a unit (in this case the unit is a standard deviation). So the new variable should have a standard deviation of 1. That is not going to be the case, as these three variables are likely correlated. (the variance of a sum of random variables is the sum of the variances plus two times the covariances) So if you want your variable to be standardized, you need to "restandardize" it.

Question is if standardardizing is the best way to give your variable a unit. One could argue that status is not something with absolute value, but its value is determined by how much you have compared to all others. In that case you can make a good case for percentile scores or plotting positions: what proportion of respondents have less than I do? See: https://www.stata.com/support/faqs/s...ons/index.html on how to compute these.

Normalizing seems like an extraordinarily bad idea in your case. Now the unit becomes the observed range of a variable. Especially with a variable like income that is just way way way to volatile. So, I could explain why your code is wrong, but that is just a waste of time, as you should not be doing that anyhow.

---------------------------------
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl
---------------------------------
1 like
Comment
Maarten Buis

Join Date: Mar 2014

Posts: 3426
#4

19 May 2023, 06:03

Originally posted by Nick Cox View Post

2. When you go for a row mean of job prestige, education and income you have mashed together things (as a lay person here) I can think about and produced a composite that I can't think about. Part of the art and the point of regression is to have predictors that can be thought about, including what their relation is to each other.

This is a very good comment, and I would not in general consider adding those variables up in general to be a good idea (I work a lot with concepts like socioeconomic status). I suspect that Andreas wants to capture the extend someone holds a privileged position in society, and such privilege can come from holding a prestigious job, having high education, and/or having a lot of money (notice that wealth would thus be a better indicator of privilege than income, but good luck getting reliable and meaningful measurements of wealth in your survey). In that sense it may seem that combining the three would be a good idea. Several problems occur when you want to combine them in one variable: the first is the issue of what unit does that new measure of privilege has, that I discussed in #3. The second is that right now you assumed that all three variables have the same weight. Something like factor analysis is not applicable to this case as it is the resources (prestige, education, and income) that determine the latent variable (privilege), while factor analysis identifies the weights by assuming that the latent variable determines the observed variables. You may get somewhere with a MIMIC model, but I doubt that that is worth the effort. The main problem I see with the idea that there is one variable prestige that is some weighted sum of resources (prestige, education, money) is that the weight of each component of privilege depends on the situation in which you want to "cash in" on that privilege: sometimes you can just buy what you want, other times it requires a certain level of education, sometimes prestige is the most efficient way of getting what you want.

Given all these difficulties most scholars in my field just keep prestige, education, and income as separate variables.

---------------------------------
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl
---------------------------------
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35429
#5

19 May 2023, 06:08

Maarten Buis knows well that I am not a social scientist, still less a sociologist, but here is a disclaimer for anyone who doesn't know that.
Comment

Announcement

Standardizing and normalizing composite variable

Comment

Comment

Comment

Comment