Using invnorm to inverse rank a variable

Mel de Lange

Join Date: Feb 2022

Posts: 21
#1

Using invnorm to inverse rank a variable

15 Feb 2022, 07:35

Hi there,

I am trying to inverse rank a variable (Fractionaccel_over425mg) in Stata using the code below. The first line (rank) works fine (I have over 90,000 observations) but then when I use the invnorm function it suddenly says I have no observations. I suspect this is something to do with the fact that there are a lot of people with the same value (e.g. there are >18,000 with a value of 0)?

egen rankFractionaccel_over425mg = rank(Fractionaccel_over425mg)
count if Fractionaccel_over425mg!=.
gen invrankFractionaccel_over425mg = invnorm(rankFractionaccel_over425mg - 0.5)/r(N)

I have tried using the rank, unique function below but this still didn't work as it then said I only had 1 observation.

egen rankFractionaccel_over425mg = rank(Fractionaccel_over425mg), unique
count if Fractionaccel_over425mg!=.
gen invrankFractionaccel_over425mg = invnorm(rankFractionaccel_over425mg - 0.5)/r(N)

I'd be grateful for any advice on how I can get this to work.

Many thanks,

Mel
Tags: None
Maarten Buis

Join Date: Mar 2014

Posts: 3426
#2

15 Feb 2022, 08:21

It depends on what you mean with "inverse the rank of a variable":

My first guess was to inverse a rank just by egen invrankvar = rank(-original_variable) (notice the minus sign). That way it will rank from largest to smallest instead of smallest to largest, i.e. it inverses the rank.

However, that does not explain why you think you need to use invnorm(). There are those who use that to force a distribution to be normal. That is almost always a very very very bad idea. It falsifies your data by forcing a distribution on your data that you did not observe. Moreover, when models assume a normal distribution, it is the errors not the dependent variable that is normally distributed. Finally, it is not going to work if you have a lot of ties. Even if your stance on this transformation is more relaxed than mine, you would still never ever ever ever ever ever ever ever even think about using the unique option here. Those ties are your data, and randomly separating them is just evil.

---------------------------------
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl
---------------------------------
2 likes
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#3

15 Feb 2022, 08:33

The invnorm() function requires that its argument be >0 and <1. It does not seem to me that your ranks, which are integers, will do that, except for any observation with a rank of 1. In your first example, if you have two observations tied for highest, their ranks will be the average of 1 and 2 - 1.5 - and you will have no observations. In your second example, exactly 1 observation will have rank 1.

Did you perhaps mean instead of

Code:

gen invrankFractionaccel_over425mg = invnorm( rankFractionaccel_over425mg - 0.5 ) / r(N)

Code:

gen invrankFractionaccel_over425mg = invnorm( (rankFractionaccel_over425mg - 0.5)/r(N) )

so that you would be dividing the rank by the number of observations before calculating the inverse normal?

Added in edit: My post crossed with #2, which expresses many concerns I too have. I tried not to think about what you were doing, just about your mistaken usage of invnorm() given the values of rankFractionaccel_over425mg.

Last edited by William Lisowski; 15 Feb 2022, 08:41.
1 like
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35432
#4

15 Feb 2022, 09:29

William Lisowski has explained the immediate problem, mathematical punctuation in the wrong palce.

I agree with others that the use of invnormal() (old name invnorm() still works, but is undocumented) is puzzling here -- unless the goal is to produce normal quantile plots, already available through qnorm (or qplot from SJ).

I also agree strongly that unique ranks are generally wrong here. The same value should be assigned the same rank, at least for most purposes. (One exception is for cumulative distribution plots, as explained in the document next mentioned.)

https://www.stata.com/support/faqs/s...ing-positions/ is an FAQ in this territory.

That said, the value of ranking is naturally limited by any large number of ties.

Also, in economics and sociology how well you are doing -- given competition and so forth -- depends partly on how far others are doing better or worse. In medicine, that isn't so true, at least in terms of how well or ill you are.

Last edited by Nick Cox; 15 Feb 2022, 09:47.
Comment
Mel de Lange

Join Date: Feb 2022

Posts: 21
#5

16 Feb 2022, 01:42

Thank you all so much. Yes I think it was a question of missing some brackets. In answer to why I'm using invnorm, I'm doing some analysis of a genetic score derived from a previously published paper and to do that I have to do exactly what they've done to create this raw inversed variable which the genetic score is associated with... I totally understand your reservations about its use!

Thanks again,

Mel
Comment

Announcement

Using invnorm to inverse rank a variable

Comment

Comment

Comment

Comment