Dear all,
I have a dataset with 200,000 obs and the following variables:
score (continuous: 0-100), pred (binary: 0/1).
I want to create a binary variabel: pred2, that acts the following: a) if score is high, pred2=1 b) if score is middle, pred2=pred c) if score is low, pred2=0
I want to define when score is high/middel/low as the following: The threshold for a high score should be such, that the total number of observations where pred2=1 is 7,000. And with the total number, I mean across both high/middle/low groups, i.e., in total 7,000 observations out of the total 200,000 observations should have pred2=1.
I want 50% of all observations (i.e. 100,000) to be in group high/low, i.e. the threshold for a low score should simply be 100,000 observations minus the number of observations in the high group.
However, I have a lot of trouble with this, since it seems like a dynamic problem - when changing the threshold for the high score group, it affect the threshold for the low group. And this thus affects the distribution of 1 and 0s in the middle group, since this group takes the value of pred.
Of course, I can find a more or less manual way to keep adjusting the thresholds back and fourth, but there must exists a more elegant way.
I have created a small example here with 10 obs, and 4 should be pred2=1
The wanted variable shows the grouping of low score / middle / high. And pred2 takes low score = 0, middle score 0 pred, and high score = 1
So here, the thresholds would be as follows:
0.0-30.0 = low (4 obs)
>30.0-85.0 = middle
>85.0 = high (1 obs)
And so, the low and high group consitutes of 50% of the observations. And the total number of pred2=1 is 4, as desired.
Thanks for your time.
I have a dataset with 200,000 obs and the following variables:
score (continuous: 0-100), pred (binary: 0/1).
I want to create a binary variabel: pred2, that acts the following: a) if score is high, pred2=1 b) if score is middle, pred2=pred c) if score is low, pred2=0
I want to define when score is high/middel/low as the following: The threshold for a high score should be such, that the total number of observations where pred2=1 is 7,000. And with the total number, I mean across both high/middle/low groups, i.e., in total 7,000 observations out of the total 200,000 observations should have pred2=1.
I want 50% of all observations (i.e. 100,000) to be in group high/low, i.e. the threshold for a low score should simply be 100,000 observations minus the number of observations in the high group.
However, I have a lot of trouble with this, since it seems like a dynamic problem - when changing the threshold for the high score group, it affect the threshold for the low group. And this thus affects the distribution of 1 and 0s in the middle group, since this group takes the value of pred.
Of course, I can find a more or less manual way to keep adjusting the thresholds back and fourth, but there must exists a more elegant way.
I have created a small example here with 10 obs, and 4 should be pred2=1
The wanted variable shows the grouping of low score / middle / high. And pred2 takes low score = 0, middle score 0 pred, and high score = 1
score | pred | wanted | pred2 |
0.2 | 1 | 1 | 0 |
1.4 | 0 | 1 | 0 |
2.1 | 1 | 1 | 0 |
14.5 | 0 | 1 | 0 |
30.5 | 1 | 2 | 1 |
37.5 | 0 | 2 | 0 |
50.2 | 1 | 2 | 0 |
75.7 | 1 | 2 | 1 |
84.2 | 0 | 2 | 1 |
99.7 | 1 | 3 | 1 |
So here, the thresholds would be as follows:
0.0-30.0 = low (4 obs)
>30.0-85.0 = middle
>85.0 = high (1 obs)
And so, the low and high group consitutes of 50% of the observations. And the total number of pred2=1 is 4, as desired.
Thanks for your time.
Comment