Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Setting "dynamic" threshold

    Dear all,

    I have a dataset with 200,000 obs and the following variables:

    score (continuous: 0-100), pred (binary: 0/1).

    I want to create a binary variabel: pred2, that acts the following: a) if score is high, pred2=1 b) if score is middle, pred2=pred c) if score is low, pred2=0

    I want to define when score is high/middel/low as the following: The threshold for a high score should be such, that the total number of observations where pred2=1 is 7,000. And with the total number, I mean across both high/middle/low groups, i.e., in total 7,000 observations out of the total 200,000 observations should have pred2=1.

    I want 50% of all observations (i.e. 100,000) to be in group high/low, i.e. the threshold for a low score should simply be 100,000 observations minus the number of observations in the high group.

    However, I have a lot of trouble with this, since it seems like a dynamic problem - when changing the threshold for the high score group, it affect the threshold for the low group. And this thus affects the distribution of 1 and 0s in the middle group, since this group takes the value of pred.

    Of course, I can find a more or less manual way to keep adjusting the thresholds back and fourth, but there must exists a more elegant way.



    I have created a small example here with 10 obs, and 4 should be pred2=1

    The wanted variable shows the grouping of low score / middle / high. And pred2 takes low score = 0, middle score 0 pred, and high score = 1
    score pred wanted pred2
    0.2 1 1 0
    1.4 0 1 0
    2.1 1 1 0
    14.5 0 1 0
    30.5 1 2 1
    37.5 0 2 0
    50.2 1 2 0
    75.7 1 2 1
    84.2 0 2 1
    99.7 1 3 1


    So here, the thresholds would be as follows:
    0.0-30.0 = low (4 obs)
    >30.0-85.0 = middle
    >85.0 = high (1 obs)


    And so, the low and high group consitutes of 50% of the observations. And the total number of pred2=1 is 4, as desired.


    Thanks for your time.
    Last edited by Sara Hansen; 01 Mar 2023, 05:52.

  • #2
    I would spend time creating a reproducible example with a variable "wanted" illustrating what is required.

    i.e., in total 7,000 observations out of the total 200,000 observations should have pred2=1.
    Of course, scale this down, e.g., 7 observations out of 20. In your actual application, you can adapt suggestions to your larger values.

    Comment


    • #3
      Originally posted by Andrew Musau View Post
      I would spend time creating a reproducible example with a variable "wanted" illustrating what is required.



      Of course, scale this down, e.g., 7 observations out of 20. In your actual application, you can adapt suggestions to your larger values.
      Thank you, I have done this now

      Comment


      • #4
        I take it that if at least half of the observations are classified as high, then none is classified as low as you do not offer advice on what to do if this is the case. But perhaps you choose a threshold such that this never happens. Here's how I'd approach it.


        Code:
        * Example generated by -dataex-. For more info, type help dataex
        clear
        input float score byte pred
          .2 1
         1.4 0
         2.1 1
        14.5 0
        30.5 1
        37.5 0
        50.2 1
        75.7 1
        84.2 0
        99.7 1
        end
        
        assert !missing(score)
        sort score
        local threshold 85.0
        qui count if score>`threshold'
        local low= max(0, int(`=_N/2'- r(N)+0.5))
        gen classification= cond(score>`threshold', 3,cond(_n<=`low', 1, 2))
        gen wanted= cond(classification==3, 1, cond(classification==1, 0, pred))
        So you only change the threshold value. One issue though with this approach is that tied scores may be classified as both low and middle due to this criterion:

        I want 50% of all observations (i.e. 100,000) to be in group high/low
        Modifications are possible if you specify extra rules.

        Res.:

        Code:
        . l, sepby(classification)
        
             +----------------------------------+
             | score   pred   classi~n   wanted |
             |----------------------------------|
          1. |    .2      1          1        0 |
          2. |   1.4      0          1        0 |
          3. |   2.1      1          1        0 |
          4. |  14.5      0          1        0 |
             |----------------------------------|
          5. |  30.5      1          2        1 |
          6. |  37.5      0          2        0 |
          7. |  50.2      1          2        1 |
          8. |  75.7      1          2        1 |
          9. |  84.2      0          2        0 |
             |----------------------------------|
         10. |  99.7      1          3        1 |
             +----------------------------------+
        Last edited by Andrew Musau; 01 Mar 2023, 14:13.

        Comment

        Working...
        X