Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Obtain which bin and corresponding density each observation in plotted histogram belongs to?

    Hello everyone,


    Suppose I plot a histogram:
    Code:
    clear
    set obs 10
    
    g z = _n
    replace z = 5 if _n > 5
    
    hist z
    Given the plotted histogram, I would like to generate two new variables:
    1. bin, giving the bin which a given observation belongs to.
    2. density, giving the density of the bin which the observation belongs to.
    If I generated these two variables manually for the histogram plotted:
    Code:
    g correct_bin = 1 if inrange(_n, 1, 2)
    replace correct_bin = 2 if _n == 3
    replace correct_bin = 3 if _n >= 4
    
    g correct_density = 0.15 if inrange(_n, 1, 2)
    replace correct_density = 0.075 if _n == 3
    replace correct_density = 0.525 if _n >= 4
    I have tried using the command twoway__histogram_gen to create a solution. However, while my solution works in the above case it does not seem to work for example when:
    • Bins aren't just beside each other, that is for example bin 1 = [1,2) and bin 2 = [5,6)
    • Or even just when the sample size grows and the values of the z variable is continuous, then numerical issues quickly arise
    I suspect a combination of twoway__histogram_gen and egen cut could be used to generate a correct solution, below follows my attempt which works for my toy example. I first outline the ideas and then provide the code:
    1. Use twoway__histogram_gen to find the midpoint of each bin.
    2. Adjust the midpoints to be the start points of the bins.
    3. Create new variables x_v which are constant to the start point of bin v.
    4. Check which interval [x_v, x_{v+1}) each observation belongs to.a
    5. Find the corresponding density of that bin.
    Here is the code with each step labelled as in the description above:
    Code:
    * 1, finding midpoints
    twoway__histogram_gen z, gen(y x)
    
    * 2, adjusting midpoints to start points
    local adjust = (x[2] - x[1]) / 2
    replace x = x - `adjust'
    
    * 3, generating variables constant to startpoints
    count if x != .
    local N = r(N)
    forvalues v = 1/`=`N'+1' {
        g x_`v' = x[`v']
    }
    
    * 4, finding bin of each observation
    g new_bin = .
    forvalues v = 1/`N' {
        replace new_bin = `v' if x_`v' <= z & z < x_`=`v'+1'
    }
    
    * 5, finding density of the bin
    g new_density = .
    forvalues v = 1/`N' {
        replace new_density = y[`v'] if new_bin == `v'
    }
    Checking so this has given the correct solution:
    Code:
    assert correct_bin == new_bin
    assert correct_density == new_density
    Finally, note that by browsing the data we see rounding already becoming a slight problem since we have x_1 == .99999994 instead of x_1 == 1.


    Thanks in advance for help!
    Simon


    aI'm not sure that the histogram bins of Stata are indeed of the form [x_v, x_{v+1}) (i.e., closed-open) but my investigations indicate this is true (the final bin having endpoint infinity).

  • #2
    I found a good enough answer here: Just estimate the density directly instead using kdensity.

    Comment

    Working...
    X