Obtain which bin and corresponding density each observation in plotted histogram belongs to?

Simon Almerstrom Przybyl

Join Date: Jan 2022

Posts: 3
#1

Obtain which bin and corresponding density each observation in plotted histogram belongs to?

27 Jan 2022, 16:43

Hello everyone,

Suppose I plot a histogram:

Code:

clear set obs 10 g z = _n replace z = 5 if _n > 5 hist z

Given the plotted histogram, I would like to generate two new variables:
bin, giving the bin which a given observation belongs to.

density, giving the density of the bin which the observation belongs to.

If I generated these two variables manually for the histogram plotted:

Code:

g correct_bin = 1 if inrange(_n, 1, 2) replace correct_bin = 2 if _n == 3 replace correct_bin = 3 if _n >= 4 g correct_density = 0.15 if inrange(_n, 1, 2) replace correct_density = 0.075 if _n == 3 replace correct_density = 0.525 if _n >= 4

I have tried using the command twoway__histogram_gen to create a solution. However, while my solution works in the above case it does not seem to work for example when:
Bins aren't just beside each other, that is for example bin 1 = [1,2) and bin 2 = [5,6)

Or even just when the sample size grows and the values of the z variable is continuous, then numerical issues quickly arise

I suspect a combination of twoway__histogram_gen and egen cut could be used to generate a correct solution, below follows my attempt which works for my toy example. I first outline the ideas and then provide the code:
Use twoway__histogram_gen to find the midpoint of each bin.

Adjust the midpoints to be the start points of the bins.

Create new variables x_v which are constant to the start point of bin v.

Check which interval [x_v, x_{v+1}) each observation belongs to.^a

Find the corresponding density of that bin.

Here is the code with each step labelled as in the description above:

Code:

* 1, finding midpoints twoway__histogram_gen z, gen(y x) * 2, adjusting midpoints to start points local adjust = (x[2] - x[1]) / 2 replace x = x - `adjust' * 3, generating variables constant to startpoints count if x != . local N = r(N) forvalues v = 1/`=`N'+1' { g x_`v' = x[`v'] } * 4, finding bin of each observation g new_bin = . forvalues v = 1/`N' { replace new_bin = `v' if x_`v' <= z & z < x_`=`v'+1' } * 5, finding density of the bin g new_density = . forvalues v = 1/`N' { replace new_density = y[`v'] if new_bin == `v' }

Checking so this has given the correct solution:

Code:

assert correct_bin == new_bin assert correct_density == new_density

Finally, note that by browsing the data we see rounding already becoming a slight problem since we have x_1 == .99999994 instead of x_1 == 1.

Thanks in advance for help!
Simon

^aI'm not sure that the histogram bins of Stata are indeed of the form [x_v, x_{v+1}) (i.e., closed-open) but my investigations indicate this is true (the final bin having endpoint infinity).
Tags: data, graphics, histogram
Simon Almerstrom Przybyl

Join Date: Jan 2022

Posts: 3
#2

28 Jan 2022, 06:01

I found a good enough answer here: Just estimate the density directly instead using kdensity.
Comment

Announcement

Obtain which bin and corresponding density each observation in plotted histogram belongs to?

Comment