Mapping two distributions

Stephen Okiya

Join Date: Feb 2025

Posts: 280
#1

Mapping two distributions

11 Feb 2022, 10:12

Hi Stata Users,
I would like to first apologize in case this is not the appropriate place to ask this question. The motivation is there are great minds and diverse minds with strong theoretical and practical background that can prove to be helpful in giving guidance to my challenge.
I am having a distribution of the asset index and would like to map it to a simulated distribution. The figures below show the two distributions.

Below is a reproducible code

Code:

*** Load data use asset_index.dta, clear kdensity asset_index, title("Actual Distribution") name("graph_1", replace) *** Simulate log-normal distribution local meani = 71.53 local gini = 0.4325251 set seed 158961 clear set obs 10000 gen sigma = sqrt(2)*(1/normal(1))*((`gini' + 1)/2) gen mu = log(`meani') - (sigma^2)/2 gen lognormal_inc = exp(rnormal(mu, sigma)) kdensity lognormal_inc, title("Simulated Distribution") name("graph_2", replace) graph combine graph_1 graph_2 graph export "${gsdOutput}/graphs/graphs.jpg", replace

Attached is the dataset

I am wondering what could be the best way to do this.

Thanks in advance!

Attached Files

asset_index.dta (72.2 KB, 1 view)
Tags: None
Nick Cox

Join Date: Mar 2014

Posts: 35211
#2

11 Feb 2022, 11:49

I am a big fan of kernel density estimation when it works well, but not otherwise. Whenever a distribution is highly skewed, that can be a problem unless the procedure is adaptive in some sense (e.g. if smoothing is first on log scale). Whenever a distribution is heterogeneous, sometimes kernel density estimation smooths out noise, and sometimes it just hides structure that should be worried about. When there is a sharp bound at 0, or anywhere else, kernel density estimation smears some probability mass beyond the bound unless you do something special.

In this case a lognormal is somehow a reference distribution. Are there are any other candidates?

Although I usually won't download .dta files (please see FAQ Advice #12) I did peek in this case. The data show a spike of exact zeros, which rules out a plain log transformation followed by a normal quantile plot.

I used log(y + 1) as coping with zeros in y. This may seem adhockery but I've grown to like this function more as behaving like y when y is zero or small positive and behaving like log y for y >> 0. In many languages including Stata this is implemented as log1p()

Below normal quantile plots for original scale and logp1(y) = log(y + 1). Evidently the data are closer to lognormal than normal, but much else has to be said.
Comment
Stephen Okiya

Join Date: Feb 2025

Posts: 280
#3

11 Feb 2022, 12:26

Hi Nick Cox

Thanks so much for taking time and looking at my data. I also appreciate you referring me to FAQ Advice #12. I have learnt something new about the

Code:

count

option in

Code:

dataex

The literature suggests that asset index has the same shape as a simulated distribution of income. The values in the dataset I shared are transformed values from PCA with negative values shifted to 0 and positive values. This may be subjective approach. The code I use for transformation is shown below

Code:

sum asset_index replace asset_index = asset_index - `r(min)'

The

Code:

dataex

output seems truncated due to the large number of observations. Please pardon me for attaching the dataset a second time
Attached Files

asset_index_v2.dta (72.2 KB, 1 view)
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35211
#4

11 Feb 2022, 20:34

Naturally I didn't know about the origin of these data. It is still true that you have a spike of values all equal to the minimum, which is qualitatively at odds with the idea of a lognormal. If you go back even further to the variables that went into the PCA, I guess wildly at some ordinal scale answers and so, notably, a bunch of people all giving the lowest possible answers on a bundle of questions.
Comment
Stephen Okiya

Join Date: Feb 2025

Posts: 280
#5

12 Feb 2022, 03:37

You are right - some households may be owning similar assets hence having a similar PCA score.
Any ideas on the mapping?
Thanks in advance!
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35211
#6

12 Feb 2022, 04:57

Sorry, I don't have further ideas. As so often it is hard to know where to place the blame -- on the data being coarse-grained because that is the only practical choice, or on the hypothesis being only a rough approximation, which is not surprising. Or both.
Comment
Stephen Okiya

Join Date: Feb 2025

Posts: 280
#7

12 Feb 2022, 08:37

Thanks so much Nick Cox Your insights were really helpful and assisted me to learn a couple of things.
Comment

Announcement

Mapping two distributions

Comment

Comment

Comment

Comment

Comment

Comment