Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Mapping two distributions

    Hi Stata Users,
    I would like to first apologize in case this is not the appropriate place to ask this question. The motivation is there are great minds and diverse minds with strong theoretical and practical background that can prove to be helpful in giving guidance to my challenge.
    I am having a distribution of the asset index and would like to map it to a simulated distribution. The figures below show the two distributions.

    Click image for larger version

Name:	graphs.jpg
Views:	1
Size:	24.8 KB
ID:	1649602


    Below is a reproducible code

    Code:
        
    *** Load data     
        use asset_index.dta, clear
        kdensity asset_index, title("Actual Distribution") name("graph_1", replace)
        
        
    *** Simulate log-normal distribution
        local meani = 71.53 
        local gini = 0.4325251
        
        set seed 158961
        
        clear
        set obs 10000
        
        gen sigma = sqrt(2)*(1/normal(1))*((`gini' + 1)/2)
        gen mu = log(`meani') - (sigma^2)/2
        gen lognormal_inc = exp(rnormal(mu, sigma))
            
        kdensity lognormal_inc, title("Simulated Distribution") name("graph_2", replace)
        
        graph combine graph_1 graph_2
        graph export "${gsdOutput}/graphs/graphs.jpg", replace
    Attached is the dataset

    I am wondering what could be the best way to do this.

    Thanks in advance!
    Attached Files

  • #2
    I am a big fan of kernel density estimation when it works well, but not otherwise. Whenever a distribution is highly skewed, that can be a problem unless the procedure is adaptive in some sense (e.g. if smoothing is first on log scale). Whenever a distribution is heterogeneous, sometimes kernel density estimation smooths out noise, and sometimes it just hides structure that should be worried about. When there is a sharp bound at 0, or anywhere else, kernel density estimation smears some probability mass beyond the bound unless you do something special.

    In this case a lognormal is somehow a reference distribution. Are there are any other candidates?

    Although I usually won't download .dta files (please see FAQ Advice #12) I did peek in this case. The data show a spike of exact zeros, which rules out a plain log transformation followed by a normal quantile plot.

    I used log(y + 1) as coping with zeros in y. This may seem adhockery but I've grown to like this function more as behaving like y when y is zero or small positive and behaving like log y for y >> 0. In many languages including Stata this is implemented as log1p()

    Below normal quantile plots for original scale and logp1(y) = log(y + 1). Evidently the data are closer to lognormal than normal, but much else has to be said.

    Click image for larger version

Name:	okiya2.png
Views:	1
Size:	25.0 KB
ID:	1649622

    Comment


    • #3
      Hi Nick Cox

      Thanks so much for taking time and looking at my data. I also appreciate you referring me to FAQ Advice #12. I have learnt something new about the
      Code:
      count
      option in
      Code:
      dataex
      The literature suggests that asset index has the same shape as a simulated distribution of income. The values in the dataset I shared are transformed values from PCA with negative values shifted to 0 and positive values. This may be subjective approach. The code I use for transformation is shown below

      Code:
      sum asset_index
      replace asset_index = asset_index - `r(min)'
      The
      Code:
      dataex
      output seems truncated due to the large number of observations. Please pardon me for attaching the dataset a second time
      Attached Files

      Comment


      • #4
        Naturally I didn't know about the origin of these data. It is still true that you have a spike of values all equal to the minimum, which is qualitatively at odds with the idea of a lognormal. If you go back even further to the variables that went into the PCA, I guess wildly at some ordinal scale answers and so, notably, a bunch of people all giving the lowest possible answers on a bundle of questions.

        Comment


        • #5
          You are right - some households may be owning similar assets hence having a similar PCA score.
          Any ideas on the mapping?
          Thanks in advance!

          Comment


          • #6
            Sorry, I don't have further ideas. As so often it is hard to know where to place the blame -- on the data being coarse-grained because that is the only practical choice, or on the hypothesis being only a rough approximation, which is not surprising. Or both.

            Comment


            • #7
              Thanks so much Nick Cox Your insights were really helpful and assisted me to learn a couple of things.

              Comment

              Working...
              X