Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Attempting to create a histogram with logarithmic axes

    Hello,

    I am attempting to create a frequency histogram of a variable (called M) with a logarithmic x- and y-axis in Stata 16. The number of observations is 2.1 mio.

    My original data are highly unequally distributed with 99% having a value of 0:
    Click image for larger version

Name:	ForForum1.png
Views:	1
Size:	16.8 KB
ID:	1544216



    Command for this was:
    Code:
    histogram M, frequency ylabel(0.000002 20 200 2000, angle(horizontal) grid glpattern(solid) gextend) xtitle(M)
    To logarithmise the x-axis is apparently none of a problem, I used gen lM = log(M) to create my desired variable in log scale.
    Afterwards, my distribution looks like that:
    Click image for larger version

Name:	ForForum1logx.png
Views:	4
Size:	19.8 KB
ID:	1544219



    Command for this:
    Code:
    histogram lM, frequency ylabel(0.000002 20 200 2000, angle(horizontal) grid glpattern(solid) gextend) xtitle(lM)
    The y-axis also changes, but that is most probably just because all the values where M=0 drop out after logarithmising them.
    Now I would like to also logarithmise the y-axis. I tried to simply use the yscale(log) option:

    Code:
    . histogram lM, frequency yscale(log) ylabel(0.000002 20 200 2000, angle(horizontal) grid glpattern(solid) gextend)
    (bin=43, start=1.0986123, width=.27069641)
    Click image for larger version

Name:	ForForum1ylog.png
Views:	3
Size:	28.3 KB
ID:	1544220



    However, the result is not as I expected. As you can see, the y-axis is behaving quite strangely, all the relevant ticks (20, 200, 2000) are basically on the same line at the top. The y-axis value of 0.000002, which I include for illustration, is just slightly below the others. What I wanted to create should look rather like this graph (Source: Yasseri, T., Sumi, R., Rung, A., Kornai, A., & Kertész, J. (2012). Dynamics of conflicts in Wikipedia. PloS one, 7(6), e38869):
    Click image for larger version

Name:	ForForum graph yasseri.png
Views:	2
Size:	52.1 KB
ID:	1544222


    Here the relevant ticks (10, 100, 100, 10000) are distributed evenly over the y-axis.
    Is it due to my data (and understanding of them) or did I just not manage to enter the right command to produce the desired outcome?

    PS: I tried to make the graphs smaller in this post, but for some reason it didn't work, sorry for that.
    PSS: For some reason the images are all being posted again at the end of my post. How do I avoid that without removing them entirely?
    Attached Files
    Last edited by Pachakutik Yupangui; 01 Apr 2020, 21:54.

  • #2
    This seems misconceived to me.

    Values of zero are inconsistent with a logarithmic scale. Taking logarithms maps your zeros to missing and omits them from any display thereafter. Providing you realise that and that is what you want, then so be it.

    For all that you can find examples in published literature, bars on a logarithmic scale (so that there is an arbitrary base to the bars) are to me graphically obnoxious.

    The people who wrote the histogram commands agree with that opinion of mine. If you ask for log scale, that isn't banned but it makes no sense to Stata as Stata always tries to start the bars at zero. Stata's result that puzzled you is best interpreted as a signal implying that your request is inconsistent with how histograms work.

    The key principle to a histogram is that bar area has a direct interpretation as probability or frequency. Warping the frequency axis violates that principle.

    That said, there is a defence of this practice for counts: using the possible minimum of 1 with logarithm zero as base is not quite arbitrary.

    https://journals.sagepub.com/doi/abs...867X1801800116 gives a full discussion of this territory, or at least the fullest discussion I know.

    The recipe I suggest for your situation -- provided that you recognise that zeros can't be handled -- is

    1. Log the response.

    2. Use equal width bins on that scale to get frequencies and then log the frequencies.

    3. A scatter plot of log frequency against midpoint of bins.

    4. Label axes in terms of raw scale, not logarithms.

    Much of this is covered directly or indirectly in the paper cited.

    Comment


    • #3
      Thank you for your quick reply.

      I was aware of the fact that by logging my variable the zeros are omitted, as between figure 1 and 2 above. However, what I didn't think about was how Stata would display if yscale is logged when for a value of M = i the frequency is = 0.

      For my purposes in the graph it doesn't make a big difference if the values of M are 0 or 1, so I can just set all the values to 1 where they are 0. In this regard I will not have any zeros any more in my data. However, the graph which results from that using the same commands as above doesn't look so much different.

      I am trying to follow your advice, it seems to be what I was trying to accomplish.

      1. Log the response.
      If I understand alright, this is what I did by gen lM = log(M).

      2. Use equal width bins on that scale to get frequencies ...
      This would be what I did with the command "histogram lM, frequency ...".

      ... and then log the frequencies.
      Here I don't know how to do this technically. I used "histogram lM, frequency yscale(log)", but apparently that is not what you meant. Is there a way to access the frequencies produced by the graph?

      3. A scatter plot of log frequency against midpoint of bins.
      I also don't know how to find the midpoints generated by the graph in Stata.

      Much of this is covered directly or indirectly in the paper cited.
      I knew the paper, but what I had found in it was mostly about the x-axis and the labelling of the y-axis. I didn't see how I could handle the problem of correctly warping the y-axis in the first place.

      PS: I noticed that there is a mistake in the second graph above. The x-axis should have the title "lM" (logged M).
      Last edited by Pachakutik Yupangui; 02 Apr 2020, 03:40.

      Comment


      • #4
        No; you can't just fudge 0s to 1s because then the frequency of 1 will be wrong.

        The midpoint of a bin is the mean of its limits.

        You seem to be hung up on what histogram can and can't do, but once you have bins, you just count values.

        As a simple example I use a dataset referred to in the paper cited in #2. (Access to the paper is not required; the dataset can be downloaded as one of the files for gr0072.)


        Code:
        use country_populations, clear
        gen log_pop = log10(pop)
        su
        gen log_pop_bin = floor(log_pop)
        bysort log_pop_bin : gen Frequency = _N
        gen log_pop_mid = log_pop_bin + 0.5
        
        forval j = 1/10 {
            label def x `j' "10{sup:`j'}", modify
        }
        label val log_pop_mid x
        
        scatter Freq log_pop_mid , ysc(log) xla(1/10, valuelabel) yla(1 2 5 10 20 50, ang(h)) xtitle(Country population)
        
        twoway bar Freq log_pop_mid , barw(1) base(1) ysc(log) xla(1/10, valuelabel) yla(1 2 5 10 20 50, ang(h)) xtitle(Country population)
        Click image for larger version

Name:	alternative_to_histogram.png
Views:	1
Size:	15.0 KB
ID:	1544273


        In my case the bins are of width 1 on log 10 scale: yours will be finer. I also give above the code for a "histogram" without approving or recommending it.


        EDIT In your case M goes up to about 400000 and a bin width of 0.1 on log 10 scale should give about 50 or 60 bins.

        Code:
        gen bin_base -= 0.1 * floor(10 * log10(M))
        Your bin mid point is 0.05 higher and any bars should be of width 0.1 too.
        Last edited by Nick Cox; 02 Apr 2020, 04:28.

        Comment


        • #5
          Thank you for providing the code. I tried to apply it to the example data in #2, but even though in the paper it says that a data file is posted on the website directory of the issue (https://doi.org/10.1177/1536867X1801800116) I couldn't find it anywhere there. (I also tried to use the data from the original Wikipedia source directly, but there I had some troubles with destringing them correctly so I moved on.)

          I adapted the code then for my data and it seems that it produces exactly what I was trying to accomplish. So thanks for your help.

          Edit: I guess I should have searched in Stata directly. There I can find the data files referred to in the paper.
          Last edited by Pachakutik Yupangui; 06 Apr 2020, 15:52.

          Comment


          • #6
            Code:
            search gr0072, entry
            gives a link to follow. The paper was published in 2018, so your Stata must be fairly up-to-date.

            Alternatively

            Code:
            net sj 18-1 gr0072
            should get you started.

            Comment

            Working...
            X