Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Labels at scatter plot

    Dear All,

    I am always fascinated seeing how much useful information can be derived from even the simplest 2-way scatter plot, such as this Pisa chart:

    Click image for larger version

Name:	image_18080.png
Views:	1
Size:	761.1 KB
ID:	1551872

    Yet many of the Stata scatter plots experience the same handicap, something like this one here:
    Click image for larger version

Name:	auto-scatter-make.gif
Views:	1
Size:	25.5 KB
ID:	1551871


    I wonder if there is any procedure written for Stata's scatter plots to optimize the placement of labels and adding lines (like for "Norway" or "Belgium" on the first graph).

    If yes, please point me into the right direction.

    Thank you, Sergiy Radyakin

  • #2
    Ulrich Kohler wrote an egen function that is in egenmore (SSC). Here is the full documentation.

    mlabvpos(yvar xvar) [ , log polynomial(#) matrix(5x5 matrix) ] automatically generates a variable giving clock positions of marker labels given names of variables yvar and xvar
    defining the axes of a scatter plot. Thus the command generates a variable to be used in the scatter option mlabvpos().

    The general idea is to pull marker labels away from the data region. So, marker labels in the lower left of the region are at clock positions 7 or 8, and those in the upper right
    are at clock-position 1 or 2, etc. More precisely, considering the following rectangle as the data region, then marker labels are placed as follows:

    +--------------+
    |11 12 12 12 1|
    |10 11 12 1 2|
    | 9 9 12 3 3|
    | 8 7 6 5 4|
    | 7 6 6 6 5|
    +--------------+

    Note that there is no attempt to prevent marker labels from overplotting, which is likely in any dataset with many observations. In such situations you might be better off simply
    randomizing clock positions with say ceil(uniform() * 12).

    If yvar and xvar are highly correlated, than the clock-positions are generated as follows (which is however the same general idea):

    +--------------+
    | 12 1 3|
    | 12 12 3 4|
    |11 11 12 5 5|
    |10 9 6 6 |
    | 9 7 6 |
    +--------------+

    To calculate the positions, the x axis is first categorized into 5 equal intervals around the mean of xvar. Afterwards the residuals from regression of yvar on xvar are
    categorized into 5 equal intervals. Both categorized variables are then used to calculate the positions according to the first table above. The rule can be changed with the
    option matrix().

    log indicates that residuals from regression are to be calculated using the logarithms of xvar. This might be useful if the scatter shows a strong curvilinear relationship.

    polynomial(#) indicates that residuals are to be calculated from a regression of yvar on a polynomial of xvar. For example, use poly(2) if the scatter shows a U-shaped
    relationship.

    matrix(#) is used to change the general rule for the plot positions. The positions are specified by a 5 x 5 matrix, in which cell [1,1] gives the clock position of marker labels
    in the upper left part of the data region, and so forth. (Stata 8.2 required.)

    . egen clock = mlabvpos(mpg weight)
    . scatter mpg weight, mlab(make) mlabvpos(clock)
    . egen clock2 = mlabvpos(mpg weight), matrix(11 1 12 11 1 \\ 10 2 12 10 2 \\ 9 3 12 9 3 \\ 8 4 6 8 4 \\ 7 5 6 7 5)
    . sc mpg weight, mlab(make) mlabvpos(clock2)

    I add a few general tips, with some risk of stating the obvious.

    * Labels for every observation can be too much information and defeat the object of adding informatively but gently. The reader may see a mess and give up in despair or distaste. Sometimes you should try that out and then decide on adding marker labels for a smaller number of cases that are especially interesting or important.

    * Long labels cause most problems. Labels that are two or three characters long -- or even one character long -- are easiest and may be well-known in your field. Just about every user of data for the states of the United States -- perhaps even for DC, Puerto Rico, ... too -- has no problems with WY OH MI (or would benefit from learning that scheme). See https://www.stata-journal.com/sjpdf....iclenum=gr0023 for an example of one-character labels, although for classes rather than individual observations.

    * When you have short marker labels, you don't also need the marker itself for most purposes. That may be true for longer labels as well. Suppress the marker and use mlabpos(0) to centre (center) the marker label where the marker would have appeared.

    * The approach may naturally be combined with others, e.g. use of
    by() to subdivide into panels, or the ideas at https://www.statalist.org/forums/for...ailable-on-ssc

    Last edited by Nick Cox; 08 May 2020, 03:17.

    Comment

    Working...
    X