Scatter Plot and normalizing variables (logs)

Abdullah Ijaz

Join Date: Sep 2017

Posts: 97
#1

Scatter Plot and normalizing variables (logs)

18 Jun 2020, 07:15

Hi,

I am dealing with panel data analysis (fixed effects with robust standard errors). The data has a large N (120) and small T (10). I have a dependent variable which is a ratio variable and my independent variable is the number of incidents annualy.

1) If I don't change teh variable and keep the original form , I get a scatterplot like this:

2) If I take the log of my dependent variable (rate) and leave independent, I get the scatterplot like this:

3) If I take natural logs of both i.e. dependent and independent. Also, my independent variable has zero values when no incident took place. So I have used the command log (incidents+1)

I get a scatter-plot like this.

Now my question is, that even I use the data in point (2), where my dependent is logged and independent is not. Is that be okay? or I have to normalize the plot as shown in figure 3. What sort of points shall I keep in mind using (2) form of data or (3) form of data. In the literature, I have seen multiple practices i.e. in some they have taken logs and in some papers, they haven't.

You help in this regard shall be appreciated.
Tags: None
Marcos Almeida

Join Date: Apr 2014

Posts: 4047
#2

18 Jun 2020, 07:33

In short, keeping an eye on the literature is a good strategy.

Basically, the logtransforming changed the scales.

That said, it seems there are many situations with zero incidents in the example.

Huge positive skewness of a count variable.

Maybe a categorization of the values would be helpful.

Best regards,

Marcos
Comment
Abdullah Ijaz

Join Date: Sep 2017

Posts: 97
#3

18 Jun 2020, 07:43

Marcos: I totally agree as you said that keeping an eye on the literature is a good strategy. But sometimes, it is confusing as well because they have used the variable in three forms as I discussed so I questioned this.

You are very right about the zero values. What do you mean by categorization of the values?
Comment
Marcos Almeida

Join Date: Apr 2014

Posts: 4047
#4

19 Jun 2020, 15:52

By categorizing I meant creating categories, such as zero, 1-200; 201-500, etc. Logically, the categories shall make sense in terms of rationale as well as literature.

Best regards,

Marcos
Comment
Abdullah Ijaz

Join Date: Sep 2017

Posts: 97
#5

03 Jul 2020, 04:37

Dear Marcos,

I am really thankful for your valuable comment. I really appreciate this. What's the command for further categorizing this variable let's say max value is 1000 and minimum is 1. and I want to take intervals of 1-200 and then 201-400 etc.

Also, in your opnion what is the best way of looking at the relationship of interval variable (independent var) and continuous var (dependent var) in graphical form?? i.e. boxplot, histogram or scatterplot.

Your help shall be highly appreciated.

Best Regards,

Last edited by Abdullah Ijaz; 03 Jul 2020, 05:03.
Comment
Marcos Almeida

Join Date: Apr 2014

Posts: 4047
#6

03 Jul 2020, 18:06

For the first question, please take a look at - egen - with - cut - option.

For the second question, perhaps boxplots would provide a nice view.

Best regards,

Marcos
Comment

Nick Cox

Join Date: Mar 2014
Posts: 35436

04 Jul 2020, 02:52

https://www.stata-journal.com/articl...article=dm0095 gives an overview of binning in Stata.

There is no data example here (contrary to our request at https://www.statalist.org/forums/help#stata) but consider this example which you can run.

Code:

. clear

. set obs 1000
number of observations (_N) was 0, now 1,000

. set seed 2803

.
. gen whatever = runiformint(1, 1000)

.
. gen bin1 = 200 * floor(whatever/200)

.
. gen bin2 = 200 * ceil(whatever/200)

.
. tab bin1 bin2

           |                          bin2
      bin1 |       200        400        600        800       1000 |     Total
-----------+-------------------------------------------------------+----------
         0 |       194          0          0          0          0 |       194
       200 |         2        194          0          0          0 |       196
       400 |         0          1        189          0          0 |       190
       600 |         0          0          0        225          0 |       225
       800 |         0          0          0          0        195 |       195
-----------+-------------------------------------------------------+----------
     Total |       196        195        189        225        195 |     1,000

.
. list if mod(whatever, 200) == 0

      +------------------------+
      | whatever   bin1   bin2 |
      |------------------------|
 166. |      400    400    400 |
 284. |      200    200    200 |
 923. |      200    200    200 |
      +------------------------+

For bins of equal width, I much prefer using floor() or ceil() over egen, cut():

1. The results of the binning have clear meaning as lower (uppper) limits of each bin, allowing easy interpretation in tables and graphs.

2. What happens at bin boundaries is an easy consequence of the function definitions -- transparent outside Stata as well as to any one who at most needs to take a few seconds to learn what floor and ceiling functions do.

3. In contrast you must read the documentation for egen, cut() to work out what it does with boundary cases and you must spell out all the boundaries you need.

All that said, binning looks arbitrary here. Note that roots or cube roots are alternatives to log(count + 1).

Comment

Nick Cox

Join Date: Mar 2014
Posts: 35436

04 Jul 2020, 04:16

A different point: two-letter country codes will much reduce the mess on your scatter plot. https://www.iban.com/country-codes

Here is some technique with two-letter codes for US states. Note the further tiny trick of omitting the marker symbol and putting the marker label where the marker symbol would have been.

I used niceloglabels from the Stata Journal, which you must install before you can use. You can of course just specify axis labels directly.

Code:

 search niceloglabels, sj

Search of official help files, FAQs, Examples, and Stata Journals

SJ-18-1 gr0072  . . . . . . . Speaking Stata: Logarithmic binning and labeling
        (help niceloglabels)  . . . . . . . . . . . . . . . . . . .  N. J. Cox
        Q1/18   SJ 18(1):262--286
        introduces the niceloglabels command for helping (even automating)
        label choice

Code:

sysuse census, clear
set scheme s1color 
niceloglabels marriage, style(125) local(xla)
niceloglabels divorce, style(125) local(yla)
scatter divorce marriage, ysc(log) xsc(log) mla(state2) ms(none) mlabsize(medsmall) mlabpos(0) yla(`yla', ang(h)) xla(`xla')

Click image for larger version

Name: betterlabel.png
Views: 1
Size: 28.5 KB
ID: 1561870

Comment

Abdullah Ijaz

Join Date: Sep 2017

Posts: 97
#9

13 Jul 2020, 08:24

Nick Cox Many thanks for all your valuable feedback. Means a lot! It really helped me with my research.

Just a query:

I have No. of incidents (per annum) as my independent variable. Max value is 3900 and I made ranges of 700 intervals. Let's say 0-700, 701-1400, 1401- 2100 and so on. What if the upper two ranges i.e. have only 2 or 3 obervations (number of incidents were low). In that case, would it be more appropriate just to make two major categories , i.e. < 2,000 or > 20000

Looking forward to your reply!
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35436
#10

13 Jul 2020, 08:31

#9 Sorry, but I see no point in your binning here and even less point in binning with inconsistent bin widths. Number of incidents is skewed in distribution, but binning will increase overlap on a graph, which looks quite the wrong way to go.
1 like
Comment
Abdullah Ijaz

Join Date: Sep 2017

Posts: 97
#11

13 Jul 2020, 08:37

Nick Cox As always, your feedback is spot on. Much appreciated.
Comment

Announcement