Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Scatter Plot and normalizing variables (logs)

    Hi,

    I am dealing with panel data analysis (fixed effects with robust standard errors). The data has a large N (120) and small T (10). I have a dependent variable which is a ratio variable and my independent variable is the number of incidents annualy.

    1) If I don't change teh variable and keep the original form , I get a scatterplot like this:
    Click image for larger version

Name:	3333.png
Views:	1
Size:	120.9 KB
ID:	1559481



    2) If I take the log of my dependent variable (rate) and leave independent, I get the scatterplot like this:





    Click image for larger version

Name:	111111.png
Views:	1
Size:	157.5 KB
ID:	1559482

    3) If I take natural logs of both i.e. dependent and independent. Also, my independent variable has zero values when no incident took place. So I have used the command log (incidents+1)

    I get a scatter-plot like this.


    Click image for larger version

Name:	2222.png
Views:	1
Size:	239.0 KB
ID:	1559483



    Now my question is, that even I use the data in point (2), where my dependent is logged and independent is not. Is that be okay? or I have to normalize the plot as shown in figure 3. What sort of points shall I keep in mind using (2) form of data or (3) form of data. In the literature, I have seen multiple practices i.e. in some they have taken logs and in some papers, they haven't.

    You help in this regard shall be appreciated.



  • #2
    In short, keeping an eye on the literature is a good strategy.

    Basically, the logtransforming changed the scales.

    That said, it seems there are many situations with zero incidents in the example.

    Huge positive skewness of a count variable.

    Maybe a categorization of the values would be helpful.

    Best regards,

    Marcos

    Comment


    • #3
      Marcos: I totally agree as you said that keeping an eye on the literature is a good strategy. But sometimes, it is confusing as well because they have used the variable in three forms as I discussed so I questioned this.


      You are very right about the zero values. What do you mean by categorization of the values?

      Comment


      • #4
        By categorizing I meant creating categories, such as zero, 1-200; 201-500, etc. Logically, the categories shall make sense in terms of rationale as well as literature.
        Best regards,

        Marcos

        Comment


        • #5
          Dear Marcos,

          I am really thankful for your valuable comment. I really appreciate this. What's the command for further categorizing this variable let's say max value is 1000 and minimum is 1. and I want to take intervals of 1-200 and then 201-400 etc.

          Also, in your opnion what is the best way of looking at the relationship of interval variable (independent var) and continuous var (dependent var) in graphical form?? i.e. boxplot, histogram or scatterplot.


          Your help shall be highly appreciated.

          Best Regards,
          Last edited by Abdullah Ijaz; 03 Jul 2020, 05:03.

          Comment


          • #6
            For the first question, please take a look at - egen - with - cut - option.

            For the second question, perhaps boxplots would provide a nice view.
            Best regards,

            Marcos

            Comment


            • #7
              https://www.stata-journal.com/articl...article=dm0095 gives an overview of binning in Stata.

              There is no data example here (contrary to our request at https://www.statalist.org/forums/help#stata) but consider this example which you can run.
              Code:
              . clear
              
              . set obs 1000
              number of observations (_N) was 0, now 1,000
              
              . set seed 2803
              
              .
              . gen whatever = runiformint(1, 1000)
              
              .
              . gen bin1 = 200 * floor(whatever/200)
              
              .
              . gen bin2 = 200 * ceil(whatever/200)
              
              .
              . tab bin1 bin2
              
                         |                          bin2
                    bin1 |       200        400        600        800       1000 |     Total
              -----------+-------------------------------------------------------+----------
                       0 |       194          0          0          0          0 |       194
                     200 |         2        194          0          0          0 |       196
                     400 |         0          1        189          0          0 |       190
                     600 |         0          0          0        225          0 |       225
                     800 |         0          0          0          0        195 |       195
              -----------+-------------------------------------------------------+----------
                   Total |       196        195        189        225        195 |     1,000
              
              .
              . list if mod(whatever, 200) == 0
              
                    +------------------------+
                    | whatever   bin1   bin2 |
                    |------------------------|
               166. |      400    400    400 |
               284. |      200    200    200 |
               923. |      200    200    200 |
                    +------------------------+
              For bins of equal width, I much prefer using floor() or ceil() over egen, cut():

              1. The results of the binning have clear meaning as lower (uppper) limits of each bin, allowing easy interpretation in tables and graphs.

              2. What happens at bin boundaries is an easy consequence of the function definitions -- transparent outside Stata as well as to any one who at most needs to take a few seconds to learn what floor and ceiling functions do.

              3. In contrast you must read the documentation for
              egen, cut() to work out what it does with boundary cases and you must spell out all the boundaries you need.

              All that said, binning looks arbitrary here. Note that roots or cube roots are alternatives to log(count + 1).

              Comment


              • #8
                A different point: two-letter country codes will much reduce the mess on your scatter plot. https://www.iban.com/country-codes

                Here is some technique with two-letter codes for US states. Note the further tiny trick of omitting the marker symbol and putting the marker label where the marker symbol would have been.

                I used niceloglabels from the Stata Journal, which you must install before you can use. You can of course just specify axis labels directly.

                Code:
                 search niceloglabels, sj
                
                Search of official help files, FAQs, Examples, and Stata Journals
                
                SJ-18-1 gr0072  . . . . . . . Speaking Stata: Logarithmic binning and labeling
                        (help niceloglabels)  . . . . . . . . . . . . . . . . . . .  N. J. Cox
                        Q1/18   SJ 18(1):262--286
                        introduces the niceloglabels command for helping (even automating)
                        label choice

                Code:
                sysuse census, clear
                set scheme s1color 
                niceloglabels marriage, style(125) local(xla)
                niceloglabels divorce, style(125) local(yla)
                scatter divorce marriage, ysc(log) xsc(log) mla(state2) ms(none) mlabsize(medsmall) mlabpos(0) yla(`yla', ang(h)) xla(`xla')
                Click image for larger version

Name:	betterlabel.png
Views:	1
Size:	28.5 KB
ID:	1561870

                Comment


                • #9
                  Nick Cox Many thanks for all your valuable feedback. Means a lot! It really helped me with my research.

                  Just a query:

                  I have No. of incidents (per annum) as my independent variable. Max value is 3900 and I made ranges of 700 intervals. Let's say 0-700, 701-1400, 1401- 2100 and so on. What if the upper two ranges i.e. have only 2 or 3 obervations (number of incidents were low). In that case, would it be more appropriate just to make two major categories , i.e. < 2,000 or > 20000

                  Looking forward to your reply!

                  Comment


                  • #10
                    #9 Sorry, but I see no point in your binning here and even less point in binning with inconsistent bin widths. Number of incidents is skewed in distribution, but binning will increase overlap on a graph, which looks quite the wrong way to go.

                    Comment


                    • #11
                      Nick Cox As always, your feedback is spot on. Much appreciated.

                      Comment

                      Working...
                      X