Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Correlation categorical and continuous variable

    Hi everyone and happy new Year,

    I would like to show in a plot that a categorical variable (a dummy specifically) and a continuous variable are correlated. How can I do that?
    I will show you my (bad) results below...I have plotted the fitted values on the continuous variable, but apparently I have done something wrong:

    This is the dataex:

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input float recalled_dummy_byfirm double tot_sales float log_tot_sales
    0 1113.4760742044518  7.015242
    0  24101614.36935137  16.99779
    0  4394125.125922655  15.29578
    0 41891691.779681504 17.550598
    0  496.9142813666729  6.208417
    0 3457450.7099350286 15.056042
    0 29781.004478986808 10.301626
    0  6688352.636122874 15.715878
    0 1023685.0082260867  13.83892
    0  502248.4229268609  13.12685
    0  82.45955933317781  4.412308
    0   140003.883568737 11.849425
    0 2837956.0285489867 14.858595
    0  44207581.22703247 17.604406
    0  797037.7454917706 13.588657
    0  416103.5217390398  12.93869
    0  741191521.7011408  20.42377
    0  5467029.886278819 15.514246
    0  49498.80240358039 10.809704
    0 30164.097628740066 10.314407
    0 222581.32152515725 12.313047
    0 1146955.0700913626 13.952621
    0 1969868.8437640197 14.493478
    0 11118918.220323103  16.22416
    0  539751.9470688725 13.198865
    0  985.1672983974645  6.892811
    0     6178.060923303   8.72876
    0  151222645.6819696 18.834263
    0  545703.7791906247  13.20983
    0  7665862341.894491  22.76004
    0   971148.621263874 13.786235
    0  794234.0318757556 13.585134
    0 256499.57484354204 12.454883
    0  7529410.327107976 15.834328
    0 460.17590708242204  6.131609
    0  40910.03527002819  10.61913
    0   4512.88991982963  8.414693
    0  87.31371441729897 4.4695077
    0 13428.599629095314  9.505142
    0   2783880.57175926 14.839356
    0  3790393.179207244  15.14798
    0  2178201.945228299  14.59401
    0  844643606.7758082 20.554426
    0  1286902325.956602 20.975504
    0   340820.572072454 12.739112
    0  54732.44839913502 10.910212
    0 28035.075917002723 10.241212
    0  7481.720159301102  8.920218
    0 30234854.987732306 17.224506
    0  9227902.617169207 16.037743
    0 3710302.3268464007 15.126624
    0  930868.2835078022 13.743873
    0  324743.0710504343  12.69079
    1  50318338803.73652 24.641636
    0  604238.9901478614 13.311725
    0  1643707.394608449 14.312465
    0  46524.24402256254  10.74773
    1  77463455.56607509 18.165318
    0 2097765.5679272357 14.556383
    0 195508.99961041173 12.183362
    0 1148613.9001555594 13.954066
    0 3551802.0606248565 15.082966
    0  6518268.982895616  15.69012
    0 1652832.2164844885    14.318
    0 20466194.921815265 16.834286
    0 428572.74838139024 12.968216
    0 2242090.4846752696  14.62292
    0 1909445.8545604625 14.462323
    0 1402644.0973251425  14.15387
    0  93522.12291372954 11.445953
    0 3281.3363325683695  8.096006
    0  6181070.950078235 15.637002
    0  359771908.7074979  19.70098
    0 11.425565077664707 2.4358535
    0 455103.65316153417  13.02828
    0   10475152.7600019 16.164516
    0  985040.8455827299 13.800438
    0   521978.202494994 13.165381
    0 20690490.057945695 16.845184
    0 115109.82337942778 11.653642
    0  373719.3565319314  12.83126
    0 1235361.5267128244 14.026875
    0  55.89785659801679  4.023526
    0  35365224.64688697 17.381239
    0  556286.9400060605  13.22904
    0  903350.4092426567 13.713866
    0 493878.69475410716 13.110045
    0  8964990516.836872 22.916594
    0  648.3229754151265  6.474389
    0 1562185.6878791922 14.261597
    0 15581242.227551658 16.561579
    0 238804.66708781102   12.3834
    0  75094.23704981519   11.2265
    0  795053.0252077907 13.586164
    0  37314.44272539405 10.527136
    0 19846.087420229367  9.895762
    0  363282530.2077879  19.71069
    0  36664.90451282127 10.509575
    1 1011690898.0537733  20.73489
    0  3567485.759946689 15.087372
    end
    This is what I tried to do:
    Code:
    reg tot_sales recalled
    predict fitted_values
    twoway(scatter fitted_values tot_sales)(line fitted tot_sales)
    Correlation.pdf

    Thank you very much!

  • #2
    Federico:
    you may want to try:

    Code:
    twoway(scatter fitted_values tot_sales) (lfit fitted_values tot_sales)
    That said, to stress the correlation of the variables you're interested in, I would go:

    Code:
    ktau tot_sales fitted_values, stats(taua taub)
    Kind regards,
    Carlo
    (Stata 19.0)

    Comment


    • #3
      Many thanks Prof. Lazzaro,

      so a simple scatter plot would tell me the correlation? Another question I would like to ask is whether the fact that the dummy is time varying might somehow affect the correlation coefficients or the plot.

      Many thanks again!

      Comment


      • #4
        Carlo gave very good advice.

        Just on a slightly different note, if you have a binary variables and you wish to make comparisons with a continuous variables, you are supposed to perform other kind of tests, instead of correlation. For example, the Student t test or the Mann-Whitney test.

        For the graphs, the main aim IMHO should be presenting difference in the mean, median, or the pattern of distribution of the continuous varialbe BY the binary variable.

        On account of this, graphics such as boxplots or histograms would do fine.

        Hopefully that helps.
        Best regards,

        Marcos

        Comment


        • #5
          When the predictor is just a (0, 1) indicator, then there are just two distinct fitted values for y = a + bx, namely the estimates for a when x = 0 and for a + b when x = 1. So, a plot of the kind shown in the pdf (NB: the FAQ Advice recommends showing .png attachments) is fairly useless.

          You might as well plot the distributions side by side. I fired up stripplot (SSC) to show quantile plots. The dataex sample is truncated at the default 100 observations' worth, but the sample you show seems to support the idea that logarithmic scale is a good idea. Here the reference lines shown are for the mean logarithms, or equivalently the geometric means.

          I don't see any objection to a correlation here. If a regression makes sense, then correlation does too.

          Code:
          stripplot log , over(recalled) cumul cumprob refline vertical centre yla(, ang(h))
          Click image for larger version

Name:	federico.png
Views:	1
Size:	36.8 KB
ID:	1476818

          Comment


          • #6
            Thank you everyone for the very interesting and useful responses.
            Prof. Cox, just a clarification. The strip plot displayed shows the cumulative for the variable log_tot_sales in the two cases of recalled_dummy=0 and recalled_dummy=1, right?
            Moreover, I agree with Prof. Almeida, that also other statistics are required, indeed what I would like to do, further, is to relate the VARIATION of tot_sales with recalls. Is that possible/useful in your opinion?

            Many thanks,

            Federico

            Comment


            • #7
              Yes and no. That's the quantile function for each predictor value with those options. The cumulative distribution function is, if you like, plotted on the x axis. The help does explain (emphasis added).

              cumulate specifies that data points are to be plotted with respect to an implicit cumulative frequency
              scale. By default displays resemble cumulative frequency plots; otherwise with vertical displays
              resemble quantile plots.
              Note that with cumulate specifying connect(L) [sic] to join points within
              groups may be helpful.

              Given cumulate the further option cumprob specifies use of an implicit cumulative probability scale rather
              than cumulative frequency. The precise definition is to plot using (rank - 0.5) / #values. Plotting
              each set of values within the same vertical or horizontal extent permits easy superimposition over box
              plots.
              Otherwise I don't understand what you don't understand. A regression of (log of) total sales on recall explains, if anything does, the relationship. Why the doubt?

              Comment


              • #8
                Ok now I see. Indeed the confusion was on cumul and cumprob to gether with my scarce knowledge of the command.
                Thank you very much again

                Comment

                Working...
                X