Significance of Cramer's V

Minchul Park

Join Date: Jun 2019
Posts: 58

Significance of Cramer's V

05 Nov 2020, 22:16

Hi I have a question about the significance of Cramer's V.

I calculated the Carmer's V by using

Code:

tab v1 v2, V

where v1 and v2 are nominal variables.

Then, the result is below

Code:

                               v2
       v1  |        1          2          3  |     Total
-----------+---------------------------------+----------
         1 |       366        284        151 |       801
         2 |       335        715        225 |     1,275
         3 |       131        216        279 |       626
-----------+---------------------------------+----------
     Total |       832      1,215        655 |     2,702

               Cramér's V =   0.2323

Here, I want to check the significance of the Cramer's V.
I found that "the p-value for the significance of V is the same one that is calculated using the Pearson's chi-squared test".
(the resource is here https://en.wikipedia.org/wiki/Cram%C3%A9r's_V)

So, I ran the command below

Code:

tab v1 v2, V chi2

And, the result is here

Code:

                               v2
       v1  |        1          2          3  |     Total
-----------+---------------------------------+----------
         1 |       366        284        151 |       801
         2 |       335        715        225 |     1,275
         3 |       131        216        279 |       626
-----------+---------------------------------+----------
     Total |       832      1,215        655 |     2,702

          Pearson chi2(4) = 291.5302   Pr = 0.000
               Cramér's V =   0.2323

My question is whether Pr=0.000 in the result can be used as the significance of the Cramer's V.

Thank you for your time spending to read this question.

Last edited by Minchul Park; 05 Nov 2020, 22:19.

Tags: None

Nick Cox

Join Date: Mar 2014
Posts: 35432

06 Nov 2020, 02:18

I think the short answer is Yes. Cramér's V (often denoted with the Greek letter lower case nu, which does not correspond to V, at all, but looks like a little v nevertheless) is a measure of association, which is a scaling of chi-square, but the associated test remains the chi-square test.

Note the accent, often omitted: https://en.wikipedia.org/wiki/Harald_Cram%C3%A9r

We don't ,have your data but we can get the frequencies from your output. Note the use of return list to get more detail on the P-value and of tabchi (from tab_chi on SSC) to get at residuals. Here the clear pattern is that you have far more than expected along the diagonal v1 = v2 and far less off it. I suspect that the relationship here may be at once overwhelmingly significant and substantively quite unsurprising. Otherwise put, the value of a test depends on how seriously you take the null hypothesis of no association.

Code:

 clear

. tabi 366 284 151 \ 335 715 225 \ 131 216 279 , chi2 V 

           |               col
       row |         1          2          3 |     Total
-----------+---------------------------------+----------
         1 |       366        284        151 |       801 
         2 |       335        715        225 |     1,275 
         3 |       131        216        279 |       626 
-----------+---------------------------------+----------
     Total |       832      1,215        655 |     2,702 

          Pearson chi2(4) = 291.5302   Pr = 0.000
               Cramér's V =   0.2323

.
. return li

scalars:
           r(CramersV) =  .2322651753645333
                  r(p) =  7.27186874811e-62
               r(chi2) =  291.5301915571825
                  r(c) =  3
                  r(r) =  3
                  r(N) =  2702



. rename (row col pop) (v1 v2 freq)

.  ssc install tab_chi 

. tabchi v1 v2 [fw=freq] , pearson

          observed frequency
          expected frequency
          Pearson residual

-------------------------------------
          |            v2            
       v1 |       1        2        3
----------+--------------------------
        1 |     366      284      151
          | 246.644  360.183  194.173
          |   7.600   -4.014   -3.098
          | 
        2 |     335      715      225
          | 392.598  573.325  309.077
          |  -2.907    5.917   -4.782
          | 
        3 |     131      216      279
          | 192.758  281.491  151.751
          |  -4.448   -3.903   10.330
-------------------------------------

          Pearson chi2(4) = 291.5302   Pr = 0.000
 likelihood-ratio chi2(4) = 268.8145   Pr = 0.000

Comment

Felix Bittmann

Join Date: Aug 2018

Posts: 661
#3

06 Nov 2020, 03:55

Another option is to bootstrap the result and generate a confidence interval. Since I take from your original post that there is large number of observation in each cell, I suppose this should yield valid results for you. The implementation is simple, see this generic example below:

Code:

sysuse nlsw88, clear tab industry union, V return list bootstrap r(CramersV), reps(999) seed(123) nodots: tab industry union, V estat bootstrap, bc

Best wishes

(Stata 16.1 MP)
Comment
Minchul Park

Join Date: Jun 2019

Posts: 58
#4

06 Nov 2020, 05:37

Nick Cox Thank you for your detailed answer! the explanation is really clear and I totally understood about that thank you.
Felix Bittmann Thank you Felix! I tried to mimic you and I got a result I expected thank you!
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35432
#5

06 Nov 2020, 05:50

Thanks for the thanks.

Usually I prefer a measure to a test -- especially when the measure is a correlation and the test is the often pointless one of testing whether it is zero.

But here I don't think the measure adds anything much to the test.

Cramér was a first-rate statistician who did very original work and wrote some splendid books. It's a bit weird that his name is most often mentioned for this small thing that he introduced. The climate of the time included exaggerated interest in measures that were as close to correlations as the logic of each procedure allowed. We have moved away from that since his time, not that he would have disagreed, I think.

What is often neglected is to look at the form of the association once you've established that it is worth attention. tabchi and its sibling tabchii from tab_chi (SSC) allow you to go a bit further here than does tabulate.
1 like
Comment
Maarten Buis

Join Date: Mar 2014

Posts: 3426
#6

09 Nov 2020, 06:14

I agree with Nick Cox that finding a deviation from independence is just the beginning of the story, not the end. Cramér's V measure the extend to which your observed table deviates from independence. Describing what cells are over-represented or under-represented compared to independence can tell us much more than a single number. The latter only tells us that some or all of the cells are different from independence. Nick did that by comparing the counts that would have occurred if the table was independent with the actual counts. That is fine, but it does mean that each cell of the table now contains multiple entries that need to be compared. So let me offer an alternative.

Independence means that there is no association between the row and column variable in the table. A naive first try to create a independent table is to divided the total number of observations by the number of cells, and give each cell the same "expected" count. However, this would imply that each value on the row and column variables are equally likely. So this way we are assuming that the row and column variables are uniformly distributed, which does not seem like a reasonable requirement for independence: we can easily imagine two variables being unrelated when the variables in isolation don't follow a uniform distribution. So for the expected counts we keep the row and column totals as observed and change to association such that there is no association. However, we are actually interested in the association and not in the row and column totals. So this is a bit backwards and is why we end up with multiple entries in each cell. Would it not be much easier to keep what we are interested in, the association, as observed, and instead change the row and column totals? This is what happens when we standardize a table (ssc desc stdtable)

Code:

. clear . qui tabi 366 284 151 \ 335 715 225 \ 131 216 279 . rename (row col pop) (v1 v2 freq) . . // standardized version of the table . stdtable v1 v2 [fw=freq] -------------------------------------- | v2 v1 | 1 2 3 Total ----------+--------------------------- 1 | 48.8 27.7 23.5 100 2 | 29.9 46.6 23.5 100 3 | 21.3 25.7 53 100 | Total | 100 100 100 300 --------------------------------------

We controlled for the margins (row and column totals) by fixing them all to be 100. This way the cells can be interpreted as both row and column percentages. We can clearly see the main diagonal being over represented, a pattern that Nick also found, but now with only one value per cell. In addition, we can also see that the off-diagonal cells having all similar values. This suggests a model for this table, where there are two types of persons: The first type are the stayers, if they score 1 on variable v1 then they will also score 1 on v2. The second type are the movers, they fill the table randomly just like the independence model. We can actually estimate that model and see how well it fits:

Code:

. gen diag = (v1 == v2) . poisson freq i.v1 i.v2 i.diag, irr nolog Poisson regression Number of obs = 9 LR chi2(5) = 670.80 Prob > chi2 = 0.0000 Log likelihood = -43.017478 Pseudo R2 = 0.8863 ------------------------------------------------------------------------------ freq | IRR Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- v1 | 2 | 1.486303 .0691796 8.51 0.000 1.356714 1.628269 3 | .8136213 .0443099 -3.79 0.000 .7312498 .9052715 | v2 | 2 | 1.324862 .0618374 6.03 0.000 1.209042 1.451778 3 | .8193708 .0436964 -3.74 0.000 .7380514 .9096501 | 1.diag | 1.878629 .0741104 15.98 0.000 1.738849 2.029646 _cons | 199.112 9.040679 116.59 0.000 182.1581 217.6437 ------------------------------------------------------------------------------ Note: _cons estimates baseline incidence rate.

The effects of v1 and v2 capture the margins, and the effect of diag tells us that there are about 1.9 stayers for every mover. We can look at how well this model fits. A common measure of fit for this type of model (a log linear model) is the index of dissimilarity, which measures the proportion of observations that need to be shifted to perfectly fit the table.

Code:

. predict mu (option n assumed; predicted number of events) . sum freq , meanonly . local n = r(sum) . gen double d = abs(freq/`n'-mu/`n') . sum d, meanonly . di "index of dissimilarity = " r(sum)/2 index of dissimilarity = .03340679

So we only need to shift 3% of the predictions to get a perfect fit, or we got 97% right. This is pretty good for such a simple model.

For more on this type of modeling see this presentation I gave at 2015 UK Stata Users' Group meeting: http://www.maartenbuis.nl/presentations/london15b.pdf . For more on standardizing tables you can look at this presentation I gave at the 2019 UK Stata Users' Group meeting http://www.maartenbuis.nl/presentations/smclpres.html . This talk was actually on how to make a presentation within Stata, but the worked example that I show is on standardizing a table.

Last edited by Maarten Buis; 09 Nov 2020, 06:25.

---------------------------------
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl
---------------------------------
2 likes
Comment

Announcement

Significance of Cramer's V

Comment

Comment

Comment

Comment

Comment