Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Confidence interval for Chi-square test

    Dear All,
    Can someone please help me out, to find out the confidence interval for chi-square test or fisher exact test.
    I tried "csi" command but it is only for 2x2 table.
    In my data i have around 200 subjects and two categorical variables which is of 3x3 table.
    please help me out in finding the CI for the 3x3 table.


    Thanks a lot in advance

  • #2
    Divya:
    -bootstrap- might be an option:
    Code:
    . use "C:\Program Files\Stata16\ado\base\a\auto.dta"
    (1978 Automobile Data)
    
    . tabulate rep78 foreign, chi2
    
        Repair |
        Record |       Car type
          1978 |  Domestic    Foreign |     Total
    -----------+----------------------+----------
             1 |         2          0 |         2
             2 |         8          0 |         8
             3 |        27          3 |        30
             4 |         9          9 |        18
             5 |         2          9 |        11
    -----------+----------------------+----------
         Total |        48         21 |        69
    
              Pearson chi2(4) =  27.2640   Pr = 0.000
    
    . ereturn list
    
    . return list
    
    scalars:
                      r(N) =  69
                      r(r) =  5
                      r(c) =  2
                   r(chi2) =  27.26396103896104
                      r(p) =  .0000175796084266
    
    . bootstrap r(chi2), reps(200) : tabulate rep78 foreign, chi2
    (running tabulate on estimation sample)
    
    Warning:  Because tabulate is not an estimation command or does not set e(sample), bootstrap has no way to determine which
              observations are used in calculating the statistics and so assumes that all observations are used.  This means that no
              observations will be excluded from the resampling because of missing values or other reasons.
    
              If the assumption is not true, press Break, save the data, and drop the observations that are to be excluded.  Be sure that
              the dataset in memory contains only the relevant data.
    
    Bootstrap replications (200)
    ----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5
    ..................................................    50
    ..................................................   100
    ..................................................   150
    ..................................................   200
    
    Bootstrap results                               Number of obs     =         74
                                                    Replications      =        200
    
          command:  tabulate rep78 foreign, chi2
            _bs_1:  r(chi2)
    
    ------------------------------------------------------------------------------
                 |   Observed   Bootstrap                         Normal-based
                 |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
    -------------+----------------------------------------------------------------
           _bs_1 |   27.26396   7.504984     3.63   0.000     12.55446    41.97346
    ------------------------------------------------------------------------------
    
    .
    Kind regards,
    Carlo
    (StataNow 18.5)

    Comment


    • #3
      Dear carlo,

      Thank you so much for your quick reply.
      Exactly the same i need and its working in my data too
      Thank a lot once again.

      - Divya

      Comment


      • #4
        Not following here, as wanting confidence intervals for a significance test is to me a puzzling question.

        That aside you can get Stata to bootstrap a chi-square statistic but if that makes sense then the confidence interval should not be allowed to default to normal-based.

        Comment


        • #5
          Divya:
          exploiting Nick's assist, you can get all the available -bootstrap- CIs typing:
          Code:
          estat bootstrap, all
          after running -bootstrap-
          Kind regards,
          Carlo
          (StataNow 18.5)

          Comment


          • #6
            Hey all
            that’s an interesting topic.
            so now I was wondering whether it makes sense to say that if the CI of the test statistic (or the corresponding cramers V value) includes the zero value, then no difference in frequency distributions can be confirmed. Is that right?

            kind regards!

            Comment


            • #7
              An old joke has a visitor ask a local for directions to X and the local replies "If I were going to X, I wouldn't start from here".

              If the original problem is whether two distributions are the same, it's hard for me to imagine a case where chi-square testing is ever ideal for comparing distributions, for all that it survives in textbooks and courses.

              I would say never for continuous distributions and almost never for discrete distributions

              I am currently reading my way through a 2021 book (5th edition) that recommends chi-square testing for checking normality, and for checking other distributions too. I would argue that there was never a time when this was state of the art -- as normal probability plots were invented in the 19th century before the chi-square test (1900).

              So, I suggest backing up and telling us more about your situation.

              In any case if you want a test, use a test. If you want a confidence interval for a measure of the difference between distributions, start by thinking what you want the measure to capture.

              Comment


              • #8
                Thrisa:
                the characteristics of the theoretical probability distribution should be considered, first.
                - the -chi2- distribution is defined on the 0 +infinitive interval (hence, having 0 as the lower limit of the 95% CI is virtually impossible);
                - degrees of freedom play a role in driving the p-value of the -chi2- distribution, too.
                In the previous example, considering 4 degrees of freedom and the lower limit of the 95% CI, the resulting p-value is:
                Code:
                . di (1-chi2(4,12.55446))
                .01367098
                whereas replacing the lower with the upper limit of the 95% CI:
                Code:
                . di (1-chi2(4,12.55446))
                .01367098
                
                . di (1-chi2(4,41.97346))
                1.689e-08
                PS: crossed in the cyberspace with Nick's towering reply, that shows the difference between knowing the statistics (he) and simply applying it (me).
                Last edited by Carlo Lazzaro; 06 Feb 2021, 04:48.
                Kind regards,
                Carlo
                (StataNow 18.5)

                Comment


                • #9
                  Hey
                  thanks Nick Cox and Carlo Lazzaro for your responses!

                  So basically, what is my intent: I have two categorical variables (each coded with 0 and 1) and to compare if the actual distribution in numbers per cells in the 2x2 table varies significantly from the expected numbers, I perfromed a chi-squared test, giving me the Pearson chi2 statistic, p-value and phi-value/Cramers V for effect size.

                  As my dataset is rather small, I thought like Divya, who started this thread, that it might be reasonable to bootstrap my sample and thereby repeat the chi-squared test (if this doesnt make sense at all, please let me know ). So here is my code for the bootstrapping:


                  bootstrap c=r(chi2) v=r(CramersV) , reps(10000): tab year_cat valteam, chi V row

                  what I get is the following:

                  Click image for larger version

Name:	Unbenannt.PNG
Views:	2
Size:	32.4 KB
ID:	1593177


                  Yet, I am not sure what insight this result is providing.
                  The chi-squared values have a CI with both negative and positive values, and so does Cramers V. As Carlo, however, already mentioned, none of these values can actually be smaller than zero, so what is the CI then about in this case?

                  best regards!
                  Attached Files

                  Comment


                  • #10
                    Thrisa:
                    as -chi2- distribution differ from the Gaussian, the normal based 95% CI bounds can be seriously misleading when the observed parameters are so small.
                    That said, I think that the substantive message is that the lack of evidence of a difference in observed and expected events between the two categorical variables is confirmed.
                    Kind regards,
                    Carlo
                    (StataNow 18.5)

                    Comment


                    • #11
                      Carlo Lazzaro okay thanks a lot!!
                      but does your explanation just apply in this case?

                      so assuming for different variables there is a difference between expected and actual events, would I be able to see it in the CIs, meaning that in case of phi/Cramers V the CI would need to start for the lower bound at a value above zero or at least not include the zero value, because this would indicate that there might be no effect at all? So what parameters did you look at/ what were your considerations in coming to the above stated conclusion (#10)?

                      regards
                      Thrisa

                      Comment


                      • #12
                        As said, don't use the default confidence intervals from bootstrap, Use some flavour of percentile-based confidence intervals. But it looks as if at least one of your variables is ordered, and the chi-square test knows and does nothing about respecting that order.

                        Comment


                        • #13
                          Thrisa:
                          my bootstrapping experience taught me that:
                          - if the original p-value does not reach statistical significance, bootstrap results confirm that outcome;
                          - if the original p-value reaches statistical significance, bootsrap results may not confirm that outcome.
                          That said, the following temptative answer considers a parametric bootstrap on -chi2- statistic:
                          Code:
                          ///in the first file, let's do a trivial inferential drill///
                          . webuse citytemp2
                          (City Temperature Data)
                          
                          . help table
                          
                          . tabulate region agecat, chi2
                          
                              Census |              agecat
                              Region |     19-29      30-34        35+ |     Total
                          -----------+---------------------------------+----------
                                  NE |        46         83         37 |       166
                             N Cntrl |       162         92         30 |       284
                               South |       139         68         43 |       250
                                West |       160         73         23 |       256
                          -----------+---------------------------------+----------
                               Total |       507        316        133 |       956
                          
                                    Pearson chi2(6) =  61.2877   Pr = 0.000
                          
                          . ///in another file, let's try a parametric bootstrap session///
                          . set obs 956
                          number of observations (_N) was 0, now 956
                          
                          . g wanted=invchi2(6,runiform())
                          
                          . bootstrap r(mean), reps(1000) size(956) saving(C:\Users\user\Desktop\carlo_chi2.dta, every(1) double replace) bca ties nodots seed(123
                          > 45) : sum wanted
                          
                          warning: Because summarize is not an estimation command or does not set e(sample), bootstrap has no way to determine which
                                   observations are used in calculating the statistics and so assumes that all observations are used. This means that no
                                   observations will be excluded from the resampling because of missing values or other reasons.
                          
                                   If the assumption is not true, press Break, save the data, and drop the observations that are to be excluded. Be sure that
                                   the dataset in memory contains only the relevant data.
                          
                          Bootstrap results                               Number of obs     =        956
                                                                          Replications      =      1,000
                          
                                command:  summarize wanted
                                  _bs_1:  r(mean)
                          
                          ------------------------------------------------------------------------------
                                       |   Observed   Bootstrap                         Normal-based
                                       |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
                          -------------+----------------------------------------------------------------
                                 _bs_1 |     5.8498   .1101804    53.09   0.000      5.63385    6.065749
                          ------------------------------------------------------------------------------
                          
                          . estat bootstrap
                          
                          Bootstrap results                               Number of obs     =        956
                                                                          Replications      =       1000
                          
                                command:  summarize wanted
                                  _bs_1:  r(mean)
                          
                          ------------------------------------------------------------------------------
                                       |    Observed               Bootstrap
                                       |       Coef.       Bias    Std. Err.  [95% Conf. Interval]
                          -------------+----------------------------------------------------------------
                                 _bs_1 |   5.8497995   .0034486   .11018038    5.624229   6.069633  (BC)
                          ------------------------------------------------------------------------------
                          (BC)   bias-corrected confidence interval (adjusted for ties)
                          
                          .
                          . ///eventually, let's compare the bootstrap vs. the orignal p-value///
                          
                          . use "C:\Users\user\Desktop\carlo_chi2.dta"
                          (bootstrap: summarize)
                          
                          . count if _bs_1>=61.2877 ///the original chi2///
                            0
                          
                          . di 0/1000
                          0                        ///the bootstrap p-value confirms the original one///
                          
                          .
                          Last edited by Carlo Lazzaro; 07 Feb 2021, 05:35.
                          Kind regards,
                          Carlo
                          (StataNow 18.5)

                          Comment

                          Working...
                          X