Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Am I using the Chi-square test correctly?

    I generally don't make use of chi-squared tests, but I thought I would use it as a post-estimation command to see if there is a relationship between two variables in my cross tab.

    Thus I do the following:

    Code:
    tab attrited goodhealth_baseline if gender == 0, column row nokey chi2 lrchi2 V exact gamma taub
    Where "attrited" is a variable equal to one if respondents had left the sample after wave 1 and zero otherwise, while goodhealth_baseline is a variable equal to 1 if respondent reported good health in the baseline, and zero if respondents reported bad health in the baseline.

    Above, I would like to make use of a X2(chi-square) tests for relationships between variables.

    The null hypothesis (Ho) is that there is no relationship.

    To reject this I need a Pr < 0.05 (at 95% confidence).

    The output from this test is as below:

    Code:
    
               |     Binary Health
      attrited |       Bad       Good |     Total
    -----------+----------------------+----------
             0 |       423      1,371 |     1,794 
               |     23.58      76.42 |    100.00 
               |     44.34      60.77 |     55.89 
    -----------+----------------------+----------
             1 |       531        885 |     1,416 
               |     37.50      62.50 |    100.00 
               |     55.66      39.23 |     44.11 
    -----------+----------------------+----------
         Total |       954      2,256 |     3,210 
               |     29.72      70.28 |    100.00 
               |    100.00     100.00 |    100.00 
    
              Pearson chi2(1) =  73.4293   Pr = 0.000
     likelihood-ratio chi2(1) =  73.1591   Pr = 0.000
                   Cramér's V =  -0.1512
                        gamma =  -0.3208  ASE = 0.035
              Kendall's tau-b =  -0.1512  ASE = 0.018
               Fisher's exact =                 0.000
       1-sided Fisher's exact =                 0.000
    
    .
    Above, the chi2 is significant because the Pearson chi2 is < 0.05.

    Thus I can reject the null hypothesis (Ho) is that there is no relationship between self-rated health and having left the sample.

    Therefore I can conclude that there is some relationship between mothers self rated health and having left the sample.

    I was happy to take this approach until I thought about the concept of Chi-square a little further, particularly the variables I include.

    I can't tell if my data is correct for this type of test, i.e. my understanding is that a chi-square test is used for nominal data, but because I am looking at good health vs, bad health, my data feels more ordinal than nominal, should I thus be using gamma and taub instead (and what would be the best approach to these anyway?) or is my original approach ok?

    Any help greatly appreciated,

    Very best,

    John


  • #2
    John:
    what about considering a different take, like -ologit-, to investigate your data?
    Kind regards,
    Carlo
    (Stata 19.0)

    Comment


    • #3
      Dear Carlo,

      Thank you for your response,

      Above, I make use of X2(chi-square) tests for relationships between variables as a first step in investigating for attrition in my sample due to the mother holding different health outcomes or conditions. When the chi2 is significant I investigate the relationship between mothers health and having left the sample as follows:

      Code:
      logit attrited goodhealth_baseline i.own_education_y0 i.maritalstatus_y0 i.medical_card_y0 i.employment_y0 i.ord_age_y0 if gender==0, cluster ( address_current_county_2002 )
      Where attrited is a binary variable equal to one if the mother had left the sample after wave 1 or zero if they had remained into the next two subsequent waves, goodhealth_baseline is a binary variable equal to one if the mother had good self-reported health in the first wave of the data and zero if the mother had bad self-reported health in the first wave of the study. Controls are included to exactly match the controls in the primary panel logit regressions run in the paper, they are included here at a baseline, and these variables are described in the regression output below, finally I cluster on mothers baseline address.

      Results are as follows:


      Code:
      note: 1.own_education_y0 != 0 predicts success perfectly
            1.own_education_y0 dropped and 1 obs not used
      
      note: 5.employment_y0 != 0 predicts success perfectly
            5.employment_y0 dropped and 2 obs not used
      
      note: 6.own_education_y0 omitted because of collinearity
      Iteration 0:   log pseudolikelihood = -645.62321  
      Iteration 1:   log pseudolikelihood = -596.20894  
      Iteration 2:   log pseudolikelihood =  -595.9762  
      Iteration 3:   log pseudolikelihood = -595.97614  
      
      Logistic regression                             Number of obs     =        950
                                                      Wald chi2(19)     =    1162.89
                                                      Prob > chi2       =     0.0000
      Log pseudolikelihood = -595.97614               Pseudo R2         =     0.0769
      
                                                                (Std. Err. adjusted for 30 clusters in address_current_county_2002)
      -----------------------------------------------------------------------------------------------------------------------------
                                                                  |               Robust
                                                        leftsampp |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
      ------------------------------------------------------------+----------------------------------------------------------------
                                           goodhealth_baseline |   -.370314   .1289517    -2.87   0.004    -.6230546   -.1175734
                                                                  |
                                                 own_education_y0 |
                                                    No schooling  |          0  (empty)
                                        Primary school education  |   1.669598   .8966431     1.86   0.063    -.0877899    3.426987
                                           Some secondary school  |   .3225232    .228789     1.41   0.159    -.1258951    .7709415
                                    Complete secondary education  |   .2945347   .1777575     1.66   0.098    -.0538636    .6429331
          Some third level education at college, university, RTC  |   .4422957   .2384249     1.86   0.064    -.0250086    .9095999
      Complete third level education at college, university, RTC  |          0  (omitted)
                                                                  |
                                                 maritalstatus_y0 |
                                                      Cohabiting  |   .3158153   .3075987     1.03   0.305    -.2870671    .9186977
                                                        Divorced  |   1.112166    .694325     1.60   0.109    -.2486864    2.473018
                                                         Widowed  |   1.403864   1.202754     1.17   0.243    -.9534902    3.761219
                                            Single/Never married  |   .3154115    .249044     1.27   0.205    -.1727056    .8035287
                                                                  |
                                                  medical_card_y0 |
                                                             Yes  |   .0470571   .1727811     0.27   0.785    -.2915876    .3857018
                                                                  |
                                                    employment_y0 |
                                                      Unemployed  |   -.138236   .4231469    -0.33   0.744    -.9675886    .6911166
        Unable to work owing to permanent sickness or disability  |   .2499563   .5619914     0.44   0.656    -.8515266    1.351439
                                               At school/student  |  -1.056431   .3381044    -3.12   0.002    -1.719103   -.3937586
                                 Seeking work for the first time  |          0  (empty)
                                                        Employed  |  -.3402886    .125736    -2.71   0.007    -.5867267   -.0938505
                                                   Self Employed  |  -.4277785   .4484061    -0.95   0.340    -1.306638    .4510813
                                                                  |
                                                       ord_age_y0 |
                                                           20-23  |  -.2871578   .2363127    -1.22   0.224    -.7503222    .1760065
                                                           24-27  |  -.5570169    .366666    -1.52   0.129    -1.275669    .1616352
                                                           28-32  |  -1.039217   .3932561    -2.64   0.008    -1.809985   -.2684494
                                                            33 +  |  -1.284368   .3721328    -3.45   0.001    -2.013735   -.5550012
                                                                  |
                                                            _cons |   .7023712    .434125     1.62   0.106    -.1484982    1.553241
      -----------------------------------------------------------------------------------------------------------------------------
      
      . 
      . margins if gender==0, dydx(goodhealth_baseline) post
      
      Average marginal effects                        Number of obs     =        950
      Model VCE    : Robust
      
      Expression   : Pr(leftsampp), predict()
      dy/dx w.r.t. : binary_health_y0
      
      ----------------------------------------------------------------------------------
                       |            Delta-method
                       |      dy/dx   Std. Err.      z    P>|z|     [95% Conf. Interval]
      -----------------+----------------------------------------------------------------
      goodhealth_baseline|   -.080887   .0279242    -2.90   0.004    -.1356174   -.0261566
      ----------------------------------------------------------------------------------
      
      . 
      . estimates store logitmod
      
      . 
      . estimates table logitmod, star stats(N r2 r2_a)
      
      ------------------------------
          Variable |   logitmod     
      -------------+----------------
      goodhealth_baseline| -.08088701**   
      -------------+----------------
                 N |        950     
                r2 |                
              r2_a |                
      ------------------------------
      legend: * p<0.05; ** p<0.01; *** p<0.001
      As the coefficient on good health is negative and significant, I conclude that mothers who self-report good health are significantly less likely to leave the sample after wave 1, therefore mothers who self-report bad health are significantly more likely to leave the sample after wave 1. As a result, I note in my conclusion that my results, where the health effects of employment change is the core outcome of my primary analysis, may be understated due to attrition.

      Thus my questions are two-fold, first and foremost, I can't tell if my data is the correct type for the initial X2(chi-square) tests for relationships between variables, i.e. my understanding is that a chi-square test is used for nominal data, but because I am looking at good health vs, bad health, my data feels more ordinal than nominal, and I wonder if a X2(chi-square) test is acceptable here as a first line test, or does this type of data require a different approach? If X2(chi-square) is acceptable I am happy to stick with it. Secondly, does my approach to attrition make good logical sense?

      Kindest regards,

      Jonathan

      Comment


      • #4
        John:
        thanks for providing further details.
        It seems that you're dealing with a panel dataset: hence I would consider using -xtlogit-.
        At the top of that, while I do sponsor clusterin, as your observations are not independent, I fail to get the reason for clustering your -logit- standard errors on -address_current_county_2002- instead of on -panelid-.
        As far as dealing with attrition is concerned, https://www.amazon.com/Applied-Econo.../dp/0415676827 devoted chapter 10 to non-response and attrition bias. In brief they propose the use of the Inverse Probability Weights to deal with that issue. Stata codes to do the trick are provided as well.
        As usual, the core issue rests on the ignorability of the missingness.
        As your data are ordinal, I would still prefer -ologit- vs. -chi2-
        Kind regards,
        Carlo
        (Stata 19.0)

        Comment


        • #5
          John:
          you may be also interested in the following working paper: https://www.york.ac.uk/media/economi...c/wp/05_05.pdf
          Kind regards,
          Carlo
          (Stata 19.0)

          Comment


          • #6
            Dear Carlo,

            Thank you for your feedback and for the resources you have supplied, I will inure to employ these in my analysis,

            Kindest regards,

            John

            Comment


            • #7
              Dear Carlo Lazzaro

              Just a brief follow up question,

              You mentioned that as my data are ordinal, you would prefer -ologit- to -chi2-, I would be interested in whether a case could still be made to argue in support of using -chi2- in this situation? Or would it be entirely incorrect to do so?

              Kindest regards,

              John

              Comment


              • #8
                John:
                I would not say that chi2 is totally incorrect, but probably not fully satisfactory when compared to the statistical procedures proposed for investigating attrition in panel data (as per the references quoted in my previous reply). Most also depends on what is the target of your research (colleagues, reviewer of a technical journal, discussants).
                Kind regards,
                Carlo
                (Stata 19.0)

                Comment


                • #9
                  Dear Carlo,

                  I must thank you for providing me with such excellent resources on the use of the Inverse Probability Weights to deal with the issue of attrition. Particularly, Jones, A. M., Rice, N., d'Uva, T. B., & Balia, S. (2007). Applied Health Economics (Routledge Advanced Texts in Economics and Finance) has been a very informative and easy to follow guide.

                  Nonetheless I am having a minor issue that I hope you could provide some guidance on.

                  In my analysis I am interested in the relationship between health outcomes and local area unemployment for a group of mothers in panel data across 3 waves.

                  To compute the IPW estimator I estimate (probit) equations for response (rit = 1) versus non-response (rit = 0) at each wave, t = 1,…,T, conditional on a set of characteristics (zi1) that are measured for all individuals at the first wave.

                  zi1 includes the initial values of all of the regressors in the health equation.

                  Also it includes initial values of the y variable (self-rated health) and of the other indicators of morbidity.

                  Code:
                  The below binary variable determines whether the respondent was in the sample at each wave:
                  
                  has_questionnaire_y0 = did the respondent have a y0 questionnaire
                  has_questionnaire_y5 = did the respondent have a y5 questionnaire
                  has_questionnaire_y10 = did the respondent have a y10 questionnaire
                  
                  Variables are then reshaped from wide to long for panel data analysis:
                  
                  reshape long has_questionnaire_y binary_health_y medical_card_y binary_employment_y age_y psum_unemployed_total_cont_y, i(id) j(year)
                  Each wave is described by wavenum as below:


                  Code:
                  ********************************************
                  
                  *         wavenum
                  
                  ********************************************
                  
                  tab year 
                  recode year  (0=1) (5=2) (10=3) (else=.), gen(wavenum) label(wavenum)
                  label variable wavenum  "wavenum"
                  tab year wavenum
                  
                  
                  *I create a wavenum variable to match what they have in the book, recoding my time period from a wave every five years starting in year 0, to a wave 1 wave, a wave 2 wave, and a wave 3 wave.
                  The following code is used to create variables that contain the initial values of the regressors at wave 1:

                  Code:
                  sort id wavenum
                  foreach X of varlist binary_health_y ///
                  psum_unemployed_total_cont_y ///
                  medical_card_y   ///
                  binary_employment_y  ///
                  age_y {
                  by id: gen `X't1 = `X'[1]
                  }
                  
                  
                  * These are included in a global variable list:
                  
                  
                  global z1 "binary_health_yt1 psum_unemployed_total_cont_yt1 medical_card_yt1 binary_employment_yt1 age_yt1"

                  These variables are used in a sequence of probit models for response versus non-response: so the dependent variable is has_questionnaire_y, which indicates whether an observation is in the estimation sample at each wave as described above.



                  Code:
                  
                  forvalues j = 1(1)3 {
                  quietly probit has_questionnaire_y $z1 if (wavenum == `j') 
                  predict p`j' , p
                  generate ipw `j' = 1/p`j' 
                  
                  }
                  
                  generate imr = 0
                  forvalues k = 1(1)3 {
                  replace imr = imr`k' if wavenum == `k' 
                  }
                  
                  generate ipw = 1
                  forvalues k = 1(1)3{
                  replace ipw = ipw`k' if wavenum == `k'
                  
                  }
                  
                  
                  }
                  I attempt to estimate the probits at each wave of the panel, from wave 1 to wave 3, using the full sample of individuals who are observed at wave 1.

                  The whole purpose is to create the new variable ipw: the inverse of the fitted probability of responding. Because they do so in the book, I also create the inverse Mills ratios (imr), predominantly to learn more about this.

                  However on executing the code I am faced with the following error:

                  Code:
                  
                  . forvalues j = 1(1)3 {
                    2. quietly probit has_questionnaire_y $z1 if (wavenum == `j') 
                    3. predict p`j' , p
                    4. generate ipw `j' = 1/p`j' 
                    5. 
                  . }
                  r(2000);
                  
                  end of do-file
                  
                  r(2000);
                  
                  .
                  Not one new variable is created.

                  According to the helpfiles this suggests that
                  Search of official help files, FAQs, Examples, SJs, and STBs

                  [P] error . . . . . . . . . . . . . . . . . . . . . . . . Return code 2000
                  no observations;
                  You have requested some statistical calculation and there are
                  no observations on which to perform it. Perhaps you specified
                  if or in and inadvertently filtered all the data.

                  (end of search)
                  But I am at a loss as to why this is, I can only suspect that my forvalues approach is incorrect as this is not a route I often take, is there anything that I have done here that jumps out to you as immediately incorrect? I tried to stick as closely to the guide provided by the textbook, i.e. from pages 283 to 287 of Applied Health Economics (Routledge Advanced Texts in Economics and Finance), more or less verbatim, so I really cant see where I may have gone wrong here.

                  Any advice received is greatly appreciated,

                  Very best,

                  John

                  Comment

                  Working...
                  X