Am I using the Chi-square test correctly?

John Adler

Join Date: Apr 2017

Posts: 173
#1

Am I using the Chi-square test correctly?

22 Jun 2018, 04:47

I generally don't make use of chi-squared tests, but I thought I would use it as a post-estimation command to see if there is a relationship between two variables in my cross tab.

Thus I do the following:

Code:

tab attrited goodhealth_baseline if gender == 0, column row nokey chi2 lrchi2 V exact gamma taub

Where "attrited" is a variable equal to one if respondents had left the sample after wave 1 and zero otherwise, while goodhealth_baseline is a variable equal to 1 if respondent reported good health in the baseline, and zero if respondents reported bad health in the baseline.

Above, I would like to make use of a X2(chi-square) tests for relationships between variables.

The null hypothesis (Ho) is that there is no relationship.

To reject this I need a Pr < 0.05 (at 95% confidence).

The output from this test is as below:

Code:

| Binary Health attrited | Bad Good | Total -----------+----------------------+---------- 0 | 423 1,371 | 1,794 | 23.58 76.42 | 100.00 | 44.34 60.77 | 55.89 -----------+----------------------+---------- 1 | 531 885 | 1,416 | 37.50 62.50 | 100.00 | 55.66 39.23 | 44.11 -----------+----------------------+---------- Total | 954 2,256 | 3,210 | 29.72 70.28 | 100.00 | 100.00 100.00 | 100.00 Pearson chi2(1) = 73.4293 Pr = 0.000 likelihood-ratio chi2(1) = 73.1591 Pr = 0.000 Cramér's V = -0.1512 gamma = -0.3208 ASE = 0.035 Kendall's tau-b = -0.1512 ASE = 0.018 Fisher's exact = 0.000 1-sided Fisher's exact = 0.000 .

Above, the chi2 is significant because the Pearson chi2 is < 0.05.

Thus I can reject the null hypothesis (Ho) is that there is no relationship between self-rated health and having left the sample.

Therefore I can conclude that there is some relationship between mothers self rated health and having left the sample.

I was happy to take this approach until I thought about the concept of Chi-square a little further, particularly the variables I include.

I can't tell if my data is correct for this type of test, i.e. my understanding is that a chi-square test is used for nominal data, but because I am looking at good health vs, bad health, my data feels more ordinal than nominal, should I thus be using gamma and taub instead (and what would be the best approach to these anyway?) or is my original approach ok?

Any help greatly appreciated,

Very best,

John
Tags: None
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17673
#2

23 Jun 2018, 02:08

John:
what about considering a different take, like -ologit-, to investigate your data?

Kind regards,
Carlo
(Stata 19.0)
Comment

John Adler

Join Date: Apr 2017
Posts: 173

25 Jun 2018, 03:52

Dear Carlo,

Thank you for your response,

Above, I make use of X2(chi-square) tests for relationships between variables as a first step in investigating for attrition in my sample due to the mother holding different health outcomes or conditions. When the chi2 is significant I investigate the relationship between mothers health and having left the sample as follows:

Code:

logit attrited goodhealth_baseline i.own_education_y0 i.maritalstatus_y0 i.medical_card_y0 i.employment_y0 i.ord_age_y0 if gender==0, cluster ( address_current_county_2002 )

Where attrited is a binary variable equal to one if the mother had left the sample after wave 1 or zero if they had remained into the next two subsequent waves, goodhealth_baseline is a binary variable equal to one if the mother had good self-reported health in the first wave of the data and zero if the mother had bad self-reported health in the first wave of the study. Controls are included to exactly match the controls in the primary panel logit regressions run in the paper, they are included here at a baseline, and these variables are described in the regression output below, finally I cluster on mothers baseline address.

Results are as follows:

Code:

note: 1.own_education_y0 != 0 predicts success perfectly
      1.own_education_y0 dropped and 1 obs not used

note: 5.employment_y0 != 0 predicts success perfectly
      5.employment_y0 dropped and 2 obs not used

note: 6.own_education_y0 omitted because of collinearity
Iteration 0:   log pseudolikelihood = -645.62321  
Iteration 1:   log pseudolikelihood = -596.20894  
Iteration 2:   log pseudolikelihood =  -595.9762  
Iteration 3:   log pseudolikelihood = -595.97614  

Logistic regression                             Number of obs     =        950
                                                Wald chi2(19)     =    1162.89
                                                Prob > chi2       =     0.0000
Log pseudolikelihood = -595.97614               Pseudo R2         =     0.0769

                                                          (Std. Err. adjusted for 30 clusters in address_current_county_2002)
-----------------------------------------------------------------------------------------------------------------------------
                                                            |               Robust
                                                  leftsampp |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
------------------------------------------------------------+----------------------------------------------------------------
                                     goodhealth_baseline |   -.370314   .1289517    -2.87   0.004    -.6230546   -.1175734
                                                            |
                                           own_education_y0 |
                                              No schooling  |          0  (empty)
                                  Primary school education  |   1.669598   .8966431     1.86   0.063    -.0877899    3.426987
                                     Some secondary school  |   .3225232    .228789     1.41   0.159    -.1258951    .7709415
                              Complete secondary education  |   .2945347   .1777575     1.66   0.098    -.0538636    .6429331
    Some third level education at college, university, RTC  |   .4422957   .2384249     1.86   0.064    -.0250086    .9095999
Complete third level education at college, university, RTC  |          0  (omitted)
                                                            |
                                           maritalstatus_y0 |
                                                Cohabiting  |   .3158153   .3075987     1.03   0.305    -.2870671    .9186977
                                                  Divorced  |   1.112166    .694325     1.60   0.109    -.2486864    2.473018
                                                   Widowed  |   1.403864   1.202754     1.17   0.243    -.9534902    3.761219
                                      Single/Never married  |   .3154115    .249044     1.27   0.205    -.1727056    .8035287
                                                            |
                                            medical_card_y0 |
                                                       Yes  |   .0470571   .1727811     0.27   0.785    -.2915876    .3857018
                                                            |
                                              employment_y0 |
                                                Unemployed  |   -.138236   .4231469    -0.33   0.744    -.9675886    .6911166
  Unable to work owing to permanent sickness or disability  |   .2499563   .5619914     0.44   0.656    -.8515266    1.351439
                                         At school/student  |  -1.056431   .3381044    -3.12   0.002    -1.719103   -.3937586
                           Seeking work for the first time  |          0  (empty)
                                                  Employed  |  -.3402886    .125736    -2.71   0.007    -.5867267   -.0938505
                                             Self Employed  |  -.4277785   .4484061    -0.95   0.340    -1.306638    .4510813
                                                            |
                                                 ord_age_y0 |
                                                     20-23  |  -.2871578   .2363127    -1.22   0.224    -.7503222    .1760065
                                                     24-27  |  -.5570169    .366666    -1.52   0.129    -1.275669    .1616352
                                                     28-32  |  -1.039217   .3932561    -2.64   0.008    -1.809985   -.2684494
                                                      33 +  |  -1.284368   .3721328    -3.45   0.001    -2.013735   -.5550012
                                                            |
                                                      _cons |   .7023712    .434125     1.62   0.106    -.1484982    1.553241
-----------------------------------------------------------------------------------------------------------------------------

. 
. margins if gender==0, dydx(goodhealth_baseline) post

Average marginal effects                        Number of obs     =        950
Model VCE    : Robust

Expression   : Pr(leftsampp), predict()
dy/dx w.r.t. : binary_health_y0

----------------------------------------------------------------------------------
                 |            Delta-method
                 |      dy/dx   Std. Err.      z    P>|z|     [95% Conf. Interval]
-----------------+----------------------------------------------------------------
goodhealth_baseline|   -.080887   .0279242    -2.90   0.004    -.1356174   -.0261566
----------------------------------------------------------------------------------

. 
. estimates store logitmod

. 
. estimates table logitmod, star stats(N r2 r2_a)

------------------------------
    Variable |   logitmod     
-------------+----------------
goodhealth_baseline| -.08088701**   
-------------+----------------
           N |        950     
          r2 |                
        r2_a |                
------------------------------
legend: * p<0.05; ** p<0.01; *** p<0.001

As the coefficient on good health is negative and significant, I conclude that mothers who self-report good health are significantly less likely to leave the sample after wave 1, therefore mothers who self-report bad health are significantly more likely to leave the sample after wave 1. As a result, I note in my conclusion that my results, where the health effects of employment change is the core outcome of my primary analysis, may be understated due to attrition.

Thus my questions are two-fold, first and foremost, I can't tell if my data is the correct type for the initial X2(chi-square) tests for relationships between variables, i.e. my understanding is that a chi-square test is used for nominal data, but because I am looking at good health vs, bad health, my data feels more ordinal than nominal, and I wonder if a X2(chi-square) test is acceptable here as a first line test, or does this type of data require a different approach? If X2(chi-square) is acceptable I am happy to stick with it. Secondly, does my approach to attrition make good logical sense?

Kindest regards,

Jonathan

Comment

Carlo Lazzaro

Join Date: Apr 2014

Posts: 17673
#4

25 Jun 2018, 11:51

John:
thanks for providing further details.
It seems that you're dealing with a panel dataset: hence I would consider using -xtlogit-.
At the top of that, while I do sponsor clusterin, as your observations are not independent, I fail to get the reason for clustering your -logit- standard errors on -address_current_county_2002- instead of on -panelid-.
As far as dealing with attrition is concerned, https://www.amazon.com/Applied-Econo.../dp/0415676827 devoted chapter 10 to non-response and attrition bias. In brief they propose the use of the Inverse Probability Weights to deal with that issue. Stata codes to do the trick are provided as well.
As usual, the core issue rests on the ignorability of the missingness.
As your data are ordinal, I would still prefer -ologit- vs. -chi2-

Kind regards,
Carlo
(Stata 19.0)
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17673
#5

26 Jun 2018, 01:05

John:
you may be also interested in the following working paper: https://www.york.ac.uk/media/economi...c/wp/05_05.pdf

Kind regards,
Carlo
(Stata 19.0)
Comment
John Adler

Join Date: Apr 2017

Posts: 173
#6

26 Jun 2018, 11:00

Dear Carlo,

Thank you for your feedback and for the resources you have supplied, I will inure to employ these in my analysis,

Kindest regards,

John
Comment
John Adler

Join Date: Apr 2017

Posts: 173
#7

26 Jun 2018, 12:17

Dear Carlo Lazzaro

Just a brief follow up question,

You mentioned that as my data are ordinal, you would prefer -ologit- to -chi2-, I would be interested in whether a case could still be made to argue in support of using -chi2- in this situation? Or would it be entirely incorrect to do so?

Kindest regards,

John
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17673
#8

26 Jun 2018, 15:37

John:
I would not say that chi2 is totally incorrect, but probably not fully satisfactory when compared to the statistical procedures proposed for investigating attrition in panel data (as per the references quoted in my previous reply). Most also depends on what is the target of your research (colleagues, reviewer of a technical journal, discussants).

Kind regards,
Carlo
(Stata 19.0)
Comment
John Adler

Join Date: Apr 2017

Posts: 173
#9

02 Jul 2018, 13:38

Dear Carlo,

I must thank you for providing me with such excellent resources on the use of the Inverse Probability Weights to deal with the issue of attrition. Particularly, Jones, A. M., Rice, N., d'Uva, T. B., & Balia, S. (2007). Applied Health Economics (Routledge Advanced Texts in Economics and Finance) has been a very informative and easy to follow guide.

Nonetheless I am having a minor issue that I hope you could provide some guidance on.

In my analysis I am interested in the relationship between health outcomes and local area unemployment for a group of mothers in panel data across 3 waves.

To compute the IPW estimator I estimate (probit) equations for response (rit = 1) versus non-response (rit = 0) at each wave, t = 1,…,T, conditional on a set of characteristics (zi1) that are measured for all individuals at the first wave.

zi1 includes the initial values of all of the regressors in the health equation.

Also it includes initial values of the y variable (self-rated health) and of the other indicators of morbidity.

Code:

The below binary variable determines whether the respondent was in the sample at each wave: has_questionnaire_y0 = did the respondent have a y0 questionnaire has_questionnaire_y5 = did the respondent have a y5 questionnaire has_questionnaire_y10 = did the respondent have a y10 questionnaire Variables are then reshaped from wide to long for panel data analysis: reshape long has_questionnaire_y binary_health_y medical_card_y binary_employment_y age_y psum_unemployed_total_cont_y, i(id) j(year)

Each wave is described by wavenum as below:

Code:

******************************************** * wavenum ******************************************** tab year recode year (0=1) (5=2) (10=3) (else=.), gen(wavenum) label(wavenum) label variable wavenum "wavenum" tab year wavenum *I create a wavenum variable to match what they have in the book, recoding my time period from a wave every five years starting in year 0, to a wave 1 wave, a wave 2 wave, and a wave 3 wave.

The following code is used to create variables that contain the initial values of the regressors at wave 1:

Code:

sort id wavenum foreach X of varlist binary_health_y /// psum_unemployed_total_cont_y /// medical_card_y /// binary_employment_y /// age_y { by id: gen `X't1 = `X'[1] } * These are included in a global variable list: global z1 "binary_health_yt1 psum_unemployed_total_cont_yt1 medical_card_yt1 binary_employment_yt1 age_yt1"

These variables are used in a sequence of probit models for response versus non-response: so the dependent variable is has_questionnaire_y, which indicates whether an observation is in the estimation sample at each wave as described above.

Code:

forvalues j = 1(1)3 { quietly probit has_questionnaire_y $z1 if (wavenum == `j') predict p`j' , p generate ipw `j' = 1/p`j' } generate imr = 0 forvalues k = 1(1)3 { replace imr = imr`k' if wavenum == `k' } generate ipw = 1 forvalues k = 1(1)3{ replace ipw = ipw`k' if wavenum == `k' } }

I attempt to estimate the probits at each wave of the panel, from wave 1 to wave 3, using the full sample of individuals who are observed at wave 1.

The whole purpose is to create the new variable ipw: the inverse of the fitted probability of responding. Because they do so in the book, I also create the inverse Mills ratios (imr), predominantly to learn more about this.

However on executing the code I am faced with the following error:

Code:

. forvalues j = 1(1)3 { 2. quietly probit has_questionnaire_y $z1 if (wavenum == `j') 3. predict p`j' , p 4. generate ipw `j' = 1/p`j' 5. . } r(2000); end of do-file r(2000); .

Not one new variable is created.

According to the helpfiles this suggests that

Search of official help files, FAQs, Examples, SJs, and STBs

[P] error . . . . . . . . . . . . . . . . . . . . . . . . Return code 2000
no observations;
You have requested some statistical calculation and there are
no observations on which to perform it. Perhaps you specified
if or in and inadvertently filtered all the data.

(end of search)

But I am at a loss as to why this is, I can only suspect that my forvalues approach is incorrect as this is not a route I often take, is there anything that I have done here that jumps out to you as immediately incorrect? I tried to stick as closely to the guide provided by the textbook, i.e. from pages 283 to 287 of Applied Health Economics (Routledge Advanced Texts in Economics and Finance), more or less verbatim, so I really cant see where I may have gone wrong here.

Any advice received is greatly appreciated,

Very best,

John
Comment

Announcement