Identifying significant difference between two categorical variables (need help regarding the statistical test)

Rahul Raoniar

Join Date: Sep 2021

Posts: 19
#1

Identifying significant difference between two categorical variables (need help regarding the statistical test)

25 Sep 2021, 21:08

Hi! Everyone, while reading an article I came across the following lines where the author used some hypothesis testing to identify whether a significant different exist between the categorical variables. The lines are as follows:

"We found a significant gender difference in the crossing behaviours of the pedestrians arriving to the crosswalk during a red-light phase, such that male pedestrians crossed on a red light more frequently than female pedestrians, (χ²(1, N= 563) = 17.17, p< 0.001). Age failed to yield significant differences in crossing behaviours, (χ²(2, N = 1392) = 1.02, p = NS), and was subsequently removed from further analyses."

Initially, based on the reported statistics I thought that the author has performed a chi-squared test of independence, but I am still in doubt that whether with independence test, we could locate the differences (as independence test used to identify association).

Can anyone spot what statistics the author has used?

Note: Gender (includes male/female), Crossing behaviour (crossed in red-light or waited for green) and age (0-20, 20-40, 40-60 and 60+), all are categorical variables.

Thanks in advance.

Last edited by Rahul Raoniar; 25 Sep 2021, 21:11.
Tags: categorical, chi-squared values, hypothesis, test
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#2

25 Sep 2021, 21:16

To say that red-light crossing behavior is independent of X is the same thing as saying that it is not associated with X.

Now, if the author doesn't tell you how he or she got the chi square statistics, well, that is just bad writing and it's another black mark on peer review that he or she was able to publish that. The results could have come from analysis of a 2x2 table, but it might have come from a logistic regression that included other covariates--which would give the finding a rather different meaning. There's no way to tell from the quote you provide.

Finally, there is something very strange about these results. Why is the N for the age test 1,392, but for the gender test it is only 563? In real world data, it is more common for age to be missing than sex, and a discrepancy in the sample sizes this large would be unusual for any two variables in any direction. There may be some perfectly innocent explanation for this that might be apparent with a full disclosure of the sampling and measurement designs. But it's hard to think of one that could produce that kind of difference in N.
2 likes
Comment
Rahul Raoniar

Join Date: Sep 2021

Posts: 19
#3

25 Sep 2021, 21:38

Thanks, Clyde Schechter for the reply. To add more context, I have included the table.

N = 1392 is the total sample size

Of the 1392 valid observations, 563 (40.4%) pedestrians arrived at the crosswalk while lights were red.

The aim of the study is to model the signal violation behaviour of pedestrian (whether a pedestrian would wait for green light or cross in red light/do not walk phase) using a logistic regression (table 2). Before fitting a logistic regression, the author checked the significance of the selected variables using the above statistical test (especially for gender and age). For example, in the above statements age was excluded [also can be seen in table 2 age is not included] as it was not significant in the hypothesis test (χ²(2, N = 1392) = 1.02, p = NS).

The first table reports the frequencies.

Thanks in advance.

Last edited by Rahul Raoniar; 25 Sep 2021, 22:28.
Comment
Rahul Raoniar

Join Date: Sep 2021

Posts: 19
#4

25 Sep 2021, 23:09

If anyone wants to take a look, this is the article.

Rosenbloom, T., 2009. Crossing at a red light: Behaviour of individuals and groups. Transportation Research Part F: Traffic Psychology and Behaviour 12, 389–394. doi:10.1016/j.trf.2009.05.002.
Comment

Carlo Lazzaro

Join Date: Apr 2014
Posts: 17707

26 Sep 2021, 06:44

Rahul:
provided that I do share Clyde's concerns about both methodology and reviewing process of this paper, the first statistic you were looking for is the result of:

Code:

. tabi 185 48 \ 302 28, chi2

           |          col
       row |         1          2 |     Total
-----------+----------------------+----------
         1 |       185         48 |       233
         2 |       302         28 |       330
-----------+----------------------+----------
     Total |       487         76 |       563

          Pearson chi2(1) =  17.1694   Pr = 0.000

.

whereas the second one (obtained by unappropriately replacing the missing value with zero, otherwise -tabi- throws a syntax error) gives back (as expected) a different (although non-significant) result:

Code:

. tabi 780 462 72\37 22 4\12 3 0, chi2

           |               col
       row |         1          2          3 |     Total
-----------+---------------------------------+----------
         1 |       780        462         72 |     1,314
         2 |        37         22          4 |        63
         3 |        12          3          0 |        15
-----------+---------------------------------+----------
     Total |       829        487         76 |     1,392

          Pearson chi2(4) =   2.9538   Pr = 0.566

.

Kind regards,
Carlo
(Stata 19.0)

Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30100

26 Sep 2021, 11:24

Well, I cannot replicate the author's findings for the association between age group and red-crossing:

Code:

. clear*

. set obs 3
Number of observations (_N) was 0, now 3.

. label define age_group 1 "20-40" 2 "40-60" 3 "60+"

. gen age_group:age_group = _n

. expand 2
(3 observations created)

. label define light 1 "Green" 2 "Red"

. by age_group, sort: gen light:light = _n

.
.
. gen crossings = 462 in 1
(5 missing values generated)

. replace crossings = 72 in 2
(1 real change made)

. replace crossings = 22 in 3
(1 real change made)

. replace crossings = 4 in 4
(1 real change made)

. replace crossings = 3 in 5
(1 real change made)

.
. tab age_group light [fweight = crossings], chi2

           |         light
 age_group |     Green        Red |     Total
-----------+----------------------+----------
     20-40 |       462         72 |       534
     40-60 |        22          4 |        26
       60+ |         3          0 |         3
-----------+----------------------+----------
     Total |       487         76 |       563

          Pearson chi2(2) =   0.5474   Pr = 0.761

.
. gen byte red_crossing = 2.light

.
. logistic red_crossing i.age_group [fweight = crossings]
note: 3.age_group != 0 predicts failure perfectly;
      3.age_group omitted and 1 obs not used.


Logistic regression                                     Number of obs =    560
                                                        LR chi2(1)    =   0.07
                                                        Prob > chi2   = 0.7858
Log likelihood = -222.34284                             Pseudo R2     = 0.0002

------------------------------------------------------------------------------
red_crossing | Odds ratio   Std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
   age_group |
      40-60  |   1.166667   .6511505     0.28   0.782     .3907208     3.48359
        60+  |          1  (empty)
             |
       _cons |   .1558442   .0197458   -14.67   0.000     .1215743    .1997742
------------------------------------------------------------------------------
Note: _cons estimates baseline odds.

.
.
end of do-file

As you can see, the chi square statistic from a cross-tabulation yields a different value from that reported (and also is based on a different number of degrees of freedom). The 2 df would be compatible with a logistic regression having the 3 age groups as the explanatory variables and red light crossing as outcome, but, as you can see, because there are no redlight crossings in the 60+ age group, there is actually only one df for age in the model when you implement it. So I do not know where the author got that from. Perhaps there is an explanation somewhere in the article itself. If not, it just confirms that the paper was poorly written and got sloppy editorial review.

As a side issue, I will point out that although it is quite common to see variables omitted from the multivariable model when they do not show a statistically significant bivariate association with the outcome, this practice is just dead wrong. I'm not even talking about statistical significance as an issue here. Even for people who believe in statistical significance, this is a bad practice. The reason is that the variable in question may still be associated with other predictors that are being included in the model, and, in that case, leaving it out results in omitted variable bias. In the particular situation here, this even seems likely to have occurred. Elderly people often choose their walking routes to avoid certain types of dangerous intersections, such as ones where there is a lot of traffic. So there may well be confounding between age and the traffic volume variable that needs to be corrected by inclusion of age in the model.

I wouldn't put a lot of credence in the results of this paper unless in other parts of the paper there are good, convincing explanations for these issues.

Comment

Rahul Raoniar

Join Date: Sep 2021

Posts: 19
#7

28 Sep 2021, 11:13

Thank you, Carlo Lazzaro and Clyde Schechter for wonderful explanations. I really learned a lot from these comments and appreciate both of your contribution to STATA forum 👏👏. Thank you!

Last edited by Rahul Raoniar; 28 Sep 2021, 11:16.
Comment

Announcement