Comparing frequency tables

Brian Poi

Join Date: Feb 2021
Posts: 22

Comparing frequency tables

02 Oct 2022, 12:33

I need some pointers as to what type of test I should be looking at to compare two frequency tables. For a concrete example, consider this:

Code:

. clear

. set seed 1

. set obs 10000
Number of observations (_N) was 0, now 10,000.

. gen str stream = "B"

. replace stream = "A" in 1/500
(500 real changes made)

. gen y = runiformint(1, 10)

. bys stream: tab y

---------------------------------------------------------------------------------
-> stream = A

          y |      Freq.     Percent        Cum.
------------+-----------------------------------
          1 |         44        8.80        8.80
          2 |         44        8.80       17.60
          3 |         43        8.60       26.20
          4 |         55       11.00       37.20
          5 |         35        7.00       44.20
          6 |         51       10.20       54.40
          7 |         45        9.00       63.40
          8 |         66       13.20       76.60
          9 |         65       13.00       89.60
         10 |         52       10.40      100.00
------------+-----------------------------------
      Total |        500      100.00

---------------------------------------------------------------------------------
-> stream = B

          y |      Freq.     Percent        Cum.
------------+-----------------------------------
          1 |        915        9.63        9.63
          2 |        933        9.82       19.45
          3 |        928        9.77       29.22
          4 |        929        9.78       39.00
          5 |        989       10.41       49.41
          6 |        919        9.67       59.08
          7 |        940        9.89       68.98
          8 |      1,037       10.92       79.89
          9 |        943        9.93       89.82
         10 |        967       10.18      100.00
------------+-----------------------------------
      Total |      9,500      100.00

My variable y is categorical with k unordered levels. I need some way to either test formally that the distribution of y in stream A is the same as the distribution of y in stream B or else have some measure of the similarity or difference between the two distributions. It seems that having k more than two complicates things, since then stream A could differ from stream B in multiple "directions" (for lack of a better term). I do not have covariates that I can use as relevant controls; I just have the variable y. I want to know if what I see in stream A is similar to what I see in stream B. I will of course use some type of bar chart to show the similarity or difference visually, but a formal hypothesis test would be nice.

In my application stream B will always be much larger than stream A, and I am willing to assume stream B is fixed and has no sampling variation itself.

I imagine this is a solved problem, but I have much time with DuckDuckGo searching for things like "comparing frequency tables" with little success.

I just need to know what methodology I should read up on to proceed with this.

Thanks!

Tags: None

Leonardo Guizzetti

Join Date: Jul 2016

Posts: 2402
#2

02 Oct 2022, 13:39

Pearson's chi2 test of homogeneity (or independence) would be appropriate here and is pretty entry level as far as statistical tests go. The hypothesis of the test is the two distributions are same, and its rejection would provide evidence that they are not.

Code:

tab y stream, chi2
Comment
Brian Poi

Join Date: Feb 2021

Posts: 22
#3

02 Oct 2022, 13:48

Originally posted by Leonardo Guizzetti View Post

Pearson's chi2 test of homogeneity (or independence) would be appropriate here and is pretty entry level as far as statistical tests go. The hypothesis of the test is the two distributions are same, and its rejection would provide evidence that they are not.

Thank you Leonardo, I knew there had to be a simple test for this. I just complete forgot about this one.
Comment

Announcement

Comparing frequency tables

Comment

Comment