I need some pointers as to what type of test I should be looking at to compare two frequency tables. For a concrete example, consider this:
My variable y is categorical with k unordered levels. I need some way to either test formally that the distribution of y in stream A is the same as the distribution of y in stream B or else have some measure of the similarity or difference between the two distributions. It seems that having k more than two complicates things, since then stream A could differ from stream B in multiple "directions" (for lack of a better term). I do not have covariates that I can use as relevant controls; I just have the variable y. I want to know if what I see in stream A is similar to what I see in stream B. I will of course use some type of bar chart to show the similarity or difference visually, but a formal hypothesis test would be nice.
In my application stream B will always be much larger than stream A, and I am willing to assume stream B is fixed and has no sampling variation itself.
I imagine this is a solved problem, but I have much time with DuckDuckGo searching for things like "comparing frequency tables" with little success.
I just need to know what methodology I should read up on to proceed with this.
Thanks!
Code:
. clear . set seed 1 . set obs 10000 Number of observations (_N) was 0, now 10,000. . gen str stream = "B" . replace stream = "A" in 1/500 (500 real changes made) . gen y = runiformint(1, 10) . bys stream: tab y --------------------------------------------------------------------------------- -> stream = A y | Freq. Percent Cum. ------------+----------------------------------- 1 | 44 8.80 8.80 2 | 44 8.80 17.60 3 | 43 8.60 26.20 4 | 55 11.00 37.20 5 | 35 7.00 44.20 6 | 51 10.20 54.40 7 | 45 9.00 63.40 8 | 66 13.20 76.60 9 | 65 13.00 89.60 10 | 52 10.40 100.00 ------------+----------------------------------- Total | 500 100.00 --------------------------------------------------------------------------------- -> stream = B y | Freq. Percent Cum. ------------+----------------------------------- 1 | 915 9.63 9.63 2 | 933 9.82 19.45 3 | 928 9.77 29.22 4 | 929 9.78 39.00 5 | 989 10.41 49.41 6 | 919 9.67 59.08 7 | 940 9.89 68.98 8 | 1,037 10.92 79.89 9 | 943 9.93 89.82 10 | 967 10.18 100.00 ------------+----------------------------------- Total | 9,500 100.00
In my application stream B will always be much larger than stream A, and I am willing to assume stream B is fixed and has no sampling variation itself.
I imagine this is a solved problem, but I have much time with DuckDuckGo searching for things like "comparing frequency tables" with little success.
I just need to know what methodology I should read up on to proceed with this.
Thanks!
Comment