carrying out thousands of chi sq tests

tim bradshaw

Join Date: Feb 2016

Posts: 192
#1

carrying out thousands of chi sq tests

20 Jul 2017, 08:36

Hi there

I want to carry out a very large number (>50,000) of chi-sq tests, based on 2x2 tables, to compare cases and controls for >50,000 binary (yes/no) exposures. This is for the purpose of creating a manhattan plot of p-values.

Ordinarily I would just create lots of binary variables for the exposures and repeat the following command:

Code:

tab casestatus exposure1, chi tab casestatus exposure2, chi tab casestatus exposure3, chi [etc...]

But clearly this is inefficient for so many exposures! Also, my number of exposures exceeds maxvar for Stata 13/IC, so this approach would not be possible anyway.

Any alternative suggestions would be greatly appreciated!
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#2

20 Jul 2017, 09:03

Well, if your exposures are really called exposure1 exposure2 exposure3, etc. you could do it in a loop. But also, if you want to make a plot out of the p-values, you will need to retain them in a dataset. So you need to use a -postfile-, too.

Code:

capture postutil clear tempfile pvalues postfile handle str32 exposure float pvalue using `pvalues' foreach v of varlist exposure* { quietly tab casestatus `v', chi2 post handle ("`v'") (`r(p)') } postclose handle use `pvalues', clear // CODE TO MAKE THE MANHATTAN PLOT GOES HERE
Comment
tim bradshaw

Join Date: Feb 2016

Posts: 192
#3

20 Jul 2017, 09:56

Thanks Clyde.
And presumably the only way of doing the above with such a vast number of exposure variables is to create multiple datasets and load them in sequentially?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#4

20 Jul 2017, 10:03

Well, it depends. If you are running Stata 15/MP, you can have up to 120,000 variables in a single data set. If you are on SE or IC you will not be able to include 50,000 in a data set and will have to create multiples.

If you are working in genomics and will be doing things like this regularly, it may be worth your while to upgrade your Stata 15 license to MP from IC or SE if that isn't what you have. The marginal cost of that upgrade is not all that much, if I recall. Contact the sales office at StataCorp for information. If you are on Stata 14, then even with MP you cannot exceed 32,767 variables in one data set.

Note that if you are going to have to do this task using several different data sets processed serially, it is probably best not to use a temporary file to hold the p-values. Rather, I would create a regular data file for the purpose, and use the -append- option in the -postfile- command so that in the end all 50,000 p-values end up in one data set for graphing.
Comment
tim bradshaw

Join Date: Feb 2016

Posts: 192
#5

21 Jul 2017, 06:55

Thanks Clyde.
I didn't even know about 'postfile', which has now changed my life.
Comment
tim bradshaw

Join Date: Feb 2016

Posts: 192
#6

25 Jul 2017, 11:01

Clyde's solution in #2 works perfectly for when data are 'long'.

A consequence of reshaping data for matched pairs analysis is that exposure variables occur twice (once for the case, once for the control).

So instead of looping the following:

Code:

tab casestatus exposure1, chi tab casestatus exposure2, chi

into:

Code:

foreach v of varlist exposure* { quietly tab casestatus `v', chi2 }

...one has to loop the following:

Code:

tab exposure5_0 exposure5_1 tab exposure4_0 exposure4_1 tab exposure3_0 exposure3_1 tab exposure2_0 exposure2_1 tab exposure1_0 exposure1_1

Any advice on how to do this would be greatly appreciated.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#7

25 Jul 2017, 11:12

So with wide data, it's just a tad harder here:

Code:

ds *_0, local(exposures) local exposures: subinstr local "_0" "", all foreach e of local exposures { tab `e'_0 `e'_1 }

Note: Assumes that all exposure variable names appear in a _0 and _1 version and that there are no variable names ending in _0 other than the exposures. If not, the statements above the loop must be replaced with code, appropriate to the actual data set, that generates a list of the exposure variable names stripped of their _0 suffixes.
Comment

tim bradshaw

Join Date: Feb 2016
Posts: 192

26 Jul 2017, 04:43

Many thanks again, Clyde.

I tested the code on the following web-based data and received the error "option local() not allowed". Apologies if I'm overlooking something obvious. Your advice is much appreciated.

Code:

clear
webuse lowbirth.dta
keep pairid low smoke ptd
reshape wide smoke ptd, i(pairid) j(low)
rename smoke0 smoke_0
rename smoke1 smoke_1
rename ptd0 ptd_0
rename ptd1 ptd_1

ds *_0, local(exposures)
local exposures: subinstr local "0" "", all
foreach e of local exposures {
    tab `e'0 `e'1
}

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#9

26 Jul 2017, 08:31

Sorry, my error. I was confusing the syntax of -ds- with that of -levelsof-, both of which are frequently used to generate a list of things to loop over. The -ds- command needs to be replaced by the following two commands:

Code:

ds *_0 local exposures `r(varlist)'
Comment
Bruce Weaver

Join Date: May 2014

Posts: 1133
#10

26 Jul 2017, 09:50

Sorry Tim, I got a bit lost. Did the problem in #1 have matched pairs of cases & controls or not? If not, I was thinking that reshaping the data from WIDE to LONG, and then using statsby to gather the p-values would be fairly straightforward approach. E.g.,

Code:

* Generate a small example of a wide file clear input id casestatus exposure1 exposure2 exposure3 1 1 1 0 1 1 2 1 1 0 1 1 3 1 1 1 1 1 4 1 0 0 0 1 5 0 0 1 1 0 6 0 0 0 0 0 7 0 1 1 0 0 8 0 0 0 1 0 9 0 0 1 1 0 end * Reshape data from WIDE to LONG reshape long exposure, i(id) j(expno) * list // Uncomment to see listing of the LONG file * Use -statsby- to gather the p-values statsby pvalue=r(p), by(expno) clear: tab casestatus exposure, chi2 list

The statsby approach ought to work for matched pairs too, but with mcc in place of tab (I would guess).

HTH.

--
Bruce Weaver
Email: [email protected]
Version: Stata/MP 19.5 (Windows)
Comment

tim bradshaw

Join Date: Feb 2016
Posts: 192

#11

27 Jul 2017, 03:52

Thanks for all help with this.

The following is my solution for carrying out thousands of chi sq tests on matched paired data. (Clyde's solution in #2 above does the job for unmatched data.)

Code:

clear
webuse lowbirth.dta
keep pairid low smoke ptd
reshape wide smoke ptd, i(pairid) j(low)

capture postutil clear
tempfile results
postfile handle str32 exposure float caseno_controlno float caseno_controlyes float caseyes_controlno float caseyes_controlyes float mcc_or float lowerci float upperci float chi float pvalue using `results'

quietly ds *0
local exposures `r(varlist)'
foreach var of any `r(varlist)' {
local newvar = substr("`var'", 1, length("`var'")-1)
quietly tab `newvar'0 `newvar'1, matcell(x)
quietly mcc `newvar'1 `newvar'0
post handle ("`newvar'") (x[1,1]) (x[1,2]) (x[2,1]) (x[2,2]) (`r(or)') (`r(lb_or)') (`r(ub_or)') (`r(chi2)') (`r(p_exact)')
}

postclose handle
use `results', clear

Announcement

carrying out thousands of chi sq tests

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment