Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • carrying out thousands of chi sq tests

    Hi there

    I want to carry out a very large number (>50,000) of chi-sq tests, based on 2x2 tables, to compare cases and controls for >50,000 binary (yes/no) exposures. This is for the purpose of creating a manhattan plot of p-values.

    Ordinarily I would just create lots of binary variables for the exposures and repeat the following command:

    Code:
    tab casestatus exposure1, chi
    tab casestatus exposure2, chi
    tab casestatus exposure3, chi
    [etc...]
    But clearly this is inefficient for so many exposures! Also, my number of exposures exceeds maxvar for Stata 13/IC, so this approach would not be possible anyway.

    Any alternative suggestions would be greatly appreciated!

  • #2
    Well, if your exposures are really called exposure1 exposure2 exposure3, etc. you could do it in a loop. But also, if you want to make a plot out of the p-values, you will need to retain them in a dataset. So you need to use a -postfile-, too.

    Code:
    capture postutil clear
    tempfile pvalues
    postfile handle str32 exposure float pvalue using `pvalues'
    
    foreach v of varlist exposure* {
        quietly tab casestatus `v', chi2
        post handle ("`v'") (`r(p)')
    }
    postclose handle
    
    use `pvalues', clear
    // CODE TO MAKE THE MANHATTAN PLOT GOES HERE

    Comment


    • #3
      Thanks Clyde.
      And presumably the only way of doing the above with such a vast number of exposure variables is to create multiple datasets and load them in sequentially?

      Comment


      • #4
        Well, it depends. If you are running Stata 15/MP, you can have up to 120,000 variables in a single data set. If you are on SE or IC you will not be able to include 50,000 in a data set and will have to create multiples.

        If you are working in genomics and will be doing things like this regularly, it may be worth your while to upgrade your Stata 15 license to MP from IC or SE if that isn't what you have. The marginal cost of that upgrade is not all that much, if I recall. Contact the sales office at StataCorp for information. If you are on Stata 14, then even with MP you cannot exceed 32,767 variables in one data set.

        Note that if you are going to have to do this task using several different data sets processed serially, it is probably best not to use a temporary file to hold the p-values. Rather, I would create a regular data file for the purpose, and use the -append- option in the -postfile- command so that in the end all 50,000 p-values end up in one data set for graphing.

        Comment


        • #5
          Thanks Clyde.
          I didn't even know about 'postfile', which has now changed my life.

          Comment


          • #6
            Clyde's solution in #2 works perfectly for when data are 'long'.

            A consequence of reshaping data for matched pairs analysis is that exposure variables occur twice (once for the case, once for the control).

            So instead of looping the following:

            Code:
            tab casestatus exposure1, chi
            tab casestatus exposure2, chi
            into:

            Code:
            foreach v of varlist exposure* {
                quietly tab casestatus `v', chi2
            }
            ...one has to loop the following:
            Code:
            tab exposure5_0 exposure5_1
            tab exposure4_0 exposure4_1
            tab exposure3_0 exposure3_1
            tab exposure2_0 exposure2_1
            tab exposure1_0 exposure1_1
            Any advice on how to do this would be greatly appreciated.

            Comment


            • #7
              So with wide data, it's just a tad harder here:

              Code:
              ds *_0, local(exposures)
              local exposures: subinstr local "_0" "", all
              foreach e of local exposures {
                  tab `e'_0 `e'_1
              }
              Note: Assumes that all exposure variable names appear in a _0 and _1 version and that there are no variable names ending in _0 other than the exposures. If not, the statements above the loop must be replaced with code, appropriate to the actual data set, that generates a list of the exposure variable names stripped of their _0 suffixes.

              Comment


              • #8
                Many thanks again, Clyde.

                I tested the code on the following web-based data and received the error "option local() not allowed". Apologies if I'm overlooking something obvious. Your advice is much appreciated.

                Code:
                clear
                webuse lowbirth.dta
                keep pairid low smoke ptd
                reshape wide smoke ptd, i(pairid) j(low)
                rename smoke0 smoke_0
                rename smoke1 smoke_1
                rename ptd0 ptd_0
                rename ptd1 ptd_1
                
                ds *_0, local(exposures)
                local exposures: subinstr local "0" "", all
                foreach e of local exposures {
                    tab `e'0 `e'1
                }

                Comment


                • #9
                  Sorry, my error. I was confusing the syntax of -ds- with that of -levelsof-, both of which are frequently used to generate a list of things to loop over. The -ds- command needs to be replaced by the following two commands:
                  Code:
                  ds *_0
                  local exposures `r(varlist)'

                  Comment


                  • #10
                    Sorry Tim, I got a bit lost. Did the problem in #1 have matched pairs of cases & controls or not? If not, I was thinking that reshaping the data from WIDE to LONG, and then using statsby to gather the p-values would be fairly straightforward approach. E.g.,

                    Code:
                    * Generate a small example of a wide file
                    clear
                    input id casestatus exposure1 exposure2 exposure3
                    1 1 1 0 1 1
                    2 1 1 0 1 1
                    3 1 1 1 1 1
                    4 1 0 0 0 1
                    5 0 0 1 1 0
                    6 0 0 0 0 0
                    7 0 1 1 0 0
                    8 0 0 0 1 0
                    9 0 0 1 1 0
                    end
                    * Reshape data from WIDE to LONG
                    reshape long exposure, i(id) j(expno)
                    * list // Uncomment to see listing of the LONG file
                    * Use -statsby- to gather the p-values
                    statsby pvalue=r(p), by(expno) clear:  tab casestatus exposure, chi2
                    list
                    The statsby approach ought to work for matched pairs too, but with mcc in place of tab (I would guess).

                    HTH.
                    --
                    Bruce Weaver
                    Email: [email protected]
                    Version: Stata/MP 18.5 (Windows)

                    Comment


                    • #11
                      Thanks for all help with this.

                      The following is my solution for carrying out thousands of chi sq tests on matched paired data. (Clyde's solution in #2 above does the job for unmatched data.)

                      Code:
                      clear
                      webuse lowbirth.dta
                      keep pairid low smoke ptd
                      reshape wide smoke ptd, i(pairid) j(low)
                      
                      capture postutil clear
                      tempfile results
                      postfile handle str32 exposure float caseno_controlno float caseno_controlyes float caseyes_controlno float caseyes_controlyes float mcc_or float lowerci float upperci float chi float pvalue using `results'
                      
                      quietly ds *0
                      local exposures `r(varlist)'
                      foreach var of any `r(varlist)' {
                      local newvar = substr("`var'", 1, length("`var'")-1)
                      quietly tab `newvar'0 `newvar'1, matcell(x)
                      quietly mcc `newvar'1 `newvar'0
                      post handle ("`newvar'") (x[1,1]) (x[1,2]) (x[2,1]) (x[2,2]) (`r(or)') (`r(lb_or)') (`r(ub_or)') (`r(chi2)') (`r(p_exact)')
                      }
                      
                      postclose handle
                      use `results', clear

                      Comment

                      Working...
                      X