Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Easier and faster way to get casewise statistics using tabstat?

    In my data set, I compare analytes analyzed in a test analyzer (X) to those analyzed in a reference analyzer (Y). In my table of descriptive statistics, I get the following for my raw data:

    Code:
     tabstat T3_X T3_Y MoK_X MoK_Y LyS_X LyS_Y, stats(n mean median min max) col(stat)
    
        variable   |       N         mean          sd           p50       min       max
    ---------------+------------------------------------------------------------------------------
       T3_X        |       205     3.892098    1.421584        4.04       .19      7.56
       T3_Y        |       203     3.877951    1.412951        4.02       .22      7.43
       MoK_X       |       183     8.855191    .9536298         8.8       6.6      11.5
       MoK_Y       |       184     8.855978    .9630706         8.8       6.5      11.6
       LyS_X       |       201     1.699502    2.658693         1.4        .1      36.9
       LyS_Y       |       202     1.910693    3.848795         1.5       .11      53.7
    -------------------------------------------------------------------------------------------------

    What I need is that the same number of non-missing observations appears for the same analyte that is analyzed in the test (X) and the reference (Y) analyzer. I thought that adding “casewise” to the previous syntax would generate statistics of interest, but I got the following result (check the N):

    Code:
     tabstat T3_X T3_Y MoK_X MoK_Y LyS_X LyS_Y, casewise stats(n mean median min max) col(stat)
    
        variable   |       N         mean          sd         p50        min       max
    ---------------+------------------------------------------------------------------------------
       T3_X        |       182     4.174835     1.14546      4.17       1.75       7.56
       T3_Y        |       182     4.159176     1.14005      4.14       1.88       7.43
       MoK_X       |       182     8.853846    .9560864       8.8        6.6       11.5
       MoK_Y       |       182     8.843956    .9524274       8.8        6.5       11.6
       LyS_X       |       182     1.782967    2.768706      1.45         .2       36.9
       LyS_Y       |       182     1.958846    3.996711       1.5         .2       53.7
    -------------------------------------------------------------------------------------------------

    To get what I really want, I must write a syntax for each variables of interest:

    Code:
    tabstat T3_X T3_Y, casewise stats(n mean median min max) col(stat)
    tabstat MoK_X MoK_Y, casewise stats(n mean median min max) col(stat)
    tabstat LyS_X LyS_Y, casewise stats(n mean median min max) col(stat)
    This is, however, not practical when I have about 50 analytes in my data set. What kind of syntax or code/loop can I have to get statistics for analytes with the same number of non-missing observations (as seen in the table below)?

    Code:
     tabstat ???????
    
        variable   |       N         mean          sd           p50       min       max
    ---------------+------------------------------------------------------------------------------
       T3_X        |       203     3.891238    1.221584        4.01       .21      7.36
       T3_Y        |       203     3.875551    1.362951        4.03       .22      7.33
       MoK_X       |       183     8.877191    .8136298         8.7       6.5      11.3
       MoK_Y       |       183     8.858778    .8530706         8.7       6.2      11.5
       LyS_X       |       201     1.699482    2.635693         1.3        .3      36.8
       LyS_Y       |       201     1.910363    3.718795         1.2       .31      53.7
    -------------------------------------------------------------------------------------------------

  • #2
    My understanding is that, for each characteristic T3, MoK, LyS, you want its summary statistics calculated on only those analytes for which the *X and *Y measurments were nonmissing, and you'd like to minimize typing effort. I'd use a loop:
    Code:
    foreach stub in T3_ MoK_ LyS_  {
       tabstat `stub'X `stub'Y , casewise stats(n mean median min max) col(stat)
    }
    Another kind of comment: With a "repeated measures" or "test/retest" situation such as you have, the "long format" organization of your data would be easier to work with for most purposes in Stata, and would offer other and perhaps easier ways to do the kind of thing(s) you want to do here. I'd recommend you take a look at the help for -reshape- and -by- and see if anything looks interesting.

    Comment


    • #3
      Thank you Mike! It worked! Is there a way to have all the results in one table? I'm a new user in Stata and appreciate every single tip

      Comment


      • #4
        I don't know, as I'm not typically needing something like this. I would recommend you look at -help tabstat-, and then look at the various commands' documentation listed under the "Also See" tab in the upper right corner of the help window.

        Comment


        • #5
          Thanks :-)

          Comment


          • #6
            Here is one way to approach it, loosely modelled on an answer in your other thread https://www.statalist.org/forums/for...e-correlations

            Code:
            * sandbox dataset
            clear
            set obs 12
            set seed 2803
            foreach s in AL BS CD {
                gen `s'_X = rnormal()
                gen `s'_Y = rnormal()
                replace `s'_Y = . if runiform() < 0.4 
            }
            
            summarize 
            
            * you start here
            unab stubs : *_X
            local stubs : subinstr local stubs "_X" "", all
            local nstubs : word count `stubs'
            local I = 2 * `nstubs' 
            
            matrix results = J(`I', 6, .)
            
            local i = 1
            gen touse = 0 
            local rownames 
            quietly foreach s of local stubs {
                replace touse = !missing(`s'_X, `s'_Y) 
                foreach v in `s'_X `s'_Y { 
                    su `v' if touse, detail 
                    mat results[`i', 1] = r(N)
                    mat results[`i', 2] = r(mean) 
                    mat results[`i', 3] = r(sd) 
                    mat results[`i', 4] = r(p50) 
                    mat results[`i', 5] = r(min) 
                    mat results[`i', 6] = r(max) 
                    local names `names' `v' 
                    local ++i
                }
            } 
            
            mat rownames results = `names'
            mat colnames results = n mean SD median min max 
            mat li results, format("%6.0g")  
            
            * results I got follow here
            
            results[6,6]
                       n    mean      SD  median     min     max
            AL_X       7   .5526    1.37   .6609  -1.834   2.512
            AL_Y       7  -.3375   1.065  -.2889  -2.093   1.087
            BS_X       7  -.4039   .9407  -.0875   -2.19   .7215
            BS_Y       7  -.2414   .7284  -.1332  -1.481    .717
            CD_X       5  -.2584   1.223  -.7595  -1.358   1.774
            CD_Y       5   .3211   .7484   .0885  -.6701   1.116
            You can put the matrix into variables using
            Code:
            svmat
            and then make a new dataset out of that.

            Another approach is to use
            Code:
            postfile
            to create a separate dataset.

            Yet another approach is to see that a different data structure would make life easier: This code follows the previous block, but if you start with your data, you don't need the drop touse You may well have an identifier variable already.

            Code:
            drop touse
            gen id = _n
            reshape long @_X @_Y, i(id) j(which) string
            keep if !missing(_X, _Y)
            
            . tabstat _X _Y, by(which) s(n mean sd p50 min max) c(s)
            
            Summary for variables: _X _Y
                 by categories of: which 
            
             which |         N      mean        sd       p50       min       max
            -------+------------------------------------------------------------
                AL |         7  .5525807  1.369795  .6609457 -1.833616  2.512294
                   |         7 -.3375053  1.064845 -.2888748 -2.092991  1.086969
            -------+------------------------------------------------------------
                BS |         7 -.4039159  .9407355 -.0875361 -2.189937  .7215156
                   |         7 -.2414407  .7283927 -.1331683 -1.481481  .7169558
            -------+------------------------------------------------------------
                CD |         5 -.2583826  1.223037 -.7595289 -1.357807  1.774275
                   |         5  .3211134   .748372  .0884645  -.670102  1.115987
            -------+------------------------------------------------------------
             Total |        19 -.0132241  1.205569 -.0875361 -2.189937  2.512294
                   |        19 -.1287924  .8702738 -.1331683 -2.092991  1.115987
            --------------------------------------------------------------------
            This last is what I would recommend most. Pepper the commands with
            Code:
            edit
            and
            Code:
            list
            to see what happens at each stage.

            Comment


            • #7
              Thank you Nick. The first code worked perfectly for the whole dataset. As I have replicate runs of each sample, I added "keep if Run==1" after the "foreach v in `s'_X `s'_Y { " in the loop. I get the right results but without the variable names in the table. I get the row numbers (r1, r2 etc.) instead. Should I modify something in the code to get the names of the variables in the table?

              The other code did not work as I get the error message r(198) type mismatch between variables. I need to spend time on the reshape command.

              Comment


              • #8
                As in your concurrent thread please study FAQ Advice #12 on how to report problems.

                This is the first time you have mentioned a variable Run. That's OK but we can't be expected to understand fully how it modifies the analysis without a full explanation and example.

                Similarly your type mismatch problem can't be understood without knowing more. Note that #6 is self-contained code that works. If your version of the same didn't work it's something to do with your data or your version of the code and we can't tell which.

                Comment


                • #9
                  Thanks for your advice. In our study, we analyzed blood samples (with unique Sampleid, then recoded to sampleno) in duplicates (Run 1 and 2) in a reference analyzer (X) and a test analyzer (Y). T3, MoK, LyS etc. are analytes measured in the blood samples by the analyzers. We need to compare the results of the analytes between the analyzers. I hereby attach a sample of my data:

                  Code:
                  Sampleid  Sampleno    Run     Date       T3_X     T3_Y   MoK_X  MoK_Y   LyS_X   LyS_Y    etc.
                  -------------------------------------------------------------------------------------------
                                                            
                  321           1       1       23/1       89       90.4   31.2   32.1    12      12.3    
                  321           1       2       23/1       92.8     91.5   31.9   31.3    13.1    12.5    
                  345           2       1       23/1       86.4     83.1   30.4   30.2    12.3    12.5    
                  345           2       2       23/1       86.7     84.9   31     30.5    12.2    12.7    
                  600           3       1       25/1       84.7     85.4   31.1   31.5    12.1    12.9    
                  600           3       2       25/1       85.8     86.1   31.6   31.1    12.8    12.7    
                  etc.
                  When I run the code above, I get the output that I want (the first 4 variables in the obtained table are string variables that will be removed manually later):

                  Code:
                  . unab stubs : *_X
                  
                  local stubs : subinstr local stubs "_X" "", all
                   local nstubs : word count `stubs'
                   local I = 2 * `nstubs'
                  
                   matrix results = J(`I', 6, .)
                   local i = 1
                  gen touse = 0
                   local rownames
                  
                  quietly foreach s of local stubs {
                       replace touse = !missing(`s'_X, `s'_Y)
                       foreach v in `s'_X `s'_Y {
                      su `v' if touse, detail
                           mat results[`i', 1] = r(N)
                           mat results[`i', 2] = r(mean)
                          mat results[`i', 3] = r(sd)
                          mat results[`i', 4] = r(p50)
                         mat results[`i', 5] = r(min)
                          mat results[`i', 6] = r(max)
                           local names `names' `v'
                          local ++i
                       }
                  }
                   mat rownames results = `names'
                  
                  mat colnames results = n mean SD median min max
                  mat li results, format("%6.0g") 
                   //** This is what I get **// 
                  results[46,6]
                                    n    mean      SD  median     min     max
                   APNA_X       0       .       .       .       .       .
                   APNA_Y       0       .       .       .       .       .
                  Flags_X       0       .       .       .       .       .
                  Flags_Y       0       .       .       .       .       .
                    T3_X      391   3.892   1.408    4.04     .19    7.56
                    T3_Y      391   3.879   1.399    4.03      .2    7.46
                    MoK_X     361   85.28   7.814    85.8    55.4   105.4
                    MoK_Y     361   86.05   7.886    86.4    55.6   107.1
                    LyS_X     361   35.22   9.231      35    18.7    63.1
                    LyS_Y     361   35.41   9.273    35.4    18.5    62.9
                  I wanted to do the same using the first replicate of the samples (only Run 1) so I started with dropping Run 2. I get the following with error:

                  Code:
                  . keep if Run==1
                  unab stubs : *_X
                  local stubs : subinstr local stubs "_X" "", all
                  local nstubs : word count `stubs'
                  local I = 2 * `nstubs'
                  matrix results = J(`I', 6, .)
                  local i = 1
                  gen touse = 0
                  local rownames
                  
                  quietly foreach s of local stubs {
                  replace touse = !missing(`s'_X, `s'_Y)
                  foreach v in `s'_X `s'_Y {
                  
                  //I even tried to put "keep if Run==1" here  and got the same results below.
                  
                  su `v' if touse, detail
                  mat results[`i', 1] = r(N)
                  mat results[`i', 2] = r(mean)
                  mat results[`i', 3] = r(sd)
                  mat results[`i', 4] = r(p50)
                  mat results[`i', 5] = r(min)
                  mat results[`i', 6] = r(max)
                  local names `names' `v'
                  local ++i
                     }
                   }
                  
                  //** This is what I get **//
                  
                  mat rownames results = `names'
                  conformability error
                  r(503);
                  
                  .
                  . mat colnames results = n mean SD median min max
                  
                  .
                  . mat li results, format("%6.0g")
                  
                  results[46,6]
                            n    mean      SD  median     min     max
                   r1       0       .       .       .       .       .
                   r2       0       .       .       .       .       .
                   r3       0       .       .       .       .       .
                   r4       0       .       .       .       .       .     
                   r5     204   3.898   1.422    4.04     .19    7.56
                   r6     204   3.884   1.414   4.025     .22    7.43
                   r7     188    85.4   7.994    85.8    55.4   105.3
                   r8     188   86.13   8.073   86.65    55.6   107.1
                   r9     188   35.29   9.407      35    18.7    63.1
                  r10     188   35.45   9.479   35.45    18.8    62.9
                  What should I do to get the varnames in the table above for results using the first run?

                  Comment


                  • #10
                    Clearly APNA* and Flag* are useless to you, so ignore those. Otherwise

                    Code:
                    local rownames 
                    would better as

                    Code:
                    local names
                    to blank out the local macro.

                    Comment


                    • #11
                      Thank you Nick. I've got what I want now.

                      Comment

                      Working...
                      X