Easier and faster way to get casewise statistics using tabstat?

Amanda Ode

Join Date: Mar 2019
Posts: 62

Easier and faster way to get casewise statistics using tabstat?

21 Mar 2019, 08:40

In my data set, I compare analytes analyzed in a test analyzer (X) to those analyzed in a reference analyzer (Y). In my table of descriptive statistics, I get the following for my raw data:

Code:

 tabstat T3_X T3_Y MoK_X MoK_Y LyS_X LyS_Y, stats(n mean median min max) col(stat)

    variable   |       N         mean          sd           p50       min       max
---------------+------------------------------------------------------------------------------
   T3_X        |       205     3.892098    1.421584        4.04       .19      7.56
   T3_Y        |       203     3.877951    1.412951        4.02       .22      7.43
   MoK_X       |       183     8.855191    .9536298         8.8       6.6      11.5
   MoK_Y       |       184     8.855978    .9630706         8.8       6.5      11.6
   LyS_X       |       201     1.699502    2.658693         1.4        .1      36.9
   LyS_Y       |       202     1.910693    3.848795         1.5       .11      53.7
-------------------------------------------------------------------------------------------------

What I need is that the same number of non-missing observations appears for the same analyte that is analyzed in the test (X) and the reference (Y) analyzer. I thought that adding “casewise” to the previous syntax would generate statistics of interest, but I got the following result (check the N):

Code:

 tabstat T3_X T3_Y MoK_X MoK_Y LyS_X LyS_Y, casewise stats(n mean median min max) col(stat)

    variable   |       N         mean          sd         p50        min       max
---------------+------------------------------------------------------------------------------
   T3_X        |       182     4.174835     1.14546      4.17       1.75       7.56
   T3_Y        |       182     4.159176     1.14005      4.14       1.88       7.43
   MoK_X       |       182     8.853846    .9560864       8.8        6.6       11.5
   MoK_Y       |       182     8.843956    .9524274       8.8        6.5       11.6
   LyS_X       |       182     1.782967    2.768706      1.45         .2       36.9
   LyS_Y       |       182     1.958846    3.996711       1.5         .2       53.7
-------------------------------------------------------------------------------------------------

To get what I really want, I must write a syntax for each variables of interest:

Code:

tabstat T3_X T3_Y, casewise stats(n mean median min max) col(stat)
tabstat MoK_X MoK_Y, casewise stats(n mean median min max) col(stat)
tabstat LyS_X LyS_Y, casewise stats(n mean median min max) col(stat)

This is, however, not practical when I have about 50 analytes in my data set. What kind of syntax or code/loop can I have to get statistics for analytes with the same number of non-missing observations (as seen in the table below)?

Code:

 tabstat ???????

    variable   |       N         mean          sd           p50       min       max
---------------+------------------------------------------------------------------------------
   T3_X        |       203     3.891238    1.221584        4.01       .21      7.36
   T3_Y        |       203     3.875551    1.362951        4.03       .22      7.33
   MoK_X       |       183     8.877191    .8136298         8.7       6.5      11.3
   MoK_Y       |       183     8.858778    .8530706         8.7       6.2      11.5
   LyS_X       |       201     1.699482    2.635693         1.3        .3      36.8
   LyS_Y       |       201     1.910363    3.718795         1.2       .31      53.7
-------------------------------------------------------------------------------------------------

Tags: None

Mike Lacy

Join Date: Apr 2014

Posts: 2416
#2

21 Mar 2019, 09:11

My understanding is that, for each characteristic T3, MoK, LyS, you want its summary statistics calculated on only those analytes for which the *X and *Y measurments were nonmissing, and you'd like to minimize typing effort. I'd use a loop:

Code:

foreach stub in T3_ MoK_ LyS_ { tabstat `stub'X `stub'Y , casewise stats(n mean median min max) col(stat) }

Another kind of comment: With a "repeated measures" or "test/retest" situation such as you have, the "long format" organization of your data would be easier to work with for most purposes in Stata, and would offer other and perhaps easier ways to do the kind of thing(s) you want to do here. I'd recommend you take a look at the help for -reshape- and -by- and see if anything looks interesting.
1 like
Comment
Amanda Ode

Join Date: Mar 2019

Posts: 62
#3

21 Mar 2019, 10:35

Thank you Mike! It worked! Is there a way to have all the results in one table? I'm a new user in Stata and appreciate every single tip
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2416
#4

21 Mar 2019, 11:01

I don't know, as I'm not typically needing something like this. I would recommend you look at -help tabstat-, and then look at the various commands' documentation listed under the "Also See" tab in the upper right corner of the help window.
Comment
Amanda Ode

Join Date: Mar 2019

Posts: 62
#5

21 Mar 2019, 11:04

Thanks :-)
Comment

Nick Cox

Join Date: Mar 2014
Posts: 35697

21 Mar 2019, 19:46

Here is one way to approach it, loosely modelled on an answer in your other thread https://www.statalist.org/forums/for...e-correlations

Code:

* sandbox dataset
clear
set obs 12
set seed 2803
foreach s in AL BS CD {
    gen `s'_X = rnormal()
    gen `s'_Y = rnormal()
    replace `s'_Y = . if runiform() < 0.4 
}

summarize 

* you start here
unab stubs : *_X
local stubs : subinstr local stubs "_X" "", all
local nstubs : word count `stubs'
local I = 2 * `nstubs' 

matrix results = J(`I', 6, .)

local i = 1
gen touse = 0 
local rownames 
quietly foreach s of local stubs {
    replace touse = !missing(`s'_X, `s'_Y) 
    foreach v in `s'_X `s'_Y { 
        su `v' if touse, detail 
        mat results[`i', 1] = r(N)
        mat results[`i', 2] = r(mean) 
        mat results[`i', 3] = r(sd) 
        mat results[`i', 4] = r(p50) 
        mat results[`i', 5] = r(min) 
        mat results[`i', 6] = r(max) 
        local names `names' `v' 
        local ++i
    }
} 

mat rownames results = `names'
mat colnames results = n mean SD median min max 
mat li results, format("%6.0g")  

* results I got follow here

results[6,6]
           n    mean      SD  median     min     max
AL_X       7   .5526    1.37   .6609  -1.834   2.512
AL_Y       7  -.3375   1.065  -.2889  -2.093   1.087
BS_X       7  -.4039   .9407  -.0875   -2.19   .7215
BS_Y       7  -.2414   .7284  -.1332  -1.481    .717
CD_X       5  -.2584   1.223  -.7595  -1.358   1.774
CD_Y       5   .3211   .7484   .0885  -.6701   1.116

You can put the matrix into variables using

Code:

svmat

and then make a new dataset out of that.

Another approach is to use

Code:

postfile

to create a separate dataset.

Yet another approach is to see that a different data structure would make life easier: This code follows the previous block, but if you start with your data, you don't need the drop touse You may well have an identifier variable already.

Code:

drop touse
gen id = _n
reshape long @_X @_Y, i(id) j(which) string
keep if !missing(_X, _Y)

. tabstat _X _Y, by(which) s(n mean sd p50 min max) c(s)

Summary for variables: _X _Y
     by categories of: which 

 which |         N      mean        sd       p50       min       max
-------+------------------------------------------------------------
    AL |         7  .5525807  1.369795  .6609457 -1.833616  2.512294
       |         7 -.3375053  1.064845 -.2888748 -2.092991  1.086969
-------+------------------------------------------------------------
    BS |         7 -.4039159  .9407355 -.0875361 -2.189937  .7215156
       |         7 -.2414407  .7283927 -.1331683 -1.481481  .7169558
-------+------------------------------------------------------------
    CD |         5 -.2583826  1.223037 -.7595289 -1.357807  1.774275
       |         5  .3211134   .748372  .0884645  -.670102  1.115987
-------+------------------------------------------------------------
 Total |        19 -.0132241  1.205569 -.0875361 -2.189937  2.512294
       |        19 -.1287924  .8702738 -.1331683 -2.092991  1.115987
--------------------------------------------------------------------

This last is what I would recommend most. Pepper the commands with

Code:

edit

and

Code:

list

to see what happens at each stage.

Comment

Amanda Ode

Join Date: Mar 2019

Posts: 62
#7

25 Mar 2019, 02:00

Thank you Nick. The first code worked perfectly for the whole dataset. As I have replicate runs of each sample, I added "keep if Run==1" after the "foreach v in `s'_X `s'_Y { " in the loop. I get the right results but without the variable names in the table. I get the row numbers (r1, r2 etc.) instead. Should I modify something in the code to get the names of the variables in the table?

The other code did not work as I get the error message r(198) type mismatch between variables. I need to spend time on the reshape command.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35697
#8

25 Mar 2019, 03:35

As in your concurrent thread please study FAQ Advice #12 on how to report problems.

This is the first time you have mentioned a variable Run. That's OK but we can't be expected to understand fully how it modifies the analysis without a full explanation and example.

Similarly your type mismatch problem can't be understood without knowing more. Note that #6 is self-contained code that works. If your version of the same didn't work it's something to do with your data or your version of the code and we can't tell which.
Comment

Amanda Ode

Join Date: Mar 2019
Posts: 62

25 Mar 2019, 06:20

Thanks for your advice. In our study, we analyzed blood samples (with unique Sampleid, then recoded to sampleno) in duplicates (Run 1 and 2) in a reference analyzer (X) and a test analyzer (Y). T3, MoK, LyS etc. are analytes measured in the blood samples by the analyzers. We need to compare the results of the analytes between the analyzers. I hereby attach a sample of my data:

Code:

Sampleid  Sampleno    Run     Date       T3_X     T3_Y   MoK_X  MoK_Y   LyS_X   LyS_Y    etc.
-------------------------------------------------------------------------------------------
                                          
321           1       1       23/1       89       90.4   31.2   32.1    12      12.3    
321           1       2       23/1       92.8     91.5   31.9   31.3    13.1    12.5    
345           2       1       23/1       86.4     83.1   30.4   30.2    12.3    12.5    
345           2       2       23/1       86.7     84.9   31     30.5    12.2    12.7    
600           3       1       25/1       84.7     85.4   31.1   31.5    12.1    12.9    
600           3       2       25/1       85.8     86.1   31.6   31.1    12.8    12.7    
etc.

When I run the code above, I get the output that I want (the first 4 variables in the obtained table are string variables that will be removed manually later):

Code:

. unab stubs : *_X

local stubs : subinstr local stubs "_X" "", all
 local nstubs : word count `stubs'
 local I = 2 * `nstubs'

 matrix results = J(`I', 6, .)
 local i = 1
gen touse = 0
 local rownames

quietly foreach s of local stubs {
     replace touse = !missing(`s'_X, `s'_Y)
     foreach v in `s'_X `s'_Y {
    su `v' if touse, detail
         mat results[`i', 1] = r(N)
         mat results[`i', 2] = r(mean)
        mat results[`i', 3] = r(sd)
        mat results[`i', 4] = r(p50)
       mat results[`i', 5] = r(min)
        mat results[`i', 6] = r(max)
         local names `names' `v'
        local ++i
     }
}
 mat rownames results = `names'

mat colnames results = n mean SD median min max
mat li results, format("%6.0g") 
 //** This is what I get **// 
results[46,6]
                  n    mean      SD  median     min     max
 APNA_X       0       .       .       .       .       .
 APNA_Y       0       .       .       .       .       .
Flags_X       0       .       .       .       .       .
Flags_Y       0       .       .       .       .       .
  T3_X      391   3.892   1.408    4.04     .19    7.56
  T3_Y      391   3.879   1.399    4.03      .2    7.46
  MoK_X     361   85.28   7.814    85.8    55.4   105.4
  MoK_Y     361   86.05   7.886    86.4    55.6   107.1
  LyS_X     361   35.22   9.231      35    18.7    63.1
  LyS_Y     361   35.41   9.273    35.4    18.5    62.9

I wanted to do the same using the first replicate of the samples (only Run 1) so I started with dropping Run 2. I get the following with error:

Code:

. keep if Run==1
unab stubs : *_X
local stubs : subinstr local stubs "_X" "", all
local nstubs : word count `stubs'
local I = 2 * `nstubs'
matrix results = J(`I', 6, .)
local i = 1
gen touse = 0
local rownames

quietly foreach s of local stubs {
replace touse = !missing(`s'_X, `s'_Y)
foreach v in `s'_X `s'_Y {

//I even tried to put "keep if Run==1" here  and got the same results below.

su `v' if touse, detail
mat results[`i', 1] = r(N)
mat results[`i', 2] = r(mean)
mat results[`i', 3] = r(sd)
mat results[`i', 4] = r(p50)
mat results[`i', 5] = r(min)
mat results[`i', 6] = r(max)
local names `names' `v'
local ++i
   }
 }

//** This is what I get **//

mat rownames results = `names'
conformability error
r(503);

.
. mat colnames results = n mean SD median min max

.
. mat li results, format("%6.0g")

results[46,6]
          n    mean      SD  median     min     max
 r1       0       .       .       .       .       .
 r2       0       .       .       .       .       .
 r3       0       .       .       .       .       .
 r4       0       .       .       .       .       .     
 r5     204   3.898   1.422    4.04     .19    7.56
 r6     204   3.884   1.414   4.025     .22    7.43
 r7     188    85.4   7.994    85.8    55.4   105.3
 r8     188   86.13   8.073   86.65    55.6   107.1
 r9     188   35.29   9.407      35    18.7    63.1
r10     188   35.45   9.479   35.45    18.8    62.9

What should I do to get the varnames in the table above for results using the first run?

Comment

Nick Cox

Join Date: Mar 2014

Posts: 35697
#10

25 Mar 2019, 06:26

Clearly APNA* and Flag* are useless to you, so ignore those. Otherwise

Code:

local rownames

would better as

Code:

local names

to blank out the local macro.
Comment
Amanda Ode

Join Date: Mar 2019

Posts: 62
#11

25 Mar 2019, 06:42

Thank you Nick. I've got what I want now.
Comment

Announcement

Easier and faster way to get casewise statistics using tabstat?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment