log transformation or boxcox

kusum shekhawat

Join Date: Jun 2022

Posts: 19
#1

log transformation or boxcox

09 Jun 2022, 04:32

Hi,
my problem is more of a statistical conceptual problem than the coding.
I'm dealing with "Cost" data and my variable total_cost is not distributed normally
i have a table to fill up with the name "Determinants of total cost"
i thought of using log transformation of total_cost variable and then using the linear regression, but then the B coefficient interprets the change in percent change or unit change in log(total_cost) (If i'm correct).
is there any way out to change B coefficient that it interprets the unit change in total_cost

independent variables are: Age, gender, wealth_tertile, sample_type

And someone suggested i should use "Boxcox regression" for this, as it is an economical model
which one i should use and how to interpret boxcox output in stata
Tags: None
Nick Cox

Join Date: Mar 2014

Posts: 35810
#2

09 Jun 2022, 05:32

It's not a requirement of any regression model that any variable -- response or outcome on one hand, predictors on the other -- has a normal distribution. That's at most an ideal condition for errors.

Total cost I'd expect to be positive and so a functional form y = exp(Xb) might be a better choice than y = Xb. That is loosely equivalent to, but not identical to, log transforming the outcome.

Box-Cox despite its splendid name is partly a period piece. It's relevant if you consider that some power transformation -- e.g. a root or reciprocal -- is a serious candidate for transformation, but that seems unlikely for your case. Choosing Box-Cox has precisely zero to do with whether you are working in economics.

I'd start with https://blog.stata.com/2011/08/22/us...tell-a-friend/ as pointing in a good direction.

You can't interpret coefficients of y= exp(Xb) or ln y = Xb as if they were coefficients of y = Xb. If working on logarithmic scale is a better idea, you lose nothing by not being able to use an interpretation that doesn't match the patterns in the data.

EDIT: can in the last para should have been can't. Now fixed.

Last edited by Nick Cox; 09 Jun 2022, 05:59.
1 like
Comment

Carlo Lazzaro

Join Date: Apr 2014
Posts: 17749

09 Jun 2022, 10:49

Kusum:
welcome to this forum.
It well known that total cost follows a -gamma- distribution (positively skewed, long right-tail).
Therefore, why not considering -glm- instead of -regress-?

Code:

use "C:\Program Files\Stata17\ado\base\a\auto.dta"
. glm price mpg i.rep78 trunk, link(log) family(gamma)

Iteration 0:   log likelihood = -669.06467 
Iteration 1:   log likelihood = -668.99899 
Iteration 2:   log likelihood = -668.99894 

Generalized linear models                         Number of obs   =         69
Optimization     : ML                             Residual df     =         62
                                                  Scale parameter =   .1624101
Deviance         =  7.760201964                   (1/df) Deviance =   .1251645
Pearson          =   10.0694236                   (1/df) Pearson  =   .1624101

Variance function: V(u) = u^2                     [Gamma]
Link function    : g(u) = ln(u)                   [Log]

                                                  AIC             =   19.59417
Log likelihood   = -668.9989441                   BIC             =  -254.7544

------------------------------------------------------------------------------
             |                 OIM
       price | Coefficient  std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
         mpg |  -.0393265   .0107076    -3.67   0.000     -.060313   -.0183399
             |
       rep78 |
          2  |   .0934405   .3287072     0.28   0.776    -.5508138    .7376949
          3  |   .1786876   .3078631     0.58   0.562    -.4247131    .7820883
          4  |    .272562   .3093257     0.88   0.378    -.3337053    .8788293
          5  |   .4441951   .3279094     1.35   0.176    -.1984955    1.086886
             |
       trunk |   .0098709   .0143812     0.69   0.492    -.0183157    .0380574
       _cons |   9.164973   .4187511    21.89   0.000     8.344236     9.98571
------------------------------------------------------------------------------

.

There's a wonderful chapter on GLM and cost distribution in

https://www.stata.com/bookstore/health-econometrics-using-stata.
Admittedly, I judged the (text)book by its cover, but its contents were enlightening.

Kind regards,
Carlo
(Stata 19.0)

Comment

kusum shekhawat

Join Date: Jun 2022

Posts: 19
#4

09 Jun 2022, 15:20

I'm not sure how to show you my output here.
total cost is the cost incurred by ARI(Pneumonia) patient while getting the treatment(medicines+consultations+lab tests etc)
But 15% of total cost includes "0" values as well, can we fix it by adding 0.1 to the total cost and then use glm?
since Gamma distribution is widely used in modeling continuous, non-negative and positive-skewed data.
I'm asking about "0", because when i used boxcox without adding 0.1, it showed an error "total_cost contains observations that are not strictly positive"

And how do we interpret these B coefficient in glm model, as percent change or unit change?

Last edited by kusum shekhawat; 09 Jun 2022, 15:32.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35810
#5

09 Jun 2022, 15:38

Generalized linear models with logarithmic link take the mean function to be positive which is not undermined if some values are zero. Hence there is no need to change the zeros. This is on a par with Poisson regression which isn’t invalidated by some zero counts.
Comment
kusum shekhawat

Join Date: Jun 2022

Posts: 19
#6

09 Jun 2022, 15:51

thanks for the quick reply Nick Cox
i have attached my code and output as logfile(as i dont know how to attach it the other way)
how do i interpret coefficients here
Attached Files

Untitled.smcl (6.1 KB, 1 view)
Comment
kusum shekhawat

Join Date: Jun 2022

Posts: 19
#7

09 Jun 2022, 17:28

Thank you for your suggestion Carlo Lazzaro
I have gone through a few articles/research papers to get better clarity on this model
Seems like this is the best fit model for my case scenario
Just confused about the coefficient, how do we interpret those
Comment

Nick Cox

Join Date: Mar 2014
Posts: 35810

09 Jun 2022, 17:52

As said, the general recipe is y = exp(Xb)

Code:

. glm total_cost i.age_grp i.sex i.comorb_cat ib2.health_insurance i.wealth_tertile i.facility1 i.level1 i.treatme
> nt1 i.flu1 ib2.sample_type_final ib2.Site, link(log) family(gamma)

Iteration 0:   log likelihood = -28890.329  
Iteration 1:   log likelihood = -28779.876  
Iteration 2:   log likelihood = -28779.611  
Iteration 3:   log likelihood = -28779.611  

Generalized linear models                         No. of obs      =      3,729
Optimization     : ML                             Residual df     =      3,709
                                                  Scale parameter =   1.401875
Deviance         =  2556.298419                   (1/df) Deviance =    .689215
Pearson          =  5199.554156                   (1/df) Pearson  =   1.401875

Variance function: V(u) = u^2                     [Gamma]
Link function    : g(u) = ln(u)                   [Log]

                                                  AIC             =   15.44629
Log likelihood   =  -28779.6114                   BIC             =  -27946.13

-----------------------------------------------------------------------------------
                  |                 OIM
       total_cost |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
------------------+----------------------------------------------------------------
          age_grp |
           65-69  |  -.0051037   .0478344    -0.11   0.915    -.0988575      .08865
    70 and above  |   .1574872   .0492621     3.20   0.001     .0609353    .2540391
                  |
              sex |
               M  |   .0019449   .0406398     0.05   0.962    -.0777077    .0815975
                  |
       comorb_cat |
             One  |   .1308091   .0517629     2.53   0.012     .0293558    .2322625
   More than one  |   .2958274   .0519545     5.69   0.000     .1939984    .3976564
                  |
 health_insurance |
             Yes  |  -.0706271   .0669648    -1.05   0.292    -.2018757    .0606215
                  |
   wealth_tertile |
               2  |  -.0311285   .0533209    -0.58   0.559    -.1356356    .0733787
               3  |  -.0203692   .0606956    -0.34   0.737    -.1393304    .0985919
                  |
        facility1 |
         Private  |   .3497437   .0684377     5.11   0.000     .2156084    .4838791
                  |
           level1 |
         Primary  |  -.1035889   .0882996    -1.17   0.241    -.2766529    .0694751
       Secondary  |   .1037402   .1487365     0.70   0.486    -.1877781    .3952584
        Tertiary  |   .3719397   .1881134     1.98   0.048     .0032441    .7406352
                  |
       treatment1 |
      Ambulatory  |   .6336718   .0849481     7.46   0.000     .4671766    .8001669
   Emergency/IPD  |     1.1153   .2170644     5.14   0.000     .6898612    1.540738
                  |
             flu1 |
         flu/RSV  |   .2168698   .0811741     2.67   0.008     .0577715    .3759681
                  |
sample_type_final |
            ALRI  |   .5802376   .0451783    12.84   0.000     .4916898    .6687855
                  |
             Site |
         Chennai  |   .4556864   .0962094     4.74   0.000     .2671194    .6442534
         Kolkata  |   .0481658   .1010846     0.48   0.634    -.1499563     .246288
            Pune  |   .2563101   .0979448     2.62   0.009     .0643419    .4482784
                  |
            _cons |   5.892474   .0721475    81.67   0.000     5.751068    6.033881
-----------------------------------------------------------------------------------

Comment

Carlo Lazzaro

Join Date: Apr 2014

Posts: 17749
#9

10 Jun 2022, 00:19

Kusum:

But 15% of total cost includes "0" values as well, can we fix it by adding 0.1 to the total cost and then use glm?

It would be interesting to delve into the zero issue, as this value can have different explanations:
1) your sample includes patients who are not allowed to receive health care services as they are niot covered by a health insurance;
2) your sample includes patients living in such deprivated jurisdictions that no health care service is available there;
3) your sample includes healthy incdividuals who did not receive any health care service because the did not need it;
4) your sample includes such severe pneumonia patients who passed away due to acute respiratory distress syndrome before receiving any health care service upon hospital admission.

As Nick previously commented, the zero issue has no bearing on the gamma distribution (whereas it would affect the -ln- transformation).

Kind regards,
Carlo
(Stata 19.0)
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35810
#10

10 Jun 2022, 01:07

To extend @Carlo Lazzaro's point, there is a substantive issue of whether the zeros belong in the analysis, which is a matter of research goals. The issue is pervasive and sometimes simple: if you want to look at controls of smoking or coffee drinking, do non-smokers or people who never drink coffee belong in your target population?
1 like
Comment
kusum shekhawat

Join Date: Jun 2022

Posts: 19
#11

10 Jun 2022, 02:56

Thanks for the response Carlo Lazzaro and Nick Cox
Your solutions helped me alot understanding the model better
and as for "0", they are the people who did not seek any medical care, basically they self treated themselves and
Yes these "self medicated" people also belong to the target population

Last edited by kusum shekhawat; 10 Jun 2022, 03:11.
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17749
#12

10 Jun 2022, 03:21

Kusum:
therefore there's a strong case for clarifying the perspective adopted in your HE evaluation (see Drummond MF, Sculpher MJ, Claxton K, Stoddart GL, Torrance GW. Methods for the economic evaluation of health care programmes. 4th ed. Oxford (UK): Oxford University Press; 2015: 24-25), especially if some health care resources (self-medication) are funded totally out-of-pocket by patients and/or their families

Last edited by Carlo Lazzaro; 10 Jun 2022, 03:23.

Kind regards,
Carlo
(Stata 19.0)
Comment
kusum shekhawat

Join Date: Jun 2022

Posts: 19
#13

11 Jun 2022, 01:52

Hi,
Is there any test or stata command to confirm that total_cost follows the gamma distribution?
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35810
#14

11 Jun 2022, 02:07

gammafit from SSC is a possibility.

But it is not strongly relevant for your project, .

If you know you have a spike of zeros, the fit should not be great and you know that in advance.

Also, and more crucially -- just as in #2 -- a gamma family being a good idea for a generalized linear model does not depend on the response or outcome having a marginal gamma distribution. That is not an assumption (read: ideal condition). The issue is about conditional distributions, and (in my experience) it is the link function that does most of the good work if a generalized linear model is a good idea at all.

Once a link has been decided, I usually try one or two different families to see how much difference it makes, Not much, usually. When it does make a difference, a closer look at the data suggests that I have a different problem and need a different or at least more complicated model any way.
2 likes
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17749
#15

11 Jun 2022, 02:30

Kusum:
the comparison of different families for cost analysis is proposed in https://www.stata.com/bookstore/health-econometrics-using-stata , Chapter 5.

Kind regards,
Carlo
(Stata 19.0)
Comment

Announcement