Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • log transformation or boxcox

    Hi,
    my problem is more of a statistical conceptual problem than the coding.
    I'm dealing with "Cost" data and my variable total_cost is not distributed normally
    i have a table to fill up with the name "Determinants of total cost"
    i thought of using log transformation of total_cost variable and then using the linear regression, but then the B coefficient interprets the change in percent change or unit change in log(total_cost) (If i'm correct).
    is there any way out to change B coefficient that it interprets the unit change in total_cost

    independent variables are: Age, gender, wealth_tertile, sample_type

    And someone suggested i should use "Boxcox regression" for this, as it is an economical model
    which one i should use and how to interpret boxcox output in stata

  • #2
    It's not a requirement of any regression model that any variable -- response or outcome on one hand, predictors on the other -- has a normal distribution. That's at most an ideal condition for errors.

    Total cost I'd expect to be positive and so a functional form y = exp(Xb) might be a better choice than y = Xb. That is loosely equivalent to, but not identical to, log transforming the outcome.

    Box-Cox despite its splendid name is partly a period piece. It's relevant if you consider that some power transformation -- e.g. a root or reciprocal -- is a serious candidate for transformation, but that seems unlikely for your case. Choosing Box-Cox has precisely zero to do with whether you are working in economics.

    I'd start with https://blog.stata.com/2011/08/22/us...tell-a-friend/ as pointing in a good direction.

    You can't interpret coefficients of y= exp(Xb) or ln y = Xb as if they were coefficients of y = Xb. If working on logarithmic scale is a better idea, you lose nothing by not being able to use an interpretation that doesn't match the patterns in the data.

    EDIT: can in the last para should have been can't. Now fixed.
    Last edited by Nick Cox; 09 Jun 2022, 05:59.

    Comment


    • #3
      Kusum:
      welcome to this forum.
      It well known that total cost follows a -gamma- distribution (positively skewed, long right-tail).
      Therefore, why not considering -glm- instead of -regress-?
      Code:
      use "C:\Program Files\Stata17\ado\base\a\auto.dta"
      . glm price mpg i.rep78 trunk, link(log) family(gamma)
      
      Iteration 0:   log likelihood = -669.06467 
      Iteration 1:   log likelihood = -668.99899 
      Iteration 2:   log likelihood = -668.99894 
      
      Generalized linear models                         Number of obs   =         69
      Optimization     : ML                             Residual df     =         62
                                                        Scale parameter =   .1624101
      Deviance         =  7.760201964                   (1/df) Deviance =   .1251645
      Pearson          =   10.0694236                   (1/df) Pearson  =   .1624101
      
      Variance function: V(u) = u^2                     [Gamma]
      Link function    : g(u) = ln(u)                   [Log]
      
                                                        AIC             =   19.59417
      Log likelihood   = -668.9989441                   BIC             =  -254.7544
      
      ------------------------------------------------------------------------------
                   |                 OIM
             price | Coefficient  std. err.      z    P>|z|     [95% conf. interval]
      -------------+----------------------------------------------------------------
               mpg |  -.0393265   .0107076    -3.67   0.000     -.060313   -.0183399
                   |
             rep78 |
                2  |   .0934405   .3287072     0.28   0.776    -.5508138    .7376949
                3  |   .1786876   .3078631     0.58   0.562    -.4247131    .7820883
                4  |    .272562   .3093257     0.88   0.378    -.3337053    .8788293
                5  |   .4441951   .3279094     1.35   0.176    -.1984955    1.086886
                   |
             trunk |   .0098709   .0143812     0.69   0.492    -.0183157    .0380574
             _cons |   9.164973   .4187511    21.89   0.000     8.344236     9.98571
      ------------------------------------------------------------------------------
      
      .
      There's a wonderful chapter on GLM and cost distribution in
      https://www.stata.com/bookstore/health-econometrics-using-stata.
      Admittedly, I judged the (text)book by its cover, but its contents were enlightening.
      Kind regards,
      Carlo
      (Stata 19.0)

      Comment


      • #4
        I'm not sure how to show you my output here.
        total cost is the cost incurred by ARI(Pneumonia) patient while getting the treatment(medicines+consultations+lab tests etc)
        But 15% of total cost includes "0" values as well, can we fix it by adding 0.1 to the total cost and then use glm?
        since Gamma distribution is widely used in modeling continuous, non-negative and positive-skewed data.
        I'm asking about "0", because when i used boxcox without adding 0.1, it showed an error "total_cost contains observations that are not strictly positive"

        And how do we interpret these B coefficient in glm model, as percent change or unit change?
        Last edited by kusum shekhawat; 09 Jun 2022, 15:32.

        Comment


        • #5
          Generalized linear models with logarithmic link take the mean function to be positive which is not undermined if some values are zero. Hence there is no need to change the zeros. This is on a par with Poisson regression which isn’t invalidated by some zero counts.

          Comment


          • #6
            thanks for the quick reply Nick Cox
            i have attached my code and output as logfile(as i dont know how to attach it the other way)
            how do i interpret coefficients here
            Attached Files

            Comment


            • #7
              Thank you for your suggestion Carlo Lazzaro
              I have gone through a few articles/research papers to get better clarity on this model
              Seems like this is the best fit model for my case scenario
              Just confused about the coefficient, how do we interpret those

              Comment


              • #8
                As said, the general recipe is y = exp(Xb)


                Code:
                . glm total_cost i.age_grp i.sex i.comorb_cat ib2.health_insurance i.wealth_tertile i.facility1 i.level1 i.treatme
                > nt1 i.flu1 ib2.sample_type_final ib2.Site, link(log) family(gamma)
                
                Iteration 0:   log likelihood = -28890.329  
                Iteration 1:   log likelihood = -28779.876  
                Iteration 2:   log likelihood = -28779.611  
                Iteration 3:   log likelihood = -28779.611  
                
                Generalized linear models                         No. of obs      =      3,729
                Optimization     : ML                             Residual df     =      3,709
                                                                  Scale parameter =   1.401875
                Deviance         =  2556.298419                   (1/df) Deviance =    .689215
                Pearson          =  5199.554156                   (1/df) Pearson  =   1.401875
                
                Variance function: V(u) = u^2                     [Gamma]
                Link function    : g(u) = ln(u)                   [Log]
                
                                                                  AIC             =   15.44629
                Log likelihood   =  -28779.6114                   BIC             =  -27946.13
                
                -----------------------------------------------------------------------------------
                                  |                 OIM
                       total_cost |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
                ------------------+----------------------------------------------------------------
                          age_grp |
                           65-69  |  -.0051037   .0478344    -0.11   0.915    -.0988575      .08865
                    70 and above  |   .1574872   .0492621     3.20   0.001     .0609353    .2540391
                                  |
                              sex |
                               M  |   .0019449   .0406398     0.05   0.962    -.0777077    .0815975
                                  |
                       comorb_cat |
                             One  |   .1308091   .0517629     2.53   0.012     .0293558    .2322625
                   More than one  |   .2958274   .0519545     5.69   0.000     .1939984    .3976564
                                  |
                 health_insurance |
                             Yes  |  -.0706271   .0669648    -1.05   0.292    -.2018757    .0606215
                                  |
                   wealth_tertile |
                               2  |  -.0311285   .0533209    -0.58   0.559    -.1356356    .0733787
                               3  |  -.0203692   .0606956    -0.34   0.737    -.1393304    .0985919
                                  |
                        facility1 |
                         Private  |   .3497437   .0684377     5.11   0.000     .2156084    .4838791
                                  |
                           level1 |
                         Primary  |  -.1035889   .0882996    -1.17   0.241    -.2766529    .0694751
                       Secondary  |   .1037402   .1487365     0.70   0.486    -.1877781    .3952584
                        Tertiary  |   .3719397   .1881134     1.98   0.048     .0032441    .7406352
                                  |
                       treatment1 |
                      Ambulatory  |   .6336718   .0849481     7.46   0.000     .4671766    .8001669
                   Emergency/IPD  |     1.1153   .2170644     5.14   0.000     .6898612    1.540738
                                  |
                             flu1 |
                         flu/RSV  |   .2168698   .0811741     2.67   0.008     .0577715    .3759681
                                  |
                sample_type_final |
                            ALRI  |   .5802376   .0451783    12.84   0.000     .4916898    .6687855
                                  |
                             Site |
                         Chennai  |   .4556864   .0962094     4.74   0.000     .2671194    .6442534
                         Kolkata  |   .0481658   .1010846     0.48   0.634    -.1499563     .246288
                            Pune  |   .2563101   .0979448     2.62   0.009     .0643419    .4482784
                                  |
                            _cons |   5.892474   .0721475    81.67   0.000     5.751068    6.033881
                -----------------------------------------------------------------------------------

                Comment


                • #9
                  Kusum:
                  But 15% of total cost includes "0" values as well, can we fix it by adding 0.1 to the total cost and then use glm?
                  It would be interesting to delve into the zero issue, as this value can have different explanations:
                  1) your sample includes patients who are not allowed to receive health care services as they are niot covered by a health insurance;
                  2) your sample includes patients living in such deprivated jurisdictions that no health care service is available there;
                  3) your sample includes healthy incdividuals who did not receive any health care service because the did not need it;
                  4) your sample includes such severe pneumonia patients who passed away due to acute respiratory distress syndrome before receiving any health care service upon hospital admission.

                  As Nick previously commented, the zero issue has no bearing on the gamma distribution (whereas it would affect the -ln- transformation).
                  Kind regards,
                  Carlo
                  (Stata 19.0)

                  Comment


                  • #10
                    To extend @Carlo Lazzaro's point, there is a substantive issue of whether the zeros belong in the analysis, which is a matter of research goals. The issue is pervasive and sometimes simple: if you want to look at controls of smoking or coffee drinking, do non-smokers or people who never drink coffee belong in your target population?

                    Comment


                    • #11
                      Thanks for the response Carlo Lazzaro and Nick Cox
                      Your solutions helped me alot understanding the model better
                      and as for "0", they are the people who did not seek any medical care, basically they self treated themselves and
                      Yes these "self medicated" people also belong to the target population
                      Last edited by kusum shekhawat; 10 Jun 2022, 03:11.

                      Comment


                      • #12
                        Kusum:
                        therefore there's a strong case for clarifying the perspective adopted in your HE evaluation (see Drummond MF, Sculpher MJ, Claxton K, Stoddart GL, Torrance GW. Methods for the economic evaluation of health care programmes. 4th ed. Oxford (UK): Oxford University Press; 2015: 24-25), especially if some health care resources (self-medication) are funded totally out-of-pocket by patients and/or their families
                        Last edited by Carlo Lazzaro; 10 Jun 2022, 03:23.
                        Kind regards,
                        Carlo
                        (Stata 19.0)

                        Comment


                        • #13
                          Hi,
                          Is there any test or stata command to confirm that total_cost follows the gamma distribution?

                          Comment


                          • #14
                            gammafit from SSC is a possibility.

                            But it is not strongly relevant for your project, .

                            If you know you have a spike of zeros, the fit should not be great and you know that in advance.

                            Also, and more crucially -- just as in #2 -- a gamma family being a good idea for a generalized linear model does not depend on the response or outcome having a marginal gamma distribution. That is not an assumption (read: ideal condition). The issue is about conditional distributions, and (in my experience) it is the link function that does most of the good work if a generalized linear model is a good idea at all.

                            Once a link has been decided, I usually try one or two different families to see how much difference it makes, Not much, usually. When it does make a difference, a closer look at the data suggests that I have a different problem and need a different or at least more complicated model any way.

                            Comment


                            • #15
                              Kusum:
                              the comparison of different families for cost analysis is proposed in https://www.stata.com/bookstore/health-econometrics-using-stata , Chapter 5.
                              Kind regards,
                              Carlo
                              (Stata 19.0)

                              Comment

                              Working...
                              X