Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Log-linear model for Unbalanced panel data [glm]

    Dear all,
    Hope you are all doing well,

    I am trying to investigate the impact of distance (distance_ij) between origin i and destination j on flow_ij by considering the attributes of origin_i and destination_j.

    I have 4 origins and 13 provinces over 13 years, so, the total number of observations is: 572.
    However, This is unbalanced data as some observations are not available for one of the origins for the first 8 years (only last 5 years available).

    - Dependent variable:
    flow_ij
    - Independent variables (in natural logarithm):
    1. lnorigin_i
    2. lndestination_j
    3. lndistance_ij

    The probability of flow is described by discrete probability distribution.

    See data sample below (Listed 16 out of 572 observations):
    Code:
    input str3 ORG str11 DEST int year float(flow_ij lnorigin_i lndestination_j lndistance_ij)
    "DAM" "bah" 2006      0 6.848005  2.549445  7.170888
    "DAM" "bah" 2007      0 6.991177  2.484907  7.170888
    "DAM" "bah" 2008      0 7.128496   2.61007  7.170888
    "DAM" "bah" 2009      0 7.112328 2.5700645  7.170888
    "DAM" "bah" 2010      0 7.195187  2.648536  7.170888
    "DAM" "bah" 2011      0 7.307873  2.721295  7.170888
    "DAM" "bah" 2012      0 7.195187 2.8053784  7.170888
    "DAM" "bah" 2013      0 7.130899  2.789118  7.170888
    "DAM" "bah" 2014      0 7.466228  2.439444  7.170888
    "DAM" "bah" 2015      0 7.577634  2.484907  7.170888
    "DAM" "bah" 2016   3800 7.484369 2.5700645  7.170888
    "DAM" "bah" 2017      0 7.366445  2.462514  7.170888
    "DAM" "bah" 2018   1200 7.340187   2.55787  7.170888
    "DAM" "jof" 2006   5400 6.848005   2.61007  7.120444
    "DAM" "jof" 2007   6500 6.991177  2.721295  7.120444
    "DAM" "jof" 2008   5400 7.128496  2.703596  7.120444


    I would like to estimate a log-linear model (poisson regression) for panel data, using individual fixed effects


    I have applied glm code


    HTML Code:
    glm flow_ij lnorigin_i lndistance_ij i.DEST_j, family(poisson) link(log) irls
    in the code above,
    the fixed effect of origin is included ( i.DEST_j )

    The results I got are as follows:
    Code:
    . glm flow_ij lnorigin_i lndistance_ij i.DEST_j, family(poisson) link(log) irls
    
    Iteration 1:   deviance =  2.70e+07
    Iteration 2:   deviance =  1.94e+07
    Iteration 3:   deviance =  1.86e+07
    Iteration 4:   deviance =  1.86e+07
    Iteration 5:   deviance =  1.86e+07
    Iteration 6:   deviance =  1.86e+07
    Iteration 7:   deviance =  1.86e+07
    
    Generalized linear models                         Number of obs   =        572
    Optimization     : MQL Fisher scoring             Residual df     =        557
                       (IRLS EIM)                     Scale parameter =          1
    Deviance         =  18551677.22                   (1/df) Deviance =   33306.42
    Pearson          =  18688077.98                   (1/df) Pearson  =   33551.31
    
    Variance function: V(u) = u                       [Poisson]
    Link function    : g(u) = ln(u)                   [Log]
    
                                                      BIC             =   1.85e+07
    
    -------------------------------------------------------------------------------
                  |                 EIM
          flow_ij |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
    --------------+----------------------------------------------------------------
       lnorigin_i |   .7029036   .0002109  3333.29   0.000     .7024903    .7033169
    lndistance_ij |   -.577117   .0001526 -3781.61   0.000    -.5774162   -.5768179
                  |
           DEST_j |
          A       |   .9244987   .0025078   368.65   0.000     .9195836    .9294139
          B       |   2.726537   .0020362  1339.02   0.000     2.722546    2.730528
          C       |   2.711206   .0020558  1318.79   0.000     2.707176    2.715235
          D       |   2.523979    .002069  1219.90   0.000     2.519923    2.528034
          E       |   2.400839   .0020661  1162.00   0.000      2.39679    2.404889
          F       |   1.637443   .0022097   741.04   0.000     1.633112    1.641774
          G       |   1.845645   .0021684   851.16   0.000     1.841395    1.849895
          H       |   2.113996   .0020751  1018.77   0.000     2.109929    2.118063
          I       |   .5495162   .0027211   201.94   0.000     .5441829    .5548495
          J       |   .7086403   .0025717   275.55   0.000     .7035999    .7136807
          K       |   4.139993   .0019932  2077.03   0.000     4.136086      4.1439
          L.      |   1.706293   .0022313   764.71   0.000      1.70192    1.710666
                  |
            _cons |   7.160802   .0027506  2603.35   0.000     7.155411    7.166193
    -------------------------------------------------------------------------------
    My questions are:
    1. Is glm suitable for unbalanced panel data?
    1. Is the fixed effect specified correctly?
    2. do I need to include the fixed effect of distance (i_year)? why?
    3. Would the use of cluster on origins benefits the model?

    I am using xtreg in Stata 15.1.

    Thank you for your time and effort, and my apologies for the long post,
    Hussain Sulaimani

  • #2
    *Edit: I used glm in Stata 15.1. (NOT xtreg)

    Comment


    • #3
      I don't fully understand what you are trying to do but since you clearly have panel data, why not use -meglm- instead of -glm-?

      Comment


      • #4
        Hi Mr. Goldstein,
        Thank you for replying,
        Let me put it in another way.

        - The dependent variable is cargo flow from 4 origins (seaports) and 13 destinations (regions) for the years 2006-2018.
        - However, the panel data is unbalanced since the data available of one of the seaports is just for the years 2014-2018.
        - The independent variables are the distance (between port i and region j), seaport attribute (total Cargo volume), region attribute (economic indicator)
        - Obveiously, the distance of each port-region pair is fixed over the full period (time invariant). The attributes of the seaport and regions are the time-variant.

        I am trying to use log-linear(Poisson) regression to estimate the impact of distance on cargo flow by considering the port and region attributes.

        Hussain,

        Comment


        • #5
          Stata routines for panel data do NOT require balance; you can us -mepoisson- or -meglm-

          Comment


          • #6
            Hussain:
            like Rich, I find your emphasis on panel unbalancedness a bit of a stretch, as Stata can run -xt- commands on both balanced and unbalanced panels without any problem.
            That said, your -glm- code actually treats observations as they were independedent: hence, you should -cluster- them at -panelid- level.
            I would also investigate the role of -i.time- by adding this categorical predictor in the right-hand side of your regression equation.
            Usually poissson regression with clustered standard errors is robust to overdispersion: other on this forum would probably have a different take, but I would give it a shot with negative binomial family, too.
            Eventually, I'wondering whether some form of -exposure- (say, to post distance) is usually reported for this kind of regression in your research field.
            Kind regards,
            Carlo
            (Stata 19.0)

            Comment


            • #7
              Dear Hussain Sulaimani,

              Adding to the useful advice already provided, I suggest you use the user-written command ppmlhdfe wich is designed to work with this kind of problem and is very fast.

              Best wishes,

              Joao

              Comment


              • #8
                Based on the mepoisson code you mentioned and the following formula from stata manual
                Code:
                mepoisson y x || lev2:
                I wrote this code as:
                Code:
                mepoisson flow_ij lndestination_j lndistance_ij || lnorigin_i
                Where lnorigin_i is the fixed effect, but Stata replies: only one fixed-effects equation allowed

                Whats wrong with it? I think I am specifying the fixed effect in the wrong way, because, I inserted just one fixed-effect.
                I think the code is for multilevel mixed effect. If specified correctly, would it produces the same outcomes as glm?

                Please note that: I am not considering the use of random effects.
                The Poisson regression model, I am considering to use, is formulated as follows:
                Click image for larger version

Name:	Screen Shot 2021-04-02 at 10.52.23 AM.png
Views:	1
Size:	69.7 KB
ID:	1601197

                Thank you,
                Hussain
                --------
                Source of code formula: STATA MULTILEVEL MIXED-EFFECTS REFERENCE MANUAL RELEASE 14.

                Comment


                • #9
                  Update: I added ( : ) after the fixed effect variable and got the following error

                  Code:
                  . mepoisson flow_ij lndestination_j lndistance_ij || lnorigin_i:
                  
                  Fitting fixed-effects model:
                  
                  Iteration 0:   log likelihood =  -39256612  
                  Iteration 1:   log likelihood =  -24095683  
                  Iteration 2:   log likelihood =  -24027594  
                  Iteration 3:   log likelihood =  -24027543  
                  Iteration 4:   log likelihood =  -24027543  
                  
                  Refining starting values:
                  
                  Grid node 0:   log likelihood =          .
                  Grid node 1:   log likelihood =          .
                  Grid node 2:   log likelihood =          .
                  Grid node 3:   log likelihood =          .
                  (note: Grid search failed to find values that will yield a log likelihood value.)
                  
                  Fitting full model:
                  
                  initial values not feasible
                  r(1400);

                  Comment


                  • #10
                    Carlo Lazzaro Joao Santos Silva
                    Thank you for your valuable feedback.
                    Considering your suggessions,

                    Dr. Lazzaro,
                    I clustered at panel ID level (no. of clusters 52) in the two modeling attempts. Also, in the second attempt, I included i.year (time FE).

                    Dr. Santon Silva,
                    Following your advice, I used the command: ppmlhdfe.

                    I ran two modeling attempts, and got the following results:
                    A. by including only the individual FE (DEST_j), I got the following outcomes:
                    Code:
                    . ppmlhdfe flow_ij lnorigin_i lndistance_ij i.DEST_j, vce(cluster ID)
                    Iteration 1:   deviance = 2.6973e+07  eps = .         iters = 1    tol = 1.0e-04  min(eta) =  -4.17  P   
                    Iteration 2:   deviance = 1.9435e+07  eps = 3.88e-01  iters = 1    tol = 1.0e-04  min(eta) =  -6.02      
                    Iteration 3:   deviance = 1.8580e+07  eps = 4.60e-02  iters = 1    tol = 1.0e-04  min(eta) =  -7.12      
                    Iteration 4:   deviance = 1.8552e+07  eps = 1.49e-03  iters = 1    tol = 1.0e-04  min(eta) =  -7.40      
                    Iteration 5:   deviance = 1.8552e+07  eps = 5.81e-06  iters = 1    tol = 1.0e-04  min(eta) =  -7.42      
                    Iteration 6:   deviance = 1.8552e+07  eps = 3.48e-10  iters = 1    tol = 1.0e-05  min(eta) =  -7.42   S O
                    ------------------------------------------------------------------------------------------------------------
                    (legend: p: exact partial-out   s: exact solver   h: step-halving   o: epsilon below tolerance)
                    Converged in 6 iterations and 6 HDFE sub-iterations (tol = 1.0e-08)
                    
                    PPML regression                                   No. of obs      =        572
                                                                      Residual df     =         51
                    Statistics robust to heteroskedasticity           Wald chi2(14)   =    1159.71
                    Deviance             =  18551677.22               Prob > chi2     =     0.0000
                    Log pseudolikelihood = -9278324.868               Pseudo R2       =     0.8640
                    
                    Number of clusters (ID)     =         52
                                                         (Std. Err. adjusted for 52 clusters in ID)
                    -------------------------------------------------------------------------------
                                  |               Robust
                          flow_ij |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
                    --------------+----------------------------------------------------------------
                       lnorigin_i |   .7029036   .1385215     5.07   0.000     .4314064    .9744007
                    lndistance_ij |   -.577117   .0994041    -5.81   0.000    -.7719455   -.3822886
                                  |
                          DEST_jj |
                             bah  |  -2.523979   .5004049    -5.04   0.000    -3.504754   -1.543203
                             epr  |  -.1231392   .5953582    -0.21   0.836     -1.29002    1.043741
                             hai  |   -.886536   .4413532    -2.01   0.045    -1.751572   -.0214996
                             jaz  |  -.6783333   .5557509    -1.22   0.222    -1.767585    .4109184
                             jof  |   -1.59948   .4568848    -3.50   0.000    -2.494958   -.7040021
                             mad  |    .202558   .4809852     0.42   0.674    -.7401556    1.145272
                             mkk  |  -.4099828   .6258462    -0.66   0.512    -1.636619    .8166532
                             naj  |  -1.815338   .6095428    -2.98   0.003     -3.01002   -.6206564
                            nbrd  |  -1.974462   .8221818    -2.40   0.016    -3.585909   -.3630158
                             qas  |    .187227   .5078222     0.37   0.712    -.8080862     1.18254
                             riy  |   1.616014   .4935388     3.27   0.001     .6486961    2.583333
                             tab  |  -.8176859   .5336744    -1.53   0.125    -1.863669    .2282968
                                  |
                            _cons |   9.684781   1.194159     8.11   0.000     7.344271    12.02529
                    -------------------------------------------------------------------------------

                    B. Here is the 2nd outcome by considering the two FE of individual-invariant (DEST_j) and time-variant (year).
                    [ppmlhdfe i.DEST_j i.year , cve(cluster ID)]

                    Code:
                    . ppmlhdfe flow_ij lnorigin_i lndistance_ij i.DEST_j i.year, vce(cluster ID)
                    Iteration 1:   deviance = 2.6264e+07  eps = .         iters = 1    tol = 1.0e-04  min(eta) =  -4.17  P   
                    Iteration 2:   deviance = 1.8693e+07  eps = 4.05e-01  iters = 1    tol = 1.0e-04  min(eta) =  -6.08      
                    Iteration 3:   deviance = 1.7827e+07  eps = 4.86e-02  iters = 1    tol = 1.0e-04  min(eta) =  -7.21      
                    Iteration 4:   deviance = 1.7800e+07  eps = 1.56e-03  iters = 1    tol = 1.0e-04  min(eta) =  -7.49      
                    Iteration 5:   deviance = 1.7800e+07  eps = 6.08e-06  iters = 1    tol = 1.0e-04  min(eta) =  -7.51      
                    Iteration 6:   deviance = 1.7800e+07  eps = 3.63e-10  iters = 1    tol = 1.0e-05  min(eta) =  -7.51   S O
                    ------------------------------------------------------------------------------------------------------------
                    (legend: p: exact partial-out   s: exact solver   h: step-halving   o: epsilon below tolerance)
                    Converged in 6 iterations and 6 HDFE sub-iterations (tol = 1.0e-08)
                    
                    PPML regression                                   No. of obs      =        572
                                                                      Residual df     =         51
                    Statistics robust to heteroskedasticity           Wald chi2(26)   =    7661.03
                    Deviance             =  17799617.32               Prob > chi2     =     0.0000
                    Log pseudolikelihood = -8902294.915               Pseudo R2       =     0.8695
                    
                    Number of clusters (ID)     =         52
                                                         (Std. Err. adjusted for 52 clusters in ID)
                    -------------------------------------------------------------------------------
                                  |               Robust
                          flow_ij |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
                    --------------+----------------------------------------------------------------
                       lnorigin_i |    .687167   .1426196     4.82   0.000     .4076377    .9666963
                    lndistance_ij |   -.579165   .0982796    -5.89   0.000    -.7717896   -.3865405
                                  |
                          DEST_jj |
                             bah  |  -2.524047   .5116029    -4.93   0.000    -3.526771   -1.521324
                             epr  |  -.1382823    .601594    -0.23   0.818    -1.317385     1.04082
                             hai  |  -.8887676   .4464587    -1.99   0.047    -1.763811   -.0137246
                             jaz  |  -.6781181   .5672767    -1.20   0.232     -1.78996    .4337238
                             jof  |  -1.601437   .4619895    -3.47   0.001     -2.50692   -.6959546
                             mad  |   .2010813   .4914005     0.41   0.682    -.7620461    1.164209
                             mkk  |  -.4084137   .6271428    -0.65   0.515    -1.637591    .8207637
                             naj  |  -1.815843   .6212883    -2.92   0.003    -3.033546   -.5981402
                            nbrd  |  -1.977209   .8189277    -2.41   0.016    -3.582278   -.3721405
                             qas  |   .1841842   .5110668     0.36   0.719    -.8174883    1.185857
                             riy  |   1.611245   .4979908     3.24   0.001     .6352012    2.587289
                             tab  |  -.8183101   .5445168    -1.50   0.133    -1.885543    .2489231
                                  |
                             year |
                            2007  |   .0258422   .0238659     1.08   0.279    -.0209342    .0726186
                            2008  |  -.1541256    .062916    -2.45   0.014    -.2774387   -.0308124
                            2009  |   .1625109   .0772471     2.10   0.035     .0111093    .3139124
                            2010  |    .158896   .0930755     1.71   0.088    -.0235286    .3413206
                            2011  |   .2243428   .0811565     2.76   0.006      .065279    .3834067
                            2012  |   .1901082    .095927     1.98   0.048     .0020947    .3781216
                            2013  |   .2516618   .0922519     2.73   0.006     .0708514    .4324722
                            2014  |   .2521025   .0958217     2.63   0.009     .0642954    .4399096
                            2015  |   .3549189    .116716     3.04   0.002     .1261597     .583678
                            2016  |   .2325958    .111932     2.08   0.038     .0132131    .4519785
                            2017  |   .1700456   .1256714     1.35   0.176    -.0762658    .4163569
                            2018  |   .0516265   .1312694     0.39   0.694    -.2056567    .3089098
                                  |
                            _cons |   9.651497   1.201433     8.03   0.000     7.296731    12.00626
                    -------------------------------------------------------------------------------
                    My inquiries:
                    1. the distance variable is time invariant (doesn't change over time. It is my major explanatory, Is it statistically beneficial to measure the impact of time invariant on a response variable in Panel data?
                    2. Does clustering by ID instead of making the standard error robust to overdispersion?
                    3. By including FE of time in attp.2, the z-value of lnorigin_i (IV) reduced compared to an increase in the z-value of lndistance_ij (IV), does it mean that the latter (IV) is not affected by time-variant?
                    4. If I am including Fixed effects, can I run the model with no intercept (constant)?
                    5. May you provide me a suggestion of a book or article that helps in interpreting the results and in measuring model fit, please?


                    Thank you very much and sorry for the lengthy reply, as I am trying to make the conducted steps clearer.
                    Huss

                    Comment


                    • #11
                      Dear Hussain Sulaimani,

                      1 - That is not a problem
                      2 - Clustering by ID makes the standard errors robust to overdispersion (actually, overdispersion is only defined for count data, so that is not a problem here)
                      3 - I do not know what there variables are so I do not comment
                      4 - It should not make a difference
                      5 - Try this https://papers.ssrn.com/sol3/papers....act_id=3421148

                      Best wishes,

                      Joao

                      Comment


                      • #12
                        Thank you for replying Prof. Joao Santos Silva

                        The working paper you sent me is really helpful.

                        Since my problem is in Freight flow, and you are an expert in this type of flow problem. I need your recommendation on an issue I am facing.

                        The total number of observations is 572, 162 of them have zero flow.

                        I including a dummy variable RAIL_ij that explains the availability of Rail services between port_i and province_j. Only one port-province pair has rail services and this pair is repeated over 13 years. By adding this dummy variable, it got a very high coefficient is very high. and resulted in reducing another independent variable so drastically. I am worrying that this is might be caused by the immense zero flow observation. In this case, Is it including it may cause problem to the other port-province pair that have nonzero flow?

                        Thank you,
                        Hussain
                        Last edited by Hussain Sulaimani; 08 Apr 2021, 06:07.

                        Comment


                        • #13
                          Sorry, I would have to know much more about the problem to be able to comment; please discuss this with your team.

                          Best wishes,

                          Joao

                          Comment

                          Working...
                          X