Log-linear model for Unbalanced panel data [glm]

Hussain Sulaimani

Join Date: Apr 2021
Posts: 14

Log-linear model for Unbalanced panel data [glm]

01 Apr 2021, 22:23

Dear all,
Hope you are all doing well,

I am trying to investigate the impact of distance (distance_ij) between origin i and destination j on flow_ij by considering the attributes of origin_i and destination_j.

I have 4 origins and 13 provinces over 13 years, so, the total number of observations is: 572.
However, This is unbalanced data as some observations are not available for one of the origins for the first 8 years (only last 5 years available).

- Dependent variable:
flow_ij
- Independent variables (in natural logarithm):
1. lnorigin_i
2. lndestination_j
3. lndistance_ij

The probability of flow is described by discrete probability distribution.

See data sample below (Listed 16 out of 572 observations):

Code:

input str3 ORG str11 DEST int year float(flow_ij lnorigin_i lndestination_j lndistance_ij)
"DAM" "bah" 2006      0 6.848005  2.549445  7.170888
"DAM" "bah" 2007      0 6.991177  2.484907  7.170888
"DAM" "bah" 2008      0 7.128496   2.61007  7.170888
"DAM" "bah" 2009      0 7.112328 2.5700645  7.170888
"DAM" "bah" 2010      0 7.195187  2.648536  7.170888
"DAM" "bah" 2011      0 7.307873  2.721295  7.170888
"DAM" "bah" 2012      0 7.195187 2.8053784  7.170888
"DAM" "bah" 2013      0 7.130899  2.789118  7.170888
"DAM" "bah" 2014      0 7.466228  2.439444  7.170888
"DAM" "bah" 2015      0 7.577634  2.484907  7.170888
"DAM" "bah" 2016   3800 7.484369 2.5700645  7.170888
"DAM" "bah" 2017      0 7.366445  2.462514  7.170888
"DAM" "bah" 2018   1200 7.340187   2.55787  7.170888
"DAM" "jof" 2006   5400 6.848005   2.61007  7.120444
"DAM" "jof" 2007   6500 6.991177  2.721295  7.120444
"DAM" "jof" 2008   5400 7.128496  2.703596  7.120444

I would like to estimate a log-linear model (poisson regression) for panel data, using individual fixed effects

I have applied glm code

HTML Code:

glm flow_ij lnorigin_i lndistance_ij i.DEST_j, family(poisson) link(log) irls

in the code above,
the fixed effect of origin is included ( i.DEST_j )

The results I got are as follows:

Code:

. glm flow_ij lnorigin_i lndistance_ij i.DEST_j, family(poisson) link(log) irls

Iteration 1:   deviance =  2.70e+07
Iteration 2:   deviance =  1.94e+07
Iteration 3:   deviance =  1.86e+07
Iteration 4:   deviance =  1.86e+07
Iteration 5:   deviance =  1.86e+07
Iteration 6:   deviance =  1.86e+07
Iteration 7:   deviance =  1.86e+07

Generalized linear models                         Number of obs   =        572
Optimization     : MQL Fisher scoring             Residual df     =        557
                   (IRLS EIM)                     Scale parameter =          1
Deviance         =  18551677.22                   (1/df) Deviance =   33306.42
Pearson          =  18688077.98                   (1/df) Pearson  =   33551.31

Variance function: V(u) = u                       [Poisson]
Link function    : g(u) = ln(u)                   [Log]

                                                  BIC             =   1.85e+07

-------------------------------------------------------------------------------
              |                 EIM
      flow_ij |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
--------------+----------------------------------------------------------------
   lnorigin_i |   .7029036   .0002109  3333.29   0.000     .7024903    .7033169
lndistance_ij |   -.577117   .0001526 -3781.61   0.000    -.5774162   -.5768179
              |
       DEST_j |
      A       |   .9244987   .0025078   368.65   0.000     .9195836    .9294139
      B       |   2.726537   .0020362  1339.02   0.000     2.722546    2.730528
      C       |   2.711206   .0020558  1318.79   0.000     2.707176    2.715235
      D       |   2.523979    .002069  1219.90   0.000     2.519923    2.528034
      E       |   2.400839   .0020661  1162.00   0.000      2.39679    2.404889
      F       |   1.637443   .0022097   741.04   0.000     1.633112    1.641774
      G       |   1.845645   .0021684   851.16   0.000     1.841395    1.849895
      H       |   2.113996   .0020751  1018.77   0.000     2.109929    2.118063
      I       |   .5495162   .0027211   201.94   0.000     .5441829    .5548495
      J       |   .7086403   .0025717   275.55   0.000     .7035999    .7136807
      K       |   4.139993   .0019932  2077.03   0.000     4.136086      4.1439
      L.      |   1.706293   .0022313   764.71   0.000      1.70192    1.710666
              |
        _cons |   7.160802   .0027506  2603.35   0.000     7.155411    7.166193
-------------------------------------------------------------------------------

My questions are:
1. Is glm suitable for unbalanced panel data?
1. Is the fixed effect specified correctly?
2. do I need to include the fixed effect of distance (i_year)? why?
3. Would the use of cluster on origins benefits the model?

I am using xtreg in Stata 15.1.

Thank you for your time and effort, and my apologies for the long post,
Hussain Sulaimani

Tags: None

Hussain Sulaimani

Join Date: Apr 2021

Posts: 14
#2

02 Apr 2021, 06:53

*Edit: I used glm in Stata 15.1. (NOT xtreg)
Comment
Rich Goldstein

Join Date: Mar 2014

Posts: 4439
#3

02 Apr 2021, 07:38

I don't fully understand what you are trying to do but since you clearly have panel data, why not use -meglm- instead of -glm-?
Comment
Hussain Sulaimani

Join Date: Apr 2021

Posts: 14
#4

02 Apr 2021, 07:57

Hi Mr. Goldstein,
Thank you for replying,
Let me put it in another way.

- The dependent variable is cargo flow from 4 origins (seaports) and 13 destinations (regions) for the years 2006-2018.
- However, the panel data is unbalanced since the data available of one of the seaports is just for the years 2014-2018.
- The independent variables are the distance (between port i and region j), seaport attribute (total Cargo volume), region attribute (economic indicator)
- Obveiously, the distance of each port-region pair is fixed over the full period (time invariant). The attributes of the seaport and regions are the time-variant.

I am trying to use log-linear(Poisson) regression to estimate the impact of distance on cargo flow by considering the port and region attributes.

Hussain,
Comment
Rich Goldstein

Join Date: Mar 2014

Posts: 4439
#5

02 Apr 2021, 08:11

Stata routines for panel data do NOT require balance; you can us -mepoisson- or -meglm-
1 like
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17676
#6

02 Apr 2021, 08:49

Hussain:
like Rich, I find your emphasis on panel unbalancedness a bit of a stretch, as Stata can run -xt- commands on both balanced and unbalanced panels without any problem.
That said, your -glm- code actually treats observations as they were independedent: hence, you should -cluster- them at -panelid- level.
I would also investigate the role of -i.time- by adding this categorical predictor in the right-hand side of your regression equation.
Usually poissson regression with clustered standard errors is robust to overdispersion: other on this forum would probably have a different take, but I would give it a shot with negative binomial family, too.
Eventually, I'wondering whether some form of -exposure- (say, to post distance) is usually reported for this kind of regression in your research field.

Kind regards,
Carlo
(Stata 19.0)
Comment
Joao Santos Silva

Join Date: Apr 2014

Posts: 3000
#7

02 Apr 2021, 09:04

Dear Hussain Sulaimani,

Adding to the useful advice already provided, I suggest you use the user-written command ppmlhdfe wich is designed to work with this kind of problem and is very fast.

Best wishes,

Joao
1 like
Comment
Hussain Sulaimani

Join Date: Apr 2021

Posts: 14
#8

02 Apr 2021, 09:07

Based on the mepoisson code you mentioned and the following formula from stata manual

Code:

mepoisson y x || lev2:

I wrote this code as:

Code:

mepoisson flow_ij lndestination_j lndistance_ij || lnorigin_i

Where lnorigin_i is the fixed effect, but Stata replies: only one fixed-effects equation allowed

Whats wrong with it? I think I am specifying the fixed effect in the wrong way, because, I inserted just one fixed-effect.
I think the code is for multilevel mixed effect. If specified correctly, would it produces the same outcomes as glm?

Please note that: I am not considering the use of random effects.
The Poisson regression model, I am considering to use, is formulated as follows:

Thank you,
Hussain
--------
Source of code formula: STATA MULTILEVEL MIXED-EFFECTS REFERENCE MANUAL RELEASE 14.
Comment

Hussain Sulaimani

Join Date: Apr 2021
Posts: 14

02 Apr 2021, 09:12

Update: I added ( : ) after the fixed effect variable and got the following error

Code:

. mepoisson flow_ij lndestination_j lndistance_ij || lnorigin_i:

Fitting fixed-effects model:

Iteration 0:   log likelihood =  -39256612  
Iteration 1:   log likelihood =  -24095683  
Iteration 2:   log likelihood =  -24027594  
Iteration 3:   log likelihood =  -24027543  
Iteration 4:   log likelihood =  -24027543  

Refining starting values:

Grid node 0:   log likelihood =          .
Grid node 1:   log likelihood =          .
Grid node 2:   log likelihood =          .
Grid node 3:   log likelihood =          .
(note: Grid search failed to find values that will yield a log likelihood value.)

Fitting full model:

initial values not feasible
r(1400);

Comment

Hussain Sulaimani

Join Date: Apr 2021
Posts: 14

#10

05 Apr 2021, 10:37

Carlo Lazzaro Joao Santos Silva
Thank you for your valuable feedback.
Considering your suggessions,

Dr. Lazzaro,
I clustered at panel ID level (no. of clusters 52) in the two modeling attempts. Also, in the second attempt, I included i.year (time FE).

Dr. Santon Silva,
Following your advice, I used the command: ppmlhdfe.

I ran two modeling attempts, and got the following results:
A. by including only the individual FE (DEST_j), I got the following outcomes:

Code:

. ppmlhdfe flow_ij lnorigin_i lndistance_ij i.DEST_j, vce(cluster ID)
Iteration 1:   deviance = 2.6973e+07  eps = .         iters = 1    tol = 1.0e-04  min(eta) =  -4.17  P   
Iteration 2:   deviance = 1.9435e+07  eps = 3.88e-01  iters = 1    tol = 1.0e-04  min(eta) =  -6.02      
Iteration 3:   deviance = 1.8580e+07  eps = 4.60e-02  iters = 1    tol = 1.0e-04  min(eta) =  -7.12      
Iteration 4:   deviance = 1.8552e+07  eps = 1.49e-03  iters = 1    tol = 1.0e-04  min(eta) =  -7.40      
Iteration 5:   deviance = 1.8552e+07  eps = 5.81e-06  iters = 1    tol = 1.0e-04  min(eta) =  -7.42      
Iteration 6:   deviance = 1.8552e+07  eps = 3.48e-10  iters = 1    tol = 1.0e-05  min(eta) =  -7.42   S O
------------------------------------------------------------------------------------------------------------
(legend: p: exact partial-out   s: exact solver   h: step-halving   o: epsilon below tolerance)
Converged in 6 iterations and 6 HDFE sub-iterations (tol = 1.0e-08)

PPML regression                                   No. of obs      =        572
                                                  Residual df     =         51
Statistics robust to heteroskedasticity           Wald chi2(14)   =    1159.71
Deviance             =  18551677.22               Prob > chi2     =     0.0000
Log pseudolikelihood = -9278324.868               Pseudo R2       =     0.8640

Number of clusters (ID)     =         52
                                     (Std. Err. adjusted for 52 clusters in ID)
-------------------------------------------------------------------------------
              |               Robust
      flow_ij |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
--------------+----------------------------------------------------------------
   lnorigin_i |   .7029036   .1385215     5.07   0.000     .4314064    .9744007
lndistance_ij |   -.577117   .0994041    -5.81   0.000    -.7719455   -.3822886
              |
      DEST_jj |
         bah  |  -2.523979   .5004049    -5.04   0.000    -3.504754   -1.543203
         epr  |  -.1231392   .5953582    -0.21   0.836     -1.29002    1.043741
         hai  |   -.886536   .4413532    -2.01   0.045    -1.751572   -.0214996
         jaz  |  -.6783333   .5557509    -1.22   0.222    -1.767585    .4109184
         jof  |   -1.59948   .4568848    -3.50   0.000    -2.494958   -.7040021
         mad  |    .202558   .4809852     0.42   0.674    -.7401556    1.145272
         mkk  |  -.4099828   .6258462    -0.66   0.512    -1.636619    .8166532
         naj  |  -1.815338   .6095428    -2.98   0.003     -3.01002   -.6206564
        nbrd  |  -1.974462   .8221818    -2.40   0.016    -3.585909   -.3630158
         qas  |    .187227   .5078222     0.37   0.712    -.8080862     1.18254
         riy  |   1.616014   .4935388     3.27   0.001     .6486961    2.583333
         tab  |  -.8176859   .5336744    -1.53   0.125    -1.863669    .2282968
              |
        _cons |   9.684781   1.194159     8.11   0.000     7.344271    12.02529
-------------------------------------------------------------------------------

B. Here is the 2nd outcome by considering the two FE of individual-invariant (DEST_j) and time-variant (year).
[ppmlhdfe i.DEST_j i.year , cve(cluster ID)]

Code:

. ppmlhdfe flow_ij lnorigin_i lndistance_ij i.DEST_j i.year, vce(cluster ID)
Iteration 1:   deviance = 2.6264e+07  eps = .         iters = 1    tol = 1.0e-04  min(eta) =  -4.17  P   
Iteration 2:   deviance = 1.8693e+07  eps = 4.05e-01  iters = 1    tol = 1.0e-04  min(eta) =  -6.08      
Iteration 3:   deviance = 1.7827e+07  eps = 4.86e-02  iters = 1    tol = 1.0e-04  min(eta) =  -7.21      
Iteration 4:   deviance = 1.7800e+07  eps = 1.56e-03  iters = 1    tol = 1.0e-04  min(eta) =  -7.49      
Iteration 5:   deviance = 1.7800e+07  eps = 6.08e-06  iters = 1    tol = 1.0e-04  min(eta) =  -7.51      
Iteration 6:   deviance = 1.7800e+07  eps = 3.63e-10  iters = 1    tol = 1.0e-05  min(eta) =  -7.51   S O
------------------------------------------------------------------------------------------------------------
(legend: p: exact partial-out   s: exact solver   h: step-halving   o: epsilon below tolerance)
Converged in 6 iterations and 6 HDFE sub-iterations (tol = 1.0e-08)

PPML regression                                   No. of obs      =        572
                                                  Residual df     =         51
Statistics robust to heteroskedasticity           Wald chi2(26)   =    7661.03
Deviance             =  17799617.32               Prob > chi2     =     0.0000
Log pseudolikelihood = -8902294.915               Pseudo R2       =     0.8695

Number of clusters (ID)     =         52
                                     (Std. Err. adjusted for 52 clusters in ID)
-------------------------------------------------------------------------------
              |               Robust
      flow_ij |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
--------------+----------------------------------------------------------------
   lnorigin_i |    .687167   .1426196     4.82   0.000     .4076377    .9666963
lndistance_ij |   -.579165   .0982796    -5.89   0.000    -.7717896   -.3865405
              |
      DEST_jj |
         bah  |  -2.524047   .5116029    -4.93   0.000    -3.526771   -1.521324
         epr  |  -.1382823    .601594    -0.23   0.818    -1.317385     1.04082
         hai  |  -.8887676   .4464587    -1.99   0.047    -1.763811   -.0137246
         jaz  |  -.6781181   .5672767    -1.20   0.232     -1.78996    .4337238
         jof  |  -1.601437   .4619895    -3.47   0.001     -2.50692   -.6959546
         mad  |   .2010813   .4914005     0.41   0.682    -.7620461    1.164209
         mkk  |  -.4084137   .6271428    -0.65   0.515    -1.637591    .8207637
         naj  |  -1.815843   .6212883    -2.92   0.003    -3.033546   -.5981402
        nbrd  |  -1.977209   .8189277    -2.41   0.016    -3.582278   -.3721405
         qas  |   .1841842   .5110668     0.36   0.719    -.8174883    1.185857
         riy  |   1.611245   .4979908     3.24   0.001     .6352012    2.587289
         tab  |  -.8183101   .5445168    -1.50   0.133    -1.885543    .2489231
              |
         year |
        2007  |   .0258422   .0238659     1.08   0.279    -.0209342    .0726186
        2008  |  -.1541256    .062916    -2.45   0.014    -.2774387   -.0308124
        2009  |   .1625109   .0772471     2.10   0.035     .0111093    .3139124
        2010  |    .158896   .0930755     1.71   0.088    -.0235286    .3413206
        2011  |   .2243428   .0811565     2.76   0.006      .065279    .3834067
        2012  |   .1901082    .095927     1.98   0.048     .0020947    .3781216
        2013  |   .2516618   .0922519     2.73   0.006     .0708514    .4324722
        2014  |   .2521025   .0958217     2.63   0.009     .0642954    .4399096
        2015  |   .3549189    .116716     3.04   0.002     .1261597     .583678
        2016  |   .2325958    .111932     2.08   0.038     .0132131    .4519785
        2017  |   .1700456   .1256714     1.35   0.176    -.0762658    .4163569
        2018  |   .0516265   .1312694     0.39   0.694    -.2056567    .3089098
              |
        _cons |   9.651497   1.201433     8.03   0.000     7.296731    12.00626
-------------------------------------------------------------------------------

My inquiries:
1. the distance variable is time invariant (doesn't change over time. It is my major explanatory, Is it statistically beneficial to measure the impact of time invariant on a response variable in Panel data?
2. Does clustering by ID instead of making the standard error robust to overdispersion?
3. By including FE of time in attp.2, the z-value of lnorigin_i (IV) reduced compared to an increase in the z-value of lndistance_ij (IV), does it mean that the latter (IV) is not affected by time-variant?
4. If I am including Fixed effects, can I run the model with no intercept (constant)?
5. May you provide me a suggestion of a book or article that helps in interpreting the results and in measuring model fit, please?

Thank you very much and sorry for the lengthy reply, as I am trying to make the conducted steps clearer.
Huss

Comment

Joao Santos Silva

Join Date: Apr 2014

Posts: 3000
#11

06 Apr 2021, 02:48

Dear Hussain Sulaimani,

1 - That is not a problem
2 - Clustering by ID makes the standard errors robust to overdispersion (actually, overdispersion is only defined for count data, so that is not a problem here)
3 - I do not know what there variables are so I do not comment
4 - It should not make a difference
5 - Try this https://papers.ssrn.com/sol3/papers....act_id=3421148

Best wishes,

Joao
Comment
Hussain Sulaimani

Join Date: Apr 2021

Posts: 14
#12

08 Apr 2021, 06:04

Thank you for replying Prof. Joao Santos Silva

The working paper you sent me is really helpful.

Since my problem is in Freight flow, and you are an expert in this type of flow problem. I need your recommendation on an issue I am facing.

The total number of observations is 572, 162 of them have zero flow.

I including a dummy variable RAIL_ij that explains the availability of Rail services between port_i and province_j. Only one port-province pair has rail services and this pair is repeated over 13 years. By adding this dummy variable, it got a very high coefficient is very high. and resulted in reducing another independent variable so drastically. I am worrying that this is might be caused by the immense zero flow observation. In this case, Is it including it may cause problem to the other port-province pair that have nonzero flow?

Thank you,
Hussain

Last edited by Hussain Sulaimani; 08 Apr 2021, 06:07.
Comment
Joao Santos Silva

Join Date: Apr 2014

Posts: 3000
#13

08 Apr 2021, 06:21

Sorry, I would have to know much more about the problem to be able to comment; please discuss this with your team.

Best wishes,

Joao
Comment

Announcement

Log-linear model for Unbalanced panel data [glm]

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment