Evaluating data structure for unbalanced panel

Marc Pelow

Join Date: Jul 2021
Posts: 85

Evaluating data structure for unbalanced panel

14 Jan 2024, 05:20

Hello,

I hope you don't mind another semi-Stata related question. I'm not very experienced in working with panel data and would greatly appreciate your advice.

In my current research project, the data structure is as follows:

I compiled a list of publicly listed US firms that were part of a major stock index from 2010 to 2018, marking my sample period. Once I identified this set of firms, I obtained sentiment data for their quarterly earnings conference calls. In essence, this data provides insights into aspects such as the tone (positive, neutral, or negative) of CEOs speech during these calls. Ideally, I aim to have 36 firm-quarter observations for each firm (9 years * 4 quarters). I am planning to run fixed effects regression using the sentiment data as my dependent variables and some CEO characteristics as my independent variables.

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input int Conf_Call_Date double Firm_ID int(CEO_Pres_Speech_NoWords CC_Quarter) float(CEO_Pres_Speech_NoPosWords CEO_Pres_Speech_NoNegWords)
18392 4295899290  397 200  5  3
18469 4295899290  411 201  7  2
18570 4295899290 1172 202 19 29
19116 4295899290 1002 208 24  6
19207 4295899290  897 209 18  2
19298 4295899290 1036 210 12  8
19409 4295899290  728 211 14  8
19480 4295899290  987 212 12  8
19571 4295899290  901 213 11  4
19662 4295899290  926 214 11  8
19772 4295899290  620 215  8  6
19844 4295899290  655 216  6  4
19935 4295899290 1084 217 21  8
20026 4295899290  769 218 12  5
20137 4295899290  803 219 16  8
20208 4295899290  927 220  9  9
20299 4295899290  982 221 30 13
20390 4295899290  947 222  8 13
20502 4295899290  868 223 15 12
20579 4295899290 1007 224 16 14
20670 4295899290  747 225 10 10
20761 4295899290  988 226 15 13
20867 4295899290 1054 227 16  6
20943 4295899290 1154 228 14 11
21034 4295899290 1348 229 33 24
end
format %tdDD/NN/CCYY Conf_Call_Date
format %tq CC_Quarter

Missing data for some of the firms leads to the following frequency distribution of firm-quarters.

Code:

 No_Quarter |      Freq.     Percent        Cum.
------------+-----------------------------------
          1 |          3        0.04        0.04
          2 |          4        0.05        0.09
          3 |          9        0.11        0.20
          4 |         20        0.25        0.45
          5 |          5        0.06        0.52
          8 |         32        0.40        0.92
          9 |          9        0.11        1.03
         11 |         22        0.28        1.31
         12 |         48        0.60        1.91
         13 |         13        0.16        2.07
         14 |         28        0.35        2.43
         15 |         30        0.38        2.80
         16 |         16        0.20        3.00
         17 |         17        0.21        3.22
         18 |         72        0.90        4.12
         19 |         38        0.48        4.60
         20 |         80        1.01        5.60
         21 |         63        0.79        6.40
         22 |         88        1.11        7.50
         23 |        230        2.89       10.39
         24 |        120        1.51       11.90
         25 |        225        2.83       14.73
         26 |        260        3.27       17.99
         27 |        459        5.77       23.76
         28 |        448        5.63       29.39
         29 |        551        6.92       36.32
         30 |      1,110       13.95       50.26
         31 |      1,581       19.87       70.13
         32 |        576        7.24       77.37
         33 |         66        0.83       78.20
         34 |        272        3.42       81.62
         35 |        455        5.72       87.33
         36 |      1,008       12.67      100.00
------------+-----------------------------------
      Total |      7,958      100.00

xtset Firm_ID CC_Quarter
xtdescribe

 Firm_ID:  4.296e+09, 4.296e+09, ..., 8.590e+09              n =        292
CC_Quarter:  2010q1, 2010q2, ..., 2018q4                     T =         36
           Delta(CC_Quarter) = 1 quarter
           Span(CC_Quarter)  = 36 periods
           (Firm_ID*CC_Quarter uniquely identifies each observation)

Distribution of T_i:   min      5%     25%       50%       75%     95%     max
                         1       8      26        30        31      36      36

     Freq.  Percent    Cum. |  Pattern
 ---------------------------+--------------------------------------
       28      9.59    9.59 |  111111111111111111111111111111111111
        8      2.74   12.33 |  11111111111.....11111111111111111111
        7      2.40   14.73 |  ....11111111111111111111111111111111
        6      2.05   16.78 |  111.....1111111111111111111111111111
        6      2.05   18.84 |  111111111111111111......111111111111
        5      1.71   20.55 |  111111111111111.....1111111111111111
        5      1.71   22.26 |  11111111111111111111111111......1111
        4      1.37   23.63 |  1111111111111111111.........11111111
        4      1.37   25.00 |  11111111111111111111.....11111111111
      219     75.00  100.00 | (other patterns)
 ---------------------------+--------------------------------------
      292    100.00         |  XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

Certainly, I am working with an unbalanced panel, and I would highly appreciate it if you could point out any concerns regarding the underlying data structure that might hinder me from running a panel data regression, incorporating firm and quarter fixed effects. As far as I know whether the dataset is balanced or unbalanced is not influencing the estimation of the coefficients, not sure about other parts of the model estimation though.

Thank you!

Tags: None

Carlo Lazzaro

Join Date: Apr 2014

Posts: 17613
#2

14 Jan 2024, 05:40

Marc:
1) as far as I can see from your data excerpt/example, you do not have an unbalanced panel dataset, but one with gaps (that is, the number of available quarters varies across years);
2) your -Firmid- refers to a unique firm; so no panel data can be run;
3) I do not see relevant issues with your dataset;
4) as you correctly pointed out, Stata is not concerned about balanced or unbalanced panel datasets.

Kind regards,
Carlo
(StataNow 18.5)
Comment
Marc Pelow

Join Date: Jul 2021

Posts: 85
#3

14 Jan 2024, 13:30

Hello Carlo,

Thank you for your prompt response. You're right in noting that the necessary data for some firms, in fact, for the majority, is missing for at least one quarter across the 9-year period. Doesn't this make the panel inherently unbalanced? I assumed that if at least one member of a panel has missing data for one time period, the panel is unbalanced.

Regarding point 2), could you provide further clarification? I'm having difficulty grasping the notion of "no panel data can be run." In my dataex, only one firm is present in the sample, while my actual dataset comprises around 290, each identified by a unique Firm_ID.

I appreciate your insights!

Best regards,
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17613
#4

15 Jan 2024, 01:37

Marc:
1) sticking with your data example (that includes one firm only), -xtset- message is:

Code:

. xtset Firm_ID CC_Quarter Panel variable: Firm_ID (strongly balanced) Time variable: CC_Quarter, 2010q1 to 2017q2, but with a gap Delta: 1 quarter

I would skip the meaning of "strongly balanced" here (as one -panelid- only does not allopw any comparison with other panels) and focus on "gap" (that is, the number of available quarters varies across years).
It may well be that, when considering the entire dataset, you can detect panels unbalancedness (that is, panels neither include the same number of observations, nor the same time points).
For more details on this topic, see Technical note, -xtset- entry, Stata .pdf manual.
2) my comment about the unfeasibility of a panel data regression related to your -dataex- excerpt: a single firm is not enough to go panel data regression.

Kind regards,
Carlo
(StataNow 18.5)
Comment
Marc Pelow

Join Date: Jul 2021

Posts: 85
#5

15 Jan 2024, 03:09

Got it, thanks Carlo!
Comment

Announcement

Evaluating data structure for unbalanced panel

Comment

Comment

Comment

Comment