Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Panel data with multiple observations per year

    Dear Statalisters,

    I am having a dataset with unbalanced panel data on class action suits, as stated below. The issue here is that some of the companies are having more than one records per year, but those records have different dates (see company_id=C), so they are not per se the same regarding the actual event date. The reason that a company may have multiple records is that it can receive more than one class action lawsuits within a year (which differ in date). The rest of the variables (e.g. Var1) represent a scoring for every class action suit, so it differs per company and time, but I also include other variables such as ASSETS that is the same for every company within the same year.

    So, my main question is on how to treat this dataset.
    Of course, if I am going to declare the dataset as a panel (xtset company_id year) I will get an error. In case that I'll type "xtset company_id " disregarding the time variable, would that solve my problem?
    Further, I do not know if it helps, but my dataset's maximum number of multiple observations within a year for the same company is 5. Maybe creating a new set of variables equal to this number could be the solution?

    Dropping the duplicate observations is not an option as my dataset is already quite limited.

    company_id year event_date Var1 Assets etc.
    A 2010 1/10/2010 55 5600
    A 2012 5/30/2012 40 5550
    A 2013 8/22/2013 38 5650
    B 2011 7/12/2011 99 335
    C 2009 3/11/2009 29 4300
    C 2014 2/08/2014 35 6000
    C 2014 6/24/2014 35 6000


    I would really appreciate your help and thoughts.

  • #2
    It depends on what you want to try to do with this data. For many analyses, all you need to accomplish with -xtset- is identify the panels, and for that -xtset company_id- is enough. Sometimes, however, you really need to specify the time variable; as when you want to do regressions with autoregressive residual structure, or analyses that use lags, leads, or differences. In that case, I might do -xtset company_id event_date-. (But as your event_dates are pretty irregularly spaced, this wouldn't really work well: most lags, leads, and differences would just be missing, and you certainly can't do autoregressive analyses with these dates.) Another possibility is to set up consecutive event numbers within companies:

    Code:
    by company_id (event_date), sort: gen int event_num = _n
    and then you could -xtset company_id event_num-. With this specification you would be able to make effective use of lag, lead, and difference operators.

    And there is an altogether different approach that might work for you. It appears that Var1 and Assets are actually constant within company_id for any given year. So, if this is true for all of your "etc." variables, then depending on where you're going with this, you might want to reduce the data set to a single observation per company_id year pair, with a variable showing how many events occurred, and using the common values of the other covariates:

    Code:
    by company_id year (Var1), sort: assert Var1[1] == Var1[_N]
    by company_id year (Assets), sort: assert Assets[1] == Assets[_N]
    // INSERT SIMILAR -assert- STATEMENTS HERE FOR ALL OF THE etc VARIABLES
    // IF THERE ARE MANY OF  THEM, THEN USE A -foreach- LOOP
    
    collapse (count) n_events = event_date (first) Var1 Assets etc_variables, by(company_id year)

    Comment


    • #3


      First of all I would like to thank you for your valuable feedback. Further, I would like to inform you that I will not use any lags, leads, difference nor an autoregressive residual structure. I suppose that your first suggestion "xtset company_id event_num" might solve the problem of multiple observations per year and I will certainly try it.
      Regarding now your second suggestion, I would like to pinpoint that yes indeed Assets are constant within a company in a given year , but Var1 is not (mistakenly I have typed the same Var1 value for company C in year 2014. Please consider this observation to have a Var1 value of "39").

      In addition except of your suggestion on using "xtset company_id event_num", might using pooled OLS instead of panel data solved my problem?

      I would like to thank you on again for your time and efforts.

      Comment


      • #4
        Kleopatra:
        I don't think that pooled OLS can ease your problem, as it is simply a (often less efficient) way to do panel data analysis.
        Kind regards,
        Carlo
        (Stata 19.0)

        Comment


        • #5
          Dear Statalisters,

          I am facing a similar issue. My database is an unbalanced panel data on spectrum auctions (a sample is stated below). The issue here is that some of the Regulation Authorities auction spectrum only some years and others no, the dates are differents in every country. SO what I have is that in some years I have no observations but for others, and for specific operators I could have more than one spectrum allocation. It is important to me to know the different spectrum bands and prices paid for each auction even though they could have been performed in the same year to the same operator. The rest of the variables (e.g. Population, GDP, ARPS; Revenues) are the same for the company at an specific year. However, for the model estimation I will have to use some lags of some variables, and therefore I dont know if this repeated observations for the same years is going to difficult the estimation.

          COUNTRIES OPERATOR ID YEAR Population Suscriptions per Operator Voice Revenues Market Share by operator Implemented Technology Spectrum band width (MHz) Award Metod: AUC(auction), Beauty Contest (BC), Grant (G), Direct Award (DA), First come first served (FCFS), Hybrid (HYB), Reallocation (REALL), Other License Fee
          Argentina Telecom Personal 1 2006 38.971 8.425 920 0,28
          Argentina Telecom Personal 1 2007 39.360 10.666 990 0,27
          Argentina Telecom Personal 1 2008 39.750 12.564 1.146 0,28
          Argentina Telecom Personal 1 2009 40.130 14.475 1.021 0,30
          Argentina Telecom Personal 1 2010 40.520 16.333 1.489 0,31
          Argentina Telecom Personal 1 2011 40.900 18.193 1.653 0,33
          Argentina Telecom Personal 1 2012 41.733 18.975 1.650 0,33
          Argentina Telecom Personal 1 2013 42.203 20.088 1.054 0,32
          Argentina Telecom Personal 1 2014 42.670 19.585 742 0,31 Neutral: 4G LTE 30 AUCT 355
          Argentina Telecom Personal 1 2014 42.670 19.585 742 0,31 Neutral 5 AUCT 5
          Argentina Telecom Personal 1 2014 42.670 19.585 742 0,31 Neutral 10 AUCT 6
          Argentina Telecom Personal 1 2014 42.670 19.585 742 0,31 Neutral 20 AUCT 247
          Argentina Telecom Personal 1 2014 42.670 19.585 742 0,31 Neutral 8 AUCT 45
          Argentina Telecom Personal 1 2015 43.130 19.656 784 0,31
          Argentina Telecom Personal 1 2016 43.590 19.921 555 0,31
          Argentina Telecom Personal 1 2017 44.045 18.980 493 0,31 Neutral: 4G LTE 40 DA 56
          Argentina Telecom Personal 1 2017 44.045 18.980 493 0,31 Neutral: 4G LTE 30 DA 33
          Argentina Movistar 2 2006 38.971 11.059 1.272 0,36
          Argentina Movistar 2 2007 39.360 13.630 1.247 0,35
          Argentina Movistar 2 2008 39.750 14.829 1.408 0,34
          Argentina Movistar 2 2009 40.130 15.931 1.434 0,33
          Argentina Movistar 2 2010 40.520 16.149 1.472 0,31
          Argentina Movistar 2 2011 40.900 16.767 1.523 0,30
          Argentina Movistar 2 2012 41.733 17.604 1.531 0,30
          Argentina Movistar 2 2013 42.203 19.955 1.331 0,32
          Argentina Movistar 2 2014 42.670 20.022 1.006 0,32 Neutral 20 AUCT $ 209,14
          Argentina Movistar 2 2015 43.130 20.389 1.113 0,32 Neutral 20 AUCT $ 218,67
          Argentina Movistar 2 2016 43.590 20.765 747 0,33
          Argentina Movistar 2 2017 44.045 19.256 665 0,31 Neutral: 4G LTE 30 DA $ 33,00

          Sorry, I wasnt able to copy from Stata, the data didnt llok very organized.

          Then I tried following Clyde suggestion in this forum:
          by Operator (year), sort: gen int event_num = _n But I dont really understand how this works. Then, I tried the following (copied from the stata panel): xtset Operator event_num string variables not allowed in varlist; Operator is a string variable r(109); . xtset year event_num repeated time values within panel r(451); . xtset IdOp event_num repeated time values within panel r(451); It seems that if it is a string variable I can not do this, so thats why I tried woth IdOp(Operator identificator) but is not working because of the years repetition. any comment will be very helpful for me. Thanks a lor, Alexandra

          Comment


          • #6
            Code:
            by Operator (year), sort: gen int event_num = _n
            will not work in the situation you describe. If you have multiple observations for the same Operator in the same year, then this code will just assign event_num to those observations in an arbitrary and irreproducible order. If you then use lags and leads, etc., you will get different results every time you run the analyses. I do not know what the context from which you took that code is. The idea behind it is to order the observations chronologically first and then assign the observations consecutive numbers. That gets followed by an -xtset- command with event_num as the "time" variable. That makes "lagged" values refer to the immediately preceding value, even if that value doesn't happen to be exactly one year (or month, or whatever the time unit) before. It sounds like you need to do something like that, but you need to establish the actual order in which these events occurred. If all you know is that some Operator did a bunch of auctions in the same year and you don't know what order they were done in, then there is no possibility of defining lags.

            As for the problem with string variable, you need a numeric value that corresponds to the operator. You can do that either with the -encode- command (see -help encode-) or with -egen nOperator = group(Operator)-.

            In the future, when showing data examples, please use the -dataex- command to do so. If you are running version 15.1 or a fully updated version 14.2, it is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

            When asking for help with code, always show example data. When showing example data, always use -dataex-.

            Comment


            • #7
              Hello,

              I am also having a problem that is somewhat similar to the other two individuals, but one that the past comments cannot address.

              Here is my dataset:

              Code:
              * Example generated by -dataex-. To install: ssc install dataex
              clear
              input long(gvkey datadate) int sics1
              1001  8765 5812
              1001  9131 5812
              1001  9496 5812
              1003  8400 5021
              1003  8765 5021
              1003  9131 5021
              1003  9527 5021
              1003  9892 5932
              1003 10257 5712
              1003 10623 5712
              1003 10988 5712
              1004  6360 4582
              1004  6360 5088
              1004  6725 4582
              1004  6725 5088
              1004  7090 4582
              1004  7090 5088
              1004  7456 4582
              1004  7456 5084
              1004  7456 5088
              end
              format %d datadate
              ------------------ copy up to and including the previous line ------------------

              Listed 20 out of 466746 observations


              As you can see, it's an unbalanced panel data, but here also, the problem is that some GVKEYs have 2 or more different SIC codes in the same year (database is our time variable). I would like to create a panel dataset that stata can understand through xtset, where gvkey has only one observation per date. I marked in red an instance that this is not the case.

              At the end, I hope my dataset will look like so:
              gvkey datadate sic1_1 sic1_2 sic1_3 sic1_4
              1 2010 123 456 . .
              1 2011 123 456 . .
              1 2012 123 . . .
              2 2010 876 . . .
              2 2011 876 . . .
              2 2012 876 . . .
              2 2013 999 . . .
              2 2014 999 . . .
              2 2015 999 000 . .
              2 2016 444 333 . .
              3 2010 444 333 . .
              3 2011 444 333 . .
              3 2012 444 333 . .
              3 2013 444 333 . .
              Meaning, I simply want to make the sic variable that is now expressed per id variable per year, to be expressed in different columns, when it's relevant and be left as missing when it is not.

              I had a similar issue in the past and I ended up renaming the variables and merging, and re-merging the dataset with itself, to create the additional columns, but I think there must a simpler solution.

              Will appreciate any helpful comments.

              Thank you.

              Comment


              • #8
                I don't understand how this helps you with your -xtset- problem. But there is, indeed, a simple way to get the layout you are asking for:

                Code:
                * Example generated by -dataex-. To install: ssc install dataex
                clear
                input long(gvkey datadate) int sics1
                1001  8765 5812
                1001  9131 5812
                1001  9496 5812
                1003  8400 5021
                1003  8765 5021
                1003  9131 5021
                1003  9527 5021
                1003  9892 5932
                1003 10257 5712
                1003 10623 5712
                1003 10988 5712
                1004  6360 4582
                1004  6360 5088
                1004  6725 4582
                1004  6725 5088
                1004  7090 4582
                1004  7090 5088
                1004  7456 4582
                1004  7456 5084
                1004  7456 5088
                end
                format %d datadate
                
                by gvkey datadate, sort: gen _j = _n
                rename sics1 sics1_
                reshape wide sics1_, i(gvkey datadate) j(_j)
                -reshape- is one of the most useful data management commands in Stata. Do read both -help reshape- and the section of the PDF manuals linked therein. It takes a bit of practice to "get the feel" of how it works. But once you learn it, you will find it endlessly helpful.

                Comment


                • #9
                  Clyde this is great, thank you very much. It's exactly what I need.

                  I will read the reshape command carefully.

                  Thank you.

                  Comment


                  • #10
                    Dear Statalisters,

                    I am working with the experimental data that I obtained from the laboratory.

                    The experiment lasts 21 experimental rounds with 48 individuals. Each individual (denoted by the variable id) in each experimental round (denoted by the variable period) takes 11 decisions, such that I have a balanced panel data with multiple observations per id per period.

                    Then, I am not able to declare the data set in Stata 13.1 as a panel data using xtset id period because of multiple observations per id per period.

                    Can you help me please on that matter ? Before running a regression analysis, what is the way of declaring the structure of my data?

                    Will appreciate your comments.
                    Thank you

                    Comment


                    • #11
                      Vita:
                      welcome to this forum.
                      If you do not have genuine duplicates (ie, errors in data entry) and do not plan to use time-series related commands such as lags and leads, you can -xtset- yout data with -panelid- only.
                      Kind regards,
                      Carlo
                      (Stata 19.0)

                      Comment


                      • #12
                        Hello, Carlo, thank you so much for your reply, will use your suggestion in my analysis. However, I would like to give more details on the data I am working with.

                        Althougth my concern is not to investigate the effect of time on the decisions taken by individuals (dependent variable), I would need to check for its possible influence.

                        Time dimention of my experiment allows me to change some parameters of the game played by participants. Hence, as far as the game is repeated, even with different parameters in each round, the individual might undertand better the game and refine her behavior in terms of decicions and, hence, reveal a kind of convergent/divergent/oscilating behavior in her decisions.
                        Therefore, individual's decisions might change across periods due to the fact of the repetition of the game (period variable ) and also due to the changes in the parameters (parameter variable).

                        Then, on the one hand, I believe I need to control for time in the regression analysis. But, on the other hand, assuming I declare the data as a panel data (that I do not know how to do it yet), will I obtain a biased estimation of the coefficients of the regression model because of including in the same regression a time regressor and parameter regressor, which are both highly correlated?


                        Kind regards,
                        Vita

                        Comment


                        • #13
                          Vita:
                          thanks for clarifications.
                          However, you seem to have multiple dependent variables per patient plus predictors.
                          This design cannot be accomodated by panel data regression models, that consider one regressand only.
                          Maybe you can take a look at -mvreg-.
                          Kind regards,
                          Carlo
                          (Stata 19.0)

                          Comment


                          • #14
                            Dear all,

                            I have a similar question to the above - my dataset is similar to Vita's (multiple individuals for each country for each year) and I do not intend to use lags or leads. How do I use the "panelid" command suggested by Carlo? I have not been able to find the commands that I need to use in order to create the panel without xtset.

                            Thank you in advance,

                            Joan

                            Comment


                            • #15
                              Originally posted by Joan Marti View Post
                              . . . my dataset is . . . multiple individuals for each country for each year . . . How do I use the "panelid" command suggested by Carlo?
                              Code:
                              xtset country_id

                              Comment

                              Working...
                              X