Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • # of observations in my panel data set

    Hello,

    I have a panel data set of, according to stata, 5296 observations. This can be seen either using the count command or at the right bottom in the data editor.

    My question is the following: How can it be that according to Stata I have 5296 observations, which is equal to the total number of columns, while I have multiple variables?

    According to my knowledge a observation is just a data point in a data set, so either a value of (in my case) total assets of firm X in year T, total assets of firm X in year T+1 and, total assets of firm Y in year T and so on.

    It is not logical to me, to see that I have 5296 columns, each column representing firm X, Year T quarter t, while I have more than 12 variables.

    Does stata means something else with observations with respect to how I see a observation? I read something about _n and _N, maybe thats explaining it.

    Thank you in advance for providing a answer to my questions.

    Yannick

    To be more clear, this is how my dataset looks like


    1 Bank X Year Q1 assets equity
    2.Bank X Year Q2 assets equity

    ........
    3. Bank Y Year Q1 assters equity

    ...................
    ....................

    5296 Bank Z year q1 assets equity

  • #2
    Since you're familiar with the data editor, the answer is easy to explain. To Stata, each row that you see in the data editor is a single observation, and each column that you see is a single variable. Your example of how your data looks shows 5296 observations and, apparently, 4 variables: the Bank, the year-and-quarter, the assets of the bank in that year-and-quarter, and the equity for the bank in that year-and-quarter.

    Comment


    • #3
      In Stata terminology, the word observations is used to refer to the number of "rows" in the data set when it is laid out as if it were a spreadsheet. So _n ranges from 1 through the number of observations, and _N is equal to the number of observations.

      What you refer to as a "column" is, in Stata terminology, a variable.

      What you are referring to as an observation or a data point has no particular name in Stata: it is just the value of a particular variable in a particular observation.

      We generally avoid using the terms "row" and "column" because they have spreadsheet-like connotations and it is important to not think of a Stata data set as a spreadsheet--that metaphor can lead to inappropriate or impossible calculations.

      It appears that you have both 5,296 observations (Stata terminology) and 5,296 variables (Stata terminology) in your data set, which is, I imagine, a coincidence.

      Comment


      • #4
        Thank you both.

        Clyde Schechter I doubt if it is really coincidence, as I have 18 variables (so columns), while as I said, I have 5296 observations (rows). If according to my terminology, an observation is a datapoint of a variable at a specif point of time, it can never be the case that I have 5296 observations (Stata terminology, so rows) and 18 variables (stata terminology, so columns) , which makes a total of 5296 ''observations'' (own terminology).

        However, when I generate variable x=_n and y=_N, both are 5296 in the end (which kind of contradict my story above, but it seems so odd to me though).

        What about the # of observations displayed when you run a regression? Is that the same # of observations as the # of observations displayed in the Stata editor?.

        Comment


        • #5
          YH:
          the number of observations displayed in the regression output rules out those with missing values in any variable (listwise deletion).
          Kind regards,
          Carlo
          (StataNow 18.5)

          Comment


          • #6
            Carlo:
            Yes, but that is still on the basis of observations as in # of "rows" in you regression?

            So for example, I have 10 observations (10 rows) with an y variable and x variable. When you open the data editor and you see at row number 5 a date with only an y variable but no independent variable x, this observation ("row") will be missing in the regression result # of observations right? So in this example I will have 9 observation in my regression output?

            And maybe a really straightforward question, but when some observations are missing for a particular variable, this variable will still be present in a regression right? So not the whole variable is deleted when there is a missing value?

            Comment


            • #7
              YH:
              yes, the variable (column) will still be present for all the observations without missing values.
              Conversely, the observation (row) wit at least a missing value in any of the variable (column) will be ruled out from the regression (this works that way for subsequent background matrix calculations that Stata performs for determining the regression outcome).
              Kind regards,
              Carlo
              (StataNow 18.5)

              Comment


              • #8
                YH: The automatic variable _n will be 1 for your first observation, 2 for the second, and so on, until 5296 for your final observation. The automatic variable _N will be the total number of observations - 5296 - for each observation. Neither _n nor _N has anything to do with the number of variables.

                What you describe as "your terminology" in post #4 is not standard terminology, neither for Stata nor for statistical writing in general, and its use can only lead to confusion. I would write something like "you have 5296 observations of 18 variables" which is unambiguous. It is almost never important to know their product, 95,328, by itself, which is why neither Stata nor statistics has a standard name for it.

                Comment


                • #9
                  William: Ah allright. Apperently I need to reopen my econometrics book then. I really thought that my terminology was the standard, so that is a mistake then.

                  But just to for my own knowledge and the last question in this topic: the number of observations is then independent of the number of variables? Which also seems kind of odd to me, because in my opinion it makes sense to have more explanatory power when you have more variables (although I know that more variables is not always better, in regression for example).

                  Comment


                  • #10
                    YH:
                    I suspect that you're mixing up listwise with casewise deletion.
                    Stata uses the first one only when it comes to regression models.
                    Kind regards,
                    Carlo
                    (StataNow 18.5)

                    Comment


                    • #11
                      Yes, the number of observations has no relationship to the number of variables, nor does it have any direct relationship to the explanatory power of your model.

                      Increasing the number of observations or the number of variables in your data may increase the explanatory power of your model built with that data, but neither the number of observations nor the number of variables measure explanatory power, either singly or in combination. Statistics like R2 measure the explanatory power of your model. Without a model, your data has no explanatory power.

                      Comment


                      • #12
                        In #3, I said you seemed to have both 5,296 observations and 5,296 variables because in #1 you said
                        How can it be that according to Stata I have 5296 observations, which is equal to the total number of columns...
                        I interpreted that as Stata saying you had 5,296 observations, and you then saying that 5,296 is also the number of columns (which would, in turn, be the number of variables). I guess you meant to say that 5,296 is also the number of rows.

                        Comment


                        • #13
                          Some of the commentary at http://www.stata.com/statalist/archi.../msg01258.html remains pertinent.

                          It seems too obvious to state, but any way: On Statalist the terminology used in Stata is that we all have access to. That in no way rules out any other terminology, but typically you will need to explain what other terminology you are using; otherwise we are all lost.

                          Comment


                          • #14
                            Due to holidays I did not respond. But thanks for the help, it is more clear to me now!

                            Comment


                            • #15
                              Dear all,

                              I have a related - but simpler - question.

                              My dataset is made of firms reporting data for several years

                              Code:
                              * Example generated by -dataex-. For more info, type help dataex
                              clear
                              input int(No Year) DV IV
                               1 2015      .         .
                               1 2016      .         .
                               1 2017      .         .
                               1 2018      .         .
                               1 2019 144851        .5
                               4 2014 213553  .3333333
                               4 2015      .         .
                               4 2016 178057  .4285714
                               4 2017 177342      .625
                               4 2018      .         .
                               4 2019 248836  .6666667
                               6 2015      .         .
                               6 2016 165502      .625
                               6 2017 220142      .625
                               6 2018      .         .
                               6 2019 279338        .4
                              12 2017      .         .
                              12 2018      .         .
                              12 2019 400365  .3636364
                              13 2019 365657 .53333336
                              end


                              When running a regression in Stata at the top right of the regression output I see the number of observations (number of firm-year observations).

                              How do I do to see the number of firms that my regression has been run on?

                              Thank you in advance for your help,
                              Best regards,
                              Jeanne

                              I am using Stata 16 for Mac.
                              Last edited by Jeanne Roche; 24 Feb 2022, 05:09.

                              Comment

                              Working...
                              X