Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Dividing panel data into quintiles

    Hello.

    I am working with a panel dataset of 3 years (2012, 2015, 2018). I am trying to see the effects of health shock on household consumption. So, the main regression includes consumption as the dependent variable. However, other regression models that I consider would include different dependent variables such as loans, assets, savings etc. I want to conduct the analysis in the overall sample but also want to do it for different quintiles.

    Concern: How to divide the panel data into quintiles based on the consumption expenditure. The panel is balanced.

    1) I ran this code:
    Code:
    xtset id year
    egen tag=tag(id)
    xtile group=total_consumption if tag, nq(3)
    bysort id (tag) : replace group = group[_N]
    I think the quintiles are then assigned based on 2012 consumption. Is this the right approach?

    2) The reason I am not using xtqreg is because I believe this creates quintiles based on the dependent variable. Please correct me if I am wrong. So, in the regression of savings on health, quintiles will be based on savings, not consumption (which I want).

    3) Whether estimating the poverty line based on the CBN method to divide households into poor and nonpoor using the baseline data 2012 and then constructing the panel for each subsample would be wise? Advice appreciated.

  • #2
    Quantile terms are often used ambiguously. You're focusing on quintiles, but that is enough to illustrate.

    Historically, and to the present, quintiles are particular levels of a variable, including values (or estimates of values) than which 20, 40, 60, 80% of values are lower and the complementary % is higher. There is small print about exactly how you estimate those for any sample size. In practice, people will often work also with the minimum and maximum and you may or may not also want to regard those as quintiles.

    More recently, quintiles have often been used to refer to intervals, often called classes or bins. So the first or lowest quintile bin is the lowest 20% of the data, and so on.

    Now your xtile calculation is a binning exercise. As the original author of tag() I know, and many Stata users know too, that it selects the first value in each group, so as you say with your sort order it selects year 2012. for each panel. But I strongly recommend that your code explicitly shows that you're selecting year 2012, because there may be many readers of your work who won't know that. For example, if someone doesn't use Stata I doubt that it would be clear. Fact is, you're misusing tag() there because the point of tag() is that you have many observations identical on some variable, but you only want to use one, and it shouldn't therefore matter which observation you choose. But here it does matter which you choose.

    Also, you're talking about quintiles, but nq(3) is binning by tertiles (or terclles)! If you really want tertiles, nq(3) is right. Otherwise for quintiles you need nq(5).

    For an extensive menagerie of quantile terms, see https://journals.sagepub.com/doi/pdf...867X1601600413 -- updated at https://stats.stackexchange.com/ques.../235334#235334



    Code:
    xtset id year 
    xtile group=total_consumption if year == 2012, nq(3)
    bysort id (year): replace group = group[1]

    Why you need or want to bin this variable at all when it's a predictor is really the first question, so what is the answer to that?

    Comment

    Working...
    X