A study of the tax burden by revenu categories

Aziz Essouaied

Join Date: Apr 2020
Posts: 152

A study of the tax burden by revenu categories

19 Dec 2024, 08:34

Hello Stata people;

I have this interesting data I'am working with, it has more than 600 000 observations in it, this is just an example:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str10 Identifiant double(Revenu_Imposable_Année Impot_Payé CSS) float Impot_Payé_et_CSS
"231******4" 24015.379   3722.48 211.203  3933.683
"205******0" 42206.545  8049.537 384.897  8434.434
"007******1"  36889.06  6598.107 334.361  6932.468
"009******2" 26109.679  4106.506  230.06  4336.566
"006******6" 39681.416  7370.442 359.466  7729.908
"010******3" 14385.029  1880.258 124.535  2004.793
"009******4" 12904.748  1623.988 113.234  1737.222
"143******4" 14070.842   1953.15 126.703  2079.853
"160******3" 11686.995  1457.566 102.383  1559.949
"115******2" 13509.571  1740.774 117.664  1858.438
"009******0" 13506.436  1845.043 121.638  1966.681
"009******0" 29359.533  5910.079 273.161   6183.24
"162******2" 15564.943  2283.732 140.218   2423.95
"228******3" 24029.455  4440.872 220.135  4661.007
"232******8" 25140.046  4533.566 224.718  4758.284
"002******7" 14385.029  1932.484 126.537  2059.021
"236******8" 24805.308  3109.181 173.455  3282.636
"004******8" 21922.531   3914.52 196.863  4111.383
"005******2" 25267.761  4577.579 231.139  4808.718
"007******1" 30787.104  5402.482 278.369  5680.851
"009******9" 35680.101  6712.119 326.701   7038.82
"005******6" 42282.843  7775.964  373.81  8149.774
"005******8" 41347.412  7860.543 376.261  8236.804
"010******0"  43959.38  5926.819 309.657  6236.476
"006******0" 46051.024  9052.386 421.724   9474.11
"006******6" 41367.373  7893.934  379.64  8273.574
"009******7" 23090.514  3558.203  205.73  3763.933
"004******6"  43895.24   8121.98 396.342  8518.322
"127******4" 33677.781  4252.627 242.962  4495.589
"004******7" 30403.981  4989.779 267.871   5257.65
"008******9" 26020.927   4736.07 234.545  4970.615
"008******8" 23343.019  4119.053 211.959  4331.012
"007******7" 39376.535   6611.73 328.691  6940.421
"008******4"   23614.8  3826.891 212.709    4039.6
"004******2" 43191.675  8250.269 392.614  8642.883
"232******2" 22990.661  3628.187 204.146  3832.333
"009******8" 20275.437  3323.298 180.903  3504.201
"233******5" 56839.433 12804.454  533.65 13338.104
"232******3" 22727.017  3835.354 202.674  4038.028
"232******1" 21925.397  3617.998 194.584  3812.582
"213******9" 17463.539  2682.333  157.68  2840.013
"010******3" 17638.807  2658.298 155.039  2813.337
"148******8" 23433.331  4017.634 209.947  4227.581
"223******4" 24772.533   4394.86 222.125  4616.985
"006******8" 42997.801  8281.398 392.906  8674.304
"005******8" 42884.521  8596.703 397.428  8994.131
"233******6" 11704.479  1426.606  105.41  1532.016
"223******9" 23390.023  4093.186 212.557  4305.743
"239******0" 12352.238  1575.335 111.234  1686.569
"009******3" 26754.953  4394.735 240.887  4635.622
"005******2" 28581.334  4687.517 253.519  4941.036
"009******6" 29817.422  5134.272 266.464  5400.736
"005******1" 46083.094  9090.305 420.898  9511.203
"129******5" 33551.781  6048.223 305.119  6353.342
"229******5" 14695.325  2116.135 132.341  2248.476
"009******6" 23837.271  4377.248 218.176  4595.424
"009******5"  16364.11  1653.296 115.645  1768.941
"153******7" 14462.969   1991.49 127.238  2118.728
"228******2" 13629.119  1742.136 117.729  1859.865
"257******8" 13480.118  1839.657 121.391  1961.048
"005******4" 15225.171  2137.324 132.861  2270.185
"210******6" 25082.852  4476.786 225.914    4702.7
"151******0" 14436.699  1262.543  99.767   1362.31
"010******0" 14147.463   1880.67 122.424  2003.094
"129******1" 24736.987  4360.352 220.807  4581.159
"223******3" 21994.715  3639.911  193.53  3833.441
"006******8" 26076.943  4745.761 234.303  4980.064
"004******4" 29916.472   5022.54 269.455  5291.995
"009******7" 37155.938  6965.568 339.859  7305.427
"010******4"   26833.7  4293.662 237.569  4531.231
"006******2" 34362.458  6326.259 309.756  6636.015
"004******9" 41451.511  7818.549 379.498  8198.047
"006******7"  37614.57  6733.238 336.737  7069.975
"010******7" 21163.171  3342.307 186.214  3528.521
"003******3" 28564.741  5158.398 249.792   5408.19
"235******7" 32265.449  5490.512 287.684  5778.196
"005******6"  34556.51  6371.166 311.785  6682.951
"009******9" 30326.373  5182.706 269.171  5451.877
"156******6" 22364.445  3710.858 196.233  3907.091
"210******0" 22327.509  3758.067 197.902  3955.969
"223******2"  18349.26  2917.352 165.332  3082.684
"004******0" 24882.902  4332.289 222.378  4554.667
"231******0" 25433.153  4003.618 225.967  4229.585
"003******9" 41274.971  7725.777  374.41  8100.187
"008******9" 52445.719 11035.108  484.15 11519.258
"010******5" 28943.441   4778.66 255.657  5034.317
"128******6" 40466.357  7743.468 369.666  8113.134
"200******5" 32317.757  5590.943 291.578  5882.521
"202******1"  25738.39  4293.533 233.133  4526.666
"207******2" 28380.104  4778.526 255.473  5033.999
"212******0" 31892.922  5457.956 284.768  5742.724
"006******3"  44198.93  8458.905 402.056  8860.961
"229******6" 34875.191  6331.529 317.464  6648.993
"007******8" 41518.507  7780.463 376.004  8156.467
"006******8" 42724.921  6928.062 346.154  7274.216
"007******2" 40438.143  7585.999 366.281   7952.28
"249******8" 18711.021  2995.258 168.495  3163.753
"215******1" 24772.533  4507.275 226.131  4733.406
"009******2"  31029.76  5302.662 275.765  5578.427
"005******7" 42017.476  8030.985 381.912  8412.897
end

The first variable is the ID of the taxpayer, the second variable is the taxable base (the revenu of the individual that is taxed), the third variable is the tax that the individual paid, the fourth variable ("CSS") is a social contribution that is also paid by the individual, and the fifth variable is the sum of the tax and the social contribution that got paid by each individual (here, I did consider that the social contribution is also a form of tax).

The idea behind this is to draw a table that could show me the distribution of the tax burden by income levels of those people:
-First, the dataset should be sorted by the second variable (taxable revenu)
-Then, I do wanna divide that variable by percentiles and deciles, like, I want to have the richest 10%, the richest 1% the richest 0.1%, the middle class, the poorest 10%, the poorest 1%, the poorest 0.1%, ...
-Then, for each sub-group of the population that has been defined by the revenu level, I want to have the size of that sub-group (the number of individuals in that group)
-Then, I want to have another variable that gives me the average revenu for each sub-group before taxation (the taxable revenu), like, how much revenu does the average individual of each sub-group have
- Then, another variable that gives me the tax paid by the average individual of each revenu category (here, by tax paid, I mean tax+social contribution, so, the work will be done on the last variable "Impot_Payé_et_CSS")
-Then, I want to have another variable that shows the average tax rate that each average individual paid (so, it means we're gonna devide the taxation the average individual paid by the average revenu of each sub-group)
-Then, as a last variable, I would like to have the total of tax revenus for each sub-group (like, how much the government collected from the richest 1%, the richest 10%, the middle class, the poorest 10%, the poorest 1%, ...)

So, I guess I can describe this as a study of the tax burden on each social group, and those groups are being represented by the average individual.

Any help, please? With thanks!

Tags: None

Clyde Schechter

Join Date: Apr 2014

Posts: 29587
#2

19 Dec 2024, 10:05

Code:

assert !missing(Revenu_Imposable_Année) gen bracket = . centile Revenu_Imposable_Année, centile(0.1 1 10 90 99 99.9) local centiles `r(centiles)' forvalues i = 1/`r(n_cent)' { replace bracket = `:word `i' of `r(centiles)'' /// if Revenu_Imposable_Année <= `r(c_`i')' & missing(bracket) } replace bracket = 100 if missing(bracket) gen effective_tax_rate = Impot_Payé_et_CSS/Revenu_Imposable_Année egen id_num = group(Identifiant) collapse (count) id_num /// (mean) Revenu_Imposable_Année Impot_Payé Impot_Payé_et_CSS /// effective_tax_rate (sum) total_income = Revenu_Imposable_Année, /// by(bracket)

The value in the variable bracket represents the percentile that is the upper bound of percentile of taxable income for that group. That is, 0.1 means income between 0 and 0.1 percentile. 90 means income between the 10th and 90th percentiles (including the upper end).

Note: In the example data shown, there are too few observations to actually separate out the very small groups at both the .1 and 99.9 percentiles, so the results you get are incomplete. But this won't be a problem in the real data with 600,000 observations.

One statistical concern. Income distributions tend to be extremely skew right in the upper income nations. As such the mean income, mean tax paid, and mean tax rates may not be good statistics to represent the taxation of the 99th and 99.9th percentiles, and maybe not even of the 90th. You might want to consider the median instead or in addition.
Comment

Aziz Essouaied

Join Date: Apr 2020
Posts: 152

19 Dec 2024, 15:11

Clyde Schechter Thanks for the code, it worked for me the way I hoped

Yet, I do want to share the result of your great suggested code for the discussion, here it is:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input float bracket long id_num double(Revenu_Imposable_Année Impot_Payé) float(Impot_Payé_et_CSS effective_tax_rate) double total_income
  .1  5110                  0  4.077435616438356  4.129705         .                  0
   1  2057  76.09372046669887  8.603921730675738    9.0889 .11363309  156524.7829999996
  10 64483  6018.783379247911   504.087481398212  543.7804  .0795368  388109208.6440431
  20 71636 12578.301269361707 1557.6799134233906  1668.032 .13241956  901059189.7319952
  30 71643 15174.854091690893 2149.2185999888525 2283.0107  .1500061 1087172071.6910107
  40 71645 18679.618705505494  2957.629349445191 3124.3875 .16707763 1338301282.1559412
  50 71642  21176.39073151233 3510.3106058876706  3699.254 .17467006 1517118984.7870064
  60 71643 23148.617442680592 3958.7539126363436 4165.9224  .1799188 1658436399.4459658
  70 71644  24666.01823625069  4302.830463541882  4523.742 .18339533 1767172210.5179443
  80 71643 26345.765829989738  4659.670955669123  4895.864 .18581405 1887489701.3579547
  90 71643  30301.43032768052  5536.423340438025  5809.789 .19158627 2170885372.9660153
  99 64479  40046.09475557975  8177.019279920614  8544.137  .2117652 2582132143.7450266
99.9  6450  68446.26252186073  17444.68605240309 18099.973 .26361576  441478393.2660017
 100   714 113490.48091036422  23168.45102100842  24010.81 .23171183  81032203.37000005
end

In general, that's what I wanted to get as a result. Yet, there are some things I would like to discuss with you

I hope it doesn't bother you

First, in what comes to the symmetry of the distribution and the fact wether to used the average individual or the median individual for each group, I'm working on a lower-to-medium revenu country, so, on principle, the question of the symmetry is not there, but I would love it if you could give me some suggestion on how to study that for my distribution and to see wether to use the average person or the median person for each revenu group. I know about the skewness and the kurtosis figures, I just forgot how to measure them on Stata.

An other concern from the result I just posted: I guess there is a problem between the second bracket and the third one, since the (mean) effective_tax_rate goes down from 11% to 7%, which defies the logic since revenus are sorted and people pay taxes on those revenus, I guess I should put the first and the second brackets on one bracket of the poorest 1% (the poorest 0.1% bracket seems unnecessary I guess,and the second bracket of the poorest 10% are a small group)

I don't know if I could discuss this. Thanks

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 29587
#4

19 Dec 2024, 19:58

To get the skewness and kurtosis of a variable, run -summ variable_name, detail- and you will see them in the lower right quadrant of the output table Stata gives you. That output table will also contain the mean and median. Myself, I generally make judgments about whether to show means or medians more by how different the mean and median are than by the skewness. If the median is much larger than the mean, then the mean may be misleading if it is intended to represent a "typical" observation.

As for what happened between the second and third brackets, the numbers seem right. In the third bracket, the total income is about 10 times that in the second bracket, but the total tax paid is only bout three times as large, so the tax rate did, indeed, go down. Now, in the US this result would not be terribly surprising: we are well known to have a regressive taxation system when total taxes are taken into account. If, however, this result seems surprising for the country you are studying, I would suggest checking the raw data for a mistake. Maybe somebody' income or taxes are entered incorrectly, say with a misplaced decimal place and wildly throwing off the numbers. Combining the two tax brackets might be a last resort if you have carefully verified the original data and are 100% sure they are right. But that just covers up the anomaly, and, frankly, it would probably better to just report the results as you found them. If I had to make a small bet, I would bet on a data error here.
Comment
Aziz Essouaied

Join Date: Apr 2020

Posts: 152
#5

21 Dec 2024, 04:43

Clyde Schechter Thanks for the explanation. I guess, in what comes to the first point you've explained, I did the -sum variable_name, detail- for the four quantitative variables "Revenu_Imposable_Année", "Impot_Payé", "CSS" and "Impot_Payé_et_CSS", I didn't find that big of a gap between the mean and the median for each one, for the first variable, the difference is about 64.5, for the second variable, it is about -158, for the third variable it is about -0.5, and for the fourth variable, it is about -158.2, so I guess the distribution is kind of symmetrical. Still, I would like to draw the graph of the distribution.

As for the second point you've mentionned, then I guess what you are saying is that the people in the third bracket have more income than those of the second bracket, and logically, people in the third bracket would pay more tax, just not as more as the revenu difference between them and the people in the second bracket... That's what I understood from your explanation, I hope I'm getting that right, I'm guessing that's why the rate went down from 11% to 7% I guess. But I was kind of surprised by that because logically, tax rates should go up with revenus, like, the more revenus people have, the more they get taxed.

Again, thanks for your responses and explanation
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29587
#6

21 Dec 2024, 09:03

You have understood my response correctly.

But I was kind of surprised by that because logically, tax rates should go up with revenus, like, the more revenus people have, the more they get taxed.

Well, that may be logical, but in many jurisdictions it is untrue. Still, if you think it should be true for the jurisdiction you are studying, I again encourage you to re-check the data for errors. A misplaced decimal point in just one observation of revenue or tax paid could produce this kind of result.
Comment
Aziz Essouaied

Join Date: Apr 2020

Posts: 152
#7

21 Dec 2024, 09:31

Clyde Schechter Thanks for the explanation, I guess what you've said makes sense. Is it possible to draw the grapg of the distribution anyway? I've tried that with the -kdensity variable_name, nroam- command, but I just want to make sure if it is the right command for that.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29587
#8

21 Dec 2024, 11:14

-kdensity- is a good way to visualize the distribution of a continuous variable.

I am not familiar with the -nroam- option and find no mention of it in the help file. Perhaps you meant the -normal- option, which superimposes a normal distribution on the graph. If so, there is no harm in that, but I also doubt it will prove helpful to you. It would surprise me if the bracket-specific distribution of any of those variables will be anything like normal. But if you think that's a real possibility, and if you are specifically interested in comparing the distribution to a normal distribution, -qnorm- would be a better tool for that task.
1 like
Comment
Aziz Essouaied

Join Date: Apr 2020

Posts: 152
#9

21 Dec 2024, 14:38

Clyde Schechter Yes exactly, I meant the -normal- option, I guess I've mistyped that, sorry!.

I just wanted to see the distribution graphically so I could judge whether it's symmetrical or not, but then again as you said before, if the gap between the mean and the median is not that big (which is the case of my data), then one can decide that the distribution is symmetrical, and therefore use the mean individual to represent the sub-groups. I will try the -qnorm- output and see the result.

Thanks for the help!
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29587
#10

21 Dec 2024, 15:27

To be clear, -qnorm- is a separate command, not an option to -kdensity-.
Comment
Aziz Essouaied

Join Date: Apr 2020

Posts: 152
#11

22 Dec 2024, 11:48

Clyde Schechter Thanks for the help anyway But, if I may go back to the main code of the cnetiles, is it possible to define the sub-groups like this: percentiles(1 10 20 40 60 80 90 95 99 99.9) and then give them labels like 1 "Bottom Centile" 2 "Bottom Decile" 3 "Bottom Quintile" 4 "Second Quintile" 5 "Middle Quintile" 6 "Fourth Quintile" 7 ">80th to 90th Percentile" 8 "> 90th to 95th Percentile" 9 "> 95th to 99th Percentile" 10 "> 99th Percentile" ? I could be a bit wrong with some of the labes, but I want the names of sub-groups to appear like that, I guess it would be better for understanding the whole table.
Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 29587

#12

22 Dec 2024, 12:03

Yes, this is easily done:

Code:

assert !missing(Revenu_Imposable_Année)
centile Revenu_Imposable_Année, centile(1 10 20 40 60 80 90 95 99 99.9)
label define bracket     1 "Bottom Centile" ///
                        2 "Bottom Decile" ///
                        3 "Bottom Quintile" ///
                        4 "Second Quintile" ///
                        5 "Middle Quintile" ///
                        6 "Fourth Quintile" ///
                        7 ">80th to 90th Percentile" ///
                        8 "> 90th to 95th Percentile" ///
                        9 "> 95th to 99th Percentile" ///
                        10 "> 99th to 99.9th Percentile" ///
                        11 "> 99.9th Percentile"
gen bracket:bracket = .
local centiles `r(centiles)'
// gen bracket = .
forvalues i = 1/`r(n_cent)' {
    replace bracket = `i' ///
        if Revenu_Imposable_Année <= `r(c_`i')' & missing(bracket)
}
replace bracket = `r(n_cent)' + 1 if missing(bracket)

gen effective_tax_rate = Impot_Payé_et_CSS/Revenu_Imposable_Année
egen id_num = group(Identifiant)

collapse (count) id_num ///
    (mean) Revenu_Imposable_Année Impot_Payé Impot_Payé_et_CSS ///
    effective_tax_rate (sum) total_income = Revenu_Imposable_Année, ///
    by(bracket)

Changes to earlier code shown in bold face. Instead of having the bracket variable show the percentile at which it cuts off, which is sometimes not an integer, we just have the bracket variable count up the brackets from 1 on, and apply a standard Stata value label.

Comment

Aziz Essouaied

Join Date: Apr 2020

Posts: 152
#13

Yesterday, 03:32

Clyde Schechter Thanks, that really helped me I guess I should leave the 11th sub-group of the 99.9th Percentile and not show it, the number of the individuals in that sub-broup is so small (about 70 people) that it makes that sub-group kid of unnecessary to show in the study, so I have to stop with the richest 1% after all.
Again, thanks for the help!
Comment

Announcement

A study of the tax burden by revenu categories

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment