Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Generate table to describe distribution of a variable that is below certain percentile by year

    Hello all,


    I would like to generate a table (and export into excel file) that describes wage (and number of employees) distribution by year that are below 5% (percentile) of variable number of employees (and wage, respectively).
    I used sum L, d (for entire percentile) before but now i would like to see them only for distribution below 5p by year.

    Could someone help me with this?

    Below is the example of my dataset:
    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input long ID int(year L) float wage
    1 2011   3  2715.333
    1 2012  10    5961.2
    1 2013  12  3171.083
    1 2014  12  3432.667
    1 2015  19      4242
    1 2016  20    7136.9
    1 2017  38      5947
    1 2018  50   6130.42
    1 2019  58  6073.069
    1 2020  73  6129.589
    1  2021  61  9232.901
    2  2002   3  4998.333
    2 2003   2      1386
    2 2010   3      1824
    2 2011   8  4173.375
    2 2012   3  4966.333
    3 2012   1        99
    3 2013   1       117
    3 2014   1       581
    3 2015   1       119
    3 2016   1       977
    3 2020   1       189
    4 2014   1      3051
    4 2015   1        76
    4 2016   1      7530
    4 2017   1      7710
    4 2018   1       970
    5 2005   8  2571.875
    5 2006  15  4940.667
    5 2007  17  5842.765
    5 2008  21  5827.571
    5 2009  14  6355.571
    5 2010  12 4302.6665
    5 2011  11  6499.818
    5 2012   9  5910.111
    5 2013  10    5597.9
    5 2014  10    6569.5
    5 2015   7  8727.571
    5 2016   9  7385.667
    5 2017  10      7295
    5 2018   9  8915.889
    5 2019   9  9571.556
    5 2020   9  8095.444
    5 2021   9  8948.444
    6 2015   1       105
    6 2016   2    1383.5
    6 2017   3 1142.6666
    6 2018   4     924.5
    6 2019   2      1680
    6 2020   1      2910
    6 2021   1      5605
    7 2001   2     550.5
    7 2002   2       734
    8  2007   2    6558.5
    8 2008   4    4433.5
    8  2009   5    4169.8
    8 2010   9  4644.778
    8 2011  17  4404.294
    8 2012  17  5559.588
    8 2013  23  7158.304
    8 2014  25   9421.92
    8 2015  29  12231.76
    8 2016  26 16147.692
    8 2017  32 17542.313
    8 2018  35 18273.943
    8 2019  36 19041.027
    8 2020  31  20974.87
    8 2021  30 24479.166
    9 2011   5    8114.4
    9 2012  31  9406.839
    9 2013  57  15195.07
    9 2014  78  27598.09
    9 2015 108  31650.85
    9 2016 142 36375.586
    9 2017 243 36786.543
    9 2018 350  35074.22
    9 2019 356  42939.02
    9 2020 417   44790.5
    9 2021 459  67097.58
    11 2011   2      2421
    11 2012   3      5874
    11 2013   4   6029.75
    11 2014   6  6395.167
    11  2015   8      7362
    11 2016   8   11283.5
    11 2017   6 15205.333
    11 2018   6 14488.333
    11 2019   4   13614.5
    11 2020   6 20224.166
    11 2021   7  19604.86
    33 2010   1       241
    33 2011   1       522
    33 2012   2      2022
    44 2000  18 1987.9445
    44 2001  22 1977.1364
    44 2002  25   1920.56
    44 2003  27  2597.963
    44 2004  26  2471.923
    44 2005  34  3146.971
    44 2006  31  4138.871
    end


    ++ for the above data example, I erased some variables, but someone could help me with exporting a table (to excel file) that describes summary statistics of variables of interest (like L, wage here) by percentile & year , please?


    Thank you very much in advance
    Last edited by Anne-Claire Jo; 06 Apr 2025, 11:02.

  • #2
    summ Y , d
    summ X if Y<= r(p5)

    Comment


    • #3
      thanks for your help! how about by year? and exporting the table into excel

      Comment


      • #4
        Code:
        foreach y of global yr {
            summ wage if year == `y', detail
            local p5 = r(p5) 
            
            summ $L if wage <= `p5' & year == `y', detail
            
        
            putdocx paragraph, style(Heading1) text("Summary Statistics for Year `y'")
            
            putdocx table stats_`y' = matrix(r(table)) 
            
            // Add the 5th percentile of wage for the year
            putdocx paragraph, text("5th Percentile of Wage: `p5'")
        }
        
        putdocx save "summary_stats.docx", replace
        It seems these commands are not working for what I intend to do

        Comment


        • #5
          Code:
          levelsof ${yr}, local(years)
          foreach y of local years {
              summ $L if wage <= r(p5) & ${yr} == `y', d
              }
          From the above commands, I have :
          Click image for larger version

Name:	Capture.PNG
Views:	1
Size:	17.4 KB
ID:	1775513

          These output tables but I couldn't export them into excel/word file. Could someone help me with this issue?

          Comment


          • #6
            Anne-Claire Jo do you need all of these statistics, or only some? If so, which ones?

            Comment


            • #7
              Hemanshu Kumar I need quantiles and nb of obs and mean (each year)

              Comment


              • #8
                Which quantiles do you need, and which version of Stata are you on?

                Comment


                • #9
                  Im using stata MP18 and for quantiles, I would preferably need all of them (if possible). If not, maybe 25 50 75 90 95 99..
                  And just one additional question, besides this output table, I would also like to export summary statistics of some variables in the dataset to excel/word file but even outreg2 with keep option is not working (and cannot export the table). May i ask which commands/code should i use ? (Maybe it's also in line with the original question)

                  Comment


                  • #10
                    Also, one other clarification: when you want to subset your data to the bottom 5th percentile, do you want to do that percentile calculation year-by-year, or overall? That is, do you want statistics for the people in the bottom 5th percentile of THAT year, or in the overall 5th percentile (combining data from all years)?

                    Comment


                    • #11
                      I want to have them each year (year by year). Like for 2021, 2020, 2019 etc..

                      Comment


                      • #12
                        Consider this code:

                        Code:
                        collect clear
                        levelsof year, local(years)
                        foreach y of local years {
                            summ L if year == `y', detail
                            local bottom_fifth = r(p5)
                            dtable if year == `y' &  L <= `bottom_fifth', ///
                                sample( , stat(freq)) ///
                                cont(L, stat(mean p1 p5 p10 p25 p50 p75 p90 p95 p99)) ///
                                name(yr_`y')
                            collect addtag Years[`y']
                            local collections `collections' yr_`y'
                            }
                        
                        collect combine comb = `collections'
                        collect label levels result frequency "N", modify
                        foreach n of numlist 1 5 10 25 50 75 90 95 99 {
                            collect remap result = percentile, fortags(result[p`n'])
                            collect label levels percentile p`n' "`n'%", modify
                        }
                        collect label dim percentile "Percentiles", modify
                        collect style header percentile, title(label)
                        collect style header result, title(hide) level(label)
                        collect style cell result[frequency] percentile, nformat(%3.0f)
                        collect style cell result[mean], nformat(%5.2f)
                        collect style cell percentile#cell_type[column-header], border(bottom, color(black))
                        collect layout (Years) (result[frequency mean] percentile[p1 p5 p10 p25 p50 p75 p90 p95 p99])
                        collect title "Distribution of the bottom 5% of L in each year"
                        collect export mystats.xlsx, replace
                        which produces, for your example data:
                        Code:
                        . collect preview
                        
                        Distribution of the bottom 5% of L in each year
                        ----------------------------------------------
                             N  Mean            Percentiles          
                                     ---------------------------------
                                     1% 5% 10% 25% 50% 75% 90% 95% 99%
                        ----------------------------------------------
                        2000 1 18.00 18 18  18  18  18  18  18  18  18
                        2001 1  2.00  2  2   2   2   2   2   2   2   2
                        2002 1  2.00  2  2   2   2   2   2   2   2   2
                        2003 1  2.00  2  2   2   2   2   2   2   2   2
                        2004 1 26.00 26 26  26  26  26  26  26  26  26
                        2005 1  8.00  8  8   8   8   8   8   8   8   8
                        2006 1 15.00 15 15  15  15  15  15  15  15  15
                        2007 1  2.00  2  2   2   2   2   2   2   2   2
                        2008 1  4.00  4  4   4   4   4   4   4   4   4
                        2009 1  5.00  5  5   5   5   5   5   5   5   5
                        2010 1  1.00  1  1   1   1   1   1   1   1   1
                        2011 1  1.00  1  1   1   1   1   1   1   1   1
                        2012 1  1.00  1  1   1   1   1   1   1   1   1
                        2013 1  1.00  1  1   1   1   1   1   1   1   1
                        2014 2  1.00  1  1   1   1   1   1   1   1   1
                        2015 3  1.00  1  1   1   1   1   1   1   1   1
                        2016 2  1.00  1  1   1   1   1   1   1   1   1
                        2017 1  1.00  1  1   1   1   1   1   1   1   1
                        2018 1  1.00  1  1   1   1   1   1   1   1   1
                        2019 1  2.00  2  2   2   2   2   2   2   2   2
                        2020 2  1.00  1  1   1   1   1   1   1   1   1
                        2021 1  1.00  1  1   1   1   1   1   1   1   1
                        ----------------------------------------------
                        (The reason I asked for your Stata version is that the dtable command above is only available from Stata 18 onwards, and the collect suite of commands is only available from Stata 17 onwards)
                        Last edited by Hemanshu Kumar; 08 Apr 2025, 03:49.

                        Comment


                        • #13
                          Separately, I would just like to warn you that your code in #5 is not doing what you expect. r(p5) is a scalar returned by a call to summ <var>, d, and it is transient, i.e. it is replaced when a new summ command (or any other command that returns results) is issued.

                          So when the loop in #5 first encounters the line
                          Code:
                          summ $L if wage <= r(p5) & ${yr} == `y', d
                          it uses the 5th percentile from whatever call to summ you last had before this loop was executed. In subsequent iterations of the loop, it will use the 5th percentile from the previous iteration of the loop, and thus for 2001, it will use the 5th percentile of 2000, and so on.

                          Comment


                          • #14
                            Extending #13, consider this code.


                            .
                            Code:
                             sysuse auto, clear
                            (1978 automobile data)
                            
                            . su mpg if mpg < r(p5)
                            
                                Variable |        Obs        Mean    Std. dev.       Min        Max
                            -------------+---------------------------------------------------------
                                     mpg |         74     21.2973    5.785503         12         41
                            Here there was no previous summarize but the command runs without complaint.

                            it is not an error to refer to r-class results that are not defined.

                            Stata just substitutes missing. In turn that may be perfectly legal in its consequences. Here the qualifier is evaluated as

                            Code:
                            if mpg < .
                            which doesn't bite as the observations summarized are those that would be summarized any way.

                            This is just @Hemanshu Kumar's point all over again. If you refer to something like r(p5) it is the coder's responsibility to ensure that said result has been calculated for whatever subset is relevant.
                            Last edited by Nick Cox; 08 Apr 2025, 04:15.

                            Comment


                            • #15
                              Hemanshu Kumar and Nick Cox Thanks for your help, I really appreciate it a lot.
                              However, what i intend to do is to have output table of L distribution each year when wage is below p5.
                              If i change (from your code):
                              Code:
                              collect clear
                              levelsof year, local(years)
                              foreach y of local years {
                                  summ L if year == `y', detail
                                  local bottom_fifth = r(p5)
                                  dtable if year == `y' &  wage <= `bottom_fifth', ///
                                      sample( , stat(freq)) ///
                                      cont(L, stat(mean p1 p5 p10 p25 p50 p75 p90 p95 p99)) ///
                                      name(yr_`y')
                                  collect addtag Years[`y']
                                  local collections `collections' yr_`y'
                                  }
                              
                              collect combine comb = `collections'
                              collect label levels result frequency "N", modify
                              foreach n of numlist 1 5 10 25 50 75 90 95 99 {
                                  collect remap result = percentile, fortags(result[p`n'])
                                  collect label levels percentile p`n' "`n'%", modify
                              }
                              collect label dim percentile "Percentiles", modify
                              collect style header percentile, title(label)
                              collect style header result, title(hide) level(label)
                              collect style cell result[frequency] percentile, nformat(%3.0f)
                              collect style cell result[mean], nformat(%5.2f)
                              collect style cell percentile#cell_type[column-header], border(bottom, color(black))
                              collect layout (Years) (result[frequency mean] percentile[p1 p5 p10 p25 p50 p75 p90 p95 p99])
                              collect title "Distribution of the bottom 5% of L in each year"
                              collect export mystats.xlsx, replace
                              does it make the one that I want?

                              And I see that sometime there are no observations, so I put something like:
                              Code:
                              if r(N)>0 {
                              summ $L if year == `y', detail
                                  local bottom_fifth = r(p5)
                                  dtable if year == `y' &  wage <= `bottom_fifth', ///
                                      sample( , stat(freq)) ///
                                      cont($L, stat(mean p1 p5 p10 p25 p50 p75 p90 p95 p99)) ///
                                      name(yr_`y')
                                  collect addtag Years[`y']
                                  local collections `collections' yr_`y'
                                  }
                              }
                              But it's not working at all (still says 'no observations').


                              p.s. I also searched for the commands/code to export summary statistics (percentiles as well) of some variables into excel file but even using asdoc or outreg2 (
                              Code:
                              outreg2 using results, word replace sum(log) eqkeep(N mean sd) keep(${vars_to_check})
                              ) didn't work. Do you have any suggestion for this, please?
                              Last edited by Anne-Claire Jo; 08 Apr 2025, 08:40.

                              Comment

                              Working...
                              X