Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to correctly utilize the "_n" & "_N" notations/functions to calculate annual growth?

    Greetings Everyone,

    I hope you all are doing well. This is Jay and I am a grad student. I have been using STATA for quite a while now. But for two weeks I have been a little confused about a problem, which has been well discussed here on STATALIST regarding "annual growth rates". All the questions which have been previously answered were very helpful and I have used them in my actual work (Thank you for that).

    Data Overview:
    This is an educational dataset which contains 23 variables and 397,000 observations. Among these multiple variables the key variables are the year, student_grade, and standardized_math_score. The "year" variable is like any other time variable. The range of year [2013-2017]. The student grade level is from 0 - 12. Finally, the standardized test score variable for math is basically z-scores. range from -3 to +3 SD.

    Problem/Issue:

    Currently, my problem is related to growth rates in terms of standardized test scores. I simply want to calculate the growth for each year from 2013 to 2017 by grade level and year. I am confused because this is a hierarchical dataset. This means for each year (2013-2017) there are 0-12 grade levels and for each grade level, there is a standardized math score and this is all is nested within different schools and then students within those schools (This doesn't matter, because I am only interested in the following variables mentioned above). I am not sure that how can I appropriately utilize the _n, _n-1, and _N notations OR maybe using _n, n-1, and _N syntax/notations is not the right approach (I am not sure).

    Current Status:
    I have tried to use the information available from previous posted answers regarding calculating growth rates. This is what I have done so far, using the following codes:


    Code:
    # delimit ;
    bysort std_math_score (year): gen t1= std_math_score - std_math_score[_n-1] / (( std_math_score[_n-1]));
    #delimit cr
    Then to report my results, I used the following code to generate a table.
    Code:
    tabstat t1, statistics( mean ) by(school_year) nototal format(%9.5g)
    This is what I get in terms of annual growth.
    Code:
    school_year | mean
    ------------+----------
    2013 | -.99331
    2014 | -1.0025
    2015 | -1.0028
    2016 | -1.0143
    2017 | -1.0118
    -----------------------

    A glimpse of the dataset for the three key variables.
    Code:
    year student_grade std_math_score
    2013  3                  .1
    2013  6 -1.7000000000000002
    2013 10  2.8000000000000003
    2013  6                  .1
    2013  9 -.30000000000000004
    2013  4                 -.2
    2013 10                 -.5
    2013  5                  .5
    2013  2                -1.5
    2013  9 -.30000000000000004
    2013  9 -1.7000000000000002
    2013  9                  .1
    2013  4                 1.8
    2013  6  1.7000000000000002
    2013  4  -.7000000000000001
    2013  9                   0
    2013  8                 1.8
    2013  8                   0
    2013  6  .30000000000000004
    2013  2   .7000000000000001
    2013  9                   0
    2013  4                 2.2
    2013  4                 -.8
    2013  8                 -.1
    If you think the problem is not stated clearly please let me know, I will try to do my best to make it clear.

    I would greatly appreciate your help and suggestion regarding this matter.

    Regards,
    Jay

    Stata Version - 15.1
    Last edited by Jay Adil; 12 Aug 2018, 03:54.

  • #2
    Executive summary:

    For us to suggest correct code, you need to tell us about the identifier variables within your dataset.

    Meanwhile some confusions can be explained.


    Let's back up and look at an example where growth rates make some sense. In the Grunfeld data we can look at changes in mvalue from year to year for each company:

    Code:
    . webuse grunfeld, clear
    
    . bysort company (year) : gen pc_change = 100 * (mvalue - mvalue[_n-1]) / mvalue[_n-1]
    (10 missing values generated)
    
    . l company year mvalue pc_change if year < 1940 & company < 3, sepby(company)
    
         +-------------------------------------+
         | company   year   mvalue   pc_change |
         |-------------------------------------|
      1. |       1   1935   3078.5           . |
      2. |       1   1936   4661.7    51.42765 |
      3. |       1   1937   5387.1    15.56084 |
      4. |       1   1938   2792.2   -48.16877 |
      5. |       1   1939   4313.2    54.47319 |
         |-------------------------------------|
     21. |       2   1935   1362.4           . |
     22. |       2   1936   1807.1    32.64092 |
     23. |       2   1937   2676.3    48.09917 |
     24. |       2   1938   1801.9   -32.67197 |
     25. |       2   1939   1957.3    8.624231 |
         +-------------------------------------+
    That's a standard calculation:

    1. Compare within groups independently defined. Here we can compare results for each company.

    2. There is a missing value at the start of each panel, because we don't know the previous value.

    3. The factor of 100 is conventional but also optional.

    4. There is a tacit assumption of regularly spaced data. We are assuming that the time differences are constant from observation to observation, which is correct for this dataset.

    Let's compare with what you did. (Semi-colons as command delimiters are perfectly legal but minority practice; I will ignore them here.) Your RHS was

    Code:
    std_math_score - std_math_score[_n-1] / (( std_math_score[_n-1]))
    But that is just

    Code:
    std_math_score - 1
    I imagine that you intended something more like

    Code:
    (std_math_score - std_math_score[_n-1]) / std_math_score[_n-1]
    but when that is fixed your calculation still produces useless results. You didn't actually show any of your results, but the implication of your question is that you are puzzled by your results.

    Let's show that with an analogue from the Grunfeld data.

    Code:
    . bysort mvalue (year) : gen puzzle = 100 * (mvalue - mvalue[_n-1]) / mvalue
    (198 missing values generated)
    
    . l company year mvalue puzzle if puzzle < . | puzzle[_n+1] < .
    
    +----------------------------------+
    | company year mvalue puzzle |
    |----------------------------------|
    33. | 5 1938 156.7 . |
    34. | 7 1946 156.7 0 |
    67. | 6 1941 276.9 . |
    68. | 9 1949 276.9 0 |
    +----------------------------------+
    The sort

    Code:
    bysort mvalue
    defines a grouping according to the different (distinct) values mvalue takes on. Within each group mvalue is constant. For the Grunfeld data, there are 198 distinct values of mvalue. 196 occur just once and the other 2 occur twice each.

    Code:
    . duplicates report mvalue
    
    Duplicates in terms of mvalue
    
    --------------------------------------
       copies | observations       surplus
    ----------+---------------------------
            1 |          196             0
            2 |            4             2
    --------------------------------------
    
    . duplicates list mvalue
    
    Duplicates in terms of mvalue
    
      +------------------------+
      | group:   obs:   mvalue |
      |------------------------|
      |      1     84    156.7 |
      |      1    132    156.7 |
      |      2    107    276.9 |
      |      2    175    276.9 |
      +------------------------+
    For the singleton groups, there is no "previous" value to refer to. For the duplicated values, it's sufficient to look at the first pair to see the fallacy. You are comparing 156.7 (for company 5 in 1938 and company 7 in 1943): the difference is zero and the percent change is zero but the calculation serves no point any way.

    All that said, I still have no clear idea of what you are trying to do. If you're looking at individual students, then your by: prefix must do that job. In a large dataset I would expect mess and I don't think I would assume no gaps.

    The readership of Statalist is international and interdisciplinary. For example, I am English and Welsh and a geographer. So, you shouldn't assume that your readers are based in or around New York and familiar with your kind of education data. it's not even 100% clear what student_grade means. (Even in the US grade appears sometimes to mean marks or scores achieved and sometimes to mean educational level. What I see in your data are different grades all mixed up.

    Code:
    year student_grade std_math_score
    2013  3                  .1
    2013  6 -1.7000000000000002
    2013 10  2.8000000000000003
    So that's a puzzling sort order.

    You haven't got a hope of analysing your dataset properly without using your identifiers for students and schools as well. You are telling us your data are hierarchical but giving us no detail there. (You say "This doesn't matter" but it really does.)

    Yet more:

    I can't see that % or proportional change in standardized score is an improvement on differences.

    Finally, to answer the question: I don't know why you are asking about _N rather than _n. _N has no role here at all. I guess that you're guessing along these lines: I used _n but the results seem dopey, so should I be using _N instead? Old advice as at http://xyproblem.info/ applies here. :

    https://www.statalist.org/forums/help#spelling

    Last edited by Nick Cox; 12 Aug 2018, 05:32.

    Comment


    • #3
      Thank you, Nick, for your time and explanation. I really appreciate your help.

      I am sorry that I couldn't clearly state my question. Actually, I just want to know that what is the correct approach for calculating annual growth.

      Several different posts on statalist and some other forums have suggested the use of _n or _n-1 notations/code when it comes to growth rates. Based on those suggestions and my understanding I tried to use it. But I realized something is wrong with my code and I need to get help or assistance with the correct approach and code. That is why I said in my previous post that I am not sure about utilizing the _n and _n-1 notations. Now, from your explanation, it seems that it has no use in the context of current data.

      Below is the data with identifier variables:

      sch_code = School Code

      sid = Student Identification Number

      grade= Student Class/Grade - Example: Kindergarten, 1st Standard, 2nd Standard...12th Standard/Class.

      std_score_m= Standardized Math Score

      t1= resutls obtained from the following code:
      Code:
      bysort s_grade_level school_year: gen t1= std_score_m - std_score_m[_n-1] / (( std_score_m[_n-1]))
      Code:
      year sch_code sid grade    std_score_m   t1
      2013  122 177580 0                 -.4    .
      2013   21 192453 0                 -.1 -1.1
      2013  424 224362 0  -.7000000000000001 -1.7
      2013 6338 142427 0                 1.8   .8
      2013  227 144998 0                  -2   -3
      2013  232 193638 0                 -.4 -1.4
      2013  227 162968 0                -1.1 -2.1
      2013  230  25503 0                 -.4 -1.4
      2013   56 126670 0   .7000000000000001  -.3
      2013   87  34878 0  .30000000000000004  -.7
      2013  180  34448 0   .7000000000000001  -.3
      2013  373 215270 0   .6000000000000001  -.4
      2013   84 180444 0  -.7000000000000001 -1.7
      2013  391    130 0                 -.1 -1.1
      2013  162 107625 0  1.2000000000000002   .2
      2013  109 107835 0                  .2  -.8
      2013   21 145784 0   .7000000000000001  -.3
      2013  373  24961 0  .30000000000000004  -.7
      2013   52 210028 0                 -.5 -1.5
      2013  252 212491 0                 -.4 -1.4
      2013  252 224271 0                   0   -1
      2013  244 129370 0                  .2    .
      2013 6072  14731 0  -.6000000000000001 -1.6
      2013  244  13605 0                -1.3 -2.3
      2013  336 224054 0  .30000000000000004  -.7
      2013   15  95000 0                  .8  -.2
      2013   15 198166 0                  .2  -.8
      2013   52 223393 0  .30000000000000004  -.7
      2013  209  26606 0                  .9  -.1
      2013   39 132014 0                 -.2 -1.2
      2013    6 179639 0                  .8  -.2
      2013  125  16342 0  1.2000000000000002   .2
      2013  336  97213 0                  .8  -.2
      2013  179  61879 0                  .5  -.5
      2013  210 208695 0                  .2  -.8
      2013   21  63244 0                 1.1   .1
      2013   84  39747 0  .30000000000000004  -.7
      2013  152 159035 0                 -.4 -1.4
      2013  101 116299 0                 -.4 -1.4
      2013  288  37491 0                  .4  -.6
      2013  130  26444 0                 -.1 -1.1
      2013  355   7554 0                 -.4 -1.4
      2013  208  67716 0                  .4  -.6
      2013  122 101405 0   .6000000000000001  -.4
      2013  389 161092 0  -.7000000000000001 -1.7
      2013   38   8005 0                  .5  -.5
      2013   20  97481 0  -.6000000000000001 -1.6
      2013  199 217487 0 -.30000000000000004 -1.3
      2013   21  20251 0                 -.2 -1.2
      2013 6334 203655 0  -.6000000000000001 -1.6
      2013  230 146063 0                   1    0
      2013 6338  78128 0  1.7000000000000002   .7
      2013  424  41421 0  1.2000000000000002   .2
      2013  399  92702 0                 -.4 -1.4
      2013  101  35703 0                -1.3 -2.3
      2013   94 117216 0                  .4  -.6
      2013  336  83145 0                  .1  -.9
      2013  285  88966 0   .7000000000000001  -.3
      2013  264  22277 0 -1.4000000000000001 -2.4
      2013  127  92512 0  1.4000000000000001   .4
      2013  373  78571 0                  .8  -.2
      2013   92  50068 0                   0   -1
      2013  373 223158 0                 -.9    .
      2013  230 135275 0                  .8  -.2
      2013   52 146350 0                 -.2 -1.2
      2013  309 192630 0                 1.6   .6
      2013   64 224110 0                -1.5 -2.5
      2013  127  69782 0                   0   -1
      2013  404 218548 0                  .8    .
      2013  378  98726 0                  .1  -.9
      2013  407  44391 0                 -.5 -1.5
      2013  209  97390 0                 -.1 -1.1
      2013  309   4956 0   .7000000000000001  -.3
      2013  102 218257 0                 -.5 -1.5
      2013  173 149257 0  -.6000000000000001 -1.6
      2013   52 189369 0                 -.5 -1.5
      2013  199 138425 0  1.4000000000000001   .4
      2013  391 190481 0  -.6000000000000001 -1.6
      2013  378  14671 0                  .9  -.1
      2013   15  45332 0   .6000000000000001  -.4
      2013   84  36939 0   .6000000000000001  -.4
      2013   65 134894 0   .7000000000000001  -.3
      2013   96 115553 0  -.6000000000000001 -1.6
      2013   67 217766 0                 -.4 -1.4
      2013  180 173144 0   .7000000000000001  -.3
      2013  169 138203 0 -1.9000000000000001 -2.9
      2013  355  17210 0  2.4000000000000004  1.4
      2013  347  31763 0                   0   -1
      2013  373 115047 0                 -.4    .
      2013  415 118060 0                -1.5 -2.5
      2013  169 180810 0                  .1  -.9
      2013  355 223715 0                 1.6   .6
      2013   54  48684 0                  .4  -.6
      2013  205 150157 0                  .8  -.2
      2013  362 142491 0                  .5  -.5
      2013  101 112576 0                  -1   -2
      2013  103  75545 0                  .8  -.2
      2013  395 135745 0                 -.2 -1.2
      2013  355 183672 0                 2.2  1.2
      2013  179  47457 0 -1.4000000000000001 -2.4
      end
      Once again thank you for your help.

      Regards,
      Jay

      Comment


      • #4
        Thanks for your thanks, but I think you need to talk this through with whoever is directing or advising your work.

        The first problem here was already pointed out in #2. Your recipe is

        Code:
         
         std_score_m - std_score_m[_n-1] / (( std_score_m[_n-1]))
        but your parentheses are misplaced. Consider as a little piece of algebra

        a - b / ((b))

        The parentheses do no harm, but they are irrelevant and that reduces to

        a - 1

        given that divisions precede subtractions.

        Indeed -- look at your results -- the relationship between std_score_m and t1 is just that t1 is smaller by 1, although it's a little obscured by different display formats.

        The formula was already corrected to

        Code:
         
         (std_score_m - std_score_m[_n-1]) / std_score_m[_n-1]
        but with your present sort order you seem to be trying to compare observations with quite different school and student identifiers. That makes no sense to me.

        A final observation is that for standardised scores growth rate is indeterminate for previous value zero and nonsensical as a metric any way. Suppose there was an improvement from 0.0001 to 0.0301. That is an absolute improvement of only 0.03 but a relative change of 300

        Code:
        . di (0.0301 - 0.0001) / 0.0001
        300
        or 30000% while an improvement from 0.3 to 0.33 is an absolute improvement of 0.03 but a relative change of only

        Code:
        . di (0.3300 - 0.3000) / 0.3000
        .1
        or 10%. As suggested in #'2, but I should have pushed the point harder, plain differences are the way to look at changes, but there is no obvious sense in anything but doing the calculation for individuals in schools and then summarizing those.

        Comment


        • #5
          Let me add one comment. From post #1

          I simply want to calculate the growth for each year from 2013 to 2017 by grade level and year.
          The first step in doing that is to calculate the standardized test scores for each year by grade level and year. It is the year-to-year difference then that characterizes growth.

          Again from post #1

          ... this is a hierarchical dataset. This means for each year (2013-2017) there are 0-12 grade levels and for each grade level, there is a standardized math score and this is all is nested within different schools and then students within those schools ...
          This is backwards. You have data on students, who have standardized test scores. The students are nested within grades, which are nested within schools, which are nested within years. You need to aggregate the scores for students within the same year and grade level (you do not seem interested in schools) in some way that makes statistical sense to calculate a standardized test score by grade level and year.

          I will add that I agree with all that Nick has written, especially the futility of looking at growth rates for data that has been standardized.

          Comment


          • #6
            Thanks again Nick & William for helping me out with this problem.

            Comment

            Working...
            X