Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Regression with Rolling Dummy Window

    Hello everyone,

    I am using Stata 18.

    I am trying to recreate the peak identification from Colagrossi et al. (2023): Intimate partner violence and help-seeking: The role of femicide news.

    Quoting and summarizing the description of their procedure from the appendix:
    They are using rolling windows to analyze trends in daily news coverage (𝐺𝐷𝑑). Each window spans 30 days, with the last 15 days marked as the Post period. A regression is estimated for each window: 𝐺𝐷𝑑=𝛾0+𝛾1Post𝑑+ϡ𝑑. Coefficient 𝛾1​ and standard errors are stored. The window shifts forward by one day, repeating this process to cover the entire time series. In the end there are 15 estimates of 𝛾1​ for each day from its inclusion in multiple windows. The share of positive and statistically significant 𝛾1​ coefficients is calculated for each day to track trends. An increase in coverage begins when a day has a positive share of significant coefficients after a day with no significant coefficients. A peak is identified as the first day with a share equal to 1. The Most Covered dummy equals 1 for events occurring between the start of an increase and the peak, and 0 otherwise.

    I have been able to estimate the rolling window regression and marking positive and significant coefficients using the following code
    Code:
    bysort date: egen dailynews=sum(numarticles)
    bysort date: gen nvals = _n ==1
    keep if nvals==1
    keep date dailynews
    gen day_id=_n
    
    local window_size = 30 
    local pre_period = 15   
    
    * prepare coefficients
    gen _b_post=.
    gen _se_post=.
    
    gen sig_pos_count=0
    gen sig_pos_share=.
    
    Step1: Rolling window regression
    forval start = 1/`=_N-`window_size'+1' { 
        local end = `start' + `window_size' - 1 
        
        tempvar temp_post
        gen `temp_post' = 0
        replace `temp_post' = 1 if day_id >= `start' + `pre_period' & day_id <= `end'
        
        qui reg dailynews `temp_post' if day_id >= `start' & day_id <= `end'
        
        replace _b_post = _b[`temp_post'] in `start'
        replace _se_post = _se[`temp_post'] in `start'
        
        local t_stat=_b[`temp_post']/_se[`temp_post']
        
        drop `temp_post'
    }
    
    * Step 2: calculate share of significant coefficients for each day
    gen sig_post=0
    replace sig_post=1 if _b_post>0 &  _b_post/_se_post>=1.96
    However, I am not able to figure out how to build the last part of calculating the share of positive and significant y1 coefficients for each day. Any help is greatly appreciated.

    Here is an example of my data:
    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input float(date dailynews day_id _b_post _se_post sig_post)
    20089   92   1        5.8  37.38538 0
    20090  370   2 -15.666667  37.38127 0
    20091  263   3 -18.866667 34.157555 0
    20092  178   4        3.4 33.232487 0
    20093  110   5 -19.133333 33.423923 0
    20094  174   6  -25.06667 33.844425 0
    20095  109   7      -13.6 34.275402 0
    20096   94   8  -7.866667  34.63528 0
    20097   12   9      -18.2  34.50024 0
    20098  106  10 -20.666666  33.55426 0
    20099   90  11 -1.7333333  37.41055 0
    20100  177  12   .3333333  37.33023 0
    20101  194  13  -43.06667 36.557022 0
    20102   57  14 -35.533333 36.668564 0
    20103  176  15  -47.06667 35.976597 0
    20104  247  16  -44.13334 36.065506 0
    20105  253  17 -35.533333  35.93398 0
    20106   33  18  -26.46667  35.29412 0
    20107  282  19      -38.2 34.606632 0
    20108  119  20      -18.4  33.25128 0
    20109   29  21 -12.133333   33.5933 0
    20110   40  22 -10.666667 33.126637 0
    20111  171  23 -4.3333335 32.981274 0
    20112   90  24         .8  32.81292 0
    20113   94  25  -3.333333  32.76384 0
    20114   83  26      -38.8 31.911325 0
    20115  451  27      -37.8 31.933344 0
    20116   89  28 -13.933333 22.998274 0
    20117  158  29  -9.866667 23.414824 0
    20118  150  30  -4.933333  23.22727 0
    20119   80  31 -14.733334 23.271347 0
    20120   88  32   9.066667   29.2126 0
    20121  137  33   45.13334  41.88179 0
    20122   48  34  170.26666 121.60722 0
    20123   39  35  247.86667 140.05241 0
    20124   56  36   310.3333 144.65503 1
    20125   57  37   344.8667 143.77008 1
    20126   93  38        381 144.24625 1
    20127  131  39      456.2  148.7119 1
    20128  366  40      495.4 145.09428 1
    20129  107  41      562.4 140.09167 1
    20130   74  42        600 135.30814 1
    20131   97  43   629.2667 128.54155 1
    20132   86  44   768.1334 160.09773 1
    20133  168  45   934.8666 186.61763 1
    20134   42  46  1071.9333 181.66937 1
    20135   59  47  1134.7333 175.22102 1
    20136   65  48  1135.9333  171.5135 1
    20137  111  49  1002.3333 203.75995 1
    20138   53  50   930.5333 215.75943 1
    20139  105  51   861.2667 219.40257 1
    20140  169  52   839.2667 216.28824 1
    20141   92  53   797.4667    216.33 1
    20142  110  54   698.9333  221.1412 1
    20143  106  55   686.7333 215.23935 1
    20144  146  56      695.6  211.9745 1
    20145   55  57      666.6  206.7899 1
    20146  166  58   619.9333  209.0302 1
    20147   88  59   327.8667 232.90767 0
    20148   39  60  26.733334  233.9945 0
    20149  361  61 -222.46666 227.93817 0
    20150  571  62       -387 219.16037 0
    20151 1870  63     -491.2 219.66396 0
    20152 1338  64  -584.4667 212.88467 0
    20153 1004  65  -685.2667  208.2626 0
    20154  672  66  -732.3333 209.82542 0
    20155  823  67       -804  207.6488 0
    20156 1219  68  -853.5333  206.9031 0
    20157  677  69     -881.8 207.64394 0
    20158  851  70  -973.2667 199.02547 0
    20159  749  71 -1140.1333  171.1339 0
    20160  475  72 -1169.1333  164.2357 0
    20161 2318  73 -1158.5333 172.51956 0
    20162 2591  74 -1009.7333  176.7059 0
    20163 1966  75     -909.2  147.6579 0
    20164 1622  76     -774.2 144.99991 0
    20165 1101  77  -711.7333 139.24194 0
    20166 1671  78     -632.8 147.54262 0
    20167 1488  79       -528 140.69063 0
    20168  916  80  -438.6667 132.89784 0
    20169  909  81  -388.6667 134.43077 0
    20170  850  82  -333.7333 135.23512 0
    20171  868  83 -288.46667 134.57275 0
    20172 1061  84     -237.6  132.7688 0
    20173 1729  85     -167.6  124.9307 0
    20174  917  86  -46.53333 75.137375 0
    20175  195  87  -68.46667  54.91557 0
    20176   89  88        -59  55.43928 0
    20177  577  89  -58.13334  55.37307 0
    20178  155  90 -13.266666  48.59359 0
    20179  415  91 -32.933334  49.18386 0
    20180   68  92  -9.333333  46.35717 0
    20181   73  93        -27  45.87768 0
    20182  126  94 -16.933332  45.77517 0
    20183  122  95 -18.733334  46.18542 0
    20184   71  96 -14.066667  46.30976 0
    20185  134  97  -6.333333  46.05893 0
    20186   93  98         -8  46.70716 0
    20187   73  99       -4.6  46.60925 0
    20188  104 100      -14.8  46.79808 0
    end
    format %td date

  • #2
    Well, your code and results contradict your description for the first step. You state that each date will have 15 estimates of gamma1, one for each window it appears in. Apart from the fact that I don't understand why you expect each date to appear in 15 windows,* you have exactly one estimate of gamma1 (which you call _b_post) for each date. So, there is no way to calculate a "share" here.

    Since I'm not familiar with what you're trying to do here, and I don't follow your explanation, I'm not going to propose a fix here. But if you can give a clearer explanation of what you want to do, and show a usable example from your starting data set, I'll give it a try.

    *I would expect most dates to appear in 30 windows given that the window width is 30 days. Those at the beginning or end of the data set will appear in fewer.

    Comment


    • #3
      Clyde Schechter I will try to clarify how I understood the elaborations of Colagrossi et al. (2023):
      The rolling window is 30 days. Half of it are "before the event", half of them "after". So in the regression of 𝐺𝐷𝑑=𝛾0+𝛾1Post𝑑+ϡ𝑑 there will be 15 days that only have a 𝛾0 coefficient, and 15 days that will additionally have a 𝛾1 coefficient. With the rolling window, the regression will be estimated with 30 observations, then the window gets rolled forward a day and another regression with 30 days will be estimated.

      Let me provide an example day: Day d=60
      The first time day 60 is part of the regression is when we estimate the regression for day 31. Here, days 31 to 45 are in the pre-period with Post=0 and days 46-60 are in the post period with Post=1.
      The last time day 60 is part of the regression is when we estimate the regression for day 60. Here, days 60-74 are in the pre-period with Post=0 and days 75-89 are in the post-period with Post=1.
      Day 60 is part of the post-period with Post=1 for the regression estimations for days 31-45. So there will be 15 𝛾1 estimates that include day 60 with Post=1 as part of the included observations.

      The shares I want to calculate are the positive and significant 𝛾1 out of the 15 𝛾1 that are estimated using day 60.

      It might very well be that I misunderstood what the authors are doing.

      This is the exact description given by Colagrossi et al. (2023), p. 21:
      We define a daily measure of news coverage of gender-based violence – 𝐺𝐷d – that is equal to the share of tagged news over the total number of news in a given day 𝑑. Since our objective is to identify femicides that receive most coverage, we devise an ad-hoc procedure to statistically detect temporary increases in media coverage of news related to gender-based violence.

      This procedure is based on pre-post rolling windows comparisons. More specifically, we define 𝑑0 as the beginning of the observable 𝐺𝐷d daily time series. We keep observations between 𝑑0 and 𝑑0 +30, and define a π‘ƒπ‘œπ‘ π‘‘ dummy for the 15-days sub-period between 𝑑0 + 15 and 𝑑0 + 30. We use these observations to estimate the regression 𝐺𝐷d = 𝛾0 + 𝛾1π‘ƒπ‘œπ‘ π‘‘d + πœ–d . After storing the estimated coefficient 𝛾1 and its standard error, we roll forward the time window by one day and re-estimate the same regression. We iterate this procedure until the end of the observable time series. The result of this iterative procedure is a series of estimated coefficients and their corresponding standard errors. Importantly, for each day 𝑑 there are 15 coefficient estimates, one for each position of day 𝑑 in the rolling 𝑃 π‘œπ‘ π‘‘ windows.

      We then compute, for each day, the fraction of iterations in which the 𝛾1 coefficient is positive and statistically significant. These shares are used to identify the increase in coverage and the subsequent decrease. Intuitively, as the time series starts to trend up, π‘π‘œπ‘ π‘‘ periods are more likely to be associated with positive and statistically significant 𝛾1 coefficients, and so the share for each day increases with the trend. The opposite happens when the series starts to trend down, with a lower likelihood of positive and significant 𝛾1 coefficients, and decreasing shares. We define the beginning of a period of increasing coverage as the first day with a positive share of significant 𝛾1 coefficients preceded by a day with a share equal to zero. We identify the peak of the temporary increase as the first day with a share equal to 1. Finally, we define the Most Covered dummy variable which is equal to 1 for femicides occurring between the beginning and peak of a period of increasing coverage, and 0 otherwise.

      Comment


      • #4
        Thank you. Now I get it. The key clarification is that in the calculation, the estimated regression coefficient is only included in the share calculation for those dates where the date was in the final 15 days of the window. By the way, I assume that at the end of the data set we don't do the regression if there aren't enough remaining observations to fill the window with 30 dates. And for the first 15 dates in the data, there will be no share calculated because those dates never appear in the last half of a window.

        So the following should do it for you:
        Code:
        * Example generated by -dataex-. For more info, type help dataex
        clear*
        input float(date dailynews day_id)
        20089   92   1
        20090  370   2
        20091  263   3
        20092  178   4
        20093  110   5
        20094  174   6
        20095  109   7
        20096   94   8
        20097   12   9
        20098  106  10
        20099   90  11
        20100  177  12
        20101  194  13
        20102   57  14
        20103  176  15
        20104  247  16
        20105  253  17
        20106   33  18
        20107  282  19
        20108  119  20
        20109   29  21
        20110   40  22
        20111  171  23
        20112   90  24
        20113   94  25
        20114   83  26
        20115  451  27
        20116   89  28
        20117  158  29
        20118  150  30
        20119   80  31
        20120   88  32
        20121  137  33
        20122   48  34
        20123   39  35
        20124   56  36
        20125   57  37
        20126   93  38
        20127  131  39
        20128  366  40
        20129  107  41
        20130   74  42
        20131   97  43
        20132   86  44
        20133  168  45
        20134   42  46
        20135   59  47
        20136   65  48
        20137  111  49
        20138   53  50
        20139  105  51
        20140  169  52
        20141   92  53
        20142  110  54
        20143  106  55
        20144  146  56
        20145   55  57
        20146  166  58
        20147   88  59
        20148   39  60
        20149  361  61
        20150  571  62
        20151 1870  63
        20152 1338  64
        20153 1004  65
        20154  672  66
        20155  823  67
        20156 1219  68
        20157  677  69
        20158  851  70
        20159  749  71
        20160  475  72
        20161 2318  73
        20162 2591  74
        20163 1966  75
        20164 1622  76
        20165 1101  77
        20166 1671  78
        20167 1488  79
        20168  916  80
        20169  909  81
        20170  850  82
        20171  868  83
        20172 1061  84
        20173 1729  85
        20174  917  86
        20175  195  87
        20176   89  88
        20177  577  89
        20178  155  90
        20179  415  91
        20180   68  92
        20181   73  93
        20182  126  94
        20183  122  95
        20184   71  96
        20185  134  97
        20186   93  98
        20187   73  99
        20188  104 100
        end
        format %td date
        
        isid date, sort
        
        frame create gammas int date float(b se)
        
        capture program drop one_date
        program define one_date
            if _N == 30 {
                gen post = (_n > 15)
                regress dailynews i.post
                forvalues i = 16/30 {
                    frame post gammas (date[`i']) (_b[1.post]) (_se[1.post])
                }
            }
            exit
        end
        
        rangerun one_date, interval(date 0 29)
        
        frame change gammas
        format date %td
        gen byte sig_pos = b > 0 & b/se > invnorm(0.975)
        collapse (mean) share_sig_pos = sig_pos, by(date)
        
        frame change default
        frlink 1:1 date, frame(gammas)
        frget share_sig_pos, from(gammas)
        -rangerun- is written by Robert Picard and is available from SSC. It is the most efficient way to run rolling-window problems, and it also has many other applications. To use -ranagerun-, you must also install -rangestat-, by Robert Picard, Nick Cox, and Roberto Ferrer, also from SSC.


        Comment


        • #5
          Thanks Clyde! I think the code does what I wanted. (still not sure if I understood the authors correctly though)

          Comment

          Working...
          X