Regression with Rolling Dummy Window

Charles Ehmat

Join Date: Mar 2020
Posts: 21

Regression with Rolling Dummy Window

05 Dec 2024, 10:13

Hello everyone,

I am using Stata 18.

I am trying to recreate the peak identification from Colagrossi et al. (2023): Intimate partner violence and help-seeking: The role of femicide news.

Quoting and summarizing the description of their procedure from the appendix:
They are using rolling windows to analyze trends in daily news coverage (𝐺𝐷_𝑑). Each window spans 30 days, with the last 15 days marked as the Post period. A regression is estimated for each window: 𝐺𝐷_𝑑=𝛾₀+𝛾₁Post_𝑑+ϵ_𝑑. Coefficient 𝛾₁ and standard errors are stored. The window shifts forward by one day, repeating this process to cover the entire time series. In the end there are 15 estimates of 𝛾₁ for each day from its inclusion in multiple windows. The share of positive and statistically significant 𝛾₁ coefficients is calculated for each day to track trends. An increase in coverage begins when a day has a positive share of significant coefficients after a day with no significant coefficients. A peak is identified as the first day with a share equal to 1. The Most Covered dummy equals 1 for events occurring between the start of an increase and the peak, and 0 otherwise.

I have been able to estimate the rolling window regression and marking positive and significant coefficients using the following code

Code:

bysort date: egen dailynews=sum(numarticles)
bysort date: gen nvals = _n ==1
keep if nvals==1
keep date dailynews
gen day_id=_n

local window_size = 30 
local pre_period = 15   

* prepare coefficients
gen _b_post=.
gen _se_post=.

gen sig_pos_count=0
gen sig_pos_share=.

Step1: Rolling window regression
forval start = 1/`=_N-`window_size'+1' { 
    local end = `start' + `window_size' - 1 
    
    tempvar temp_post
    gen `temp_post' = 0
    replace `temp_post' = 1 if day_id >= `start' + `pre_period' & day_id <= `end'
    
    qui reg dailynews `temp_post' if day_id >= `start' & day_id <= `end'
    
    replace _b_post = _b[`temp_post'] in `start'
    replace _se_post = _se[`temp_post'] in `start'
    
    local t_stat=_b[`temp_post']/_se[`temp_post']
    
    drop `temp_post'
}

* Step 2: calculate share of significant coefficients for each day
gen sig_post=0
replace sig_post=1 if _b_post>0 &  _b_post/_se_post>=1.96

However, I am not able to figure out how to build the last part of calculating the share of positive and significant y₁ coefficients for each day. Any help is greatly appreciated.

Here is an example of my data:

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input float(date dailynews day_id _b_post _se_post sig_post)
20089   92   1        5.8  37.38538 0
20090  370   2 -15.666667  37.38127 0
20091  263   3 -18.866667 34.157555 0
20092  178   4        3.4 33.232487 0
20093  110   5 -19.133333 33.423923 0
20094  174   6  -25.06667 33.844425 0
20095  109   7      -13.6 34.275402 0
20096   94   8  -7.866667  34.63528 0
20097   12   9      -18.2  34.50024 0
20098  106  10 -20.666666  33.55426 0
20099   90  11 -1.7333333  37.41055 0
20100  177  12   .3333333  37.33023 0
20101  194  13  -43.06667 36.557022 0
20102   57  14 -35.533333 36.668564 0
20103  176  15  -47.06667 35.976597 0
20104  247  16  -44.13334 36.065506 0
20105  253  17 -35.533333  35.93398 0
20106   33  18  -26.46667  35.29412 0
20107  282  19      -38.2 34.606632 0
20108  119  20      -18.4  33.25128 0
20109   29  21 -12.133333   33.5933 0
20110   40  22 -10.666667 33.126637 0
20111  171  23 -4.3333335 32.981274 0
20112   90  24         .8  32.81292 0
20113   94  25  -3.333333  32.76384 0
20114   83  26      -38.8 31.911325 0
20115  451  27      -37.8 31.933344 0
20116   89  28 -13.933333 22.998274 0
20117  158  29  -9.866667 23.414824 0
20118  150  30  -4.933333  23.22727 0
20119   80  31 -14.733334 23.271347 0
20120   88  32   9.066667   29.2126 0
20121  137  33   45.13334  41.88179 0
20122   48  34  170.26666 121.60722 0
20123   39  35  247.86667 140.05241 0
20124   56  36   310.3333 144.65503 1
20125   57  37   344.8667 143.77008 1
20126   93  38        381 144.24625 1
20127  131  39      456.2  148.7119 1
20128  366  40      495.4 145.09428 1
20129  107  41      562.4 140.09167 1
20130   74  42        600 135.30814 1
20131   97  43   629.2667 128.54155 1
20132   86  44   768.1334 160.09773 1
20133  168  45   934.8666 186.61763 1
20134   42  46  1071.9333 181.66937 1
20135   59  47  1134.7333 175.22102 1
20136   65  48  1135.9333  171.5135 1
20137  111  49  1002.3333 203.75995 1
20138   53  50   930.5333 215.75943 1
20139  105  51   861.2667 219.40257 1
20140  169  52   839.2667 216.28824 1
20141   92  53   797.4667    216.33 1
20142  110  54   698.9333  221.1412 1
20143  106  55   686.7333 215.23935 1
20144  146  56      695.6  211.9745 1
20145   55  57      666.6  206.7899 1
20146  166  58   619.9333  209.0302 1
20147   88  59   327.8667 232.90767 0
20148   39  60  26.733334  233.9945 0
20149  361  61 -222.46666 227.93817 0
20150  571  62       -387 219.16037 0
20151 1870  63     -491.2 219.66396 0
20152 1338  64  -584.4667 212.88467 0
20153 1004  65  -685.2667  208.2626 0
20154  672  66  -732.3333 209.82542 0
20155  823  67       -804  207.6488 0
20156 1219  68  -853.5333  206.9031 0
20157  677  69     -881.8 207.64394 0
20158  851  70  -973.2667 199.02547 0
20159  749  71 -1140.1333  171.1339 0
20160  475  72 -1169.1333  164.2357 0
20161 2318  73 -1158.5333 172.51956 0
20162 2591  74 -1009.7333  176.7059 0
20163 1966  75     -909.2  147.6579 0
20164 1622  76     -774.2 144.99991 0
20165 1101  77  -711.7333 139.24194 0
20166 1671  78     -632.8 147.54262 0
20167 1488  79       -528 140.69063 0
20168  916  80  -438.6667 132.89784 0
20169  909  81  -388.6667 134.43077 0
20170  850  82  -333.7333 135.23512 0
20171  868  83 -288.46667 134.57275 0
20172 1061  84     -237.6  132.7688 0
20173 1729  85     -167.6  124.9307 0
20174  917  86  -46.53333 75.137375 0
20175  195  87  -68.46667  54.91557 0
20176   89  88        -59  55.43928 0
20177  577  89  -58.13334  55.37307 0
20178  155  90 -13.266666  48.59359 0
20179  415  91 -32.933334  49.18386 0
20180   68  92  -9.333333  46.35717 0
20181   73  93        -27  45.87768 0
20182  126  94 -16.933332  45.77517 0
20183  122  95 -18.733334  46.18542 0
20184   71  96 -14.066667  46.30976 0
20185  134  97  -6.333333  46.05893 0
20186   93  98         -8  46.70716 0
20187   73  99       -4.6  46.60925 0
20188  104 100      -14.8  46.79808 0
end
format %td date

Tags: None

Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#2

05 Dec 2024, 10:40

Well, your code and results contradict your description for the first step. You state that each date will have 15 estimates of gamma1, one for each window it appears in. Apart from the fact that I don't understand why you expect each date to appear in 15 windows,* you have exactly one estimate of gamma1 (which you call _b_post) for each date. So, there is no way to calculate a "share" here.

Since I'm not familiar with what you're trying to do here, and I don't follow your explanation, I'm not going to propose a fix here. But if you can give a clearer explanation of what you want to do, and show a usable example from your starting data set, I'll give it a try.

*I would expect most dates to appear in 30 windows given that the window width is 30 days. Those at the beginning or end of the data set will appear in fewer.
Comment
Charles Ehmat

Join Date: Mar 2020

Posts: 21
#3

06 Dec 2024, 03:07

Clyde Schechter I will try to clarify how I understood the elaborations of Colagrossi et al. (2023):
The rolling window is 30 days. Half of it are "before the event", half of them "after". So in the regression of 𝐺𝐷_𝑑=𝛾₀+𝛾₁Post_𝑑+ϵ_𝑑 there will be 15 days that only have a 𝛾₀ coefficient, and 15 days that will additionally have a 𝛾₁ coefficient. With the rolling window, the regression will be estimated with 30 observations, then the window gets rolled forward a day and another regression with 30 days will be estimated.

Let me provide an example day: Day d=60
The first time day 60 is part of the regression is when we estimate the regression for day 31. Here, days 31 to 45 are in the pre-period with Post=0 and days 46-60 are in the post period with Post=1.
The last time day 60 is part of the regression is when we estimate the regression for day 60. Here, days 60-74 are in the pre-period with Post=0 and days 75-89 are in the post-period with Post=1.
Day 60 is part of the post-period with Post=1 for the regression estimations for days 31-45. So there will be 15 𝛾₁ estimates that include day 60 with Post=1 as part of the included observations.

The shares I want to calculate are the positive and significant 𝛾₁ out of the 15 𝛾₁ that are estimated using day 60.

It might very well be that I misunderstood what the authors are doing.

This is the exact description given by Colagrossi et al. (2023), p. 21:

We define a daily measure of news coverage of gender-based violence – 𝐺𝐷_d – that is equal to the share of tagged news over the total number of news in a given day 𝑑. Since our objective is to identify femicides that receive most coverage, we devise an ad-hoc procedure to statistically detect temporary increases in media coverage of news related to gender-based violence.

This procedure is based on pre-post rolling windows comparisons. More specifically, we define 𝑑₀ as the beginning of the observable 𝐺𝐷_d daily time series. We keep observations between 𝑑₀ and 𝑑₀ +30, and define a 𝑃𝑜𝑠𝑡 dummy for the 15-days sub-period between 𝑑₀ + 15 and 𝑑₀ + 30. We use these observations to estimate the regression 𝐺𝐷_d = 𝛾₀ + 𝛾₁𝑃𝑜𝑠𝑡_d + 𝜖_d . After storing the estimated coefficient 𝛾₁ and its standard error, we roll forward the time window by one day and re-estimate the same regression. We iterate this procedure until the end of the observable time series. The result of this iterative procedure is a series of estimated coefficients and their corresponding standard errors. Importantly, for each day 𝑑 there are 15 coefficient estimates, one for each position of day 𝑑 in the rolling 𝑃 𝑜𝑠𝑡 windows.

We then compute, for each day, the fraction of iterations in which the 𝛾₁ coefficient is positive and statistically significant. These shares are used to identify the increase in coverage and the subsequent decrease. Intuitively, as the time series starts to trend up, 𝑝𝑜𝑠𝑡 periods are more likely to be associated with positive and statistically significant 𝛾₁ coefficients, and so the share for each day increases with the trend. The opposite happens when the series starts to trend down, with a lower likelihood of positive and significant 𝛾₁coefficients, and decreasing shares. We define the beginning of a period of increasing coverage as the first day with a positive share of significant 𝛾₁ coefficients preceded by a day with a share equal to zero. We identify the peak of the temporary increase as the first day with a share equal to 1. Finally, we define the Most Covered dummy variable which is equal to 1 for femicides occurring between the beginning and peak of a period of increasing coverage, and 0 otherwise.
Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30100

06 Dec 2024, 09:32

Thank you. Now I get it. The key clarification is that in the calculation, the estimated regression coefficient is only included in the share calculation for those dates where the date was in the final 15 days of the window. By the way, I assume that at the end of the data set we don't do the regression if there aren't enough remaining observations to fill the window with 30 dates. And for the first 15 dates in the data, there will be no share calculated because those dates never appear in the last half of a window.

So the following should do it for you:

Code:

* Example generated by -dataex-. For more info, type help dataex
clear*
input float(date dailynews day_id)
20089   92   1
20090  370   2
20091  263   3
20092  178   4
20093  110   5
20094  174   6
20095  109   7
20096   94   8
20097   12   9
20098  106  10
20099   90  11
20100  177  12
20101  194  13
20102   57  14
20103  176  15
20104  247  16
20105  253  17
20106   33  18
20107  282  19
20108  119  20
20109   29  21
20110   40  22
20111  171  23
20112   90  24
20113   94  25
20114   83  26
20115  451  27
20116   89  28
20117  158  29
20118  150  30
20119   80  31
20120   88  32
20121  137  33
20122   48  34
20123   39  35
20124   56  36
20125   57  37
20126   93  38
20127  131  39
20128  366  40
20129  107  41
20130   74  42
20131   97  43
20132   86  44
20133  168  45
20134   42  46
20135   59  47
20136   65  48
20137  111  49
20138   53  50
20139  105  51
20140  169  52
20141   92  53
20142  110  54
20143  106  55
20144  146  56
20145   55  57
20146  166  58
20147   88  59
20148   39  60
20149  361  61
20150  571  62
20151 1870  63
20152 1338  64
20153 1004  65
20154  672  66
20155  823  67
20156 1219  68
20157  677  69
20158  851  70
20159  749  71
20160  475  72
20161 2318  73
20162 2591  74
20163 1966  75
20164 1622  76
20165 1101  77
20166 1671  78
20167 1488  79
20168  916  80
20169  909  81
20170  850  82
20171  868  83
20172 1061  84
20173 1729  85
20174  917  86
20175  195  87
20176   89  88
20177  577  89
20178  155  90
20179  415  91
20180   68  92
20181   73  93
20182  126  94
20183  122  95
20184   71  96
20185  134  97
20186   93  98
20187   73  99
20188  104 100
end
format %td date

isid date, sort

frame create gammas int date float(b se)

capture program drop one_date
program define one_date
    if _N == 30 {
        gen post = (_n > 15)
        regress dailynews i.post
        forvalues i = 16/30 {
            frame post gammas (date[`i']) (_b[1.post]) (_se[1.post])
        }
    }
    exit
end

rangerun one_date, interval(date 0 29)

frame change gammas
format date %td
gen byte sig_pos = b > 0 & b/se > invnorm(0.975)
collapse (mean) share_sig_pos = sig_pos, by(date)

frame change default
frlink 1:1 date, frame(gammas)
frget share_sig_pos, from(gammas)

-rangerun- is written by Robert Picard and is available from SSC. It is the most efficient way to run rolling-window problems, and it also has many other applications. To use -ranagerun-, you must also install -rangestat-, by Robert Picard, Nick Cox, and Roberto Ferrer, also from SSC.

Comment

Charles Ehmat

Join Date: Mar 2020

Posts: 21
#5

11 Dec 2024, 03:31

Thanks Clyde! I think the code does what I wanted. (still not sure if I understood the authors correctly though)
Comment

Announcement

Regression with Rolling Dummy Window

Comment

Comment

Comment

Comment