Creating first-order difference variables in repeated cross-sectional (i.e. pseudo panel) data

Zach Goldberg

Join Date: Jul 2017

Posts: 184
#1

Creating first-order difference variables in repeated cross-sectional (i.e. pseudo panel) data

30 Aug 2019, 15:21

Greetings,

I'm running Stata 15.1 on OSX. I've created a repeated cross-sectional dataset to assess variations in attitudes towards crime (the dependent variable). My time variable essentially stores the month and year in which each survey (31 in total) was conducted. For my independent variable, I created and merged (using the time variable) a newspaper issue-salience index that stores the percent of monthly New York Times articles that refer to crime-related issues. My expectation is that in months in which crime is salient in the media, we will see an increase in the percent of respondents saying crime 'is a serious issue'. To prepare the dataset for analysis, I created (using 'collapse') a variable that stores the mean percent of respondents that gave the 'serious issue' response in each survey (i.e. by year/month). I did the same with the salience index variable (separate dataset) and merged it into the collapsed survey dataset. I ran a simple Pearson correlation between the index and the survey response variable and uncovered a strong relationship (r=0.87). However, a colleague of mine who saw the resulting graph warned me that I 'shouldn't correlate time series with strong autocorrelation' and that, instead, I should 'create first-order difference sequences and correlate those'. I'm not quite sure how to go about doing this. The dataset has no panel ID, so I tried creating one:

Code:

gen id=_n

(note that the id is then simply a number assigned to each survey--31 in total)

I then entered the following:

Code:

xtset id year

To create the 'first difference' variable I tried the following:

Code:

gen indexdiff=D.index

However, Stata subsequently gave me the '31 missing values generation' message.

What am I doing wrong here and how do I get it right? Thanks for your time!

Sample data:

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input float(index crime_serious yearmonth year id) 1.0898919 37.634632 441 1996 1 1.141105 41.15658 449 1997 2 .9819449 31.580814 453 1997 3 1.1344688 35.43958 475 1999 4 1.2987403 39.7779 487 2000 5 1.1022217 39.37875 488 2000 6 1.045117 32.872364 521 2003 7 .7872596 35.538055 522 2003 8 .8885489 38.24273 523 2003 9 .9927688 35.79262 524 2003 10 .7067459 39.30157 539 2004 11 1.0929303 36.767914 548 2005 12 1.0707874 25.04893 572 2007 13 1.0773966 34.76981 573 2007 14 1.0685753 29.70381 576 2008 15 1.118886 27.0324 580 2008 16 .9239349 31.63132 584 2008 17 .7300239 23.623867 597 2009 18 .7975035 28.98842 598 2009 19 1.1477937 34.304623 613 2011 20 1.0149189 38.20615 614 2011 21 1.1804827 34.5046 624 2012 22 1.3056893 39.55238 648 2014 23 1.2751036 41.03848 649 2014 24 1.369863 42.47158 650 2014 25 1.8246716 52.22675 662 2015 26 2.096708 48.12559 667 2015 27 1.6774454 47.23487 668 2015 28 1.5856438 42.08379 669 2015 29 2.575059 57.32762 686 2017 30 2.7088645 64.2695 689 2017 31 end format %tm yearmonth
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 29953
#2

30 Aug 2019, 15:30

Unfortunately, the example data you posted is useless for resolving your question because it contains only one observation for each id, so that all the first differences are undefined. Presumably that is not the case with your actual data set, though I half expect that what you will find is that there are many places where a given id's data has gaps in its time series. The first difference is undefined whenever there is no value for the preceding time period--hence, it is always undefined for the first observation in each id.

Anyway, look for gaps in the time series within id's, and remember that the first id's first difference will always be missing value. My guess is that these things will explain it all. If not, please post back with better example data, an example that includes complete time series for a few id's, rather than just one observation per id and many id's.
Comment

Zach Goldberg

Join Date: Jul 2017
Posts: 184

30 Aug 2019, 16:01

Hey Clyde,

To clarify, there is only one observation per id because each observation is either the mean percent of respondents who gave the 'crime is a serious' response in that specific survey (indexed by year/month and/or the generated id variable) OR the percent of monthly articles referring to crime for that month. In other words, the data I provided is aggregate data (generated using the 'collapse (mean) crime_serious, by (yearmonth)' command).

If it helps, here is first the pre-collapsed data (which was created using: 'egen crime_serious=wtmean(CRIME), by(yearmonth) weight(weight)

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input float(crime_serious index id yearmonth year)
37.634632 1.0898919   1 441 1996
37.634632 1.0898919   2 441 1996
37.634632 1.0898919   3 441 1996
37.634632 1.0898919   4 441 1996
37.634632 1.0898919   5 441 1996
37.634632 1.0898919   6 441 1996
37.634632 1.0898919   7 441 1996
37.634632 1.0898919   8 441 1996
37.634632 1.0898919   9 441 1996
37.634632 1.0898919  10 441 1996
37.634632 1.0898919  11 441 1996
37.634632 1.0898919  12 441 1996
37.634632 1.0898919  13 441 1996
37.634632 1.0898919  14 441 1996
37.634632 1.0898919  15 441 1996
37.634632 1.0898919  16 441 1996
37.634632 1.0898919  17 441 1996
37.634632 1.0898919  18 441 1996
37.634632 1.0898919  19 441 1996
37.634632 1.0898919  20 441 1996
37.634632 1.0898919  21 441 1996
37.634632 1.0898919  22 441 1996
37.634632 1.0898919  23 441 1996
37.634632 1.0898919  24 441 1996
37.634632 1.0898919  25 441 1996
37.634632 1.0898919  26 441 1996
37.634632 1.0898919  27 441 1996
37.634632 1.0898919  28 441 1996
37.634632 1.0898919  29 441 1996
37.634632 1.0898919  30 441 1996
37.634632 1.0898919  31 441 1996
37.634632 1.0898919  32 441 1996
37.634632 1.0898919  33 441 1996
37.634632 1.0898919  34 441 1996
37.634632 1.0898919  35 441 1996
37.634632 1.0898919  36 441 1996
37.634632 1.0898919  37 441 1996
37.634632 1.0898919  38 441 1996
37.634632 1.0898919  39 441 1996
37.634632 1.0898919  40 441 1996
37.634632 1.0898919  41 441 1996
37.634632 1.0898919  42 441 1996
37.634632 1.0898919  43 441 1996
37.634632 1.0898919  44 441 1996
37.634632 1.0898919  45 441 1996
37.634632 1.0898919  46 441 1996
37.634632 1.0898919  47 441 1996
37.634632 1.0898919  48 441 1996
37.634632 1.0898919  49 441 1996
37.634632 1.0898919  50 441 1996
37.634632 1.0898919  51 441 1996
37.634632 1.0898919  52 441 1996
37.634632 1.0898919  53 441 1996
37.634632 1.0898919  54 441 1996
37.634632 1.0898919  55 441 1996
37.634632 1.0898919  56 441 1996
37.634632 1.0898919  57 441 1996
37.634632 1.0898919  58 441 1996
37.634632 1.0898919  59 441 1996
37.634632 1.0898919  60 441 1996
37.634632 1.0898919  61 441 1996
37.634632 1.0898919  62 441 1996
37.634632 1.0898919  63 441 1996
37.634632 1.0898919  64 441 1996
37.634632 1.0898919  65 441 1996
37.634632 1.0898919  66 441 1996
37.634632 1.0898919  67 441 1996
37.634632 1.0898919  68 441 1996
37.634632 1.0898919  69 441 1996
37.634632 1.0898919  70 441 1996
37.634632 1.0898919  71 441 1996
37.634632 1.0898919  72 441 1996
37.634632 1.0898919  73 441 1996
37.634632 1.0898919  74 441 1996
37.634632 1.0898919  75 441 1996
37.634632 1.0898919  76 441 1996
37.634632 1.0898919  77 441 1996
37.634632 1.0898919  78 441 1996
37.634632 1.0898919  79 441 1996
37.634632 1.0898919  80 441 1996
37.634632 1.0898919  81 441 1996
37.634632 1.0898919  82 441 1996
37.634632 1.0898919  83 441 1996
37.634632 1.0898919  84 441 1996
37.634632 1.0898919  85 441 1996
37.634632 1.0898919  86 441 1996
37.634632 1.0898919  87 441 1996
37.634632 1.0898919  88 441 1996
37.634632 1.0898919  89 441 1996
37.634632 1.0898919  90 441 1996
37.634632 1.0898919  91 441 1996
37.634632 1.0898919  92 441 1996
37.634632 1.0898919  93 441 1996
37.634632 1.0898919  94 441 1996
37.634632 1.0898919  95 441 1996
37.634632 1.0898919  96 441 1996
37.634632 1.0898919  97 441 1996
37.634632 1.0898919  98 441 1996
37.634632 1.0898919  99 441 1996
37.634632 1.0898919 100 441 1996
end
format %tm yearmonth

And here are data for the original/individual-level variables (where CRIME=100='serious issue', and 0=other):

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input byte CRIME float(index yearmonth year id)
100 1.0898919 441 1996   1
  0 1.0898919 441 1996   2
100 1.0898919 441 1996   3
100 1.0898919 441 1996   4
  0 1.0898919 441 1996   5
100 1.0898919 441 1996   6
100 1.0898919 441 1996   7
100 1.0898919 441 1996   8
100 1.0898919 441 1996   9
100 1.0898919 441 1996  10
100 1.0898919 441 1996  11
  0 1.0898919 441 1996  12
100 1.0898919 441 1996  13
  0 1.0898919 441 1996  14
  0 1.0898919 441 1996  15
100 1.0898919 441 1996  16
100 1.0898919 441 1996  17
100 1.0898919 441 1996  18
100 1.0898919 441 1996  19
  0 1.0898919 441 1996  20
100 1.0898919 441 1996  21
  0 1.0898919 441 1996  22
  0 1.0898919 441 1996  23
100 1.0898919 441 1996  24
100 1.0898919 441 1996  25
  0 1.0898919 441 1996  26
  0 1.0898919 441 1996  27
  0 1.0898919 441 1996  28
100 1.0898919 441 1996  29
  0 1.0898919 441 1996  30
  0 1.0898919 441 1996  31
  0 1.0898919 441 1996  32
  0 1.0898919 441 1996  33
  0 1.0898919 441 1996  34
100 1.0898919 441 1996  35
  0 1.0898919 441 1996  36
  0 1.0898919 441 1996  37
  0 1.0898919 441 1996  38
100 1.0898919 441 1996  39
  0 1.0898919 441 1996  40
  0 1.0898919 441 1996  41
100 1.0898919 441 1996  42
  0 1.0898919 441 1996  43
  0 1.0898919 441 1996  44
100 1.0898919 441 1996  45
100 1.0898919 441 1996  46
  0 1.0898919 441 1996  47
  0 1.0898919 441 1996  48
  0 1.0898919 441 1996  49
  0 1.0898919 441 1996  50
100 1.0898919 441 1996  51
  0 1.0898919 441 1996  52
100 1.0898919 441 1996  53
100 1.0898919 441 1996  54
100 1.0898919 441 1996  55
100 1.0898919 441 1996  56
  0 1.0898919 441 1996  57
100 1.0898919 441 1996  58
  0 1.0898919 441 1996  59
  0 1.0898919 441 1996  60
100 1.0898919 441 1996  61
  0 1.0898919 441 1996  62
100 1.0898919 441 1996  63
  0 1.0898919 441 1996  64
  0 1.0898919 441 1996  65
100 1.0898919 441 1996  66
  0 1.0898919 441 1996  67
  0 1.0898919 441 1996  68
  0 1.0898919 441 1996  69
100 1.0898919 441 1996  70
100 1.0898919 441 1996  71
100 1.0898919 441 1996  72
  0 1.0898919 441 1996  73
  0 1.0898919 441 1996  74
100 1.0898919 441 1996  75
  0 1.0898919 441 1996  76
  0 1.0898919 441 1996  77
  0 1.0898919 441 1996  78
100 1.0898919 441 1996  79
  0 1.0898919 441 1996  80
  0 1.0898919 441 1996  81
  0 1.0898919 441 1996  82
  0 1.0898919 441 1996  83
100 1.0898919 441 1996  84
  0 1.0898919 441 1996  85
100 1.0898919 441 1996  86
100 1.0898919 441 1996  87
  0 1.0898919 441 1996  88
  0 1.0898919 441 1996  89
100 1.0898919 441 1996  90
100 1.0898919 441 1996  91
  0 1.0898919 441 1996  92
  0 1.0898919 441 1996  93
  0 1.0898919 441 1996  94
  0 1.0898919 441 1996  95
100 1.0898919 441 1996  96
  0 1.0898919 441 1996  97
  0 1.0898919 441 1996  98
  0 1.0898919 441 1996  99
  0 1.0898919 441 1996 100
end
format %tm yearmonth

I notice that only data for the first survey (October 1996) was sampled, so here is a sample for another year (just to give you an idea):

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input byte CRIME float(index yearmonth year id)
  0 1.0685753 576 2008 9830
100 1.0685753 576 2008 9831
  0 1.0685753 576 2008 9832
  0 1.0685753 576 2008 9833
  0 1.0685753 576 2008 9834
100 1.0685753 576 2008 9835
  0 1.0685753 576 2008 9836
100 1.0685753 576 2008 9837
  0 1.0685753 576 2008 9838
  0 1.0685753 576 2008 9839
  0 1.0685753 576 2008 9840
  0 1.0685753 576 2008 9841
  0 1.0685753 576 2008 9842
100 1.0685753 576 2008 9843
  0 1.0685753 576 2008 9844
100 1.0685753 576 2008 9845
  0 1.0685753 576 2008 9846
100 1.0685753 576 2008 9847
  0 1.0685753 576 2008 9848
  0 1.0685753 576 2008 9849
100 1.0685753 576 2008 9850
100 1.0685753 576 2008 9851
100 1.0685753 576 2008 9852
  0 1.0685753 576 2008 9853
  0 1.0685753 576 2008 9854
100 1.0685753 576 2008 9855
100 1.0685753 576 2008 9856
  0 1.0685753 576 2008 9857
  0 1.0685753 576 2008 9858
100 1.0685753 576 2008 9859
100 1.0685753 576 2008 9860
  0 1.0685753 576 2008 9861
  0 1.0685753 576 2008 9862
  0 1.0685753 576 2008 9863
  0 1.0685753 576 2008 9864
  0 1.0685753 576 2008 9865
  0 1.0685753 576 2008 9866
  0 1.0685753 576 2008 9867
  0 1.0685753 576 2008 9868
100 1.0685753 576 2008 9869
  0 1.0685753 576 2008 9870
100 1.0685753 576 2008 9871
  0 1.0685753 576 2008 9872
100 1.0685753 576 2008 9873
100 1.0685753 576 2008 9874
  0 1.0685753 576 2008 9875
  0 1.0685753 576 2008 9876
  0 1.0685753 576 2008 9877
100 1.0685753 576 2008 9878
100 1.0685753 576 2008 9879
  0 1.0685753 576 2008 9880
  0 1.0685753 576 2008 9881
  0 1.0685753 576 2008 9882
  0 1.0685753 576 2008 9883
  0 1.0685753 576 2008 9884
  0 1.0685753 576 2008 9885
  0 1.0685753 576 2008 9886
100 1.0685753 576 2008 9887
  0 1.0685753 576 2008 9888
100 1.0685753 576 2008 9889
  0 1.0685753 576 2008 9890
  0 1.0685753 576 2008 9891
100 1.0685753 576 2008 9892
  0 1.0685753 576 2008 9893
  0 1.0685753 576 2008 9894
  0 1.0685753 576 2008 9895
100 1.0685753 576 2008 9896
  0 1.0685753 576 2008 9897
  0 1.0685753 576 2008 9898
  0 1.0685753 576 2008 9899
  0 1.0685753 576 2008 9900
  0 1.0685753 576 2008 9901
100 1.0685753 576 2008 9902
100 1.0685753 576 2008 9903
  0 1.0685753 576 2008 9904
  0 1.0685753 576 2008 9905
100 1.0685753 576 2008 9906
  0 1.0685753 576 2008 9907
100 1.0685753 576 2008 9908
  0 1.0685753 576 2008 9909
  0 1.0685753 576 2008 9910
  0 1.0685753 576 2008 9911
  0 1.0685753 576 2008 9912
  0 1.0685753 576 2008 9913
  0 1.0685753 576 2008 9914
  0 1.0685753 576 2008 9915
  0 1.0685753 576 2008 9916
100 1.0685753 576 2008 9917
  0 1.0685753 576 2008 9918
  0 1.0685753 576 2008 9919
  0 1.0685753 576 2008 9920
  0 1.0685753 576 2008 9921
100 1.0685753 576 2008 9922
  0 1.0685753 576 2008 9923
  0 1.0685753 576 2008 9924
100 1.0685753 576 2008 9925
  0 1.0685753 576 2008 9926
  0 1.0685753 576 2008 9927
  0 1.0685753 576 2008 9928
  0 1.0685753 576 2008 9929
end
format %tm yearmonth

If there's anything else I can provide to clarify things further, please let me know. Thanks again!

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 29953
#4

30 Aug 2019, 17:20

OK. In your collapsed data, when you -xtset id yearmonth- and then -gen index_diff = D1.index-, you are asking Stata to calculate the difference between the value of index in the observation and the value of index in the observation corresponding to the same id in the immediately preceding month. That is the definition of first difference. If your data does not contain observations that are consecutive months, the first difference is undefined and Stata gives you a missing value for index_diff. In your last two -dataex- excerpts, there are only two values of yearmonth altogether: 441 and 576, which are not consecutive. If these are the only two values of yearmonth in your complete data set, then you will just get all missing values for index_diff because all of the first-differences are undefined. But I can't tell if that is what your complete data set looks like or whether you have just picked the results for these two particular dates to exhibit.

The issue is whether the complete series of dates for any given id contains consecutive values of datetime or not. A more helpful data example would be the output of -dataex if inlist(id, 1, 2, 3)-. This would show the complete series of values of datetime for each of id's 1, 2, and 3.
Comment

Zach Goldberg

Join Date: Jul 2017
Posts: 184

30 Aug 2019, 17:40

Hey Clyde,

Here you go:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input float(crime_serious index yearmonth year id)
37.634632 1.0898919 441 1996  1
 41.15658  1.141105 449 1997  2
31.580814  .9819449 453 1997  3
 35.43958 1.1344688 475 1999  4
  39.7779 1.2987403 487 2000  5
 39.37875 1.1022217 488 2000  6
32.872364  1.045117 521 2003  7
35.538055  .7872596 522 2003  8
 38.24273  .8885489 523 2003  9
 35.79262  .9927688 524 2003 10
 39.30157  .7067459 539 2004 11
36.767914 1.0929303 548 2005 12
 25.04893 1.0707874 572 2007 13
 34.76981 1.0773966 573 2007 14
 29.70381 1.0685753 576 2008 15
  27.0324  1.118886 580 2008 16
 31.63132  .9239349 584 2008 17
23.623867  .7300239 597 2009 18
 28.98842  .7975035 598 2009 19
34.304623 1.1477937 613 2011 20
 38.20615 1.0149189 614 2011 21
  34.5046 1.1804827 624 2012 22
 39.55238 1.3056893 648 2014 23
 41.03848 1.2751036 649 2014 24
 42.47158  1.369863 650 2014 25
 52.22675 1.8246716 662 2015 26
 48.12559  2.096708 667 2015 27
 47.23487 1.6774454 668 2015 28
 42.08379 1.5856438 669 2015 29
 57.32762  2.575059 686 2017 30
  64.2695 2.7088645 689 2017 31
end
format %tm yearmonth

Comment

Zach Goldberg

Join Date: Jul 2017

Posts: 184
#6

30 Aug 2019, 17:41

(the above is the collapsed data)
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29953
#7

30 Aug 2019, 20:54

We're going around in circles here. What you show in #5 is not what I asked for in #4 and it doesn't enable me to make progress on solving your problem. You have, in fact, just repeated the data you gave in #1, which, as already noted, is not suitable for calculating first differences within id's as there is only one observation per id. The closest you could come to that is to overlook the id variable and run:

Code:

tsset yearmonth gen index_diff = D1.index

That will get you some values that are not missing--but plenty will also be missing because there are many instances where the yearmonth variable has gaps, so that the first difference is not defined for those observations. In the example data there are 20 missing values and 11 non-missing. And visually I can verify that this is exactly correct.

Bear in mind also that the first differences that you do get do not distinguish between different values of id: it treats the entire data set as if it were a single time series without multiple id's within it. Is that what you want? Or do you want separate results for each id?
Comment
Zach Goldberg

Join Date: Jul 2017

Posts: 184
#8

30 Aug 2019, 22:50

Why would there be 11 missing values? I really just want the survey-to-survey changes for the months that are available: e.g. 41.15658 (1997m6) - 37.63 (1996m10)...... 31.58(1997m10)-41.15658 (1997m6) and so on and so forth. Pardon my naivety, but is that not what first-differences are?

I tried the following and obtained such a variable with (necessarily) 1 mission value:

Code:

gen crimediff= crime_serious[_n]-crime_serious[_n-1]

Which gives me the following variable:

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input float(crimediff yearmonth) . 441 3.521946 449 -9.575764 453 3.858765 475 4.3383217 487 -.3991547 488 -6.506382 521 2.6656914 522 2.704674 523 -2.4501076 524 3.508953 539 -2.533661 548 -11.718985 572 9.720881 573 -5.066 576 -2.6714115 580 4.5989227 584 -8.007454 597 5.364552 598 5.316204 613 3.9015236 614 -3.701546 624 5.047783 648 1.4860916 649 1.4331093 650 9.755161 662 -4.1011543 667 -.8907204 668 -5.151085 669 15.243832 686 6.941883 689 end format %tm yearmonth

Again, my ultimate goal is to test/see whether the original two variables are spuriously correlated. Does this get me what I'm looking for? Thanks for your patience with me.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29953
#9

31 Aug 2019, 09:28

Why would there be 11 missing values? I really just want the survey-to-survey changes for the months that are available: e.g. 41.15658 (1997m6) - 37.63 (1996m10)...... 31.58(1997m10)-41.15658 (1997m6) and so on and so forth. Pardon my naivety, but is that not what first-differences are?

No, that's not what is normally meant by first differences. The difference between 1996m10 and 1997m6 is eight months, not one month. But if what you want is just the difference between the consecutive available observations without regard to the time elapsed between them, then your code is basically on the right track. The only problem I see with it is that if you ran this command when the data were not sorted in chronological order you could be calculating differences between random pairs of observations. So if you precede that command by -sort yearmonth- then you have it.

And yes, if you do this process with both sequences and correlate them you will get an estimate of their correlation that eliminates any common trend due simply to the passage of time and autocorrelation within each series.
Comment

Announcement