Missing Data in Panel Data Set - How to derive a function from existing data to add missing values

Tim Tolkmitt

Join Date: Jun 2018

Posts: 9
#1

Missing Data in Panel Data Set - How to derive a function from existing data to add missing values

12 Jun 2018, 12:41

Dear all,

I have a dataset with around 800 objects (videos). For 500 of the 800 objects I have cumulative data (views) over 72 Periods (not all starting/ending in the same period). I have the 'total' value after 72 periods for all objects. I'd like to derive a function telling me "how many % of the total value is reached after period X" from the existing data to add missing values. I'm not sure what might be the best way to do it.

I calculated the average % of total for each object and per month and then calculated the average over all videos. That does not seem to be very academic to me (and the values are sometimes dropping even though that does not make sense for cumulative data. This is due to objects dropping out). Any help is very welcome.

Thanks a lot in advance!

Last edited by Tim Tolkmitt; 12 Jun 2018, 12:44.
Tags: fitting function, missing values, panal data
Carlos Teigimiz

Join Date: May 2016

Posts: 26
#2

12 Jun 2018, 18:01

Hi Tim,
If I understand your question right, you would like to see how many views all videos have on day t and how many % this is as a percentage of the total number of views for all videos altogether after 72 periods.
So maybe - instead of calculating the average %-of total for each object - you should calculate the total number of views over all objects by period

Before that, you might want to take care of the missing values. The below code does not work right, if some videos drop out of the time-series.
For this, check out command - tsfill -, after you tsset your data (e.g. tsset id period).

If you already have the total value after 72 periods for each object, you should use this and generate a variable storing the number of views.

Code:

sort period *Total number of views of all videos in a specific period by period: egen total_views_allvideos = total(views) *Calculate the reference point [should be the maximum, as 72 is the last period] gen total_views_period72 = max(total_views_allvideos) *Calculate percentage for each period gen pc_reached = total_views_allvideos / total_views_period72

Last edited by Carlos Teigimiz; 12 Jun 2018, 18:11.
Comment
Tim Tolkmitt

Join Date: Jun 2018

Posts: 9
#3

13 Jun 2018, 00:44

Hi Carlos,

thanks a lot for your help. Unfortunately I already tested your approach and the results are worse than the percent of total per video and taking the average of all the videos per period.
I was looking for a method to derive a function which approximates the “% of total views” development over time, by taking each value from the panel data into account not just the averages. Could not find that so far.

here is how it looks like
Comment
Carlos Teigimiz

Join Date: May 2016

Posts: 26
#4

13 Jun 2018, 10:53

Hi Tim,

maybe you could try - ssc install carryforward -.
This, in combination with the tsfill enables you to "ignore" that some videos are dropped in the course of time.

Alternatively something like this, should do the same:

Code:

*after you applied - tsfill - by object: replace views = views[_n-1] if missing(views)

Best,
Carlos
Comment
Tim Tolkmitt

Join Date: Jun 2018

Posts: 9
#5

13 Jun 2018, 11:51

Hi Carlos
thanks again! But what would be the appropriate method to derive a function, which most accurately reflects the trend of the viewcount over time? What I have done so far is calculating
. nl (percent_of_total_views = {b0=1}*ln(month)+{b1=0.1})
for the "% per video averaged over all videos per period" (Blue line). But I think, that this might not be the best solution as the variance between the videos is not taken into account.

Last edited by Tim Tolkmitt; 13 Jun 2018, 12:08.
Comment
Carlos Teigimiz

Join Date: May 2016

Posts: 26
#6

13 Jun 2018, 13:25

Unfortunately, I am not sure if I can of any help regarding that.

However, I do believe that your procedure
1. Calculating the % of maximum views reached for each individual video in each period
2. Averaging [1] over all videos in a period
is statistically not sound. You should work with the actual number of views for each video, then aggregate it over all videos in a period and the relate it to the aggregate maximum number of views. Working with percentages and then averaging them does not seem to be the right procedure. Otherwise your results will be biased.
Comment

Announcement

Missing Data in Panel Data Set - How to derive a function from existing data to add missing values

Comment

Comment

Comment

Comment

Comment