Dropping certain observations based on classifications and rankings of time-series output

Ben Wiggins

Join Date: May 2017

Posts: 10
#1

Dropping certain observations based on classifications and rankings of time-series output

17 Jul 2017, 11:39

Good afternoon, everyone.

I have applied a "reshape long" to the dataset at bottom and would like to be able to perform the following actions for each worker.

(With my apologies, the project tags are inconsistent in this test data set in order to properly represent the filing inconsistencies we're dealing with in the data set it represents.)

Actions desired:

1) If "project" is not tagged for any of a worker's observations, leave the observations for that worker as is. (This will be most of them.)
2) If "project" contains values for one of the years, but not both, then drop the year *with* project values. (i.e. drop all Year 1 observations for Worker 3 below)
3) If "project" contains values for both years, then apply the following for all situations, regardless of whether the projects are "A" and "B", "A" and no value, or "B" and no value:
If one project's output is higher than the other project's output for all years observed, keep the worker but drop all year-observations for the lower-output project for that worker.
Ex: Drop Worker 5's project B, since its output is always lower than the output of the project with no letter specification. Keep the observations with no letter specification

Ex: Drop Worker 13's project A, since its output is always lower than project B for worker 13. Keep the observations for project B.

Ex: Drop Worker 17's project with no letter specification, since its value is lower than Project A for all years observed. Keep the observations for Project A.

If there is conflict as to which project is higher output (i.e. the values cross) or there is a tie at any time, drop all years with project values.
Ex: In the case of worker 10, there is a tie between project A and the project of no letter specification in year 2. Drop both year 1 and year 2 for worker 10. (If there were an additional year 3 observation with no project specifications, we would keep it per step #2 above)

Ex: In the case of worker 12, project A has higher output in the first year but project B has higher output in the second year. Drop year 1 and year 2 for worker 12.

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input byte(worker year) str1 project byte output 1 1 "" 25 1 2 "" 63 2 1 "" 16 2 2 "" 32 3 1 "A" 50 3 1 "B" 5 3 2 "" 90 4 1 "" 28 4 2 "" 5 5 1 "" 12 5 1 "B" 10 5 2 "" 12 5 2 "B" 11 6 1 "" 92 6 2 "" 71 7 1 "" 67 7 2 "" 94 8 1 "" 100 8 2 "" 36 9 1 "" 20 9 2 "" 16 10 1 "A" 50 10 1 "" 10 10 2 "A" 55 10 2 "" 55 11 1 "" 63 11 2 "" 26 12 1 "A" 40 12 1 "B" 30 12 2 "A" 45 12 2 "B" 50 13 1 "A" 60 13 1 "B" 65 13 2 "A" 65 13 2 "B" 70 14 1 "" 49 14 2 "" 16 15 1 "" 10 15 2 "" 55 16 1 "" 55 16 2 "" 47 17 1 "" 70 17 1 "A" 90 17 2 "" 75 17 2 "A" 90 18 1 "" 90 18 2 "" 64 19 1 "" 86 19 2 "" 16 20 1 "" 76 20 2 "" 74 end

Please let me know if any revisions or clarification are needed, and thank you for your time.
Tags: None
Phil Bromiley

Join Date: Apr 2014

Posts: 4348
#2

18 Jul 2017, 12:18

Thank you for providing data. Providing your current Stata code in code delimiters and readable Stata output would also help. It would also you you'd really tried to solve this problem before asking for help.

So, your first few questions. I suspect you're working backwards - it is probably easier to apply the conditions in the opposite order to what you're doing.

1) If "project" is not tagged for any of a worker's observations, leave the observations for that worker as is. (This will be most of them.)

g tag=1 if project !="A" & project !="B"
bysort worker: egen hastag=mean(tag)
bysort worker: egen numtag=count(tag)
g keep=hastag
*this ids workers with a tag for any observation.

You can probably work from here.
Comment
Ben Wiggins

Join Date: May 2017

Posts: 10
#3

24 Jul 2017, 14:14

I appreciate your patience, Phil Bromiley. I'm mostly self-taught and have a lot of trouble with this program sometimes. Even the simplest workarounds can drive me crazy. This is partly because Stata is an open sandbox until you know what command to search the help function for. If you have any other resources to recommend, I'm more than happy to use those first.

I've implemented what you suggested. Here is the code I was using, with your suggestions worked in. I think I've captured all the intent of what you did; let me know if not.

The only remaining problem is dropping Worker 12 (all observations) since the top output changes hands. How would I go about dropping all observations for a worker, if all years don't have the same top project?

To make that part of the problem more clear, I've added another two years of observation for Worker 12 (and would need to solve the problem as though there could be dozens of years of observation for any given worker).

So, at this point, if any of a worker's "years" contain a project letter (for the top projects, which we've kept), then every year's letter must be the same, or *all* observations for that worker must be dropped.

Code for reference:

Code:

gen marker = 1 if shrcls == "A" | shrcls == "B" bysort gvkey datayear: egen yearmarker = mean(marker) gsort worker year -output by worker year: g top = 1 if [_n==1] gen to_drop = 1 if yearmarker == 1 & top != 1 drop if to_drop == 1 drop to_drop

And the dataset, after implementing the above code, with the additional observations for worker 12.

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input byte(worker year) str1 project byte output double(tag yearhastag top) 1 1 "" 25 . . 1 1 2 "" 63 . . 1 2 1 "" 16 . . 1 2 2 "" 32 . . 1 3 1 "A" 50 1 1 1 3 2 "" 90 . . 1 4 1 "" 28 . . 1 4 2 "" 5 . . 1 5 1 "" 12 . 1 1 5 2 "" 12 . 1 1 6 1 "" 92 . . 1 6 2 "" 71 . . 1 7 1 "" 67 . . 1 7 2 "" 94 . . 1 8 1 "" 100 . . 1 8 2 "" 36 . . 1 9 1 "" 20 . . 1 9 2 "" 16 . . 1 10 1 "A" 50 1 1 1 10 2 "A" 55 1 1 1 11 1 "" 63 . . 1 11 2 "" 26 . . 1 12 1 "A" 40 1 1 1 12 2 "B" 50 1 1 1 12 3 "A" 55 1 1 1 12 4 "A" 60 1 1 1 13 1 "B" 65 1 1 1 13 2 "B" 70 1 1 1 14 1 "" 49 . . 1 14 2 "" 16 . . 1 15 1 "" 10 . . 1 15 2 "" 55 . . 1 16 1 "" 55 . . 1 16 2 "" 47 . . 1 17 1 "A" 90 1 1 1 17 2 "A" 90 1 1 1 18 1 "" 90 . . 1 18 2 "" 64 . . 1 19 1 "" 86 . . 1 19 2 "" 16 . . 1 20 1 "" 76 . . 1 20 2 "" 74 . . 1 end
Comment

Announcement

Dropping certain observations based on classifications and rankings of time-series output

Comment

Comment