Only use selected observations

rick ert

Join Date: Jul 2017

Posts: 7
#1

Only use selected observations

26 Jul 2017, 04:19

Hi everybody,

I got quite a large dataset (700K observations and around 20 variables), but I'm only interested in a small part of this. I combined firm financial specifics (compustat) and director specifics (ISS) and I'm trying to select only firms that appointed directors in two or more different years.

I'm a complete beginner with STATA, but I tried identifying firms by CUSIP code and director appointments by the year a director started (directorsince). However, I can't figure out the commands for selecting only this subset of my database. I could be totally wrong, but I think I need a command that does something similar as:

only keep observations if directorsince is for at least two different years and the same CUSIP code (firm)

Hopefully I described my problem clear enough for someone to be able to help me!
Tags: None
Matthew Breckons

Join Date: Jul 2015

Posts: 38
#2

26 Jul 2017, 04:55

Rick - can you create an additional binary variable which could indicate what you want it to - what format does the 'directorsince' variable take, and if they had appointed directors in two or more different years would there be multiple years in the same variable or would each director have a 'directorsince' variable associated with them?
Comment
rick ert

Join Date: Jul 2017

Posts: 7
#3

26 Jul 2017, 05:12

Hi Matthew,

Thanks for replying! The format of the directorsince variable is years (type: double, format %ty). The directorsince variable is linked to the firm (CUSIP code), so it would show like:

cusip (firm ID) ------ directorsince
123456 ------------ 2007
123456 ------------ 2009
123456 ------------ 2011

This would indicate that the same firm hired directors in multiple different years. However, my database contains a lot of observations of firms that hired no directors at all, only in 1 year or multiple in the same year.

Hope this makes a bit more clear.
Comment

eric_a_booth

Join Date: Apr 2014
Posts: 286

26 Jul 2017, 06:04

Hi Rick - I dont think this is clear. It would help if you showed an actual example of the data since we dont know what your varnames mean or how your data are structured. Here's an example with a few options that might help:

Code:

**create some fake data:
clear
set obs 60
g firm = int(runiform()*5)
bys firm: g year = 2000+_n
g directorsince = year if runiform()<.15



**keep only if director since isnt missing & >= 2

bys firm (year): egen select = count(directorsince)

keep if select >=2


**or you could do this:
bys firm (year): egen firstdirector = min(directorsince)
bys firm (year): egen lastdirector = max(directorsince)


keep if firstdirector!=lastdirector & !mi(lastdirector)

Eric A. Booth | Senior Director of Research | Far Harbor | Austin TX

Comment

rick ert

Join Date: Jul 2017

Posts: 7
#5

26 Jul 2017, 07:04

Hi Eric,

Thanks a lot for trying to help me out. I copied the following part of my dataset:

Code:

cusip dirsince 03600T104 2004 03600T104 2012 98933Q108 2010 50060P106 2004 50060P106 2008 98933Q108 2016 247361702 2010 910047109 2010 37045V100 2009 053332102 2002 584688105 2003

So from the part of the dataset I copied, I only want to keep the CUSIP codes that have two ore more dirsince observations in different years.
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#6

26 Jul 2017, 08:08

Perhaps this will do what you need.

Code:

by cusip (dirsince), sort: keep if dirsince[1]!=dirsince[_N]
Comment
rick ert

Join Date: Jul 2017

Posts: 7
#7

26 Jul 2017, 10:31

Originally posted by William Lisowski View Post

Perhaps this will do what you need.

Code:

by cusip (dirsince), sort: keep if dirsince[1]!=dirsince[_N]

What exactly does the command do William? So far it seems to work..
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#8

26 Jul 2017, 11:37

They keys to understanding the command are
the by command will sort the data by cusip and dirsince, and then run the command following it separately for each value of cusip

dirsince[1] is the first observation of dirsince; dirsince[_N] is the last observation, which because of the by command are for a given value of cusip

the keep command will thus keep those observations where the cusip has at least two different values of dirsince

But you should thoroughly read the output of help by since it is an important tool in Stata and you will need it repeatedly in your work.

I'm sympathetic to you as a new user of Stata - it's a lot to absorb. And even worse if perhaps you are under pressure to produce some output quickly.

When I began using Stata in a serious way, I started, as have others here, by reading my way through the Getting Started with Stata manual relevant to my setup. Chapter 18 then gives suggested further reading, much of which is in the Stata User's Guide, and I worked my way through much of that reading as well. There are a lot of examples to copy and paste into Stata's do-file editor to run yourself, and better yet, to experiment with changing the options to see how the results change.

All of these manuals are included as PDFs in the Stata installation (since version 11) and are accessible from within Stata - for example, through the PDF Documentation section of Stata's Help menu. The objective in doing the reading was not so much to master Stata (several years later and I still won't claim mastery) as to be sure I'd become familiar with a wide variety of important basic techniques, so that when the time came that I needed them, I might recall their existence, if not the full syntax, and know how to find out more about them in the help files and PDF manuals.

Stata supplies exceptionally good documentation that amply repays the time spent studying it - there's just a lot of it. The path I followed surfaces the things you need to know to get started in a hurry and to work effectively.
Comment
rick ert

Join Date: Jul 2017

Posts: 7
#9

26 Jul 2017, 14:51

Thanks a lot for your in depth explanation William!
Comment

Announcement

Only use selected observations

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment