Dear Statalilst:
I have a dataset of 1.3 million observations of partnerships between two sets of organizations, i and j, where the variable tie==1 indicates i and j have a partnership and tie==0 indicates no partnership. For most i-j observations, tie==0 because most organizations do not partner with each other. For analytical purposes, I need to keep all the i-j observations of tie==1 and sample a subset of the i-j observations of tie==0 . Because I need the subsample to be stratified by a key grouping variable the syntax I am using is bysort group: sample n if tie ! == 1, count. This results in a dataset that includes all the original observations of tie==1 and n observations of tie==0 within each group. THE PROBLEM is I need an n subample of tie==0 observations for every observation of tie==1. In other words, for an organization, i, with two partnerships I need a subsample that include the two tie==1 observations but also includes n*2 observations of tie==0 (i.e. 14 zeros if the sampling n=7); likewise, if an organization has three partnerships, I need the subsample to include the 3 tie==1 observations and 21 tie==0 observations. Intuitively, the syntax I'd want is bysort group: sample n*realized_count if tie ! == 1, count, where "realized_count" is a variable I've generated for the total number of ties for each organization i. It seems that I can't use the multiplication operation (*) or a variable name (realized_count) with sample so this syntax won't work. Alternatively, it seems like I should be able to accomplish my aim with some kind of foreach or forvalue loop but I also can't get this to work either. Any advice?
Thank you, in advance, Ebony
I have a dataset of 1.3 million observations of partnerships between two sets of organizations, i and j, where the variable tie==1 indicates i and j have a partnership and tie==0 indicates no partnership. For most i-j observations, tie==0 because most organizations do not partner with each other. For analytical purposes, I need to keep all the i-j observations of tie==1 and sample a subset of the i-j observations of tie==0 . Because I need the subsample to be stratified by a key grouping variable the syntax I am using is bysort group: sample n if tie ! == 1, count. This results in a dataset that includes all the original observations of tie==1 and n observations of tie==0 within each group. THE PROBLEM is I need an n subample of tie==0 observations for every observation of tie==1. In other words, for an organization, i, with two partnerships I need a subsample that include the two tie==1 observations but also includes n*2 observations of tie==0 (i.e. 14 zeros if the sampling n=7); likewise, if an organization has three partnerships, I need the subsample to include the 3 tie==1 observations and 21 tie==0 observations. Intuitively, the syntax I'd want is bysort group: sample n*realized_count if tie ! == 1, count, where "realized_count" is a variable I've generated for the total number of ties for each organization i. It seems that I can't use the multiplication operation (*) or a variable name (realized_count) with sample so this syntax won't work. Alternatively, it seems like I should be able to accomplish my aim with some kind of foreach or forvalue loop but I also can't get this to work either. Any advice?
Thank you, in advance, Ebony
Comment