Hi,
I have data about students in schools for several years. I can identify the peers of non-Western origin and I would like to see the effect of exposure to a higher share of non-Western peers on the native students' outcomes.
I would like to compare the kernel density plot of the standard deviations of the within-schools between-year share of non-Western peers once using actual data and once using simulated data.
For the simulation, I use a binomial distribution that randomly assigns a non-Western indicator to the peers. I then calculate the share of non-Western peers within-schools and between-years using the simulated peers. I repeat this 1000 times.
When I plot the kernel density of the standard deviations of actual share of non-Western peers and simulated share of non-Western peers, I can see that the right tail of my simulated plot is much longer than the actual data. Meaning that the standard deviation of the share of non-Western peers using simulated data has more outliers than in the standard deviation of the share of non-Western peers generated by actual data.
Now my question is: How can I tell my simulation not to go higher than the max standard deviation of the actual share of peers?
My code is as follows:
My data looks like this:
I have data about students in schools for several years. I can identify the peers of non-Western origin and I would like to see the effect of exposure to a higher share of non-Western peers on the native students' outcomes.
I would like to compare the kernel density plot of the standard deviations of the within-schools between-year share of non-Western peers once using actual data and once using simulated data.
For the simulation, I use a binomial distribution that randomly assigns a non-Western indicator to the peers. I then calculate the share of non-Western peers within-schools and between-years using the simulated peers. I repeat this 1000 times.
When I plot the kernel density of the standard deviations of actual share of non-Western peers and simulated share of non-Western peers, I can see that the right tail of my simulated plot is much longer than the actual data. Meaning that the standard deviation of the share of non-Western peers using simulated data has more outliers than in the standard deviation of the share of non-Western peers generated by actual data.
Now my question is: How can I tell my simulation not to go higher than the max standard deviation of the actual share of peers?
My code is as follows:
Code:
clear all cd "use" clear gen share_sd=. save mc_sd_empty.dta, replace use mc_data, clear bys school year: egen share_real=mean(non_western) bys school: egen overall_share=mean(non_western) *Drop the ones with zero variation in share of non westerns drop if overall_share=0 drop overall_share *Calculating the sd from actual data, for natives keep if native==3 bys school year: keep if _n==1 *Gen sd in schools to later draw the kernel for bys school: egen share_sd=sd(share_real) keep share_sd save actual_sd.dta, replace *Prepare simulation data use mc_data, clear *drop the ones with zero variation in share of non-westerns bys school: egen overall_share=mean(non-western) drop if overall_share==0 drop overall_share *Mean of nonwesterns in each school to use in simulation bys school: egen p_nonwestern=mean(non-western) save data_ready, replace **Program capture program drop mc program define mc, rclass use data_ready, clear *Randomly assign the peers an immigrant status based on binomial bys school: gen rand_cohort=rbinomial(1, p_nonwestern) if native!=3 bys school year: egen share_simulated=mean(rand_cohort) keep if native==3 bys school year: keep if _n==1 bys school: egen share_sd=sd(share_simulated) keep share_sd append using mc_sd.dta save mc_sd.dta, replace end copy mc_sd_empty mc_sd, replace seed 1234 simulate share_sd, reps(1000): mc use actual_sd, clear append using mc_sd, gen(simulated) twoway (kdensity share_sd if simulated==0) (kdensity share_sd if simulated==1)
My data looks like this:
Code:
* Example generated by -dataex-. For more info, type help dataex clear input float(id year school native non_western) 1 2001 100 1 0 1 2002 100 1 0 2 2001 100 0 1 3 2004 101 1 0 3 2005 101 1 0 4 2001 100 1 0 4 2002 100 1 0 4 2003 100 1 0 5 2004 101 0 1 6 2005 101 1 0 6 2006 101 1 0 6 2007 101 1 0 7 2002 100 1 0 7 2003 100 1 0 7 2004 100 1 0 8 2002 100 0 1 8 2003 100 0 1 9 2005 101 0 0 10 2005 101 1 0 10 2006 101 1 0 10 2007 101 1 0 end
Comment