Monte Carlo - Simulations: random samples, record means

Caitlyn Best

Join Date: Dec 2020

Posts: 3
#1

Monte Carlo - Simulations: random samples, record means

04 Dec 2020, 14:34

Hello,

I have a dataset with the variable "LeadLevels" (continuous numerical variable, cannot be a negative number, 1104 observations). From this column, I would like to select 10 random samples (without replacement during selection), take the mean of these 10 samples, and then repeat many times (let's say 1000 times). I would like the output means saved in a new document.

So far, Monte Carlo simulation appears to be the best choice. However, I am struggling with coding the command portion (select 10 random samples, average, record result).

. simulate mean, LeadLevel, reps(1000) saving(/Users/caitlynbest/Desktop/Stata output.dta, replace): sample 10, count
invalid 'reps'
r(198);

Any assistance is much appreciated.

Thank you,

Caitlyn
Tags: None
Rich Goldstein

Join Date: Mar 2014

Posts: 4412
#2

04 Dec 2020, 15:35

you have two commas before the colon- you should have only one; but there appear to be other problems: (1) you haven't told the command where to find the mean; (2) why are you mentioning "LeadLevel" as part of the -simulate- part of the command? see

Code:

help simulate
Comment
Caitlyn Best

Join Date: Dec 2020

Posts: 3
#3

04 Dec 2020, 19:19

Thank you for the quick reply.

I am relatively new to Stata, and unfortunately there is no drop-down menu for simulate. I referred to LeadLevel within the -simulate- part of the command because that is how the command is worded if using permute through the drop-down options (which I also tried, unsuccessfully). To simplify things, I removed all other columns from my data set for now, so that LeadLevel is the only column.

At this time, I don't know how to write the command to use the sample pool generated to calculate the mean. I think I need to use r(mean) for this?

Given this example in the help,

. simulate exp list, reps(#): command

I would think that the following code would work:

simulate r(mean), reps(1000) saving(/Users/caitlynbest/Desktop/Stata output.dta, replace) : sample 10, count
command: sample 10, count
_sim_1: r(mean)

However, I have "." in each of the output cells, which should be the recorded averages of each of the pools.

At this point, I am very happy to have the code run - even if there is no usable output.

Any other suggestions for my new / modified code?

Thank you very much

Caitlyn
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2396
#4

04 Dec 2020, 23:18

-simulate- could not find r(mean) because the command you gave it after the colon, Stata's built-in -sample- command, just takes a sample and does not calculate and return a result called r(mean). The fact that you referred to "r(mean)" didn't tell -simulate- "I'd like you to calculate a mean." Instead, it told -simulate- "Each time you run my program, something will be done to calculate and make visible to you a result called r(mean)." You need to create a program that takes the sample you want, runs a program to obtain the mean you want, and puts that mean into what Stata calls the "return list" of your program, which makes the result visible to -simulate-. -simulate- does not by default know how to do *anything* except repeatedly run your program and keep its results. The same is true of -permute- and -bootstrap-: They can repeatedly do what you tell them and keep the results, but they don't calculate means and the like. The calculations are up to users to create by calling existing Stata programs or by writing their own. Take a look at some of the examples in the PDF documents for -simulate- with this idea in mind, and I think they will make better sense.

Below, I show an example of how to do what you want. I have written my code to be relatively clear but not to be flexible or particularly efficient. I'm just trying to show something close to how you were trying to proceed. I suspect that if we knew *why* you want to do what you're asking for here, there might be a different way to get to that end.

Code:

clear // Create example data for illustration, which you don't // need because you already have the data. set seed 4643 set obs 1104 gen LeadLevel = 100 * runiform() // end creating data // // Define a program to do what you want capture program drop mymean10 program mymean10, rclass preserve // because you'll want your data back sample 10, count // Get the mean from the sample you just took summarize LeadLevel // Assign the value obtained from the summarize // command to the return list of -mymean10- return scalar meanfor10 = r(mean) // Bring back all of your original data so it can be sampled again restore end // // Have -simulate- run your program 1000 times and // accumulate the result called r(meanfor10) simulate r(meanfor10), reps(1000) saving("YourOutputFile.dta"): mymean10
1 like
Comment
Caitlyn Best

Join Date: Dec 2020

Posts: 3
#5

06 Dec 2020, 14:14

Thank you, Mike! I think your solution works - is it appropriate to change the third line to -gen LeadLevel = 1 * runiform()- ? When this line is set to 100, the means seemed to be 100 x greater than I expected, but within expected limits when I change this to 1.

A little more background: I have 1104 observations that represent lead concentrations in normal animals. A small amount of background lead is common and considered normal. I am trying to see if pooled testing would work as a more efficient way to rule out a large number of normal animals with fewer tests (and therefor less expense). Because normal animals rarely have a lead concentration of zero, I am trying to asses the frequency of which 'false positive' pools may occur. The 'threshold' for a positive pool is 0.1/n, where n is the pool size - in this case, 10 samples. My reasoning here is that if I can make pools of 10 normal animals, then see what the mean lead level is for that pool and repeat 1000 times, I can compare the results to the threshold and identify the proportion of positive pools expected ... so that I can create an expected cost for these false positive pools.

I really appreciate your help and input. I have taken graduate stats courses using Stata, however we did not cover these specific commands. The help reference material is useful, however the process of writing the code is still relatively new to me. Lots to learn!
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2396
#6

06 Dec 2020, 22:56

Per my comment lines above, the first part is just my creation of some fake data to illustrate a solution, and is not something for you to use. Changing that third line is irrelevant because you have your own data and will not be using my code that creates fake data. I just picked "100 * runiform()" out the clear blue sky as some convenient numbers to work with.

So, you want to know how often would the mean of a sample of 10 would be expected to exceed some threshold level when drawn from a population whose mean is at some normal level? If it's reasonable to assume that the lead levels in a population follow a Gaussian (i.e, "normal" in the statistical sense), then simulation here is not needed, as ordinary statistical theory would provide a hand-calculable answer, based on the t-distribution. However, there's certainly nothing wrong with simulating to find a result. The motivation for simulating here should be: 1) You believe your sample of 1104 gives a good representation of the mean, standard deviation, and shape of the distribution of lead levels in a population with no diseased individuals and 2) You think that the population distribution of lead levels does not have a Gaussian/normal shape.
Comment

Nicolas Larrea

Join Date: Jan 2023
Posts: 14

31 Jan 2024, 02:26

Hello! I have used this discussion to help me out with my issue, it has been really helpful, thanks! I have a coding that results in a point estimate. My goal is to use Monte Carlo simulation to repeat the procedur "n" times, obtain a mean of the estimation and thus a more robust value. To this concern I used the simulate command and since i have the same sample always I relied on bsample to use a given percentage of the population with replacement. I am not pretty sure about it because I am having a quite high value of standard errors in the final output of the mean even when I use 99% of the original sample.

Code:

use "${gsdClean}/Pseudo_panel_yearly.dta", clear            
cd "${gsdOut}/3. Partial correlation coefficient"    

************************************************************************************************************************************
************************************************************************************************************************************
**** PROGRAM ******************************************************************************************************************

capture program drop myboot
program myboot, rclass

*** Regression inputs and other parameters
local indvars = "f age age2 ed marital munder_3 tot4_24 older_65 i.region"            
local depvar = "pci"            

*** Age range
local agevar = "age"            
scalar min_age = 23            
scalar max_age = 55

*** Survey years
local yearvar = "year"            
                
*** Other parameters
local cohvar = "age region nivel_ed_a"                    
local idvar = "id"            

*** Survey settings
svyset _n [weight=weight]        
local weight = "weight"            

preserve
bsample round(0.99*_N)

local j = 1
local i=2003
        local year1 = `i'
        local year2 = `i' + 1        
        local diff = `year2'-`year1'
        
        
        collapse `depvar', by(`cohvar' `yearvar')                    
        replace `agevar' = `agevar' - `diff' if `yearvar' == `year2'    
        reshape wide `depvar', i(`cohvar') j(`yearvar')              

        corr `depvar'`year1' `depvar'`year2' if `agevar' >= min_age & `agevar' <= max_age
        scalar rho_s = r(rho)    

        restore, preserve
        
        **** DROP OBS OUTSIDE OF AGE RANGE
        
        drop if (`yearvar' == `year1' & `agevar' < min_age) | (`yearvar' == `year1' & `agevar' > max_age)
        drop if (`yearvar' == `year2' & `agevar' < (min_age +`diff')) | (`yearvar' == `year2' & `agevar' > (max_age +`diff' ))

        **** INCOME MODEL *************************************************************
        **** POINT ESTIMATES OF CONDITIONAL AND UNCONDITIONAL PROBABLITIES
            
        *** regressions to obtain inputs for estimation of rho and probabilities bivariate normal distribution

        *** regression year 1    
        svy: reg `depvar' `indvars' if `yearvar' == `year1'    
        
        predict fitinc`year1'_x`year1' if e(sample) == 1, xb                            
        predict fitinc`year1'_x`year2' if e(sample) == 0 & `yearvar' ==`year2', xb        
        matrix b1 = e(b)                                                        
    
        predict res`year1' if e(sample), residuals                                    
        sum res`year1' if `yearvar' == `year1' [aw=`weight']
        scalar sd_res`year1' = r(sd)                                                
        
        sum `depvar' if e(sample) [aw=`weight']
        scalar vary1 = r(Var)                                                  
        
        *** regression year 2                
        svy: reg `depvar' `indvars' if `yearvar' == `year2'

        predict fitinc`year2'_x`year2' if e(sample) == 1, xb                            
        matrix b2 = e(b)                                                      
        
        predict res`year2' if e(sample), residuals                                  
        sum res`year2' if `yearvar' == `year2' [aw=`weight']
        scalar sd_res`year2' = r(sd)                                                

        sum `depvar' if e(sample) [aw=`weight']
        scalar vary2 = r(Var)                                                    
        
        ** calculate variance-covariance matrix for explanatory variables in year 2
        matrix accum VARX2 = `indvars' if e(sample), deviations noconstant        
        mat VARX2 = VARX2/(r(N)-1)                                              
        mat zero= J(1, colsof(e(b))- 1, 0)                                        
        mat VARX2= VARX2\zero                                                  
        mat zero= J(colsof(e(b)),1,0)                                            
        mat VARX2= VARX2, zero                                                    
        
        mat li VARX2                                                            
                            
        ***** ESTIMATION OF PARTIAL CORRELATION COEFFICIENT RHO **********************
        
        matrix RHO_HAT_`year1'_`year2' = (rho_s*sqrt(vary1*vary2)-b1*VARX2*b2')/(sd_res`year1'*sd_res`year2')        // eq. 5 DL 2019
        mat li RHO_HAT_`year1'_`year2'    
        
        return scalar rho_hat = RHO_HAT_`year1'_`year2'[1,1]
        
        restore, preserve
    
end
        
    simulate rho_bs = r(rho_hat), reps(15): myboot
    summarize

Am I inserting the bsample command in the correct place?

Announcement

Monte Carlo - Simulations: random samples, record means

Comment

Comment

Comment

Comment

Comment

Comment