Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Monte Carlo - Simulations: random samples, record means

    Hello,

    I have a dataset with the variable "LeadLevels" (continuous numerical variable, cannot be a negative number, 1104 observations). From this column, I would like to select 10 random samples (without replacement during selection), take the mean of these 10 samples, and then repeat many times (let's say 1000 times). I would like the output means saved in a new document.

    So far, Monte Carlo simulation appears to be the best choice. However, I am struggling with coding the command portion (select 10 random samples, average, record result).

    . simulate mean, LeadLevel, reps(1000) saving(/Users/caitlynbest/Desktop/Stata output.dta, replace): sample 10, count
    invalid 'reps'
    r(198);

    Any assistance is much appreciated.

    Thank you,

    Caitlyn

  • #2
    you have two commas before the colon- you should have only one; but there appear to be other problems: (1) you haven't told the command where to find the mean; (2) why are you mentioning "LeadLevel" as part of the -simulate- part of the command? see
    Code:
    help simulate

    Comment


    • #3
      Thank you for the quick reply.

      I am relatively new to Stata, and unfortunately there is no drop-down menu for simulate. I referred to LeadLevel within the -simulate- part of the command because that is how the command is worded if using permute through the drop-down options (which I also tried, unsuccessfully). To simplify things, I removed all other columns from my data set for now, so that LeadLevel is the only column.

      At this time, I don't know how to write the command to use the sample pool generated to calculate the mean. I think I need to use r(mean) for this?

      Given this example in the help,

      . simulate exp list, reps(#): command

      I would think that the following code would work:

      simulate r(mean), reps(1000) saving(/Users/caitlynbest/Desktop/Stata output.dta, replace) : sample 10, count
      command: sample 10, count
      _sim_1: r(mean)

      However, I have "." in each of the output cells, which should be the recorded averages of each of the pools.

      At this point, I am very happy to have the code run - even if there is no usable output.

      Any other suggestions for my new / modified code?

      Thank you very much

      Caitlyn

      Comment


      • #4
        -simulate- could not find r(mean) because the command you gave it after the colon, Stata's built-in -sample- command, just takes a sample and does not calculate and return a result called r(mean). The fact that you referred to "r(mean)" didn't tell -simulate- "I'd like you to calculate a mean." Instead, it told -simulate- "Each time you run my program, something will be done to calculate and make visible to you a result called r(mean)." You need to create a program that takes the sample you want, runs a program to obtain the mean you want, and puts that mean into what Stata calls the "return list" of your program, which makes the result visible to -simulate-. -simulate- does not by default know how to do *anything* except repeatedly run your program and keep its results. The same is true of -permute- and -bootstrap-: They can repeatedly do what you tell them and keep the results, but they don't calculate means and the like. The calculations are up to users to create by calling existing Stata programs or by writing their own. Take a look at some of the examples in the PDF documents for -simulate- with this idea in mind, and I think they will make better sense.

        Below, I show an example of how to do what you want. I have written my code to be relatively clear but not to be flexible or particularly efficient. I'm just trying to show something close to how you were trying to proceed. I suspect that if we knew *why* you want to do what you're asking for here, there might be a different way to get to that end.

        Code:
        clear
        // Create example data for illustration, which you don't
        // need because you already have the data.
        set seed 4643
        set obs 1104
        gen LeadLevel = 100 * runiform()
        // end creating data
        //
        // Define a program to do what you want
        capture program drop mymean10
        program mymean10, rclass
        preserve // because you'll want your data back
        sample 10, count
        // Get the mean from the sample you just took
        summarize LeadLevel 
        // Assign the value obtained from the summarize
        // command to the return list of -mymean10-
        return scalar meanfor10 = r(mean)
        // Bring back all of your original data so it can be sampled again
        restore
        end
        //
        // Have -simulate- run your program 1000 times and
        // accumulate the result called r(meanfor10)
        simulate r(meanfor10), reps(1000) saving("YourOutputFile.dta"): mymean10

        Comment


        • #5
          Thank you, Mike! I think your solution works - is it appropriate to change the third line to -gen LeadLevel = 1 * runiform()- ? When this line is set to 100, the means seemed to be 100 x greater than I expected, but within expected limits when I change this to 1.

          A little more background: I have 1104 observations that represent lead concentrations in normal animals. A small amount of background lead is common and considered normal. I am trying to see if pooled testing would work as a more efficient way to rule out a large number of normal animals with fewer tests (and therefor less expense). Because normal animals rarely have a lead concentration of zero, I am trying to asses the frequency of which 'false positive' pools may occur. The 'threshold' for a positive pool is 0.1/n, where n is the pool size - in this case, 10 samples. My reasoning here is that if I can make pools of 10 normal animals, then see what the mean lead level is for that pool and repeat 1000 times, I can compare the results to the threshold and identify the proportion of positive pools expected ... so that I can create an expected cost for these false positive pools.

          I really appreciate your help and input. I have taken graduate stats courses using Stata, however we did not cover these specific commands. The help reference material is useful, however the process of writing the code is still relatively new to me. Lots to learn!

          Comment


          • #6
            Per my comment lines above, the first part is just my creation of some fake data to illustrate a solution, and is not something for you to use. Changing that third line is irrelevant because you have your own data and will not be using my code that creates fake data. I just picked "100 * runiform()" out the clear blue sky as some convenient numbers to work with.

            So, you want to know how often would the mean of a sample of 10 would be expected to exceed some threshold level when drawn from a population whose mean is at some normal level? If it's reasonable to assume that the lead levels in a population follow a Gaussian (i.e, "normal" in the statistical sense), then simulation here is not needed, as ordinary statistical theory would provide a hand-calculable answer, based on the t-distribution. However, there's certainly nothing wrong with simulating to find a result. The motivation for simulating here should be: 1) You believe your sample of 1104 gives a good representation of the mean, standard deviation, and shape of the distribution of lead levels in a population with no diseased individuals and 2) You think that the population distribution of lead levels does not have a Gaussian/normal shape.

            Comment


            • #7
              Hello! I have used this discussion to help me out with my issue, it has been really helpful, thanks! I have a coding that results in a point estimate. My goal is to use Monte Carlo simulation to repeat the procedur "n" times, obtain a mean of the estimation and thus a more robust value. To this concern I used the simulate command and since i have the same sample always I relied on bsample to use a given percentage of the population with replacement. I am not pretty sure about it because I am having a quite high value of standard errors in the final output of the mean even when I use 99% of the original sample.

              Code:
              use "${gsdClean}/Pseudo_panel_yearly.dta", clear            
              cd "${gsdOut}/3. Partial correlation coefficient"    
              
              ************************************************************************************************************************************
              ************************************************************************************************************************************
              **** PROGRAM ******************************************************************************************************************
              
              capture program drop myboot
              program myboot, rclass
              
              *** Regression inputs and other parameters
              local indvars = "f age age2 ed marital munder_3 tot4_24 older_65 i.region"            
              local depvar = "pci"            
              
              *** Age range
              local agevar = "age"            
              scalar min_age = 23            
              scalar max_age = 55
              
              *** Survey years
              local yearvar = "year"            
                              
              *** Other parameters
              local cohvar = "age region nivel_ed_a"                    
              local idvar = "id"            
              
              *** Survey settings
              svyset _n [weight=weight]        
              local weight = "weight"            
              
              preserve
              bsample round(0.99*_N)
              
              local j = 1
              local i=2003
                      local year1 = `i'
                      local year2 = `i' + 1        
                      local diff = `year2'-`year1'
                      
                      
                      collapse `depvar', by(`cohvar' `yearvar')                    
                      replace `agevar' = `agevar' - `diff' if `yearvar' == `year2'    
                      reshape wide `depvar', i(`cohvar') j(`yearvar')              
              
                      corr `depvar'`year1' `depvar'`year2' if `agevar' >= min_age & `agevar' <= max_age
                      scalar rho_s = r(rho)    
              
                      restore, preserve
                      
                      **** DROP OBS OUTSIDE OF AGE RANGE
                      
                      drop if (`yearvar' == `year1' & `agevar' < min_age) | (`yearvar' == `year1' & `agevar' > max_age)
                      drop if (`yearvar' == `year2' & `agevar' < (min_age +`diff')) | (`yearvar' == `year2' & `agevar' > (max_age +`diff' ))
              
                      **** INCOME MODEL *************************************************************
                      **** POINT ESTIMATES OF CONDITIONAL AND UNCONDITIONAL PROBABLITIES
                          
                      *** regressions to obtain inputs for estimation of rho and probabilities bivariate normal distribution
              
                      *** regression year 1    
                      svy: reg `depvar' `indvars' if `yearvar' == `year1'    
                      
                      predict fitinc`year1'_x`year1' if e(sample) == 1, xb                            
                      predict fitinc`year1'_x`year2' if e(sample) == 0 & `yearvar' ==`year2', xb        
                      matrix b1 = e(b)                                                        
                  
                      predict res`year1' if e(sample), residuals                                    
                      sum res`year1' if `yearvar' == `year1' [aw=`weight']
                      scalar sd_res`year1' = r(sd)                                                
                      
                      sum `depvar' if e(sample) [aw=`weight']
                      scalar vary1 = r(Var)                                                  
                      
                      *** regression year 2                
                      svy: reg `depvar' `indvars' if `yearvar' == `year2'
              
                      predict fitinc`year2'_x`year2' if e(sample) == 1, xb                            
                      matrix b2 = e(b)                                                      
                      
                      predict res`year2' if e(sample), residuals                                  
                      sum res`year2' if `yearvar' == `year2' [aw=`weight']
                      scalar sd_res`year2' = r(sd)                                                
              
                      sum `depvar' if e(sample) [aw=`weight']
                      scalar vary2 = r(Var)                                                    
                      
                      ** calculate variance-covariance matrix for explanatory variables in year 2
                      matrix accum VARX2 = `indvars' if e(sample), deviations noconstant        
                      mat VARX2 = VARX2/(r(N)-1)                                              
                      mat zero= J(1, colsof(e(b))- 1, 0)                                        
                      mat VARX2= VARX2\zero                                                  
                      mat zero= J(colsof(e(b)),1,0)                                            
                      mat VARX2= VARX2, zero                                                    
                      
                      mat li VARX2                                                            
                                          
                      ***** ESTIMATION OF PARTIAL CORRELATION COEFFICIENT RHO **********************
                      
                      matrix RHO_HAT_`year1'_`year2' = (rho_s*sqrt(vary1*vary2)-b1*VARX2*b2')/(sd_res`year1'*sd_res`year2')        // eq. 5 DL 2019
                      mat li RHO_HAT_`year1'_`year2'    
                      
                      return scalar rho_hat = RHO_HAT_`year1'_`year2'[1,1]
                      
                      restore, preserve
                  
              end
                      
                  simulate rho_bs = r(rho_hat), reps(15): myboot
                  summarize
              Am I inserting the bsample command in the correct place?

              Comment

              Working...
              X