Several hundred regressions and Stata memory

FernandoRios

Join Date: Apr 2014

Posts: 2408
#16

03 Feb 2022, 20:37

Tried to understand your code, but not sure what is it you are trying to do.
Normally, for bootstrap, you estimate your N models, obtain the coefficients, do your process, and drop the estimation altogether (except for the coefficients).
that being said. Im not sure you want to store your coefficients in memory, so perhaps an alternative option is to "save" your equations into ster files. Which you can later restore to do any process you wanted to do.

Code:

forval i=1/300 { use sortedUzb.dta, clear bsample 1477 qui probit owns_dwelling male age_group high_ed married divorced separated widowed log_monspending hh_size capital estimates save Uzb`i', replace qui probit owns_dwelling male age_group high_ed married divorced separated widowed log_monspending hh_size capital gen_trust estimates save UzbT`i', replace }

then if you want to do something with a particular equation:

Code:

est use Uzb1, Some code est use Uzb2, some code. Etc

Perhaps this will work. It will just create 600 new files on your working directory with the results of ALL your probits

HTH
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29792
#17

03 Feb 2022, 20:48

Certainly the solution in #16 will work. But creating 600 new files to store a lot of information, most of which, I think, is just not going to be used, doesn't make a lot of sense. Not to mention how slow that's going to be.

Like Fernando Rios I cannot understand what O.P. is trying to do. Maybe we asked the wrong question. Maybe we should ask her what she would code if she didn't need to do any replications and just had to run through the code once. Exactly what would that code consist of? Alternatively, if she can describe, or better still, show what the end results she wants would look like for just one iteration.
Comment
Farogat WIUT

Join Date: Feb 2018

Posts: 37
#18

04 Feb 2022, 15:56

Okay, the seminal paper for my research can be found under: https://www.semanticscholar.org/pape...daa457c31fb7b2

Here, authors decompose the difference in stock ownership rates in two countries into so-called 'covariate' and 'coefficient effects'. The model goes as follows:

Pr_base - pr_i = (pr_base - pr_hat,base) + (pr_hat,base - pr_i)

Base - base country (against which comparison is made), i - comparison country

Pr_base - participation rate in a base country, pr_i - participation rate in comparioson country.
Difference (1) is covariate effect, meaning the difference that we would have if residents in base country had the same set of characteristics as residents in a comparison country
Difference (2) is coeffcient effect, meaning the difference that we would have if residents in base country faced the same set of coefficients in a comparison country.

What is lacking and I want to find are the two differences on the right-hand side of the equation, i.e. covariate and coefficient effects. But in order to find them, i need the counterfactual, that is pr_hat,base. In the paper, authors say: "We compute bootstrap standard errors by drawing (with replacement) from the full sample for both countries and repeating the estimation and decomposition two hundred times".

My question is to use this model to check homeownership (owns_dwelling), financial inclusion rates (instead of stock ownership in the paper above)in post-soviet countries and a couple of countries from a developed world. I want also want to see the pattern when I control for trust variables of households (both general trust, i.e. trust towards people and specific trust (trust towards fin institutions for example). As you've seen, I have two regressions within a loop. The reason is that, if i want to compare the covariate and coefficient effects without and with trust variables, observations in the samples should be the same, am I right? That's why I first resample Uzb 200 times with size1477 and run probits each time and store these 200 sets of coefficients. Then, I try to generate a sample with size 1477 from Kaz 200 times and apply stored coefficients to predict owns_dwelling. Once I have 200 averaged yhats or predicted owns_dwelling for Kaz, I can have 200 differences for finding both covariate and coefficient effects. And then I will be able to find the 95% CI.

Base country - Uzbekistan (Uzb), sample size is 1477, comparison country - Kazakhstan (Kaz), sample size is 971
1. Take a sample of Uzbekistan from the dataset, the size should be 1477
2. Do probit regression, store the coefficients
3. Draw a sample from Kaz dataset, with sample size of 1477 as well
4. Apply stored coefficients from (2)
5. Obtain yhats, average out yhat (the variable of interest) over 1477 observations
This is how we get a so-called counterfactual

Once we get this, we can have 2 differences=
1. Covariate effect: (pr_base - pr_hat,base). Since pr_base is fixed, i.e. comes from the dataset (Uzb), I create a local. I just need subtract (5) from this base rate
2. Coefficient effect: (pr_hat,base - pr_i). Again, pr_i is also fixed, from the dataset for comparison country (Kaz), I need subtract it from (5).
Why do I need those 200's? To do this process 200 times, and find the 95% confidence interval for these differences.

Again, back to the authors: "We compute bootstrap standard errors by drawing (with replacement) from the full sample for both countries and repeating the estimation and decomposition two hundred times". Maybe there's a simpler way to do this with bootstrap I don't know. I am also worried that, for comparison purposes I have to run probit regressions (2 or 3 regressions, but in my loop now, I have 2, another one with gen_trust) on the same set of resampled observations.
Let me know if anything is not clear.

Thanks,
Farogat.
Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 29792

#19

04 Feb 2022, 23:25

I'm not entirely sure I understand this, but I think I get the idea. I have no data set to really test this code on, and I haven't used -bootstrap- in a while, so I may be overlooking some subtleties, but I think what you want is something like this:

Code:

clear*
use sorted_Uzb
gen country = "Uzb"
append using sorted_Kaz
replace country = "Kaz" if missing(country)


capture program drop one_rep
program define one_rep, rclass
    probit owns_dwelling male age_group high_ed married divorced ///
        separated widowed log_monspending hh_size capital if country == "Uzb"
    predict owns_hat if country == "Kaz", pr
    summ owns_dwelling if country == Uzb, meanonly
    scalar pr_base = r(mean)
    summ owns_hat, meanonly
    scalar p_i_base = r(mean)
    return scalar covariate_effect = pr_base - p_i_base
    summ owns_dwelling if country == "Kaz", meanonly
    return scalar coefficient_effect = p_i_base - r(mean)

    probit owns_dwelling male age_group high_ed married divorced ///
        separated widowed log_monspending hh_size capital gen_trust ///
        if country == "Uzb"
    drop owns_hat
    predict owns_hat if country == "Kaz", pr
    summ owns_dwelling if country == Uzb, meanonly
    scalar pr_base = r(mean)
    summ owns_hat, meanonly
    scalar p_i_base = r(mean)
    return scalar covariate_effect_trust = pr_base - p_i_base
    summ owns_dwelling if country == "Kaz", meanonly
    return scalar coefficient_effect_trust = p_i_base - r(mean)
    exit
end

bootstrap covariate_effect = r(covariate_effect) ///
    coefficient_effect = r(coefficient_effect) ///
    covariate_effect_trust = r(covariate_effect_trust) ///
    coefficient_effect_trust = r(coefficient_effect_trust), ///
    strata(country) reps(200) seed(1234) saving(runs, replace): one_rep

For the -seed()- option you can pick any integer you like, it doesn't have to be 1234. The -saving()- option will create a new data set, runs.dta, containing the four scalars returned by program one_rep in each iteration. I'm not sure you really need it, but you might come up with other uses for the individual replicate results.

Note that I am not specifying the N's for the bootstrap sampling. Neither -bootstrap- nor -bsample- will allow you to sample 1477 cases from a data set with 971 observations. And while I don't have a deep understanding of the bootstrap statistically, I don't think it would be valid to do that. So you'll have to settle for 1477 from Uzbekistan and 971 from Kazakhstan. Since those would be the sizes of the strata in the combined data set that is created at the beginning of the above code, there is no need to specify a sampling size: you will get those N's by default.

I am not saving the actual probit regression coefficients anywhere in this code--there is no reason I can see for doing that, as once the predicted probabilities are calculated, they are not needed for anything else.

I hope this helps.

Last edited by Clyde Schechter; 04 Feb 2022, 23:38.

Comment

Farogat WIUT

Join Date: Feb 2018

Posts: 37
#20

06 Feb 2022, 08:42

Clyde, thank you a lot! I have tried the code, but only bootstrap area seems not to work:

type mismatch
an error occurred when bootstrap executed one_rep
r(109);

end of do-file

r(109);

Does it have something to do with a country variable, which is string?
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#21

06 Feb 2022, 09:04

Looking at the code and comparing similar lines show us the obvious oversight

Code:

summ owns_dwelling if country == Uzb, meanonly

which should have been

Code:

summ owns_dwelling if country == "Uzb", meanonly

since elsewhere we see

Code:

if country == "Uzb"

I leave it to you to check the code thoroughly to see if there are any other similar oversights.
1 like
Comment

Announcement

Comment

Comment

Comment

Comment

Comment

Comment