How to solve the cost of this estimate?
I am conducting an analysis on infant mortality due to pollution. For this purpose I have considered microdata of births whose outcome equal to 1 corresponds to mortality (before 24 hours after birth or late fetal).
My instrumental variable is the accumulation or average of a particular type of pollutant (PM2.5 or PM10). I have also considered for my regression both meteorological controls and characteristics of the mother.
I am conducting this analysis at the municipal district level (it should be noted that within my sample there are about 8,000 municipalities) each with its measurement of accumulated pollution also at the weeks of gestation of the fetus or newborn child, until the date of death.
My empirical strategy is defined by this equation:
set maxvar 30000
set min_memory 8g
set niceness 0
set matsize 10000
use "C:\ ...", clear
set dp comma
describe
xi: reg birthclassification accpm10 sexofbirth multiplebirth studylevelsmother yearsmother yearsmother2 foreignnationality monthlyrainfall monthlyt2 monthlyraint2 monthlyrainfall2 monthlyt22 i.monthbirth*i.year i.monthbirth*i.capm
birthclassification corresponds to mortality classification
accpm10 corresponds to accumulated PM10 pollutant
sexofbirth : control for the sex of the newborn, dummy equal to 1 if the child is female
multiplebirth : dummy 1 if the birth is multiple
studylevelsmother : dummy equal to 1 if the mother has higher education
yearsmother : age of the mother
yearsmother2 : age of the mother at birth square
foreignnationality : dummy equal to 1 if mother is native to the corresponding country
monthlyrainfall: monthly average of accumulated rainfall
monthlyt2: monthly average of accumulated temperatures
monthlyraint2 : interaction between average rainfall and temperature
monthlyrainfall2 : square of average rainfall
monthlyt22 : square of average temperatures
Finally, there are two interactions that I have included in my regression. One is monthbirth x year, expressed by "i.monthbirth*i.year" and from which I hope to measure seasonal patterns and unobserved heterogeneities. And also another interaction that corresponds to minutiae with months "i.monthbirth*i.capm" to evaluate seasonal patterns that differ in each municipality on a monthly basis.
My problem arises in the calculation of this estimate when doing the regression. In total there are 2,623,692 observations (corresponds a study period between 2009-2016), and it implies an enormous cost in the estimate, even in days. I would like to ask if the procedure I am carrying out could have some obvious error; if the equation I have created could be defective? On the other hand, I assume that the interaction of the municipalities with the months is what is generating an enormous computational cost for me. If so, is there any way to remedy the cost of the operation? Please, I look forward to your comments in order to be able to solve this obstacle.
Thank you
JC
I am conducting an analysis on infant mortality due to pollution. For this purpose I have considered microdata of births whose outcome equal to 1 corresponds to mortality (before 24 hours after birth or late fetal).
My instrumental variable is the accumulation or average of a particular type of pollutant (PM2.5 or PM10). I have also considered for my regression both meteorological controls and characteristics of the mother.
I am conducting this analysis at the municipal district level (it should be noted that within my sample there are about 8,000 municipalities) each with its measurement of accumulated pollution also at the weeks of gestation of the fetus or newborn child, until the date of death.
My empirical strategy is defined by this equation:
set maxvar 30000
set min_memory 8g
set niceness 0
set matsize 10000
use "C:\ ...", clear
set dp comma
describe
xi: reg birthclassification accpm10 sexofbirth multiplebirth studylevelsmother yearsmother yearsmother2 foreignnationality monthlyrainfall monthlyt2 monthlyraint2 monthlyrainfall2 monthlyt22 i.monthbirth*i.year i.monthbirth*i.capm
birthclassification corresponds to mortality classification
accpm10 corresponds to accumulated PM10 pollutant
sexofbirth : control for the sex of the newborn, dummy equal to 1 if the child is female
multiplebirth : dummy 1 if the birth is multiple
studylevelsmother : dummy equal to 1 if the mother has higher education
yearsmother : age of the mother
yearsmother2 : age of the mother at birth square
foreignnationality : dummy equal to 1 if mother is native to the corresponding country
monthlyrainfall: monthly average of accumulated rainfall
monthlyt2: monthly average of accumulated temperatures
monthlyraint2 : interaction between average rainfall and temperature
monthlyrainfall2 : square of average rainfall
monthlyt22 : square of average temperatures
Finally, there are two interactions that I have included in my regression. One is monthbirth x year, expressed by "i.monthbirth*i.year" and from which I hope to measure seasonal patterns and unobserved heterogeneities. And also another interaction that corresponds to minutiae with months "i.monthbirth*i.capm" to evaluate seasonal patterns that differ in each municipality on a monthly basis.
My problem arises in the calculation of this estimate when doing the regression. In total there are 2,623,692 observations (corresponds a study period between 2009-2016), and it implies an enormous cost in the estimate, even in days. I would like to ask if the procedure I am carrying out could have some obvious error; if the equation I have created could be defective? On the other hand, I assume that the interaction of the municipalities with the months is what is generating an enormous computational cost for me. If so, is there any way to remedy the cost of the operation? Please, I look forward to your comments in order to be able to solve this obstacle.
Thank you
JC
Comment