Hi Statalist,
I'm writing with a logistical challenge that there may be an elegant fix for. If someone has run across this and figured it out, I am hoping to benefit from your expertise.
I am outputting a large amount of results to Excel. The results are in the format of medians and IQRs of a continuous variable, according to the values of one binary variable. I'm also presenting these results in a stratified format across a large number of different underlying factors. (I know--this is not the way I love to do things, but it was requested by a stakeholder.)
I've adjusted my code for use with the NLSW88 dataset, with two tweaks in comments below. It seems to work almost exactly the way I want it:
The only issue I am running into is this: Because I am slicing and dicing the data in some cases very finely, I am finding that sometimes there are no observations for individuals where married = yes and c_city == yes. This means that of course, tabstat can't calculate the median wage for individuals where married = yes and c_city == yes. So when I call the matrices r(Stat1) and r(Stat2) after tabstat, only one of those matrices is actually called.
My binary variable is coded as 1 = yes and 2 = no (something that would be very, very confusing to change on the fly given the amount of other code and statistics being provided). This means that in cases where I do have observations for both levels, r(Stat1) represents statistics for c_city = yes, and r(Stat2) represents statistics for c_city = no. BUT, in cases where I don't have any observations for c_city = yes, r(Stat1) instead represents c_city = no.
I know that is probably really confusing to read. Here is (roughly) what the Excel output looks like, if that helps:
The table is reporting a value that seems like it is the median/IQR wage for married individuals living in a center city. But in reality, this is the median/IQR wage for married individuals NOT living in a center city. Because there were no married individuals living in a center city, r(Stat2) became r(Stat1) and now r(Stat2) doesn't exist at all.
This is a long-winded lead-up to my question, which is: Is there any way that I can tell Stata to reorganize these r(Stat) matrices, or reorganize the naming convention, according to missing values in the dataset? Or is there any other way to get around this? Assuming there are not 5, but 50, levels of $desc_var, is there any way for me to do this without running it, seeing which are missing, and then manually coding a separate chunk for those strata that are "one-sided"? I've tried thinking about matlist and other matrix-speciifc language, but Stata doesn't seem to like an attempt to call something that doesn't exist.
This is a weird question I'm having trouble clearly explaining, so please let me know if I can clarify. I place myself at the mercy of Statalist.
I'm writing with a logistical challenge that there may be an elegant fix for. If someone has run across this and figured it out, I am hoping to benefit from your expertise.
I am outputting a large amount of results to Excel. The results are in the format of medians and IQRs of a continuous variable, according to the values of one binary variable. I'm also presenting these results in a stratified format across a large number of different underlying factors. (I know--this is not the way I love to do things, but it was requested by a stakeholder.)
I've adjusted my code for use with the NLSW88 dataset, with two tweaks in comments below. It seems to work almost exactly the way I want it:
Code:
sysuse nlsw88, clear //recoding c_city to assist with my example replace c_city = 2 if c_city == 0 //recoding central city values to assist with my example replace c_city = 2 if married == 1 cd "H:\Location_of_my_output" putexcel set my_excel_output.xlsx, replace putexcel A1 = ("Characteristic") B1 = ("Central city median wage") C1 = ("P25") D1 = ("P75") F1 = ("Non central city median wage") G1 = ("P25") H1 = ("P75") J1 = ("Overall median wage") K1 = ("P25") L1 = ("P75") global desc_vars married never_married collgrad south smsa local j 3 tokenize $desc_vars forvalues i = 1/5 { tabstat wage if ``i'' == 1, by(c_city) statistics(median p25 p75) save return list putexcel A`j' = matrix(r(Stat1)'), rownames nformat(number_d2) putexcel E`j' = matrix(r(Stat2)'), rownames nformat(number_d2) putexcel I`j' = matrix(r(StatTotal)'), rownames nformat(number_d2) local j = `j' + 1 } local i 3 foreach var in $desc_vars{ local label: var lab `var' putexcel A`i' = "`label'" local ++i }
The only issue I am running into is this: Because I am slicing and dicing the data in some cases very finely, I am finding that sometimes there are no observations for individuals where married = yes and c_city == yes. This means that of course, tabstat can't calculate the median wage for individuals where married = yes and c_city == yes. So when I call the matrices r(Stat1) and r(Stat2) after tabstat, only one of those matrices is actually called.
My binary variable is coded as 1 = yes and 2 = no (something that would be very, very confusing to change on the fly given the amount of other code and statistics being provided). This means that in cases where I do have observations for both levels, r(Stat1) represents statistics for c_city = yes, and r(Stat2) represents statistics for c_city = no. BUT, in cases where I don't have any observations for c_city = yes, r(Stat1) instead represents c_city = no.
I know that is probably really confusing to read. Here is (roughly) what the Excel output looks like, if that helps:
Characteristic | Central city median wage | P25 | P75 | Non central city median wage | P25 | P75 | Overall median wage | P25 | P75 |
married | 6.21 | 4.27 | 9.29 | r1 | 6.21 | 4.27 | 9.29 | ||
never_married | 6.28 | 4.33 | 9.57 | 7.21 | 4.90 | 11.61 | 6.82 | 4.35 | 11.03 |
collgrad | 5.79 | 4.14 | 7.98 | 11.27 | 7.50 | 14.26 | 9.68 | 6.63 | 12.44 |
The table is reporting a value that seems like it is the median/IQR wage for married individuals living in a center city. But in reality, this is the median/IQR wage for married individuals NOT living in a center city. Because there were no married individuals living in a center city, r(Stat2) became r(Stat1) and now r(Stat2) doesn't exist at all.
This is a long-winded lead-up to my question, which is: Is there any way that I can tell Stata to reorganize these r(Stat) matrices, or reorganize the naming convention, according to missing values in the dataset? Or is there any other way to get around this? Assuming there are not 5, but 50, levels of $desc_var, is there any way for me to do this without running it, seeing which are missing, and then manually coding a separate chunk for those strata that are "one-sided"? I've tried thinking about matlist and other matrix-speciifc language, but Stata doesn't seem to like an attempt to call something that doesn't exist.
This is a weird question I'm having trouble clearly explaining, so please let me know if I can clarify. I place myself at the mercy of Statalist.
Comment