Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Dealing with missing values when outputting results to Excel using tabstat

    Hi Statalist,

    I'm writing with a logistical challenge that there may be an elegant fix for. If someone has run across this and figured it out, I am hoping to benefit from your expertise.

    I am outputting a large amount of results to Excel. The results are in the format of medians and IQRs of a continuous variable, according to the values of one binary variable. I'm also presenting these results in a stratified format across a large number of different underlying factors. (I know--this is not the way I love to do things, but it was requested by a stakeholder.)

    I've adjusted my code for use with the NLSW88 dataset, with two tweaks in comments below. It seems to work almost exactly the way I want it:

    Code:
    sysuse nlsw88, clear
    
    //recoding c_city to assist with my example
    replace c_city = 2 if c_city == 0
    //recoding central city values to assist with my example
    replace c_city = 2 if married == 1
    
    cd "H:\Location_of_my_output"
    putexcel set my_excel_output.xlsx, replace
    putexcel A1 = ("Characteristic") B1 = ("Central city median wage") C1 = ("P25") D1 = ("P75") F1 = ("Non central city median wage") G1 = ("P25") H1 = ("P75") J1 = ("Overall median wage") K1 = ("P25") L1 = ("P75")
    
    global desc_vars married never_married collgrad south smsa
    local j 3
    tokenize $desc_vars
    forvalues i = 1/5 {
        tabstat wage if ``i'' == 1, by(c_city) statistics(median p25 p75) save
        return list
        putexcel A`j' = matrix(r(Stat1)'), rownames nformat(number_d2)
        putexcel E`j' = matrix(r(Stat2)'), rownames nformat(number_d2)
        putexcel I`j' = matrix(r(StatTotal)'), rownames nformat(number_d2)
        local j = `j' + 1
    }
    
    local i 3
    foreach var in $desc_vars{
        local label: var lab `var'
        putexcel A`i' = "`label'"
        local ++i
    }

    The only issue I am running into is this: Because I am slicing and dicing the data in some cases very finely, I am finding that sometimes there are no observations for individuals where married = yes and c_city == yes. This means that of course, tabstat can't calculate the median wage for individuals where married = yes and c_city == yes. So when I call the matrices r(Stat1) and r(Stat2) after tabstat, only one of those matrices is actually called.


    My binary variable is coded as 1 = yes and 2 = no (something that would be very, very confusing to change on the fly given the amount of other code and statistics being provided). This means that in cases where I do have observations for both levels, r(Stat1) represents statistics for c_city = yes, and r(Stat2) represents statistics for c_city = no. BUT, in cases where I don't have any observations for c_city = yes, r(Stat1) instead represents c_city = no.


    I know that is probably really confusing to read. Here is (roughly) what the Excel output looks like, if that helps:
    Characteristic Central city median wage P25 P75 Non central city median wage P25 P75 Overall median wage P25 P75
    married 6.21 4.27 9.29 r1 6.21 4.27 9.29
    never_married 6.28 4.33 9.57 7.21 4.90 11.61 6.82 4.35 11.03
    collgrad 5.79 4.14 7.98 11.27 7.50 14.26 9.68 6.63 12.44


    The table is reporting a value that seems like it is the median/IQR wage for married individuals living in a center city. But in reality, this is the median/IQR wage for married individuals NOT living in a center city. Because there were no married individuals living in a center city, r(Stat2) became r(Stat1) and now r(Stat2) doesn't exist at all.

    This is a long-winded lead-up to my question, which is: Is there any way that I can tell Stata to reorganize these r(Stat) matrices, or reorganize the naming convention, according to missing values in the dataset? Or is there any other way to get around this? Assuming there are not 5, but 50, levels of $desc_var, is there any way for me to do this without running it, seeing which are missing, and then manually coding a separate chunk for those strata that are "one-sided"? I've tried thinking about matlist and other matrix-speciifc language, but Stata doesn't seem to like an attempt to call something that doesn't exist.

    This is a weird question I'm having trouble clearly explaining, so please let me know if I can clarify. I place myself at the mercy of Statalist.

  • #2
    Originally posted by Maria Sundaram View Post

    This is a weird question I'm having trouble clearly explaining, so please let me know if I can clarify.
    Chances are someone else asked something similar at some point. See https://www.statalist.org/forums/for...ta-in-one-line. The suggestion there would correspond to manipulating the r(stat) matrices, but you need to generalize this beyond 2 \(\times\) 2 matrices.

    Comment


    • #3
      It may be a better way to first using table (and/or collect) to create the table you want and to subsequently export that table to Excel, see "table" -> "Advanced table customization" in the PDF-Manual [R] and the PDF Manual [T]. The commands are complex, perhaps someone else can help.

      Comment


      • #4
        I agree with Dirk: using a command like -table- or -collapse- allows you to replace the original dataset with the results of the command.
        For -table-, use the replace option. It may take you some time fiddling around to find a suitable row/column format.
        For -collapse-, you may need to make some duplicates of the wage variable since you can only apply a single mean/median... conversion to one variable at a time.
        After that, any extra formatting is easy and you will probably be able to simply use -export excel- to get the results out.
        At a conceptual level, it's better to only use the put[docx/excel/pdf] functions for final (and simple) output after you have a fine-tuned a dataset in STATA first.
        Last edited by Koenraad Blot; 27 Feb 2024, 05:19.

        Comment


        • #5
          On second thought (and having tried a few things with the data myself), here's another solution that may be easier to work with your existing code:
          After invoking -tabstat-, use the -confirm- command to check for the existence of the r(Stat2) matrix. You need to precede this by -capture-, otherwise the program will stop when confirm doesn't find the specified matrix. -capture- stores the result of confirm in the global scalar _rc, which is zero when -confirm- is succesfull (ie, the matrix exists) or non-zero otherwise. It should look like this:

          Code:
          capture confirm matrix r(Stat2)
          if !_rc {
          // r(Stat2) exist, normal code execution
          } else {
          // No r(Stat2), adjust your code to putexcel in the right place
          }
          Hope this helps!

          Comment


          • #6
            I really appreciate everyone's thoughts and tips here. I am not as skilled at coding--especially for outputting--as I'd like to be, so am continuing to attempt to implement these. Once I get one of them to run, I'll post an update here. I really do appreciate so much the thoughts!

            Andrew, a big thank you in particular for your link to the other, similar Statalist post. I'd looked around on the forums but had not been using search terminology that would've included results for 2x2 tables. That thread was really helpful.

            Updates hopefully to come soon!

            Comment


            • #7
              Originally posted by Maria Sundaram View Post
              The table is reporting a value that seems like it is the median/IQR wage for married individuals living in a center city. But in reality, this is the median/IQR wage for married individuals NOT living in a center city. Because there were no married individuals living in a center city, r(Stat2) became r(Stat1) and now r(Stat2) doesn't exist at all.
              Assuming that one level is always present, the fix is not that difficult. Following the linked thread in #2, here is some code. This works for a binary -by()- variable coded 1= "yes" and 2= "no", although you can adapt it to 0/1 or some other coding simply by changing how the conditions are defined. Finally, the matrix r(names) left behind by tabstat outputs the value labels. So we temporarily drop these so that we can use the values themselves. If you find some fault in the code which you cannot resolve, post back. Alternatively, if there are situations where both levels of the -by()- indicator are not present, post back and I will write something that covers this. Thanks for the data example in #1.

              Code:
              sysuse nlsw88, clear
              
              capture program drop mat2xN
              program define mat2xN, rclass
              local name1= "`r(name1)'" 
              local name2= "`r(name2)'"
              mat Stat1= r(Stat1)
              mat Stat2= r(Stat2)
              mat StatTotal= r(StatTotal)
              if "`r(name1)'"=="2"{
                  return local name1= "1"
                  return local name2= "`name1'"
                  mat Stat2= Stat1
                  mat Stat1= J(1, `=colsof(Stat1)', .z)
                  return matrix StatTotal=StatTotal
                  return matrix Stat1= Stat1
                  return matrix Stat2= Stat2
              }
              if missing("`r(name2)'") & "`r(name1)'"=="1"{
                  return local name1= "`name1'"
                  return local name2= "2"
                  mat Stat2= J(1, `=colsof(Stat1)', .z)
                  return matrix StatTotal=StatTotal
                  return matrix Stat1= Stat1
                  return matrix Stat2= Stat2
              }
              if !missing("`r(name1)'") & !missing("`r(name2)'"){
                  return local name1= "`name1'"
                  return local name2= "`name2'"
                  return matrix StatTotal=StatTotal
                  return matrix Stat1= Stat1
                  return matrix Stat2= Stat2
              }
              end
              
              
              //recoding c_city to assist with my example
              replace c_city = 2 if c_city == 0
              //recoding central city values to assist with my example
              replace c_city = 2 if married == 1
              
              *cd "H:\Location_of_my_output"
              putexcel set my_excel_output.xlsx, replace
              putexcel A1 = ("Characteristic") B1 = ("Central city median wage") C1 = ("P25") D1 = ("P75") F1 = ("Non central city median wage") G1 = ("P25") H1 = ("P75") J1 = ("Overall median wage") K1 = ("P25") L1 = ("P75")
              
              global desc_vars married never_married collgrad south smsa
              local j 3
              tokenize $desc_vars
              forvalues i = 1/5 {
                  lab values c_city blank
                  tabstat wage if ``i'' == 1, by(c_city) statistics(median p25 p75) save
                  mat2xN
                  return list
                  putexcel A`j' = matrix(r(Stat1)'), rownames nformat(number_d2)
                  putexcel E`j' = matrix(r(Stat2)'), rownames nformat(number_d2)
                  putexcel I`j' = matrix(r(StatTotal)'), rownames nformat(number_d2)
                  local j = `j' + 1
              }
              lab values c_city `:val lab c_city'
              local i 3
              foreach var in $desc_vars{
                  local label: var lab `var'
                  putexcel A`i' = "`label'"
                  local ++i
              }
              Res.:
              Click image for larger version

Name:	Capture.PNG
Views:	1
Size:	38.7 KB
ID:	1744898

              Comment


              • #8
                Thank you so much, Andrew! This program language worked really nicely. I've not ever written a 'program' within a do-file before and was really trying not to do it myself, due to the extremely high likelihood for error on my part. Your example code worked really well and has given me a basic structure for if/when I find myself needing to write another program in the future.

                A few other code updates:

                Koenraad, thank you so much for your suggestion to use capture and confirm, something I was not familiar with and which I am excited to learn about. I was able to get this to work doing line-by-line putexcel statements, but the if/else syntax seemed to bomb when I tried to incorporate it into the loop I had, and it seems like individual putexcel statements don't allow an 'if' option. I think the answer is to write a program as Andrew has done (which I think was implied in the structure of your code), which I hesitated to do because I've never done it before

                Koenraad and Dirk, thank you also very much for your suggestions about table and collect. These worked so nicely! I did have trouble identifying how I could get results for just the value of (for example) married == 1, rather than married == 1 and married == 0, out of table. No doubt there is a functionality for this--you've given me some clues in your replies--and I'll keep reading the help documentation to find the structure that works for this dataset. I am super appreciative of this because I think this functionality will work for other similar outputs that I will need in the future.

                Thanks all, for teaching me a lot more about Stata!

                Maria
                Last edited by Maria Sundaram; 28 Feb 2024, 08:57.

                Comment


                • #9
                  Hi Maria,
                  The -table- syntax allow for the use of if, so it's not more difficult than writing
                  Code:
                  table wage if married == 1
                  -putexcel- syntax does not allow if, but you can put a -putexcel- between curly braces of an -if- command.

                  Last edited by Koenraad Blot; 28 Feb 2024, 10:06.

                  Comment


                  • #10
                    Ah, thank you so much Koenraad! Of course I did not consider the simplest option for restricting table output (kicking myself now). That's great news that if works there!

                    Comment


                    • #11
                      Hi Maria,
                      I have found a much simpler way to get around the problem. It consists of using -tabstat- without the by() option. Instead of using by(), we use -tabstat- 3 times:
                      1) for c_city == 1,
                      2) for c_city == 2,
                      3) for c_city == any value.
                      When used without by(), -tabstat- returns results to the r(StatTotal) matrix. This a column matrix that needs to be transposed to a row matrix like this:
                      Code:
                      tabstat wage [...]
                      matrix c1 = r(StatTotal)'
                      If -tabstat- finds no suitable observations (as in married == 1 & c_city == 1), it will return an error and quit the program. To prevent this, we used -capture- to catch the error and only proceed to -putexcel- when no error occurs (_rc == 0).
                      I also streamlined the foreach loop by using varlist (more readable) and compact notation for incrementing the local j.
                      Code:
                      local j = 3
                      foreach v of varlist  married never_married collgrad south smsa {
                          putexcel A`j' = "`v'"
                          capture tabstat wage if `v' == 1 & c_city == 1, statistics(median p25 p75) save
                          if !_rc {
                              matrix c1 = r(StatTotal)' // transpose matrix r(StatTotal) to horizontal
                              putexcel B`j' = matrix(c1),  nformat(number_d2)
                          }
                          // c_city == 0
                          capture tabstat wage if `v' == 1 & c_city == 2, statistics(median p25 p75) save
                          if !_rc {
                              matrix c2 = r(StatTotal)' // transpose matrix r(StatTotal) to horizontal
                              putexcel F`j' = matrix(c2),  nformat(number_d2)
                          }
                          // c_city == 0 or 1
                          capture tabstat wage if `v' == 1, statistics(median p25 p75) save
                          if !_rc {
                              matrix c = r(StatTotal)' // transpose matrix r(StatTotal) to horizontal
                              putexcel J`j' = matrix(c),  nformat(number_d2)
                           }
                          local ++j
                      }
                      This code generates the same output as Andrew's program.

                      Hope this is useful
                      Koenraad

                      Comment


                      • #12
                        Hi Koenraad, this is really wonderful! I really, really appreciate it and am really excited to find so many great solutions to my "huge amount of output in small amount of time and space" problem. Thank you!!

                        Comment

                        Working...
                        X