Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Excel export performance/speed

    Dear All,

    I am trying to optimize performance of my code that generates the output.
    The simplified example of what I am having is the following code:

    Code:
    clear all
    version 17.0
    set rmsg on
    
    mata
        void setcols() {
            class xl scalar b
            b = xl()
            b.load_book(st_local("file"))
            colwidths = tokens(st_local("colwidths"))
            sheets = b.get_sheets()
            for(i=1;i<=rows(sheets);i++) {
                    b.set_sheet(sheets[i])
                    st_global("c(current_time)") + " " + sheets[i]
                    for(j=1; j<=cols(colwidths); j++) {
                      b.set_column_width(j,j,strtoreal(colwidths[j]))
                    }
            }
    
            b.close_book()
        }
    end
    
    
    program define formatcols
        version 17.0
        syntax , file(string) colwidths(string)
        mata setcols()
    end
    
    local f "C:/temp/benchmark.xlsx"
    
    capture erase "`f'"
    
    forval s=1/20 {
        putexcel set "`f'", open modify sheet("s`s'")
        local cols="ABCDEFGHIJKLMNOPQRSTUVWZXYZ"
        forval i=1/200 {
            forval j=1/`=strlen("`cols'")' {
                putexcel `=substr(`"`cols'"',`j',1)'`i' = "Test value"
            }
        }
        putexcel save
    }
    
    formatcols, file("`f'") colwidths("3.82 31.27 37.55 30.91 15.64 18.36 26.36 40.91 65.18 35.82")
    
    // END OF FILE

    The first (gray) part of the code is data generation, which substitutes for real report-producing code. I am not sure how the content here might affect performance, but I think for illustration purposes it's fine as is here.
    After an Excel file is produced with the -putexcel- statements, I reformat it using a Mata procedure formatcols, since this is the only way I found to adjust the columns widths (or is there a better way?).

    I am concerned about the performance from this point onward. In the example it takes some 11 seconds per sheet to format the 10 columns. In the real project it takes well above 2 minutes per sheet. I can't imagine what is going on. For comparison, it takes a fraction of a second to extract all sheets from the XLSX file (unpack).



    Code:
    . formatcols, file("`f'") colwidths("3.82 31.27 37.55 30.91 15.64 18.36 26.36 40.91 65.18 35.82")
      12:32:52 s1
      12:33:03 s2
      12:33:14 s3
      12:33:25 s4
      12:33:35 s5
      12:33:46 s6
      12:33:57 s7
      12:34:07 s8
      12:34:18 s9
      12:34:29 s10
      12:34:40 s11
      12:34:51 s12
      12:35:01 s13
      12:35:12 s14
      12:35:23 s15
      12:35:34 s16
      12:35:44 s17
      12:35:55 s18
      12:36:06 s19
      12:36:17 s20
    r; t=216.41 12:36:27
    The only thing I've noticed is that the Excel file in the real report contains massive (above 44MB) "styles.xml" file which is repeating the same style definition over and over again. It is easily compressed though, so doesn't bloat the *.xlsx file much.

    Is there a faster way to set the column widths in an Excel file from Stata?


    Thank you, Sergiy Radyakin

  • #2
    Mata's xl() class is currently the only way to change column width (though I would really like -putexcel- to mirror this function). However, you can speed this up substantially by flagging to Stata you wish to keep the file open for writing.

    Code:
    mata
        void setcols() {
            class xl scalar b
            b = xl()
            b.load_book(st_local("file"))
            b.set_mode("open")
            colwidths = tokens(st_local("colwidths"))
            sheets = b.get_sheets()
            for(i=1;i<=rows(sheets);i++) {
                    b.set_sheet(sheets[i])
                    st_global("c(current_time)") + " " + sheets[i]
                    for(j=1; j<=cols(colwidths); j++) {
                      b.set_column_width(j,j,strtoreal(colwidths[j]))
                    }
            }
    
            b.close_book()
        }
    end
    The resulting execution time of -formatcols- shrunk from ~90 seconds to <1 second.

    For those interested, there is an example given of how to use set_mode() to speed up IO in the manual. Note, close_book() is mandatory after using set_mode("open").

    Code:
    help mf_xl##set_mode

    Comment


    • #3
      Thank you Leonardo Guizzetti ,

      your advice is very helpful and solves this issue.

      Since we are having this example, is there any way to avoid using putexcel save in the loop of the grayed code above? Ideally, I would like to populate all sheets of the workbook, then save it all together. But I don't find a way to switch the sheet with putexcel set, sheet() without having to putexcel save the file, which takes time (both for saving and for re-opening later).

      Thank you, Sergiy Radyakin

      Comment


      • #4
        Originally posted by Sergiy Radyakin View Post
        Since we are having this example, is there any way to avoid using putexcel save in the loop of the grayed code above? Ideally, I would like to populate all sheets of the workbook, then save it all together. But I don't find a way to switch the sheet with putexcel set, sheet() without having to putexcel save the file, which takes time (both for saving and for re-opening later).
        You're welcome, Sergiy. To your next question, I don't think you can use the -putexcel, open- open option in quite the same way as with Mata. I'm inferring from the manual examples that it's intention is more to open one sheet at a time, make all the changes, and then only have to save the book once for that sheet. This is as opposed to the default behaviour which otherwise re-saves the file after each command. The alternative then is to replace the for loop with equivalent Mata code.

        Comment

        Working...
        X