Announcement

Collapse
No announcement yet.
This is a sticky topic.
X
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Re #180:

    Regarding 1), I tend to doubt that allowing -encode- to handle more than 65,536 distinct values will really accomplish much. At least in my workflow, in the very large data sets where this becomes an issue, the strings in question tend to be unique to each observation, or very nearly so. In that case, the value label, which must be stored in the data set and also be in memory when the data set is in use, will be roughly as large as the original string variable. Yes, if you had a data set with several million observations and a string variable that took on, say, 200,000 distinct values, then you would be able to appreciably shrink the dataset with -encode-, and also get a performance boost on many commands as a result. But in my experience, it is vanishingly rare to see that situation.

    Regarding 2), use of a RAM drive to save files to memory, -frame copy- fulfills that need.

    Regarding 3), the current update of version 18 now has a -favor()- option that allows you to favor speed or memory. I'm told that with -favor(speed)- things go much faster than before. I haven't actually had occasion to use this in a large data set yet, so I can't really comment from personal experience. I'd also point out that even before this, there were user-written versions of -reshape- that were much faster: -greshape- (part of Mauricio Caceres' -gtools- package) and Rafal Raciborski's -tolong- (for wide to long only). Both are available from SSC.

    Regarding 4), I disagree, pretty strongly. If you are trying to combine two files, and in one of them a variable is a string and in the other is numeric, that is simply a data error, pure and simple. Frankly, if I were going to revise -append- I would do it by eliminating the -force- option so that this condition would always throw an error and abort execution. It's a condition that should never happen, and when it does, I want to know about it so I can find out why and fix it. I don't want Stata making any repair attempts. After all, while investigating why the data sets are incompatible, I may well uncover other errors in the creation of those data sets that could trip me up later if left undetected now. It's an opportunity for me to find and fix problems in my data and get a better understanding of my data. Also, since you are concerned about speed of execution with large files, I'll point out that string-numeric conversions in either direction are very time consuming.

    FWIW, I think one of the most underrated user-written programs is Mark Chatfield's -precombine-. It allows you to compare files that you plan to -merge- or -append- together and it gives you warnings about any incompatibilities or potential problems that will arise. Using it before you do a -merge- or -append- is, I think, almost always a wise precaution, especially if the files are large and the combining will take a long time. Much better to know in seconds that your files are inconsistently organized than to find out late into a time-consuming -append- or -merge-.

    Regarding 5), anything that could be done with -joinby- can also be done with Robert Picard's -rangejoin- command (just set the interval to be from negative infinity to positive infinity), and it will both be substantially faster and use less memory. -rangejoin- is available from SSC. To use it, you must also install -rangestat-, by Robert Picard, Nick Cox, and Roberto Ferrer, also available from SSC.




    Comment


    • Originally posted by Clyde Schechter View Post
      Re #180:

      Regarding 1), I tend to doubt that allowing -encode- to handle more than 65,536 distinct values will really accomplish much. At least in my workflow, in the very large data sets where this becomes an issue, the strings in question tend to be unique to each observation, or very nearly so. In that case, the value label, which must be stored in the data set and also be in memory when the data set is in use, will be roughly as large as the original string variable. Yes, if you had a data set with several million observations and a string variable that took on, say, 200,000 distinct values, then you would be able to appreciably shrink the dataset with -encode-, and also get a performance boost on many commands as a result. But in my experience, it is vanishingly rare to see that situation.
      I agree with this point entirely. A strategy that may be employed to shrink memory footprint if absolutely necessary is to manually create a map of integers and value labels (easily done, if a little tedious), and then use frame aliases to effect the label map, while only having to store the dictionary dataset.

      Originally posted by Clyde Schechter View Post
      Regarding 2), use of a RAM drive to save files to memory, -frame copy- fulfills that need.
      RAM disks are much bigger than this. They allocate a portion of RAM to be used as temporary disk storage. RAM disks can be highly effective if there's a need to work with large datasets, as a location for the Stata temp directory, which is will always be faster than any involvement of actual disk storage. Perhaps a more common application could be with simulation. However, I don't average users of Stata or even advanced users of Stata have a need for this in their day to day work. Furthermore, I think it's best left to the OS (or specialized programs) to handle the creation and operation of RAM disks.

      Comment


      • Relogged in with my full name.

        Originally posted by Clyde Schechter View Post

        Regarding 1), I tend to doubt that allowing -encode- to handle more than 65,536 distinct values will really accomplish much. At least in my workflow, in the very large data sets where this becomes an issue, the strings in question tend to be unique to each observation, or very nearly so. In that case, the value label, which must be stored in the data set and also be in memory when the data set is in use, will be roughly as large as the original string variable. Yes, if you had a data set with several million observations and a string variable that took on, say, 200,000 distinct values, then you would be able to appreciably shrink the dataset with -encode-, and also get a performance boost on many commands as a result. But in my experience, it is vanishingly rare to see that situation.
        There are many use cases where the 65,536 limit is too small, but could make a real difference if it were larger. Here are just a few
        ICD10 codes: ~72k; descriptions are STR60, so converting to a long is 1/15th the memory
        Medical billing codes: ~71k
        Unique city names in the US: ~100k
        First names in the US: ~300k
        Street addresses in a large directory: only 25% are unique, so significant memory savings could be achieved here (probably 70% reduction in memory)

        Originally posted by Leonardo Guizzetti View Post

        I agree with this point entirely. A strategy that may be employed to shrink memory footprint if absolutely necessary is to manually create a map of integers and value labels (easily done, if a little tedious), and then use frame aliases to effect the label map, while only having to store the dictionary dataset.
        Frame aliases are cool and helpful, but have limitations. First, frlink is very slow. In a test I ran, it took 3x the time to frlink a variable as to simply merge in the same data and encode it. Second, you still have to create the linkage variable which is, as mentioned, tedious, and requires you to keep all of the different files/frames that you link to. Simply increasing the limit for encoding simplifies this significantly.

        Originally posted by Clyde Schechter View Post
        Regarding 3), the current update of version 18 now has a -favor()- option that allows you to favor speed or memory. I'm told that with -favor(speed)- things go much faster than before. I haven't actually had occasion to use this in a large data set yet, so I can't really comment from personal experience. I'd also point out that even before this, there were user-written versions of -reshape- that were much faster: -greshape- (part of Mauricio Caceres' -gtools- package) and Rafal Raciborski's -tolong- (for wide to long only). Both are available from SSC.
        The challenge with the favor, at least with large datasets, is that memory is also at a premium. There are work arounds (see this post for one approach I previously took for reshaping; note that I haven't extensively benchmarked the new reshape commands), but in general I think Stata could still optimize many of its core commands (beyond reshape and duplicates) as they don't appear to be much faster than they were in older versions.

        Originally posted by Clyde Schechter View Post
        FWIW, I think one of the most underrated user-written programs is Mark Chatfield's -precombine-. It allows you to compare files that you plan to -merge- or -append- together and it gives you warnings about any incompatibilities or potential problems that will arise. Using it before you do a -merge- or -append- is, I think, almost always a wise precaution, especially if the files are large and the combining will take a long time. Much better to know in seconds that your files are inconsistently organized than to find out late into a time-consuming -append- or -merge-.
        I was not familliar with precombine, and it could be useful for this issue, so thank you for sharing. My use case is when I'm forced to append together potentially thousands of files with hundreds of variables that are produced by someone else (often an automated process designed by someone else). Any change with a variable data type over time can lead to issues with appending. As I mentioned before, the straight-forward approach is to import all variables as strings and then convert to numeric afterward, but it's memory inefficient. I would still prefer an option with append that "force" just converts the variable to a string.

        Originally posted by Clyde Schechter View Post
        Regarding 5), anything that could be done with -joinby- can also be done with Robert Picard's -rangejoin- command (just set the interval to be from negative infinity to positive infinity), and it will both be substantially faster and use less memory. -rangejoin- is available from SSC. To use it, you must also install -rangestat-, by Robert Picard, Nick Cox, and Roberto Ferrer, also available from SSC.
        I was not familiar with rangejoin, either, so tested it. I could imagine using it in the future for specific analyses. I did find it to be faster than joinby, but still quite slow. For a quick test I merged 3M observations (lots of duplicates) with 25 matches, reaching 75M total observations. joinby was 208 seconds, rangejoin was 109s, and my version (similar to the approach by rangejoin but without checking ranges) was 55s. For comparison, a 1:m merge of 3M unique values to the same final size of 75M took 10.5s. There has to be some efficiencies that smart people at Stata could find to speed that up.

        Comment


        • Stata menl procedure for nonlinear mixed-effects regression has many nice features but in my experience
          annoyingly often fails to converge. To optimize the likelihood it uses the algorithm of Lindstrom and Bates from 1990(1) which
          has not only convergence issues but provides biased estimates for the variance components
          if they are 'large' (2) which is very typical for biological data. Additionally it is also stated for non-continous.
          response variables the estimates obtained with linearization based algorithm are biased(3).
          That is why the SAEM algorithm , originally described by Kuhn and Lavielle(4) became the method of choice among pharmacometricians
          for fitting complex models. SAEM is implemented in the two major professional (and very expensive) software package used
          in the pharmaceutical industry for modeling complex data (Monolix and NONMEM). It is also available in R (5)
          In my experience with SAEM it really works like a charm.
          Considering the so many available options with menl which is not available
          elsewhere it would be wonderful if Stata would consider implement SAEM as an optimization option for menl
          in the next release.

          References:
          1. Lindstrom, M.J. and Bates, D.M (1990). Nonlinear Mixed
          Effects Models for Repeated Measures Data, Biometrics, 46,
          673-687.

          2. Comets E, Mentré F (2001). Evaluation of Tests Based on Individual versus Population
          Modelling to Compare Dissolution Curves. Journal of Biopharmaceutical Statistics, 11(3),
          107–123.

          3. Molenberghs G, Verbeke G (2005). Models for Discrete Longitudinal Data. Springer-Verlag,
          Berlin.

          4. Kuhn E, Lavielle M (2005).“Maximum Likelihood Estimation in Nonlinear Mixed Effects
          Models.” Computational Statistics & Data Analysis, 49(4), 1020–1038

          5. E Comets, A Lavenu, M Lavielle M (2017). Parameter estimation in nonlinear mixed effect models using saemix,
          an R implementation of the SAEM algorithm. Journal of Statistical Software, 80(3):1-41.

          Comment


          • I don't know whether this is possible. But it would be nice if one could generate code outputs as pictures from the code and the command window. I am thinking of a prefix command like -codedump, save(filename): statacommand- did save the code dump in the clipboard and a file if the save option was specified.
            Last edited by Niels Henrik Bruun; 04 Dec 2023, 23:14.
            Kind regards

            nhb

            Comment


            • margins / marginsplot have great flexibility, but there is a persistent limitation in generating prediction CI's that are bound by [0,1] - as they should be following logistic regressions.

              there is an undocumented command _coef_table with undocumented citype(logit) option that will produce the appropriately-bounded CI's after a margins command.
              _coef_table, citype(logit) cititle(Logistic)
              however, marginsplot will only use the untransformed CI's, even after posting the margins optained citype(logit).
              i've seen some work-arounds that call Jeff Pitblado's transform_margins ado, but there is not a straightforward graphing option after that.

              my request: add an option to marginsplot similar to citype(logit).

              thanks for considering.

              Comment


              • It would be handy to have available an on-the-fly method to change decimal points' displays in standard outputs. For example

                Code:
                . sysuse auto
                (1978 automobile data)
                
                . corr price-head
                (obs=69)
                
                             |    price      mpg    rep78 headroom
                -------------+------------------------------------
                       price |   1.0000
                         mpg |  -0.4559   1.0000
                       rep78 |   0.0066   0.4023   1.0000
                    headroom |   0.1112  -0.3996  -0.1480   1.0000
                
                . corr price-head, decs(2)
                (obs=69)
                
                             |  price    mpg  rep78 headroom
                -------------+------------------------------------
                       price |   1.00
                         mpg |  -0.46   1.00
                       rep78 |   0.01   0.40   1.00
                    headroom |   0.11  -0.40  -0.15   1.00
                Alternatively one might consider expanding the scope of cformat, pformat, and sformat to other types of displayed results.

                Comment


                • It would be great to have a way to log commands only to a .txt file while running an entire .do file (not with interactive use) (see using -cmdlog- to export commands to .txt file - Statalist).

                  Comment


                  • Originally posted by Castor Comploj View Post
                    It would be great to have a way to log commands only to a .txt file while running an entire .do file (not with interactive use) (see using -cmdlog- to export commands to .txt file - Statalist).
                    Can you give an example of why you think you need this? By definition, the do-file is the record of commands that were executed, so there is no reason I can see to duplicate this record.

                    Comment


                    • Originally posted by Leonardo Guizzetti View Post

                      Can you give an example of why you think you need this? By definition, the do-file is the record of commands that were executed, so there is no reason I can see to duplicate this record.
                      The default log (-log-), or subfiles of a log (, name("somename")), can be used to keep an overview of what has been estimated, with many details. When collaborating in a research group, an example where this would be useful is a Jupyter Notebook not running Stata, into which I could read the .txt file of the estimated commands above the output, such that other researchers in my team quickly can identify mistakes or give suggestions, without having to dig through the entire log, or entire sub-logs, or the .do file. This way, code and output would be on a single file. The same example would also apply when working with text apps, such as LaTeX compilers or Google Docs.

                      A possible solution is running all commands quietly, but having cmdlog that takes care of this would be a separate solution.

                      Comment


                      • Once upon a time, if you did -use varlist using file-, Stata would read in all of file and then -keep varlist-. For large files this was slow and chewed up a lot of memory during the process. At some point (version 16?) this was changed so that this same command ran rapidly, actually extracting varlist while reading in the file and never actually reading in all of file. That was a huge improvement. I'd like the analogous thing to be done for the -keepusing()- option of -merge-.

                        This morning, I had a 10 GB file which I needed to -merge- with an 80 GB file, but I only needed to keep two variables from the 80 GB file. When I ran -merge 1:m linkvar using 80gb_file, keepusing(var1 var2)-, I finally decided to kill the process after a little over 3 hours. Also, I was unable to open a rather small file in another instance of Stata because no more memory was available while it was running. So it seems that -merge- is hauling the entire -using- file into memory and struggles to do the sorting needed for the -merge- because the data are so bulky.

                        I went back and did
                        Code:
                        use linkvar var1 var2 using 80gb_file, clear
                        tempfile for_merge
                        save `for_merge'
                        
                        use 10_gb_file, clear
                        
                        merge 1:m linkvar using `for_merge'
                        and the job completed in 2 minutes 35 seconds!

                        OK, it's not a huge deal to do this workaround, but it really makes the -keepusing()- option kind of pointless. I would imagine that whatever was done to speed up -use varlist using file- could be adapted to -merge- with -keepusing()-.

                        Comment


                        • Ability to include lasso in the mi command list so that lasso models

                          Comment


                          • Code folding in the do-file editor:
                            1) The View menu allows users to "Unfold all". Can we get a "Fold All" analogue? I often enclose longer blocks in brackets so that I can collapse them:
                            Code:
                            if 1 == 1 {
                            * code here....
                            }
                            Being able to collapse all of these in a single button would be a huge help!

                            2) Could users have the options to "auto-fold" all when opening a do file?

                            3) Keyboard shortcuts for "Unfold all" (and the requested "Fold all") would be helpful too! Better yet - can we allow users to create custom shortcuts? I really miss that feature from Notepad++, e.g.

                            Comment


                            • Panel unit root tests should include the optimal lag length selection option. There is no option, nor is clear elsewhere, how to determine optimal lag length in panels in Stata. Eventually can be combined with automatic option like creating new variables or other

                              Comment


                              • A built-in Stata code co-pilot GPT!

                                OpenAI recently released their GPT-store (https://chat.openai.com/gpts). There is one GPT for Stata, "The Stata GPT" by Jose RA Sanchez Ameijeiras. I haven't tried it (I'm not a ChatGPT Plus user) so I cannot attest to it's usability. But a GPT that is trained on everything relating to Stata, all commands, all help files, all Statalist posts and every paper from Stata Journal and made built-in into Stata 19 for autocompletion of code, proof reading etc etc would be highly useful. VSCode has a Github Co-pilot (https://github.com/features/copilot) if one is looking for inspiration.

                                Comment

                                Working...
                                X