Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • I appreciate all the comments that have already been made. Some of these suggestions are duplicates, in which case consider them seconded, and others haven't been mentioned, yet. They are not listed in any order:
    • Interactive output (ex: hover over truncated variable name and see full variable name and see the label)
    • Longer variable names and macro names - it may be cumbersome for some output, but that's the risk of those who define them
    • Unicode support
    • Parallel processing for large datasets (ex: George Vega Yon's parallel command is a good starting point, but getting this officially supported would be great)
    • Update to the FAQ that provides comparisons of related commands and which ones are faster
    • Increased tab size (no more "too many values" errors)
    • Preserve labels and notes following collapse
    • Transparent option with graph output
    • Quietly graph and export images as pngs (see http://www.stata.com/statalist/archive/2012-04/msg00709.html)
    • Estimated time to completion of commands - have a minimum time that it doesn't show up (such as anything less than an estimated 2 minutes won't provide an estimate), but if I begin calculating something large I'd like to know if it's estimated to take 12 minutes or 12 hours; I know there are lots of challenges with this, but even a warning that suggests a command may take a long time is helpful
    • Built-in support for json data
    • Formal support for spmap or, preferably, improved mapping support that will allow for map layer creation
    • Support for larger data sets (everything Jeph Herrin and László said)
    • More intuitive way of bringing in relational databases - maybe storing multiple relational databases in memory
    • Save a subset of observations - ex: save xxx if x>2
    • Save a random sample of observations - ex: save xxx, sample(.1)
    • Use data but only import a subset of variables - ex: use xxx.dta, keep(x y z)
    • Currency or financial formatting - think of having dollar values in the y axis or being able to output to an excel file already formatted as currency
    • Better integration with excel - it would be great to be able to assign formatting to cells and worksheets and be able to insert excel charts from Stata (such as inserting the data on worksheet 1 and then referencing that data and creating a bar chart on worksheet 2); the majority of the business world functions with MS Office and automating the output would save a lot of manual cleaning up of generated spreadsheets

    Comment


    • Originally posted by David Muhlestein View Post
      • Use data but only import a subset of variables - ex: use xxx.dta, keep(x y z)
      Note that this is possible with -use varlist using filename-. One of the first and biggest lessons of http://www.nber.org/stata/efficient/. What would be new if we could rename variables on the fly (more important for -merge-), which does not require the values to be sitting in memory, or any automatic selection of variables ever used and kept (not-dropped) later. I had a related suggestion on that.

      Comment


      • Originally posted by László View Post
        Note that this is possible with -use varlist using filename-. One of the first and biggest lessons of http://www.nber.org/stata/efficient/. What would be new if we could rename variables on the fly (more important for -merge-), which does not require the values to be sitting in memory, or any automatic selection of variables ever used and kept (not-dropped) later. I had a related suggestion on that.
        Somehow I've missed that function. Thank you.

        Comment


        • My "wish" is in the context of medicine (although this "wish" would apply to any field), for binary outcomes such as mortality. In medical journals, this type of data is typically displayed in a Table as the number of events over the number of patients in the group, i.e. "n/N" (example: 23/130 patients died in the Intervention group while 13/130 patients died in the control group). I frequently need to verify P values, risk ratios, CIs, etc. For this, I frequently use the -csi- command.

          Right now, to use Stata's contingency table commands (such as -csi-), I must calculate the difference in events between the denominator and the numerator. While this is not particularly difficult, it is prone to simple math errors, and requires more effort than I would like, especially for many outcomes.

          I "wish" I could enter -csi-like commands using events and total numbers, rather than events and non-events.

          Example (using the above numbers) of how I enter the command currently:
          Code:
          csi 23 13 107 117
          
                           |   Exposed   Unexposed  |      Total
          -----------------+------------------------+------------
                     Cases |        23          13  |         36
                  Noncases |       107         117  |        224
          -----------------+------------------------+------------
                     Total |       130         130  |        260
                           |                        |
                      Risk |  .1769231          .1  |   .1384615
                           |                        |
                           |      Point estimate    |    [95% Conf. Interval]
                           |------------------------+------------------------
           Risk difference |         .0769231       |   -.0065187    .1603649 
                Risk ratio |         1.769231       |    .9374361    3.339083 
           Attr. frac. ex. |         .4347826       |   -.0667393    .7005166 
           Attr. frac. pop |         .2777778       |
                           +-------------------------------------------------
                                         chi2(1) =     3.22  Pr>chi2 = 0.0726
          Example of how I would like to enter the data: (I am inventing the option 'total' here, but it could be called anything):
          Code:
          csi 23 13 130 130, total
          
                           |   Exposed   Unexposed  |      Total
          -----------------+------------------------+------------
                     Cases |        23          13  |         36
                  Noncases |       107         117  |        224
          -----------------+------------------------+------------
                     Total |       130         130  |        260
                           |                        |
                      Risk |  .1769231          .1  |   .1384615
                           |                        |
                           |      Point estimate    |    [95% Conf. Interval]
                           |------------------------+------------------------
           Risk difference |         .0769231       |   -.0065187    .1603649 
                Risk ratio |         1.769231       |    .9374361    3.339083 
           Attr. frac. ex. |         .4347826       |   -.0667393    .7005166 
           Attr. frac. pop |         .2777778       |
                           +-------------------------------------------------
                                         chi2(1) =     3.22  Pr>chi2 = 0.0726
          Anybody else wish this? Or, am I missing something and can Stata already do this?

          Comment


          • Philip, that's a fantastic suggestion! I always end up either using -expand- or adding frequency weights; it's admittedly a minor issue, but annoying nonetheless.
            __________________________________________________ __
            Assistant Professor, Department of Biostatistics and Epidemiology
            School of Public Health and Health Sciences
            University of Massachusetts- Amherst

            Comment


            • Re: Philip Jones' wish: +1. This problem can be solved by writing a wrapper program for csi. But it seems to be something that comes up often enough to be annoying, but not enough to movitate me to actually write that wrapper.

              Comment


              • I agree, this would be useful.

                Along with my previously documented list (http://www.stata.com/statalist/archi.../msg00083.html) I would like to suggest the removal of a feature: the ability to merge m:m. This is a very confusing and, as far as I can tell, totally useless feature. I am quite confident that users have gotten incorrect results without realising it by using this "feature". Users should be pointed to joinby instead. The m:m "feature" could be retained under version control for diehards and people trying to understand why they can't replicate earlier erroneous (!) analyses.

                Comment


                • Clyde's suggestion of a wrapper program was a good one.

                  I have done so and the code is below. Just save as -csti.ado- in your personal folder and it should work.

                  I re-arranged the default entry order for -csi- because for me it makes more sense to enter the numbers as:
                  EVENTS_group1 TOTAL_group1 EVENTS_group2 TOTAL_group2
                  rather than the default.

                  For example, if 23/130 patients died in the Intervention group while 13/127 patients died in the control group, you would type:
                  Code:
                  csti 23 130 13 127
                  All options for csi will be passed along and will work. All of the -csi- "r" results are returned.

                  There is no error checking.

                  Comments and improvements most welcome!

                  If people find it useful, I can make a short help file and upload to SSC. Just let me know.

                  I hope this is helpful for someone.

                  Phil
                  Code:
                  *! version 1.0.7 12sep2014 \ Philip M Jones, [email protected]
                  /* csit.ado: Wrapper program for csi to use total number of patients. */
                  /* Example: for 23 events in 130 patients in one group, 13 events in 127 patients in another group */
                  /* "csti 23 130 13 127" */
                  /* all options for -csi- will continue to work as they are passed along */
                  
                  capture program drop csti
                  program define csti
                      version 13
                      
                      syntax anything [, *]
                      
                      tokenize `anything'
                      local n1 `1'
                      local N1 `2'
                      local n2 `3'
                      local N2 `4'
                      
                      local N_1 = `N1' - `n1'
                      local N_2 = `N2' - `n2'
                      
                      csi `n1' `n2' `N_1' `N_2', `options'
                      
                  end

                  Comment


                  • Thanks, Philip. This looks great. Can't wait to try it!

                    Comment


                    • I'll add another feature that would be very useful, although I've seen related requests mentioned before: an interactive debugger for Mata. I'm not just talking about -set trace-, -pause-, etc; I'm referring to actual debuggers that modern programming environments use, e.g. MATLAB, Rstudio, various Python IDE's, etc. The ability to set breakpoints in code and step through it line by line is something SORELY missing from Stata and Mata, but especially Mata.

                      This is another feature that could significantly improve Stata's market share; I often find myself needing to write functionality in Stata that uses matrix operations, but the lack of real debugging tools means I almost have to use MATLAB, Python, etc. for tasks like this (for people who have used modern programming environments, they realize how primitive Stata's programming environment really is), at which point I usually end up doing all of my analysis in those languages instead of Stata.

                      Comment


                      • I know not all topics qualify as "wishes for Stata 14," even if all quirks and questions could potentially be relevant for an improvement, "fix" etc., but I still mention this here: I have a hard time understanding why Stata uses so much memory for -use- or -merge- operations where you specify to use only a small subset of observations or variables. Using two variables "in 1/1000" for my data of 150 million observations with total size of 40 GBs (with all the other variables) still takes minutes and many-many GBs of RAM (temporarily) to load. Even without any indexing and database computer science wizardry (raised under this topic before), I consider this very poor form. Isn't this "fixable in Stata 14" even within Stata's current memory model and file format?

                        Comment


                        • I wish in stata 14 the relational data files can be opened with use command or can read variables from different files based on the keyvariable.

                          Comment


                          • I am also annoyed by Stata raising error (and thus crashing the job) on some time-series operators being killed when xt data is not sorted on panelid time any more. See the example below with or without commenting out (the second) -xtset-.

                            I am not sure I see why the current behavior is preferred to resorting and proceeding. If that is too costly sometimes to be a default (i.e. users might want to be informed about their jobs being inefficient and slow), at least I would welcome a switch in -set- to turn such behavior on.

                            I see no apparent logic in which commands override previously xtset data, and I am a reasonably savvy user. I think it is bad if Stata complains about the sort order when the data should still be xtset.

                            Code:
                            clear all
                            cd ~/Downloads
                            set obs 1000
                            mata:
                                    // Store data directly with st_store()
                                    st_store(.,st_addvar("double",("x1","x2"),1),runiform(st_nobs(),2))
                            end        
                            g long id = floor(_n/10)
                            g byte time = mod(_n,10)
                            xtset id time
                            g l1x1 = l.x1
                            sort x2
                            xtset
                            g l1x1_2 = l.x1
                            exit
                            Last edited by László Sándor; 19 Sep 2014, 13:02. Reason: giving an example

                            Comment


                            • As Stata is sorting so much in the background, it'd be great to have a more flexible -sort-. Basically, if my command needs sorted data but was called on only in a subsample, I would want to have an option not to sort the data that will never be used. If the sortedby local needs to carry around a second term now to remember that the data is only sorted on the given variables only in a subsample defined by another variable, so be it. As -sort- already allows [in], maybe the system local is already robust to this…

                              Or you would achieve this now in two steps?:
                              Code:
                              sort `touse'
                              count if `touse'
                              sort `touse' sortvar in `=_N-`r(N)'+1'/`=_N'
                              It it still clumsy to use the end of my data now, not the beginning… So maybe it is not that costly to generate another tempvar
                              Code:
                              gen byte `invtouse' = 1-`touse'
                              By the way, if [in] is much, much faster than [if], why do most of our commands carry around "if `touse'" instead of quickly (?) sorting on `touse' and keep track the start and the end of the data to use with in?

                              Sorting on a binary variable should be much faster than O(_N log _N), of course, as the order is calculable (i.e. count if `touse' produces a sufficient statistic, r(N)) and can then be imposed. Why can't we specify for -sort- whether the sortvar is binary (or categorical) and not continuous?
                              Last edited by László Sándor; 22 Sep 2014, 13:04.

                              Comment


                              • I wish there were an option for -egen rowmean()- to only compute a value if there are a certain number of non-missing values. For example, when computing a scale with 15 items, I'd only like to compute it for those with ten or more valid values. Not that big of a deal to use -egen row(no)miss()- and -replace- afterwards to clean things up, but it's a common enough situation I'd like to do it with a single command instead of three.

                                Comment

                                Working...
                                X