Wish list for Stata 14

David Muhlestein

Join Date: Sep 2014

Posts: 8
#106

08 Sep 2014, 08:51

I appreciate all the comments that have already been made. Some of these suggestions are duplicates, in which case consider them seconded, and others haven't been mentioned, yet. They are not listed in any order:

Interactive output (ex: hover over truncated variable name and see full variable name and see the label)

Longer variable names and macro names - it may be cumbersome for some output, but that's the risk of those who define them

Unicode support

Parallel processing for large datasets (ex: George Vega Yon's parallel command is a good starting point, but getting this officially supported would be great)

Update to the FAQ that provides comparisons of related commands and which ones are faster

Increased tab size (no more "too many values" errors)

Preserve labels and notes following collapse

Transparent option with graph output

Quietly graph and export images as pngs (see http://www.stata.com/statalist/archive/2012-04/msg00709.html)

Estimated time to completion of commands - have a minimum time that it doesn't show up (such as anything less than an estimated 2 minutes won't provide an estimate), but if I begin calculating something large I'd like to know if it's estimated to take 12 minutes or 12 hours; I know there are lots of challenges with this, but even a warning that suggests a command may take a long time is helpful

Built-in support for json data

Formal support for spmap or, preferably, improved mapping support that will allow for map layer creation

Support for larger data sets (everything Jeph Herrin and László said)

More intuitive way of bringing in relational databases - maybe storing multiple relational databases in memory

Save a subset of observations - ex: save xxx if x>2

Save a random sample of observations - ex: save xxx, sample(.1)

Use data but only import a subset of variables - ex: use xxx.dta, keep(x y z)

Currency or financial formatting - think of having dollar values in the y axis or being able to output to an excel file already formatted as currency

Better integration with excel - it would be great to be able to assign formatting to cells and worksheets and be able to insert excel charts from Stata (such as inserting the data on worksheet 1 and then referencing that data and creating a bar chart on worksheet 2); the majority of the business world functions with MS Office and automating the output would save a lot of manual cleaning up of generated spreadsheets
Comment
László Sándor

Join Date: Apr 2014

Posts: 120
#107

08 Sep 2014, 11:08

Originally posted by David Muhlestein View Post

Use data but only import a subset of variables - ex: use xxx.dta, keep(x y z)

Note that this is possible with -use varlist using filename-. One of the first and biggest lessons of http://www.nber.org/stata/efficient/. What would be new if we could rename variables on the fly (more important for -merge-), which does not require the values to be sitting in memory, or any automatic selection of variables ever used and kept (not-dropped) later. I had a related suggestion on that.
1 like
Comment
David Muhlestein

Join Date: Sep 2014

Posts: 8
#108

08 Sep 2014, 17:15

Originally posted by László View Post

Note that this is possible with -use varlist using filename-. One of the first and biggest lessons of http://www.nber.org/stata/efficient/. What would be new if we could rename variables on the fly (more important for -merge-), which does not require the values to be sitting in memory, or any automatic selection of variables ever used and kept (not-dropped) later. I had a related suggestion on that.

Somehow I've missed that function. Thank you.
1 like
Comment

Philip Jones

Join Date: Mar 2014
Posts: 104

#109

10 Sep 2014, 07:06

My "wish" is in the context of medicine (although this "wish" would apply to any field), for binary outcomes such as mortality. In medical journals, this type of data is typically displayed in a Table as the number of events over the number of patients in the group, i.e. "n/N" (example: 23/130 patients died in the Intervention group while 13/130 patients died in the control group). I frequently need to verify P values, risk ratios, CIs, etc. For this, I frequently use the -csi- command.

Right now, to use Stata's contingency table commands (such as -csi-), I must calculate the difference in events between the denominator and the numerator. While this is not particularly difficult, it is prone to simple math errors, and requires more effort than I would like, especially for many outcomes.

I "wish" I could enter -csi-like commands using events and total numbers, rather than events and non-events.

Example (using the above numbers) of how I enter the command currently:

Code:

csi 23 13 107 117

                 |   Exposed   Unexposed  |      Total
-----------------+------------------------+------------
           Cases |        23          13  |         36
        Noncases |       107         117  |        224
-----------------+------------------------+------------
           Total |       130         130  |        260
                 |                        |
            Risk |  .1769231          .1  |   .1384615
                 |                        |
                 |      Point estimate    |    [95% Conf. Interval]
                 |------------------------+------------------------
 Risk difference |         .0769231       |   -.0065187    .1603649 
      Risk ratio |         1.769231       |    .9374361    3.339083 
 Attr. frac. ex. |         .4347826       |   -.0667393    .7005166 
 Attr. frac. pop |         .2777778       |
                 +-------------------------------------------------
                               chi2(1) =     3.22  Pr>chi2 = 0.0726

Example of how I would like to enter the data: (I am inventing the option 'total' here, but it could be called anything):

Code:

csi 23 13 130 130, total

                 |   Exposed   Unexposed  |      Total
-----------------+------------------------+------------
           Cases |        23          13  |         36
        Noncases |       107         117  |        224
-----------------+------------------------+------------
           Total |       130         130  |        260
                 |                        |
            Risk |  .1769231          .1  |   .1384615
                 |                        |
                 |      Point estimate    |    [95% Conf. Interval]
                 |------------------------+------------------------
 Risk difference |         .0769231       |   -.0065187    .1603649 
      Risk ratio |         1.769231       |    .9374361    3.339083 
 Attr. frac. ex. |         .4347826       |   -.0667393    .7005166 
 Attr. frac. pop |         .2777778       |
                 +-------------------------------------------------
                               chi2(1) =     3.22  Pr>chi2 = 0.0726

Anybody else wish this? Or, am I missing something and can Stata already do this?

Comment

Andrew Lover

Join Date: Apr 2014

Posts: 182
#110

10 Sep 2014, 07:13

Philip, that's a fantastic suggestion! I always end up either using -expand- or adding frequency weights; it's admittedly a minor issue, but annoying nonetheless.

__________________________________________________ __
Assistant Professor, Department of Biostatistics and Epidemiology
School of Public Health and Health Sciences
University of Massachusetts- Amherst
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29773
#111

10 Sep 2014, 10:03

Re: Philip Jones' wish: +1. This problem can be solved by writing a wrapper program for csi. But it seems to be something that comes up often enough to be annoying, but not enough to movitate me to actually write that wrapper.
Comment
Phil Clayton

Join Date: Mar 2014

Posts: 50
#112

10 Sep 2014, 20:42

I agree, this would be useful.

Along with my previously documented list (http://www.stata.com/statalist/archi.../msg00083.html) I would like to suggest the removal of a feature: the ability to merge m:m. This is a very confusing and, as far as I can tell, totally useless feature. I am quite confident that users have gotten incorrect results without realising it by using this "feature". Users should be pointed to joinby instead. The m:m "feature" could be retained under version control for diehards and people trying to understand why they can't replicate earlier erroneous (!) analyses.
Comment
Philip Jones

Join Date: Mar 2014

Posts: 104
#113

11 Sep 2014, 12:39

Clyde's suggestion of a wrapper program was a good one.

I have done so and the code is below. Just save as -csti.ado- in your personal folder and it should work.

I re-arranged the default entry order for -csi- because for me it makes more sense to enter the numbers as:
EVENTS_group1 TOTAL_group1 EVENTS_group2 TOTAL_group2
rather than the default.

For example, if 23/130 patients died in the Intervention group while 13/127 patients died in the control group, you would type:

Code:

csti 23 130 13 127

All options for csi will be passed along and will work. All of the -csi- "r" results are returned.

There is no error checking.

Comments and improvements most welcome!

If people find it useful, I can make a short help file and upload to SSC. Just let me know.

I hope this is helpful for someone.

Phil

Code:

*! version 1.0.7 12sep2014 \ Philip M Jones, [email protected] /* csit.ado: Wrapper program for csi to use total number of patients. */ /* Example: for 23 events in 130 patients in one group, 13 events in 127 patients in another group */ /* "csti 23 130 13 127" */ /* all options for -csi- will continue to work as they are passed along */ capture program drop csti program define csti version 13 syntax anything [, *] tokenize `anything' local n1 `1' local N1 `2' local n2 `3' local N2 `4' local N_1 = `N1' - `n1' local N_2 = `N2' - `n2' csi `n1' `n2' `N_1' `N_2', `options' end
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29773
#114

11 Sep 2014, 13:15

Thanks, Philip. This looks great. Can't wait to try it!
Comment
Michael Anbar

Join Date: Aug 2014

Posts: 116
#115

12 Sep 2014, 15:56

I'll add another feature that would be very useful, although I've seen related requests mentioned before: an interactive debugger for Mata. I'm not just talking about -set trace-, -pause-, etc; I'm referring to actual debuggers that modern programming environments use, e.g. MATLAB, Rstudio, various Python IDE's, etc. The ability to set breakpoints in code and step through it line by line is something SORELY missing from Stata and Mata, but especially Mata.

This is another feature that could significantly improve Stata's market share; I often find myself needing to write functionality in Stata that uses matrix operations, but the lack of real debugging tools means I almost have to use MATLAB, Python, etc. for tasks like this (for people who have used modern programming environments, they realize how primitive Stata's programming environment really is), at which point I usually end up doing all of my analysis in those languages instead of Stata.
Comment
László Sándor

Join Date: Apr 2014

Posts: 120
#116

15 Sep 2014, 16:57

I know not all topics qualify as "wishes for Stata 14," even if all quirks and questions could potentially be relevant for an improvement, "fix" etc., but I still mention this here: I have a hard time understanding why Stata uses so much memory for -use- or -merge- operations where you specify to use only a small subset of observations or variables. Using two variables "in 1/1000" for my data of 150 million observations with total size of 40 GBs (with all the other variables) still takes minutes and many-many GBs of RAM (temporarily) to load. Even without any indexing and database computer science wizardry (raised under this topic before), I consider this very poor form. Isn't this "fixable in Stata 14" even within Stata's current memory model and file format?
Comment
Rasool Bux

Join Date: May 2014

Posts: 1
#117

19 Sep 2014, 01:54

I wish in stata 14 the relational data files can be opened with use command or can read variables from different files based on the keyvariable.
Comment
László Sándor

Join Date: Apr 2014

Posts: 120
#118

19 Sep 2014, 12:18

I am also annoyed by Stata raising error (and thus crashing the job) on some time-series operators being killed when xt data is not sorted on panelid time any more. See the example below with or without commenting out (the second) -xtset-.

I am not sure I see why the current behavior is preferred to resorting and proceeding. If that is too costly sometimes to be a default (i.e. users might want to be informed about their jobs being inefficient and slow), at least I would welcome a switch in -set- to turn such behavior on.

I see no apparent logic in which commands override previously xtset data, and I am a reasonably savvy user. I think it is bad if Stata complains about the sort order when the data should still be xtset.

Code:

clear all cd ~/Downloads set obs 1000 mata: // Store data directly with st_store() st_store(.,st_addvar("double",("x1","x2"),1),runiform(st_nobs(),2)) end g long id = floor(_n/10) g byte time = mod(_n,10) xtset id time g l1x1 = l.x1 sort x2 xtset g l1x1_2 = l.x1 exit

Last edited by László Sándor; 19 Sep 2014, 13:02. Reason: giving an example
Comment
László Sándor

Join Date: Apr 2014

Posts: 120
#119

22 Sep 2014, 12:35

As Stata is sorting so much in the background, it'd be great to have a more flexible -sort-. Basically, if my command needs sorted data but was called on only in a subsample, I would want to have an option not to sort the data that will never be used. If the sortedby local needs to carry around a second term now to remember that the data is only sorted on the given variables only in a subsample defined by another variable, so be it. As -sort- already allows [in], maybe the system local is already robust to this…

Or you would achieve this now in two steps?:

Code:

sort `touse' count if `touse' sort `touse' sortvar in `=_N-`r(N)'+1'/`=_N'

It it still clumsy to use the end of my data now, not the beginning… So maybe it is not that costly to generate another tempvar

Code:

gen byte `invtouse' = 1-`touse'

By the way, if [in] is much, much faster than [if], why do most of our commands carry around "if `touse'" instead of quickly (?) sorting on `touse' and keep track the start and the end of the data to use with in?

Sorting on a binary variable should be much faster than O(_N log _N), of course, as the order is calculable (i.e. count if `touse' produces a sufficient statistic, r(N)) and can then be imposed. Why can't we specify for -sort- whether the sortvar is binary (or categorical) and not continuous?

Last edited by László Sándor; 22 Sep 2014, 13:04.
Comment
ben earnhart

Join Date: May 2014

Posts: 1027
#120

22 Sep 2014, 16:52

I wish there were an option for -egen rowmean()- to only compute a value if there are a certain number of non-missing values. For example, when computing a scale with 15 items, I'd only like to compute it for those with ten or more valid values. Not that big of a deal to use -egen row(no)miss()- and -replace- afterwards to clean things up, but it's a common enough situation I'd like to do it with a single command instead of three.
Comment

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment