Wish list for Stata 14

Michael Anbar

Join Date: Aug 2014

Posts: 116
#121

23 Sep 2014, 18:33

Originally posted by László Sándor View Post

This might be related or unrelated, but there seem to be more and more features of Stata (factor variables, large-N small-T panels etc.) which would benefit greatly from sparse matrices in Mata. One wonders how hard it is to add.

Separate memory spaces came up before, but note that the huge costs of sorting and preserving-restoring in data with many covariates (esp. if unused in a line) or irrelevant observations, comes from the fact that the rest of the big data is also moved in memory needlessly.

I'll second this. The lack of sparse matrices is one of the major impediments (but far from the only one) to my colleagues and I using Mata for a wider array of tasks. Mata's requirement that all matrices be full matrices, regardless of how sparse they actually are, imposes needlessly high memory requirements on storing certain matrices and imposes upper bounds on the size of matrices that (for certain problems) are far below those of competitors that do support sparse matrices, e.g. MATLAB, Python, etc.
Comment
László Sándor

Join Date: Apr 2014

Posts: 120
#122

25 Sep 2014, 19:42

Small thing, but maybe easy to add then: Why doesn't -collapse- accept stubs for newvarnames? Basically now you can choose to use varlists, but then each variable only once (though you still might generate means for some vars, sums for others, of course), and none renamed during the collapse, or generate the aggregates but specify them one by one just to give them a name.
1 like
Comment
rollanders

Join Date: Sep 2014

Posts: 33
#123

26 Sep 2014, 02:58

As to consistency - some command accept "by" others need "over".
In the Browse window I would like to be able to specify a case number that I want to jump to.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35173
#124

26 Sep 2014, 03:27

rollanders: over() usually means groups are shown side by side within the same block of output or graph panel; by() within different blocks or graph panels. Some graph commands necessarily support both. If you can find violations of this distinction, sing them out.
Comment
Ariel Linden

Join Date: Apr 2014

Posts: 153
#125

26 Sep 2014, 09:49

I'll stick with my ongoing request.... give us some data mining tools! Even basic classification and regression tree commands would be welcome.
Comment
László Sándor

Join Date: Apr 2014

Posts: 120
#126

01 Oct 2014, 07:05

I also find -levelsof- woefully slow. At least when I am using -fvrevar- anyway, I would like access to the values of the caterorical/factor variable that r(varlist) corresponds to. It is lame to jump through hoops to keep track of which tempvar indicates which case. Or maybe I am missing something that is already there? (Then perhaps one can still improve the documentation.)
Comment
daniel klein

Join Date: Mar 2014

Posts: 3798
#127

01 Oct 2014, 07:42

Although I am not working with huge datasets and hence have never experienced trouble with levelsof, it is obvious that the command could be much faster if it was re-implemented in Mata. I have my own version of levelsof that requires Stata 9.2 (as does the original levelsof). In a fake dataset using 1,000 observations filled with random numbers (runifom()), it is almost 20 times faster. Of course, to put this in perspective, the absolute time for both is (far) below one second on my machine.

I do not fully get what László wants do do with fvrevar.

Best
Daniel
Comment
László Sándor

Join Date: Apr 2014

Posts: 120
#128

01 Oct 2014, 08:54

Originally posted by daniel klein View Post

I do not fully get what László wants do do with fvrevar.

Thanks, Daniel. -fvrevar- generates the tempvars corresponding to fvvar use, but "maddeningly" it only returns the names of the created variables (or including existing variables if they correspond to something in the fvvarlist, of course) in a macro, and does not return any other macro with any information on which level or factor the created variable corresponds to. Maybe this is ill-construed for a very general -fvrevar-, as it is hard to imagine what to return for terms of v1##v2##c.v3, say. But for simply parsing things like i.v4, it is needlessly hard at the moment to re-generate the values, while I think they are often used in conjunction of the new tempvars themselves. It might even be surprising StataCorp themselves did not need this functionality.
Comment
Sergio Correia

Join Date: Apr 2014

Posts: 420
#129

01 Oct 2014, 11:29

Originally posted by László Sándor View Post

does not return any other macro with any information on which level or factor the created variable corresponds to.

László ,

Is this what you are looking for?

Code:

sysuse auto fvrevar i.turn char list

In essence, the generated tempvars will have chars called either fvrevar (or tsrevar) with the canonical names, which you can then use to extract the group by extracting the numbers of the left.

Best,
S
1 like
Comment
László Sándor

Join Date: Apr 2014

Posts: 120
#130

01 Oct 2014, 12:05

Thanks, Sergio, that's cool. That said, it is not great if I need to invoke string functions but it's much better than what I thought we had. I should check -char- more often.

It is also kind of interesting why Stata generates and all-0 variable for the base case, but fine for me.
Comment
Joseph Luchman

Join Date: Mar 2014

Posts: 114
#131

01 Oct 2014, 12:11

László,

Another option is:

Code:

sysuse auto fvexpand i.turn display "`r(varlist)'"

You can apply string functions to the varlist returned by fvexpand (which produces variable names in the same order as fvrevar).

- joe

Joseph Nicholas Luchman, Ph.D., PStat® (American Statistical Association)
----
Research Fellow
Fors Marsh
----
Version 18.0 MP
1 like
Comment
László Sándor

Join Date: Apr 2014

Posts: 120
#132

01 Oct 2014, 13:09

Originally posted by Joseph Luchman View Post

You can apply string functions to the varlist returned by fvexpand (which produces variable names in the same order as fvrevar).

Thanks, Joe. I overlooked -fvexpand-, though I think it will tabulate the variable once again in the background, which is wasteful in big data.

Last edited by László Sándor; 01 Oct 2014, 13:22.
Comment
László Sándor

Join Date: Apr 2014

Posts: 120
#133

01 Oct 2014, 13:12

Something else, simple but useful: -mvencode- is fast and powerful, but not useful for the case when I also want to have a dummy variable to indicate observations which were originally missing. I think it is quite common to deal with missing values this way (i.e. including the observations, but soaking up their average difference in a single dummy, which is hopefully good enough). A new stub option for -mvencode- would be welcome to do this, and do this fast.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29773
#134

01 Oct 2014, 13:25

"Soaking up their average difference in a single dummy" can be a dangerous way to deal with missing values. It can bias the coefficient estimates associated with the non-missing levels of the variable. There are circumstances where it raises no problems, but I think they are the exception rather than the rule. From that perspective, IMO, it may be a good thing that it is not particularly convenient to do this.
1 like
Comment
Alex Gamma

Join Date: Mar 2014

Posts: 18
#135

01 Oct 2014, 13:48

Some big and smaller wishes:
provide tools for Bayesian analysis

implement machine learning algorithms: LASSO, trees, SVM, splines...

alpha blending / graph marker transparency

make set trace on/off into a toggle switch (or, for nostalgic reason, shorten them to 'tron' / 'troff')

bring back Stata 12's highlighting for matched brackets in the do-file editor!

integrate and extend description of regex capabilities in official documentation

And seconding some previous wishes:
+1 for "An option for "tab" that will display both value labels and unlabeled values"

+1 "Nice code completion, especially I would to get RStudio equivalent a list of options after typing coma for each command. So for instance hitting Tab after typing graph box, would open a list of available options"

+1 for "a replace option for generate."

+1 for saving bookmakrs in code

+1 for adding breakpoints for debugging
1 like
Comment

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment