Wish list for Stata 14

Jeph Herrin

Join Date: Apr 2014

Posts: 332
#136

01 Oct 2014, 14:01

+1 for "a replace option for generate."

And a -replace- functionality for -egen-, even if it's -regen-.
Comment
László Sándor

Join Date: Apr 2014

Posts: 120
#137

01 Oct 2014, 16:50

Originally posted by Clyde Schechter View Post

"Soaking up their average difference in a single dummy" can be a dangerous way to deal with missing values. It can bias the coefficient estimates associated with the non-missing levels of the variable. There are circumstances where it raises no problems, but I think they are the exception rather than the rule. From that perspective, IMO, it may be a good thing that it is not particularly convenient to do this.

I hear you, Clyde. Though I still think it is a common practice, needlessly slow and clumsy now. If we are worried about forces driving missing variables, why is that much better to use only non-missing observations? Is that such a great easy default, based on more robust assumptions? I think the dummied version recovers some statistical power on observed covariates' effects. No, it should not affect much the coefficient on the variable that is missing, true.
Comment
Richard Williams

Join Date: Apr 2014

Posts: 4886
#138

01 Oct 2014, 20:29

I used and taught the missing data dummy approach for years. But then Allison showed that it was usually (but not always) worse than doing nothing at all, i.e. using listwise: http://www.amazon.com/Missing-Quanti.../dp/0761916725

There was a discussion of this some years ago:

http://www.stata.com/statalist/archi.../msg00024.html

http://www.stata.com/statalist/archi.../msg00030.html

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 18.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
1 like
Comment
Andrew Lover

Join Date: Apr 2014

Posts: 182
#139

02 Oct 2014, 02:46

Hi Alex,

Do you know about -winbugsfromstata- (SSC)? It's maybe pretty dated, but there's a bit on the web, along with a Stata J article (Volume 6 Number 4: pp. 530-549).

re: LASSO, while not comprehensive, check out -lars- (SSC) for least angle methods.

__________________________________________________ __
Assistant Professor, Department of Biostatistics and Epidemiology
School of Public Health and Health Sciences
University of Massachusetts- Amherst
Comment
László Sándor

Join Date: Apr 2014

Posts: 120
#140

02 Oct 2014, 09:27

Originally posted by Richard Williams View Post

I used and taught the missing data dummy approach for years. But then Allison showed that it was usually (but not always) worse than doing nothing at all, i.e. using listwise: http://www.amazon.com/Missing-Quanti.../dp/0761916725

There was a discussion of this some years ago:

http://www.stata.com/statalist/archi.../msg00024.html

http://www.stata.com/statalist/archi.../msg00030.html

These are great points, thanks. Let me note that I could use this when the data is inherently missing, not just unobserved. So it could be faster in Stata. (I am even OK with Stata raising a warning about this, though it rarely does about other dangerous practices.)

I also wonder how this relates to event studies, where maybe the data is not inherently missing. I mean, if you observe only 2000-2007 for some treatment, then an analysis of outcomes on leads and lags of treatment would necessarily constrain yourself to a few years in the middle where all leads and lags exist. You are saying that this approach cannot be sensibly extended to more years, when 2006 could still be used even if I cannot control for, say, two leads any longer? Then I can only use lags but no leads as controls for any of the other years either? Prudent, though a bit dispiriting. Always better than an invalid analysis though. Thanks.
Comment
daniel klein

Join Date: Mar 2014

Posts: 3798
#141

02 Oct 2014, 10:00

As interesting as these statistical issues might be, would it not be better to start a new thread and focus on the topic here?

Perhaps there is no need to go as far as Sergiy suggested, but lets do all of us a favor and keep things where we and other find them later.

Best
Daniel

Finally, on this thread, may I humbly suggest splitting suggestions from wishes? Some suggestions are actually resolved quickly by other users pointing to already available functionality, but such suggestions really clutter this thread. I tend to think of a wish in this context as something that is not doable by the user in principle, but something that should be relatively easy to do for developers having access to internals. For example, if the list of the variables can be exposed so that the user can pick variable names from it, why not expose the list of globals? The rest, (I wish Stata (program) did my job, or I wish Stata (program) was smart enough to understand what I want from it) I put into the group dreams, which is not something that is worth discussing. But I think what could help is some weighting of features (easily done here in the forum with opinion polls), such as "what do you prefer 3D charts, or Mata debugger?". Both are useful, and maybe even equivalent in man-hours. But the market imho will strongly signal the former, since the latter is interesting only to a few developers. With some other features it is less clear.

Best regards, Sergiy Radyakin
Comment
Sergiy Radyakin

Join Date: Apr 2014

Posts: 1862
#142

02 Oct 2014, 13:03

Originally posted by daniel klein View Post

Although I am not working with huge datasets and hence have never experienced trouble with levelsof, it is obvious that the command could be much faster if it was re-implemented in Mata. I have my own version of levelsof that requires Stata 9.2 (as does the original levelsof). In a fake dataset using 1,000 observations filled with random numbers (runifom()), it is almost 20 times faster. Of course, to put this in perspective, the absolute time for both is (far) below one second on my machine.

I do not fully get what László wants do do with fvrevar.

Best
Daniel

Daniel, your example is one which is not really levelsof's playground. While technically it works with anything, the optimization will kick-in for categorical variables with relatively small number of categories (relative to the size of the dataset). I have just spent 5 minutes on a totally non-optimized mata version of levelsof and then half an hour on testing various cases. The only case when it can beat original levelsof is your example of all different values. Of course it could be because of my inefficient approach though (see attached log).

This seems to come from the fact that levelsof sorts the whole dataset, while fast implementation would sort only unique values. (If you don't want to read the code, just notice the sortpreserve marker in the declaration). This is quite obvious, and I don't think the command that useful and basic was overlooked by developers. In fact this mode is activated only if a much faster mode of getting levels with a fast/builtin command tabulate fails. It fails if there are too many levels (more than matsize). If your matsize is default (I guess 400, may depend on Stata flavor), then creating a dataset of 1000 observations of all different random numbers, you are forcing levelsof into a very special case, where it has to do job twice (first try the fast method, establish it doesn't work, second fallback to alternative).

Another source of perceived slowness of levelsof is that it serializes the levels into a string (which is totally not required in most of my tasks). String operations are slow. Moreover getting each level later would also be slow (using word i or foreach). It is not clear how your procedure reports the results (string, matrix?).

Directly using

Code:

quietly tabulate x, matrow(X)

should be a good alternative in many cases, when I know the variable is numeric and has few codes.

I am pretty sure that if you can beat tabulate in the above task, StataCorp would be interested to know about your approach. I definitely am and would like to see the (at least compiled version of your routine).

@Laszlo

I also find -levelsof- woefully slow.

Compared to what?

What I find strange however, is that tabulate is not parallelized. I don't see a reason why.

Code:

. quietly tab occup in 1/22460000 r; t=1.69 13:38:30 . quietly tab occup in 1/22460000 r; t=1.71 13:38:39 . quietly tab occup r; t=3.40 13:38:45 . quietly tab occup r; t=3.33 13:38:50

Tabulation on half data takes half as much time as on full data (provided some uniformity assumption about how unique values are distributed across the data). So two processors should produce the list of unique values twice as fast (considering merging the two lists negligible compared to the task of looking through the data).

It should be possible to write a plugin for that, but Stata plugins can't address Stata dataset in multiple threads (at least they can't write, not so clear about reading). So this would not be without a few hurdles.

Finally it is sometimes possible to exploit some a priori knowledge about the data to determine the levels (for example if you know the codes are 1,2,3 and first three observations are 1,2,3, you don't need to look through millions of observations to follow).

Best regards, Sergiy Radyakin.
Attached Files

flevelsof.txt (743 Bytes, 1 view)
1 like
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35173
#143

02 Oct 2014, 13:47

I have a personal perspective on levelsof, as its original author (under different names, but that's not material). Naturally, as it is now an official command, its programming and documentation are totally the responsibility of StataCorp. Equally naturally, my personal reasons for writing it in the first place need not be identical to, or even relevant to, anyone's reasons for using it now.

But in essence I see two main uses for levelsof, at least as intended.

The first was as a display command to show which distinct values are present, and to show really concisely, more concisely even than tabulate, which values are, and sometimes which are not, .present in the data.

The second was to provide output of a returned value or equivalently a local macro including a list of those distinct levels for use in looping, especially when there is some irregularity to the distinct levels.

With any variable with a large number of distinct levels, the benefits in using levelsof for either purpose are likely to be much diminished, to the point that it may be wondered why people are using levelsof at all. Sometimes a display of e.g. hundreds of distinct values can be useful, but not often. If the aim is to loop over the distinct values, there are likely to be better ways to do it, most notably statsby or using egen, group() to construct a looping variable.
1 like
Comment
László Sándor

Join Date: Apr 2014

Posts: 120
#144

03 Oct 2014, 09:40

Originally posted by Sergiy Radyakin View Post

Tabulation on half data takes half as much time as on full data (provided some uniformity assumption about how unique values are distributed across the data). So two processors should produce the list of unique values twice as fast (considering merging the two lists negligible compared to the task of looking through the data).

Another suggestion for version 14, then (I *do* try to keep these posts on-topic), is to increase transparency on MP support. As someone who cajoled his institution into buying two 64-core licenses, I am embarrassed by how hit-and-miss the MP benefits are (or on the hardware side: requesting 8 8-corse chips on a compute node, and the corresponding memory). -tabulate- is indeed much faster than many of its alternatives, but I am still dismayed if it's not parallelized. Yet it is much, much faster than collapsing the relevant data (-preserve- and -restore- is costly in many systems and larger data) or trying -egen-. I was lazy with -by: egen, mean()- last night and wasted nine hours without -egen- completing. I have no relevant comparison, but -tabulate- can only be faster, though parsing the resulting matrices are a bit clumsy.

I have no relevant comparison of -levelsof- to -tabulate-.

If -tabulate- is this much faster, I would like to see -tabulate, summarize()- also produce a matrix for later use. This goes back to an earlier wish on building in a fast version of -binscatter- (and its -fastxtile-).
Comment
Phil Bromiley

Join Date: Apr 2014

Posts: 4348
#145

03 Oct 2014, 14:00

I would like a Stata output procedure that, like outreg2, wrote to Word files. It would also be nice if we had an easy way to export formatted correlation matrices to Word.
1 like
Comment
László Sándor

Join Date: Apr 2014

Posts: 120
#146

03 Oct 2014, 20:46

MathWorks is now marketing MapReduce on the desktop and MapReduce on Hadoop for Matlab. If only something like it would be easy to do for StataCorp too. http://www.mathworks.com/discovery/m...ce-hadoop.html

Last edited by László Sándor; 03 Oct 2014, 21:23.
Comment
Alan Neustadtl

Join Date: Mar 2014

Posts: 107
#147

06 Oct 2014, 20:14

How about factor variables for the left hand side of models. Something like:

Code:

logistic i.sex i.chd c.income

Best,
Alan
Comment
Joseph Coveney

Join Date: Apr 2014

Posts: 4347
#148

06 Oct 2014, 21:33

Originally posted by Alan Neustadtl View Post

How about factor variables for the left hand side of models. Something like:

Code:

logistic i.sex i.chd c.income

Best,
Alan

Stata actually doesn't prohibit using factor variables on the left-hand side. It's your option, as the author of an estimation command, to have your command forbid factor variables there. It's done with the _fv_check_depvar command.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29773
#149

06 Oct 2014, 21:44

I understand Joseph Coveney's response: if you are writing your own estimation procedure, you can have factor variables on the left side if you want to.

But I don't get Alan's original question and his example. Just what would -logit i.sex i.chd c.income- mean? Logistic regression implies that the dependent variable is not only categorical, but specifically a dichotomy. And if you wrote -regress i.something i.predictor c.other_predictor-, what would you want regress to do? It seems to me that all of the built-in estimation commands uniquely determine whether their dependent variables are categorical or not. Perhaps the exception is Poisson which will accept (and use as continuous) a continuous outcome variable even though it is nominally (no pun intended) a procedure for estimating count variables.
Comment
Richard Williams

Join Date: Apr 2014

Posts: 4886
#150

06 Oct 2014, 21:45

Originally posted by Alan Neustadtl View Post

How about factor variables for the left hand side of models. Something like:

Code:

logistic i.sex i.chd c.income

Best,
Alan

I suppose that would be mildly advantageous if your response variable is coded 1,2 rather than 0, 1. It would be a disaster if your response variable was coded 0, zillions of values besides zero, because each of those non-zero values would get treated as a unique value rather than as 1. So I am inclined to think it wouldn't be a good idea, but I suppose it wouldn't matter that much.

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 18.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
Comment

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment