Wish list for Stata 14

Imed Limam

Join Date: May 2014

Posts: 39
#151

07 Oct 2014, 03:13

Contributing to this very interesting debate, I hope that the error messages become more explicit and helpful so that debugging become easier. r(????) messages are too general to be useful, and do not point where the error took place. Work on this part of STATA would be very effort and time saving.
1 like
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35696
#152

07 Oct 2014, 04:37

Easier debugging is a common request. In abstraction, I imagine we all agree.

But I disagree that existing error messages are "too general to be useful". That in turn is too exaggerated to be helpful. I benefit from error messages all the time.

What is trickier here is to move towards the program being smart enough to tell you what you should have written. That is a very difficult, ultimately impossible, goal.

Longer error messages would not necessarily be more helpful. If they appeared by default they would more often be irritating than helpful. Did you know that you can click on an error code to see a longer message?

Knowing where an error occurred is indeed a key part of debugging. Did you know about set trace? A common complaint is that that produces far too much output. The common request that error messages be of the form

error on line 19 of program foo
called at line 42 of program bar
....
called at line 666 of program myprog

is, I understand, on StataCorp's long-term to do list in some form or another. I understand it's trickier than one imagines for reasons that depend on Stata's internals.

P.S. Please see FAQ Advice Section 18.
Comment
Imed Limam

Join Date: May 2014

Posts: 39
#153

07 Oct 2014, 07:09

I agree but there has to be a better middle. Being too concise by default is not helpful either. Thank you for pointing FAQ S. 18 to me. Regards.
Comment
Richard Williams

Join Date: Apr 2014

Posts: 4987
#154

07 Oct 2014, 07:30

Part of the problem is that the real error often isn't what Stata thinks is the error. For example, if you run this as a do file,

Code:

sysuse auto, clear reg price i.foreign/// weight

Stata complains

Code:

. reg price i.foreign/// / invalid name r(198);

I've seen people spend hours trying to figure out what Stata is whining about, only to finally realize that they need a space after foreign.

Occasionally it is possible to suggest a clearer error message to Stata Corp, and it will do so when asked.

Having said all that, I agree that it would be wonderful if there was something better than -set trace on-, which can often overwhelm you with its output.

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
Comment
Alan Neustadtl

Join Date: Mar 2014

Posts: 107
#155

07 Oct 2014, 15:54

Originally posted by Clyde Schechter View Post

But I don't get Alan's original question and his example. Just what would -logit i.sex i.chd c.income- mean? Logistic regression implies that the dependent variable is not only categorical, but specifically a dichotomy. And if you wrote -regress i.something i.predictor c.other_predictor-, what would you want regress to do? It seems to me that all of the built-in estimation commands uniquely determine whether their dependent variables are categorical or not. Perhaps the exception is Poisson which will accept (and use as continuous) a continuous outcome variable even though it is nominally (no pun intended) a procedure for estimating count variables.

While there may be no general prohibition in estimating models, Stata written estimation routines like -logist- do not allow factor variables on the LHS. So, it strikes me as useful if the factor notation we have become used to on the right hand side be allowed on the left hand side and generate an error if the DV is not dichotomous.

In many of the datasets I use the dichotomous measures are coded 1 and 2. In these situations Stata ends with an error. Consider the example below using data from the General Social Survey where the variable sex is coded with male=1 and female=2:

Code:

. logistic i.sex c.educ depvar may not be a factor variable r(198);

Best,
Alan
Comment
Joseph Coveney

Join Date: Apr 2014

Posts: 4410
#156

07 Oct 2014, 19:18

Originally posted by Alan Neustadtl View Post

In many of the datasets I use the dichotomous measures are coded 1 and 2. In these situations Stata ends with an error.

Well, you could always use gsem in such cases.

Code:

sysuse auto recode foreign (0=1) (1=2) gsem (i.foreign <- c.displacement), logit nolog
Comment
Steve Samuels

Join Date: Mar 2014

Posts: 1786
#157

07 Oct 2014, 19:49

I believe that the proper numbering for binary responses, especially "Yes/No" in a questionnaire should be 1 "Yes" 2 "No". This is natural phrasing in ordinary language, whereas 0 "No" 1 "Yes" is not. And, in writing questionnaires, the more natural and unsurprising the phrasing, the better. That said, I've never been tempted to use the 1-2 values in an analysis, on either side of a regression equation. For a binary outcome Y with probability \(P\) , a 1-2 coding would destroy the simple theoretical relation

\[
E(Y) = P
\]

Last edited by Steve Samuels; 07 Oct 2014, 20:02.

Steve Samuels
Statistical Consulting
[email protected]

Stata 14.2
Comment
Joseph Coveney

Join Date: Apr 2014

Posts: 4410
#158

07 Oct 2014, 22:43

Originally posted by Steve Samuels View Post

For a binary outcome Y with probability P, a 1-2 coding would destroy the simple theoretical relation E(Y) = P

I'm reluctant to continue the thread drift, but with Stata's factor variables you can have your 1-2 coding cake and eat it too:

Code:

clear * set more off set seed `=date("2014-10-08", "YMD")' quietly set obs 200 generate byte response = floor((2 - 1 + 1) * runiform() + 1) generate double predictor = runiform() gsem (ib2.response <- c.predictor), logit nolog // <- Here recode response (2 = 0) logit response c.predictor, nolog // <- Confirmed here

Also, it's not uncommon in clinical trial data-collection ("case report") forms for patient eligibility to have responses to items in the top-half of the page coded 1 = Yes and 2 = No for inclusion criteria, and 2 = No [that is, left or first] and 1 = Yes for exclusion criteria in the bottom half of the page.

And, for some reason, I've always believed that the 1-2 coding (instead of 0-1 coding) harks back to the days when SAS was the exclusive software package used for clinical trial data. (Namely, because of PROC LOGISTIC's default coding for the response variable.)
Comment
Maarten Buis

Join Date: Mar 2014

Posts: 3456
#159

08 Oct 2014, 01:58

Originally posted by Steve Samuels View Post

I believe that the proper numbering for binary responses, especially "Yes/No" in a questionnaire should be 1 "Yes" 2 "No".

I don't understand that statement. I always read 0 as "false" and 1 as "true". So the first thing I do when I open a new dataset is change variables like sex (1 "male" 2 "female") into a more sensible (to me) variable female (0 "false, so male", 1 "true so female"). The 0=false and 1=true convention derives from Boolean logic. Where does the 1=yes, 2=no convention come from?

---------------------------------
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl
---------------------------------
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35696
#160

08 Oct 2014, 03:40

I guess this hinges on the distinction between what a person filling in [filling out] a form sees and what the researcher uses. Given

Are you

0. a new learner?
1. an experienced learner?

there is just too much scope for somebody not familiar with Boolean logic to feel offended, confused or puzzled. Naturally, don't do that then! is one answer, that is, don't show numeric codes to someone taking a survey.
Comment
László Sándor

Join Date: Apr 2014

Posts: 120
#161

15 Oct 2014, 13:34

Originally posted by László Sándor View Post

MathWorks is now marketing MapReduce on the desktop and MapReduce on Hadoop for Matlab. If only something like it would be easy to do for StataCorp too. http://www.mathworks.com/discovery/m...ce-hadoop.html

And by the way, Hadoop and Spark for Python are here too.
http://continuum.io/anaconda-cluster
If only Stata code (and licenses!) could work similarly on clusters or rented hardware like Amazon Web Services.
Comment
Matthew White

Join Date: Apr 2014

Posts: 29
#162

31 Oct 2014, 16:38

I've found very useful Stata 13's introduction of Java plug-ins. Here are some notes from my experience...

From [P] java (my emphasis):

When a programmer is developing and testing a Java program, it is important to understand when
the JRE is loaded and its effect.

The JRE loads the first time that it is needed. That can happen if internal Stata functionality requires
Java or if Java is needed for some user-written command. Java’s classpath is set when the JRE is
loaded, and it cannot be modified afterward (that is, modifying the ado-path after the JRE has loaded
will not change the classpath). For the end user who is consuming a completed Java plugin, the
process of how Java plugins are loaded is not important because it happens transparently. However,
for the programmer who is modifying and testing code, it is very important to understand the process
.
Assume you are implementing a Java method named mymethod(). You have compiled it, placed
the class or JAR file in the correct location, and call it for the first time using javacall. Perhaps it
executes correctly, but you want to make some modification. You edit the source code, compile it,
and copy it to the correct location. If you are using the same Stata session, your changes will not be
reflected when you call it again. To reload a Java plugin, Stata must be restarted.

When writing anything but the simplest Java classes, I find myself restarting Stata frequently, which is cumbersome and slows down development. Part of the reason for this is my profile.do, which takes several seconds to complete. This wait time is normally acceptable, but is less convenient when I'm restarting Stata relatively often. Even beyond my profile.do, reopening my do-file editor windows and resetting the working directory is a hassle.

With this in mind, it would be great to have a Stata command to reload the JRE.

Part of what my profile.do does is set my ado-path, which contains about 75 directories, as my ado-files are scattered across project directories and Git repositories. It also sets my PERSONAL system directory outside the default C:\ado\personal: I like keeping PERSONAL on Dropbox in order to facilitate ado-file consistency across machines. Yet all my calls to the adopath and sysdir commands seem to be processed after the JRE is loaded fairly early on, so javacall looks only in C:\ado for Java files. Again, a command to reload the JRE with the current ado-path would address this.

Less importantly, it would be nice to be able to get/set string scalars. Especially when working with difficult strings, locals sometimes aren't the right option, so I find myself returning values from Java to Stata locals, then using Mata to copy the locals to string scalars.

It'd also be nice to get/set stored results.

All in all, I'm very glad that Stata has this Java integration. A few changes would help the experience flow better.
Comment
László Sándor

Join Date: Apr 2014

Posts: 120
#163

06 Nov 2014, 08:45

I wonder if anyone would revisit -areg-. Though I would not be surprised if it had generated long discussions before, I am just not aware of them.

Basically, I am not sure what to think about the behavior that -areg- is able (even defaults) to predict out of sample — but not the absorbed fixed effects. I understand why the latter is necessary if the absorbed variable is itself missing, but then even the other fitted values are hard to make sense of, unless one has some strong priors that the fixed effects have mean zero when the absorbvar is missing.

In other cases, when I fit the model on a subsample, but all variables are observed out of sample, even the absorbed one, I would love to be able to predict the values incl. "d", the fixed effects.

If its algorithmically impractical to change the behavior about "predict, d" (i.e. the speed of -areg- comes from transforming only e(sample)), then I would revisit what "predict, xb" defaults to after areg, without a warning.

Though I understand, part of my concern is what "predict, xb" is useful for at all, without also running "predict, d".

By the way, if StataCorp adopts the code from -reghdfe-, its similar behavior might also be revisited.
Comment
Matthieu Gomez

Join Date: Nov 2014

Posts: 11
#164

12 Nov 2014, 12:31

I have never participated in the Statalist but I registered after reading this thread. Here is my wishlist by order of importance:

- A faster "sort" and "by: egen sum". These functions could be made 10x faster as shown by the performance of the R packages data.table or dplyr, or the Python library Panda. I have attached some benchmarks. Since sort and by: sum are used across a wide range of commands, this would make Stata instantly better with large datasets.
- areg with arbitrary number of fixed effect, multiple clusters, and iv. Basically what already exists in R with the package lfe (http://cran.r-project.org/web/packages/lfe/index.html)
- A faster csv reader. Again, Python and R readers (panda ps.read_csv and data.table fread) are 10x faster than Stata import delimited.
- New functions gzipuse and gzipsave that would read and write gzip files using a named pipe (for reference: http://www.nber.org/stata/efficient/pipes.html, http://www.stata.com/statalist/archi.../msg00867.html, http://hsphsun3.harvard.edu/cgi-bin/...ticle-183.html)
- an option to preserve / restore in RAM
- The command append should have an option to coerce variables with conflicting types. In particular, numeric in master vs string in using should coerce everything to strings.
- suppress merge m:m or at least improve the documentation. The documentation in help merge just says "Many-to-many merge", which is meaningless, if not wrong. The documentation should explain what m:m does (at least as clearly as in the former documention of merge) and redirects users to joinby. It would also be nice if a new version of joinby could have the same syntax than merge (including default options).
- A lighter set trace on. On error, this mode would print the last command, the stack of function calls leading to it, and all stored macro and scalar.
- no restriction in the length of variable names.

Benchmarks · Stata to R

http://www.princeton.edu

Last edited by Matthieu Gomez; 12 Nov 2014, 12:50.
1 like
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35696
#165

12 Nov 2014, 13:04

Matthieu:

Anyone wishing to follow your benchmark results would need to install user-written commands you use other than your own, which you name.

I spotted distinct and reg2hdfe; there may be others.

(Please register using your full name, Matthieu Gomez.)
Comment

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment