Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Originally posted by László Sándor View Post

    This might be related or unrelated, but there seem to be more and more features of Stata (factor variables, large-N small-T panels etc.) which would benefit greatly from sparse matrices in Mata. One wonders how hard it is to add.

    Separate memory spaces came up before, but note that the huge costs of sorting and preserving-restoring in data with many covariates (esp. if unused in a line) or irrelevant observations, comes from the fact that the rest of the big data is also moved in memory needlessly.
    I'll second this. The lack of sparse matrices is one of the major impediments (but far from the only one) to my colleagues and I using Mata for a wider array of tasks. Mata's requirement that all matrices be full matrices, regardless of how sparse they actually are, imposes needlessly high memory requirements on storing certain matrices and imposes upper bounds on the size of matrices that (for certain problems) are far below those of competitors that do support sparse matrices, e.g. MATLAB, Python, etc.

    Comment


    • Small thing, but maybe easy to add then: Why doesn't -collapse- accept stubs for newvarnames? Basically now you can choose to use varlists, but then each variable only once (though you still might generate means for some vars, sums for others, of course), and none renamed during the collapse, or generate the aggregates but specify them one by one just to give them a name.

      Comment


      • As to consistency - some command accept "by" others need "over".
        In the Browse window I would like to be able to specify a case number that I want to jump to.

        Comment


        • rollanders: over() usually means groups are shown side by side within the same block of output or graph panel; by() within different blocks or graph panels. Some graph commands necessarily support both. If you can find violations of this distinction, sing them out.

          Comment


          • I'll stick with my ongoing request.... give us some data mining tools! Even basic classification and regression tree commands would be welcome.

            Comment


            • I also find -levelsof- woefully slow. At least when I am using -fvrevar- anyway, I would like access to the values of the caterorical/factor variable that r(varlist) corresponds to. It is lame to jump through hoops to keep track of which tempvar indicates which case. Or maybe I am missing something that is already there? (Then perhaps one can still improve the documentation.)

              Comment


              • Although I am not working with huge datasets and hence have never experienced trouble with levelsof, it is obvious that the command could be much faster if it was re-implemented in Mata. I have my own version of levelsof that requires Stata 9.2 (as does the original levelsof). In a fake dataset using 1,000 observations filled with random numbers (runifom()), it is almost 20 times faster. Of course, to put this in perspective, the absolute time for both is (far) below one second on my machine.

                I do not fully get what László wants do do with fvrevar.

                Best
                Daniel

                Comment


                • Originally posted by daniel klein View Post
                  I do not fully get what László wants do do with fvrevar.

                  Thanks, Daniel. -fvrevar- generates the tempvars corresponding to fvvar use, but "maddeningly" it only returns the names of the created variables (or including existing variables if they correspond to something in the fvvarlist, of course) in a macro, and does not return any other macro with any information on which level or factor the created variable corresponds to. Maybe this is ill-construed for a very general -fvrevar-, as it is hard to imagine what to return for terms of v1##v2##c.v3, say. But for simply parsing things like i.v4, it is needlessly hard at the moment to re-generate the values, while I think they are often used in conjunction of the new tempvars themselves. It might even be surprising StataCorp themselves did not need this functionality.

                  Comment


                  • Originally posted by László Sándor View Post
                    does not return any other macro with any information on which level or factor the created variable corresponds to.
                    László ,

                    Is this what you are looking for?

                    Code:
                    sysuse auto
                    fvrevar i.turn
                    char list
                    In essence, the generated tempvars will have chars called either fvrevar (or tsrevar) with the canonical names, which you can then use to extract the group by extracting the numbers of the left.

                    Best,
                    S

                    Comment


                    • Thanks, Sergio, that's cool. That said, it is not great if I need to invoke string functions but it's much better than what I thought we had. I should check -char- more often.

                      It is also kind of interesting why Stata generates and all-0 variable for the base case, but fine for me.

                      Comment


                      • László,

                        Another option is:

                        Code:
                        sysuse auto
                        fvexpand i.turn
                        display "`r(varlist)'"
                        You can apply string functions to the varlist returned by fvexpand (which produces variable names in the same order as fvrevar).

                        - joe
                        Joseph Nicholas Luchman, Ph.D., PStat® (American Statistical Association)
                        ----
                        Research Fellow
                        Fors Marsh

                        ----
                        Version 18.0 MP

                        Comment


                        • Originally posted by Joseph Luchman View Post
                          You can apply string functions to the varlist returned by fvexpand (which produces variable names in the same order as fvrevar).
                          Thanks, Joe. I overlooked -fvexpand-, though I think it will tabulate the variable once again in the background, which is wasteful in big data.
                          Last edited by László Sándor; 01 Oct 2014, 13:22.

                          Comment


                          • Something else, simple but useful: -mvencode- is fast and powerful, but not useful for the case when I also want to have a dummy variable to indicate observations which were originally missing. I think it is quite common to deal with missing values this way (i.e. including the observations, but soaking up their average difference in a single dummy, which is hopefully good enough). A new stub option for -mvencode- would be welcome to do this, and do this fast.

                            Comment


                            • "Soaking up their average difference in a single dummy" can be a dangerous way to deal with missing values. It can bias the coefficient estimates associated with the non-missing levels of the variable. There are circumstances where it raises no problems, but I think they are the exception rather than the rule. From that perspective, IMO, it may be a good thing that it is not particularly convenient to do this.

                              Comment


                              • Some big and smaller wishes:
                                • provide tools for Bayesian analysis
                                • implement machine learning algorithms: LASSO, trees, SVM, splines...
                                • alpha blending / graph marker transparency
                                • make set trace on/off into a toggle switch (or, for nostalgic reason, shorten them to 'tron' / 'troff')
                                • bring back Stata 12's highlighting for matched brackets in the do-file editor!
                                • integrate and extend description of regex capabilities in official documentation
                                And seconding some previous wishes:
                                • +1 for "An option for "tab" that will display both value labels and unlabeled values"
                                • +1 "Nice code completion, especially I would to get RStudio equivalent a list of options after typing coma for each command. So for instance hitting Tab after typing graph box, would open a list of available options"
                                • +1 for "a replace option for generate."
                                • +1 for saving bookmakrs in code
                                • +1 for adding breakpoints for debugging

                                Comment

                                Working...
                                X