Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • As someone who does quasi-experimental research, I would appreciate formal Stata implementation for procedures such as augmented synthetic-controls, robust synthetic controls, and similar estimators. In Stata 17, Stata formally implemented Difference-in-Differences estimators, so I'd like to see formal extensions in this area too, given recent methodological advances.

    Comment


    • I frequently work on several do files at once. It would be nice to be able to apply the Find command to all of them at once rather than just the one that is open.

      I should clarify this a bit. The do file editor allows multiple files to be active at once, each with its own tab. One file is open. I want to be able to do a search on all tabbed files at once.
      Last edited by Dick Campbell; 18 Sep 2021, 15:04.
      Richard T. Campbell
      Emeritus Professor of Biostatistics and Sociology
      University of Illinois at Chicago

      Comment


      • This post concerns latent profile analysis. It's a repost of an earlier post I made on the Stata 17 wishlist. Basically, the default options will nudge less experienced users into making a very restrictive set of assumptions. Moreover, the manual example does not make clear how restrictive the assumptions are.

        Stata's defaults are to A) assume that the error variances are equal across each latent class and B) a diagonal covariance structure for Gaussian indicators. I think assumption a is more problematic. I'll deal with it first. Consider the graphic below from Masyn's chapter on mixture models that was referenced in the SEM examples dealing with LCA and LPA.

        Issue A

        Click image for larger version

Name:	Screen Shot 2021-09-21 at 1.42.17 PM.jpg
Views:	1
Size:	63.2 KB
ID:	1628427

        Panel a) in the diagram is a dataset with 2 indicators, Y1 and Y2. Panel b) illustrates a 3-class LPA model with equal error variances. Do you see how the circles are the same size? The center of each circle represents the means of each indicator, and the diameters along the x- and y- axes represent the error variance of each indicator (they are equal in this example, but they need not be).That's what equal error variances means.

        That's just a sample dataset. Knowing nothing about what Y1 and Y2 are, maybe it's not absurd to suppose that the group of dots might stem from three separate sources. That's fine. The thing is that with real data, you might not be able to make this assumption. However, if you don't override Stata's default assumptions, you will be telling Stata to take your magic multidimensional cookie cutter and to cut out k cookies of equal size from your data. If you relax that default, you tell Stata that it can resize the magic cookie cutter as appropriate after each stamp.

        For clarity, I show Stata's default behavior with code after I discuss issue B.

        Issue B
        Now, let's deal with case b), or Consider the diagram below, which borrowed from the manual for the R package flexmix. This is an artificial (I think) dataset with two dimensions. You could simply think of them as physical x and y coordinates for this post. Both panels represent the results of latent profile models with 4 classes. Each color represents observations assigned to each latent class.

        The model for the left panel had diagonal covariance structure for the Gaussian indicators, i.e. within each class, all the Gaussian indicators have 0 correlation. That's described starting on pg 14 of the manual. (NB: I believe flexmix's default is to assume unequal error variances across classes for Gaussian indicators. You can see that the size of each circle is different. So, there's precedent for not using equal error variance across classes as the default.) Note classes 1 and 4. There's a small swathe of points running diagonally. The first model broke that group into two distinct classes.

        On the right panel, the model has unstructured covariance, i.e. within each class, a correlation between the (errors of) each indicator variable is explicitly modeled. It can turn out to be 0, as with class 3 on the right. However, note that the left panel's classes 1 and 4 have become one class, #2, on the right. See how the ellipse is slanted - that tilt represents the correlation. In magic multidimensional cookie cutter terms: Stata's default behavior is to cut (multidimensional) ellipses at only angles of 0 or 90 degrees. Relaxing Stata's default behavior lets it tilt the cookie cutter as appropriate with each cut.

        My ask
        Relax Stata's default behavior when fitting latent profile models. Clarify in the manual that you need to explore models with fewer constraints. SEM example 52 mentions only that the final model relaxed both constraints described above, not why you need to do this and what this does. I believe it would be better if the default were the least restrictive set of assumptions.

        This is one recent example where a new poster fit LPMs with only Stata's default (and restrictive) assumptions. I link the post not to criticize the user. Again, they were nudged into this action by Stata's defaults.

        Code example for case A
        Code:
        use https://www.stata-press.com/data/r16/gsem_lca2
        gsem (glucose insulin sspg <- _cons), lclass(C 2) byparm 
          ... var(e.glucose)|             C |            1  |   191.5596   23.83815                      150.0992    244.4723            2  |   191.5596   23.83815                      150.0992    244.4723 var(e.insulin)|             C |            1  |   119.0542   14.00336                      94.54204    149.9217            2  |   119.0542   14.00336                      94.54204    149.9217  var(e.sspg)#C|            1  |   55.91283   6.713667                      44.18801     70.7487            2  |   55.91283   6.713667                      44.18801     70.7487 -------------------------------------------------------------------------------
        The bolded coefficients are the error variances for each of the two latent classes. See how they're all equal? Now, with unequal error variances:

        Code:
        gsem (glucose insulin sspg <- _cons), lclass(C 2) lcinvariant(none) byparm  
         ... var(e.glucose)|             C |            1  |   22.62693    4.35593                       15.5153    32.99827            2  |   1263.401   223.8804                      892.6978    1788.043 var(e.insulin)|             C |            1  |   26.36603   4.285562                      19.17298    36.25767            2  |   283.2775   50.93803                       199.137    402.9697  var(e.sspg)#C|            1  |   25.26045   5.003334                       17.1334    37.24247            2  |   70.49358    12.7819                       49.4094    100.5749 -------------------------------------------------------------------------------
        I don't feel it's necessary to demonstrate with code, but the option covstructure(unstructured) will fit LPMs where the error terms for each indicator are allowed to correlate within each latent class. For each class, you'll see the covariance between each indicator at the end of the results table. Recall that you can convert covariance to correlation: rho = covariance(x1, x2) / sqrt[Var(x1) * Var(x2)].

        Attached Files
        Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

        When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.

        Comment


        • Adding a wish for support of the OpenType font format (so the same OpenType fonts, and features , can be used in Stata plots, LuaLaTeX, and Adobe InDesign).
          Last edited by Bjarte Aagnes; 22 Sep 2021, 10:18.

          Comment


          • Originally posted by Giovanni Russo View Post
            An expansion of the SEM and GSEM suite, the 3-step approach for LCA and other estimation methods that are less sensitive to deviations from normality. Mplus and LatentGold offer a wider set of options
            Can you clarify what you mean by other estimation methods that are less sensitive to deviations from normality in this context? If you have binary indicators, I don't see what normality means. You can treat continuous indicators as Gaussian. However, you can also treat them as Poisson, negative binomial, as any type of survival model, etc.

            I would second the statement about 3-step approaches. For other readers: very often, after we fit an LCA, we want to know how other variables that weren't used in the LCA model are related to class membership. For example, say we fit an LCA on profiles of adolescent risk behavior, and say we found some subtypes with qualitatively different types of risk. How is, say, being raised by a single parent related to risk profile membership?

            Many readers will go and do modal class assignment, i.e. predict latent class membership probabilities, then take the class with the highest probability and assume you belong to that class. Then you tabulate Y by K, so you have E(Y| K = k). This is theoretically erroneous, because we don't know which class someone belongs to, we only know the vector of probabilities that they belong to each latent class. Now, after the LCA model, we may be relatively certain about which classes people belong to (i.e. high entropy), and this exercise would be slightly wrong but still useful. However, we aren't always certain. Its been shown by smarter people than I that this classification uncertainty will bias your estimates.

            One way around this is to go and fit a latent class regression. Say K refers to latent class membership, Indicators is the vector of indicators of the latent class (e.g. going with my prior example, you might use sex, alcohol consumption, smoking, other drug use, etc as indicators), and Y is a vector of covariates which might influence latent class membership but aren't indicators (e.g. single parent, income). In a LCA, you estimate E(Indicators | K = k). In a latent class regression, you simultaneously fit an LCA and estimate P(K = k | Y).

            The problem with latent class regression is that what if you have a lot of indicators? Also, what if your latent class characteristics change substantially when you introduce Y? Three step approaches will fit an LCA model and then tabulate Y by K while also correcting for classification uncertainty. A whole bunch of articles can be found if you Google this. Many are written by Jeroen Vermunt and colleagues. Quite frankly, it's taken me a long time to understand what they're talking about, and I still can't understand their algebra for how they correct for classification uncertainty. I can gather that it's not straightforward to implement in Stata (or at least it's beyond my math and programming skill), so I haven't tried.

            And speaking of entropy, that calculation is fairly straightforward to implement, and many forum members have given code. However, I'd like to see this implemented in Stata 18. I made this request in a post on the wishlist for 17. As another part of that wishlist, when doing latent class with binary indicators, we will sometimes have situations where the class-specific proportion of an indicator is 0 or 1. That corresponds to logit intercepts of +/- infinity, and it will prevent convergence with Stata's default convergence criteria. MPlus (and possibly the R package polca plus the Penn State LCA plugin for Stata) will constrain the logit intercepts to +/- 15 as appropriate, and then declare convergence while providing a warning. I'd like to see this implemented as default behavior in Stata, with a warning made fairly prominently. I'd like the manual to outline this case, describe why it happens, and warn that too many such constraints is a sign that you're trying to extract too many latent classes (i.e. drop this model and go back to the one with k-1 latent classes).

            All 3 of the issues here (latent class vs distal outcomes, entropy, and no convergence due to logit intercepts wandering to infinity) are all fairly frequent issues raised on the forum.

            Ideally, I would also like to see the bootstrap likelihood ratio test for k vs k-1 latent classes implemented. That seems to be a well-accepted test, but it appears complex to implement and also very processor-intensive to execute.
            Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

            When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.

            Comment


            • I was referring to alternative to ML to estimate SEM models which are less sensitive to deviations from normality assumption, for example Diagonally Weighted Least Squares (DWLS) also referred to as WLSM or WLSMV.

              Comment


              • Originally posted by Giovanni Russo View Post
                I was referring to alternative to ML to estimate SEM models which are less sensitive to deviations from normality assumption, for example Diagonally Weighted Least Squares (DWLS) also referred to as WLSM or WLSMV.
                That makes sense. Stata does have an asymptotic distribution free (ADF) estimator for traditional SEM. I am under the impression that it a type of weighted least squares estimator. However, this isn't my specialty. Perhaps someone more knowledgeable can comment.
                Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

                When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.

                Comment


                • I would like to suggest that StataCorp reconsider how it abstracts and utilizes different commands and functionalities from a computer science perspective. What I mean by this is that there are features that have been added long ago that, if one were to redesign today, would likely never implement them in the way they currently exist in Stata, or that there are functionalities that other software/languages have which Stata lacks. To give some examples:

                  Code:
                  sort
                  and
                  Code:
                  gsort
                  have no relative areas of strength over the other as best I can tell, with the only difference being that sort's functionality is merely a subset of gsort. If I were to re-configure the sorting functionality in Stata, there seems to be no reason that I would keep it as-is. Shouldn't the sort command not be restricted into sorting only by ascending order? If I want to sort in a different direction, it would be much simpler and a better abstraction, in principle, to have this feature shared with the same command that does ascending sorting, particularly when the command that does descending sort in the current state can also do ascending sorting.

                  Other examples include the distinction between gen and egen, or the fact that one is unable to use the merge command to join two datasets together using variable columns that have different names, a feature present in many other alternatives to Stata. These examples may not seem like a big concern, but nevertheless their improvement would represent quality of life increases over the long-run. Indeed, in my experience, it is the little distinctions of things like this that frustrate new users of Stata the most. Refining the functionality of already existing commands in Stata to better and more natural levels of abstraction would certainly be an improvement to its long-term accessibility (even though I'm sure a few veterans might grumble about some changes to core functionalities on account of having to re-learn something they knew for years, you could suppose). I don't think StataCorp needs to throw everything away, of course, but certainly some introspection on how core features might benefit from small changes or minor improvements would be very welcome.


                  Originally posted by Jared Greathouse View Post
                  As someone who does quasi-experimental research, I would appreciate formal Stata implementation for procedures such as augmented synthetic-controls, robust synthetic controls, and similar estimators. In Stata 17, Stata formally implemented Difference-in-Differences estimators, so I'd like to see formal extensions in this area too, given recent methodological advances.
                  I would also like to say I second this suggestion by Jared. I know that there exists an R package that can do this already. If you have read Causal Inference: a Mixtape by Scott Cunningham and look at his example code for implementing these methods in Stata and in R, the Stata example is 3 pages long, whereas the R example is only a few lines when using the package. Something in Stata that incorporates these methodologies would be great.

                  Comment


                  • In interactive use, Stata demands /// for multi-line commands, except when { } are required in loops or if-statements.

                    Other languages use indent (e.g. Python) or allow for added brackets to indicate where a multi-line command starts and where it ends (e.g. R).

                    It shouldn’t be too difficult to allow for such an option, even in interactive use, I hope? It would at least ease coding (and maybe make code prettier too).

                    Comment


                    • the parts of the above that I understand are not correct ("///" can only be used in do files, not in interactive use) and the rest is confusing; please clarify

                      Comment


                      • Originally posted by Christopher Bratt View Post
                        In interactive use, Stata demands /// for multi-line commands, except when { } are required in loops or if-statements.

                        Other languages use indent (e.g. Python) or allow for added brackets to indicate where a multi-line command starts and where it ends (e.g. R).

                        It shouldn’t be too difficult to allow for such an option, even in interactive use, I hope? It would at least ease coding (and maybe make code prettier too).
                        If you really don't want the added /// (or /* */) for multi line commands, you have the option already to change the delimiter to a semicolon.

                        Code:
                        #delim ;
                        Your long multi line command;
                        #delim cr  // to revert to carriage return

                        Comment


                        • It would be convenient if there were a single command that would copy a label from one frame to another.

                          Comment


                          • Responding to #115 and #116:

                            ("///" can only be used in do files, not in interactive use
                            You can use /// in interactive use: when running parts of code in a do-file.

                            I assume most people develop a do-file interactively. Or, one can use a do-file to code interactively without keeping the do-file later. In Stata's do-file editor, running parts of the code in a do-file requires that the user selects the code in question (this is a bit cumbersome, running parts of a do-file code is easier in external editors.)

                            you have the option already to change the delimiter to a semicolon.
                            Not for interactive use; only when you run the whole do-file. (An earlier request at Statalist, not by me, was that Stata should be more consistent and allow for the semicolon in interactive use.)

                            the rest is confusing; please clarify
                            Take this code:

                            Code:
                            tabplot disagree_home workplace2,              ///
                               title("Use of patients' home",              ///
                                      size(medlarge))                      ///
                               xtitle("")                                  ///
                               b1title("Nurses' workplace")                ///
                               subtitle("") ytitle("")                     ///
                               percent(workplace2)                         ///
                               showval separate(disagree_home)             ///
                               bar1(bfcolor(green) blcolor(green))         ///
                               bar2(bfcolor(green*0.1) blcolor(green*0.3)) ///
                               bar3(bfcolor(red*0.2) blcolor(red*0.3))     ///
                               bar4(bfcolor(red*0.5) blcolor(red*0.6))     ///
                               bar5(bfcolor(red*0.9) blcolor(red))         ///
                               scheme(s1color) yreverse aspect(1)          ///
                               name(tabplot1, replace) nodraw
                            I would prefer to be able to use some sort of brackets: ( ) [ ] { } to show where the command starts and where it ends, like I do when coding in R.

                            Or, see below. Even when the code makes clear where the command starts and where it ends, Stata needs its ///.
                            Brackets -- here, left parenthesis at the start, then right parenthesis at the end -- make it clear where the code starts and where it ends.

                            Code:
                            runmplus(                                               ///
                                predage_r2 lkrspag_r2 trtbdag_r2 c_age c_agesq      ///
                                country pspwght,                                    ///
                                saveinputfile(mplusin) saveinputdatafile(mplusin)   ///
                                savelogfile(e01_5a_MNLFA_all_MI)                    ///
                                variable(                                           ///
                                    weight      = pspwght;                          ///
                                    categorical = predage_r2 lkrspag_r2 trtbdag_r2; ///
                                    constraint  = c_age c_agesq;                    ///
                                    cluster     = country;                          ///
                                )                                                   ///
                                analysis(                                           ///
                                    type      = complex;                            ///
                                    estimator = mlr;                                ///
                                    link      = logit;                              ///
                                )                                                   ///
                                model(                                              ///
                                    discrim BY predage_r2*2.53025;                  ///
                                    discrim BY lkrspag_r2*6.52504;                  ///
                                    discrim BY trtbdag_r2*4.80997;                  ///
                                                                                    ///
                                    discrim ON c_age*-0.05025;                      ///
                                    discrim ON c_agesq*0.04353;                     ///
                                                                                    ///
                                    [ discrim@0 ];                                  ///
                                                                                    ///
                                    [ predage_r2$1*1.33603 ];                       ///
                                    [ lkrspag_r2$1*2.48037 ];                       ///
                                    [ trtbdag_r2$1*3.12506 ];                       ///
                                                                                    ///
                                    discrim*999 (v_disc);                           ///
                                                                                    ///
                                model constraint:                                   ///
                                    new(v_disc1*0.01080);                           ///
                                    new(v_disc2*-0.00116);                          ///
                                    v_disc = exp(v_disc1*c_age + v_disc2*c_agesq);  ///
                                )                                                   ///
                                output(svalues);                                    ///
                                savedata(                                           ///
                                save=fscores;                                       ///
                                file=mnlfa0.dat;                                    ///
                                )                                                   ///
                            )
                            I don't want to have to type all the ///, and I would prefer not having to look at them.

                            I have experience with coding only in Stata and R. R uses brackets to indicate start and stop of a specific command (semicolon can also be used, most convenient for separating two commands on one line.)

                            Since I don't code in Python, I don't know much about it. But I think its use of indentation is elegant. An indented line means: "Code continues!"

                            Code:
                            if a==1:
                                print(a)
                                if b==2:
                                    print(b)
                            print('end')

                            Comment


                            • Braces without content are allowed. I just tried this in the do-file editor, which indented automatically. I might use it more. (tabplot is from the Stata Journal, and just an example here.)

                              Code:
                              sysuse auto, clear 
                              
                              {
                                  tabplot foreign rep78 
                              }
                              I am not fond of the effect of lots of lines ending /// but sometimes it is the least unattractive choice. Without quite putting my finger on why, I dislike ; as a delimiter in Stata but am happy to use it sometimes in Mata. I think it's because needing to type #delimit ; and #delimit cr is ugly as well as irritating.

                              Comment


                              • As a note to #118 above:

                                You can use /// in interactive use: when running parts of code in a do-file.
                                In discussing Stata the term "interactive" is generally reserved for commands typed one-at-a-time in the Stata Command window, or generated from the menus. (This is especially true for Mata.) Submitting do-files, or portions of do-files, is typically not considered interactive. In section 16.1.2 of the Stata User's Guide PDF this distinction is reinforced with the implication that interactive use of /// is different than use in a do-file.

                                The /* */, //, and /// comment indicators can be used in do-files and ado-files only; you may not use them interactively.

                                Comment

                                Working...
                                X