Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Structural equation modeling: Whether and how to estimate margins w.r.t. a latent variable

    I’d like to get input from the Stata community on the advisability and potential utility of estimating the margins of Observable Endogenous (OEn) variables with respect to a Latent Endogenous variable (Len). The specific questions I am asking are at the bottom of this post. Any comments, positive or negative, will be greatly appreciated.

    Post-estimation utilities for -sem- and -gsem- facilitate estimating margins of the OEn w.r.t. any Observable Exogenous (OEx) variable. Searching Statalist for discussions of estimating margins after -sem- or -gsem-, I found a 2014 thread initiated by ​​​​​​​Jan Hultgren . Jeff Pitblado (StataCorp) ’s responses are particularly helpful. I've also searched for this question in various textbooks about SEM including that by Anders Skrondal and Sophia Rabe-Hesketh entitled Generalized latent variable modeling, 2004. But so far I have not found a discussion of whether to estimate a margin w.r.t. an LEn or, if that were desired, how to best approach the task.

    Here is a toy example to motivate the issue and, hopefully, to show why it might be pertinent.


    I. Preliminaries
    Consider an extremely simple SEM using Stata’s demo data, auto.dta.
    Code:
    sysuse auto, clear
    replace price = price/1000
    lab var price "Price in $1,000's'"
    sem  (price <- foreign L)  (mpg <- foreign L)  (L <- displ gear turn trunk )
    The above model is an instance of a class of SEM models called MIMIC models. See examples 10 and 36g in Stata’s [SEM] PDF documentation. Here are the estimation results:
    Code:
    sem (price <- foreign L) (mpg <- foreign L) (L <- displ gear turn trunk )
     
    Endogenous variables
      Observed: price mpg
      Latent:   L
     
    Exogenous variables
      Observed: foreign displacement gear_ratio turn trunk
     
    Fitting target model:
    Iteration 0:   log likelihood = -1300.1204  (not concave)
    <snip>
    Iteration 13:  log likelihood = -1196.7388 
    Iteration 14:  log likelihood = -1196.7388 
     
    Structural equation model                                   Number of obs = 74
    Estimation method: ml
     
    Log likelihood = -1196.7388
     
     ( 1)  [price]L = 1
    --------------------------------------------------------------------------------
                   |                 OIM
                   | Coefficient  std. err.      z    P>|z|     [95% conf. interval]
    ---------------+----------------------------------------------------------------
    Structural     |
      price        |
                 L |          1  (constrained)
           foreign |   4.042227   .8126842     4.97   0.000     2.449395    5.635059
             _cons |  -3.554518   3.569106    -1.00   0.319    -10.54984    3.440801
      -------------+----------------------------------------------------------------
      mpg          |
                 L |    -2.0604   .3596857    -5.73   0.000    -2.765371   -1.355429
           foreign |  -2.739423   1.369886    -2.00   0.046    -5.424351   -.0544954
             _cons |   39.66227   7.471028     5.31   0.000     25.01933    54.30522
      -------------+----------------------------------------------------------------
      L            |
      displacement |   .0137306   .0049636     2.77   0.006     .0040021    .0234591
        gear_ratio |  -.9408139   .7621917    -1.23   0.217    -2.434682    .5530545
              turn |   .1976456   .0710133     2.78   0.005     .0584622     .336829
             trunk |   .0588124   .0533685     1.10   0.270    -.0457879    .1634126
    ---------------+----------------------------------------------------------------
       var(e.price)|   4.576513    .907895                      3.102217    6.751453
         var(e.mpg)|   10.99188    2.81405                      6.655097    18.15472
           var(e.L)|   .5206339   .4753274                      .0869769    3.116456
    --------------------------------------------------------------------------------
    LR test of model vs. saturated: chi2(3) = 9.17              Prob > chi2 = 0.0271
    est store sem_orig
    After this estimation, we can estimate and plot the margins of both OEn w.r.t. the OEx, -displacement-. (I use Roger Newson’s utility regaxis available from SSC.)
    Code:
    which regaxis
    regaxis displ , lticks(atnumlist) maxticks(7)
    Xeq margins , at(displ = (`atnumlist'))
    marginsplot,   ///
      title(Predictive margins and 95% CIs from -sem-)  ///
      subtitle(Conventional margins of OEn variables w.r.t. a single OEx variable)  ///
      note(At mean values of other OEx variables, span)  ///
      ytitle(Predicted values of Price and MPG)  ///
      legend(order(3 "Predicted Price" "in $1,000's" 4 "Predicted Miles-" "Per-Gallon"))  ///
    name(OEx_impact_on_OEn, replace)
    This code produces the following margin plot.
    Click image for larger version

Name:	OEx_impact_on_OEn.png
Views:	1
Size:	214.5 KB
ID:	1624878



    II. Narrative/Motivation/"Theory"
    In the above model, can the latent variable, L, be interpreted as vehicle "quality"? (Identification problems abound. Please suspend your disbelief here.)

    If the latent variable L represents something real, but not directly observable, like "Vehicle Quality", then it is of interest to produce a graph like the one above with respect to "Quality", a LEn variable. Such a graph would support a narrative that the Cause part of the MIMIC model estimates a sort of production function for "Quality", while the indicator part of the MIMIC model can be thought of as a hedonic index of quality. Again suspending our disbelief, imagine that -price- reflects willingness-to-pay, the higher the better, while -mpg- reflects the social value of the car, the lower the better.

    III. Stata Problem
    To my knowledge, neither -sem-'s nor -gsem-'s postestimation commands enables predicting the values of the OEn indicator variables over a range of values of an LEn variable like the L in this model.

    As Jeff Pitblado (StataCorp) emphasizes in his Statalist posts on margins with latent variables (ibid), the unobserved LEn are inherently stochastic, being the result of adding a stochastic error term to a function of observed variables, some of which might also be endogenous and therefore stochastic. Thus the problem is to estimate margins w.r.t. an unobserved random variable. It might seem that a plausible approach would be to derive variation in the predicted value of L from variation in the observed values of the OEx on which it depends. However, although such an approach might give the correct margins, it does not allow an unambiguous estimate of the standard errors of the margins.

    To see this ambiguity, note that the observed values of the four OEx variables in the demo data are quite different between the Buick Century and the Buick Opel. Here I use my own -mluwild- which can be downloaded from inside Stata.
    Code:
    net install mluwild, from(http://digital.cgdev.org/doc/stata/MO/Misc)
    which mluwild
    *Extract the coefficients of these four OEx variables from the -sem- results
    est restore sem_orig
    mluwild e(b)["y1","L:*"]
    mat def Lcoefs = r(submat)
    matlist Lcoefs
     
    *Use the matrix -Lcoefs- to score the four OEx variables
    *for the Buick Century and the Buick Opel, creating predicted values of L.
    *(This -matrix score- command works by matching the column names of the 
    *matrix of coefficients to the names of variables in the data.
    *It is a is a cool feature of Stata.)
    matrix score L_hat = Lcoefs if make =="Buick Century" | make=="Buick Opel"
    Now note that, when we calculate the margins of the OEn variables -price- and -mpg- for these two different observations:
    . The two estimates of the predicted -price- are almost the same at $6,366, but the standard error of the prediction is twice as large for the Opel.
    . The two estimates of the predicted -mpg- are almost the same at 20.9, but the standard error of the prediction is almost 4 times larger for the Opel.

    Code:
    . margins , at(displ=196 gear == 2.93  turn==40  trunk==16) at(displ=304 gear == 2.87  turn==34  trunk==10)
     
    Predictive margins                                          Number of obs = 74
    Model VCE: OIM
     
    1._predict: Linear prediction (Price in $1,000's'), predict(xb(price))
    2._predict: Linear prediction (Mileage (mpg)), predict(xb(mpg))
     
    1._at: displacement =  196
           gear_ratio   = 2.93
           turn         =   40
           trunk        =   16
    2._at: displacement =  304
           gear_ratio   = 2.87
           turn         =   34
           trunk        =   10
     
    ------------------------------------------------------------------------------
                 |            Delta-method
                 |     Margin   std. err.      z    P>|z|     [95% conf. interval]
    -------------+----------------------------------------------------------------
    _predict#_at |
            1 1  |   6.365587   .2626448    24.24   0.000     5.850813    6.880362
            1 2  |   6.366048   .6387133     9.97   0.000     5.114193    7.617903
            2 1  |   20.88454   .4258567    49.04   0.000     20.04987     21.7192
            2 2  |   20.88359   1.251274    16.69   0.000     18.43113    23.33604
    ------------------------------------------------------------------------------
    Since the standard errors of the predictions are necessary for inferences about the statistical significance of the impacts of changes in the latent variable, it seems that in order to proceed we must assume that the latent variable is fixed-in-repeated-samples, not random. Thus our estimates of the margins w.r.t. the latent variable will be conditional on this assumption.


    IV. Proposed solution
    The work-around I propose is:
    1. Estimate the predicted value of L on each observation in the sample based only on the causal component of the -sem- model.
      1. Use the command -predict LfromOEx, xblatent-.
    2. Assume that LfromOEx is "fixed-in-repeated-samples" as if it were an Observed Exogenous variable
    3. Fit the OEn indicators to the selected estimate of L using one of Stata's multivariate regression packages such as -reg3- or -sureg-
      1. . To replicate -sem-'s anchoring of each Len to one of the OEn, constrain the coefficient of L to equal 1.0 in one of the indicator equations.
    4. Optionally: Replace the e(b) and e(V) matrices in the -reg3- or -sureg- ereturn space with those from the -sem-, which to Stata look like they are from -reg3-, creating a Frankenstein set of estimates
    5. Apply -margins- and -marginsplot- to the resulting -reg3- or -sureg- estimates to obtain the impact on the OEn (here -price- and -mpg-) of change in the LEn (here "Quality").
    The attached DO file implements this proposed solution both with and without step (4). Here is the -marginsplots- result of executing steps (1), (2), (3) and (5), without using the Frankenstein estimates.

    Click image for larger version

Name:	eb_eV_from_reg3.png
Views:	1
Size:	240.4 KB
ID:	1624879
    And here is the proposed solution using all five steps and thus predicting the margins based on the less precise Frankenstein estimates.
    Click image for larger version

Name:	eb_eV_from_sem.png
Views:	1
Size:	237.3 KB
ID:	1624880
    As you see, the two sets of estimates produce identical estimated margins for -price- and -mpg- at each estimated value of L, but very different standard errors and confidence intervals.


    V. Questions [Should this be a "poll" :-) ]:
    (1) Is the objective of estimating the relationship between the hedonic "quality" of these cars and the indicator variables price and mpg understandable? Suspending our disbelief regarding identification, does this objective make sense?
    (2) Is there a way to compute margins with respect to a latent variable other than by estimating/predicting the latent variable and then assuming it is fixed in repeated samples?
    (3) Is it meaningful to present margins of the OEn which are conditional on the assumption that L is fixed in repeated samples at the values estimated from the causal part of the model?
    (4) After -gsem-, Stata offers only the Empirical Bayes method of estimating the values of the latent variable. Since the Empirical Bayes method uses the OEn variables as well as the OEx variables, the assumption that estimates of the LEn based on the Empirical Bayes method are fixed in repeated samples seems to be a heavier lift. Do folks agree that this distinction argues for using -sem- rather than -gsem-?
    (Or to implement this "solution" after -gsem-, one could simply use -matrix score- to score the OEx variables with their -gsem- estimated coefficients. This should produce results with properties similar to those produced by -predict ..., xblatent- after -sem-.)
    Based on the above and on intuition (and on my experience with large data sets), the estimated coefficients of L from -reg3- without modification are almost the same as the estimated coefficients from the Frankenstein model, but the standard errors are much larger in the Frankenstein model. (Compare graphs named eb_eV_from_reg3 and eb_eV_from_sem.)

    (5) Conditional on assuming the estimated values of L are fixed, as I believe we must do to compute the margins of the OEn w.r.t. L, are we justified to make statistical inferences using the much smaller standard errors from -reg3- without modification?
    (6) Since replacing the -reg3- results with the e(b) and e(V) values from the -sem- estimates yields larger (and thus more conservative) standard errors, are we on firmer ground to use these Frankenstein results for inference, justifying them as restoring part of the stochastic characteristics inherent in the original -sem- model?

    Again, any comments would be appreciated.
    Attached Files

  • #2
    Whoops, Please ignore the sentence reading:

    Again suspending our disbelief, imagine that -price- reflects willingness-to-pay, the higher the better, while -mpg- reflects the social value of the car, the lower the better.
    Replace it with:

    Again suspending our disbelief, imagine that cars viewed as of higher "quality" have higher prices and get fewer miles to the gallon.

    Comment


    • #3
      OK, I'm trying again to see if anyone is interested in this question.

      Here's the code, without the digressions. The code uses the following user-contributed programs:

      From SSC:
      regaxis
      erepost
      From http://digital.cgdev.org/doc/stata/MO/Misc
      mluwild
      grc1leg2 (only used to assemble the three created -marginsplot-s into a single graph)


      Code:
      *    Estimate a MIMIC model using -sem-
      sysuse auto, clear
      replace price = price/1000
      lab var price "Price in $1,000's'"
      sem  (price <- foreign L)  (mpg <- foreign L)  (L <- displ gear turn trunk )
      
      *    Predict values of the latent variable based only on the Observed Exogenous variables
      predict LfromOEx, xblatent
      clonevar L = LfromOEx
          lab var L "Predicted values of L from the Observed Exogenous variables"
      
      *    Use -reg3- to fit OLS regressions of each indicator on the predicted latent variable
      constraint 1 _b[price:L] == 1
      reg3 (price L foreign )  (mpg  L foreign ) , constraint(1)
          est store LfromOEx, title(Unmodified -reg3- estimates of price and mpg on LfromOEx)
      
      *    Compare the two sets of estimates.  The coefficients match (by construction), 
      *    but the standard errors are much smaller in -reg3- estimates,
      *    because -reg3- ignores the stochastic origin of fitted L. 
      est tab sem_orig ., se keep(price: mpg: ) stat(N ll r2_1 r2_2)
      
      *    Using -reg3- estimates, we compute the margins of the indicators 
      *    over the range of the predicted values of the latent variable.
      regaxis L , lticks(atnumlist) maxticks(7)
      margins , at(L = (`atnumlist'))
      
      *    The marginsplot shows impact on the indicators of variation of L 
      *    and shows the narrow confidence intervals for those margins 
      *    that are a consequence of suppressing the stochastic nature of L.
      marginsplot,   ///
          title(Predictive margins and 95% CIs from -sem-)  ///
          subtitle("Margins of OEn variables w.r.t. L:"  ///
              "Standard errors are from -reg3- without modification")  ///
          note("Assumes L is fixed at estimated values and other OEn variables" "(i.e. -foreign-) are fixed at their mean", span)  ///
          ytitle(Predicted values of Price and MPG)  ///
          legend(order(4 "Predicted Price" "in $1,000's" 3 "Predicted Miles-" "Per-Gallon"))  ///
          plot1opts(pstyle(p2)) plot2opts(pstyle(p1))  ///
          ci1opts(pstyle(p2)) ci2opts(pstyle(p1))  ///
          name(eb_eV_from_reg3, replace)
          
      *    A more conservative approach to estimating the standard errors of the 
      *    coefficients linking the indicators to L is to use the portion of the 
      *    -sem- vce that relates to them instead of the vce matrix from -reg3.
      
      *    The vce from reg 3 is:
      
      matlist e(V)
      
          
      *    Create a "Frankenstein" set of estimates by , getting and assembling the relevant pieces of the e(V) matrix from -sem-  ...
      est restore sem_orig    
      
      which mluwild
      mluwild e(V)["price:*","price:*"]  
          mat def V_pxp = r(submat)
          matlist V_pxp
      
      mluwild e(V)["price:*","mpg:*"]  
          mat def V_pxm = r(submat)
          matlist V_pxm
          
      mluwild e(V)["mpg:*","price:*"]  
          mat def V_mxp = r(submat)
          matlist V_mxp
          
      mluwild e(V)["mpg:*","mpg:*"]  
          mat def V_mxm = r(submat)
          matlist V_mxm
          
      mat def eV_from_sem =  ( V_pxp , V_pxm \ V_mxp , V_mxm )
      
      matlist eV_from_sem
      
      
      *    ... and then stuffing them into the -reg3- estimates to make the Frankenstein estimates:
      est restore LfromOEx
      erepost V = eV_from_sem, rename
      
      est store reg3_with_sem_eV
      
      *    Compare the reg3 estimates and the Frankenstein estimate (reg3_with_sem_eV)
      *    to the original -sem- estimates of the indicator model
      est table sem_orig LfromOEx reg3_with_sem_eV, t keep(price: mpg: ) stat(N r2_1 r2_2)
      
      *    compute the margins and make the marginsplot graph using the Frankenstein estimates
      regaxis L , lticks(atnumlist) maxticks(7)
      margins , at(L = (`atnumlist'))
      marginsplot,   ///
          title(Predictive margins and 95% CIs from -sem-)  ///
          subtitle("Margins of OEn variables w.r.t. L:"  ///
              "Standard errors are transplanted into -reg3- from -sem- results")  ///
          note("Assumes L is fixed at estimated values and other OEn variables" "(i.e. -foreign-) are fixed at their mean", span)  ///
          ytitle(Predicted values of Price and MPG)  ///
          legend(order(4 "Predicted Price" "in $1,000's" 3 "Predicted Miles-" "Per-Gallon"))  ///
          plot1opts(pstyle(p2)) plot2opts(pstyle(p1))  ///
          ci1opts(pstyle(p2)) ci2opts(pstyle(p1))  ///
          name(frankenstein, replace)
          
      grc1leg2 OEx_impact_on_OEn  eb_eV_from_reg3 frankenstein,  ///
          altshrink ycommon maintotoptitle ytol1title  ///
          holes(2) ring(0) pos(2) lcols(1) lxoffset(-15) lyoffset(-15)  ///
          labsize(small) symysize(small) name(compare_sem_margins, replace)
      This code produces these three -marginsplot-s. The upper left panel shows the conventional -marginsplot- of the indicator variables against one of the observable exogenous variables, displacement. The bottom two panels show the marginsplots constructed for the latent variable. The lines in the two bottom panels are identical, but the confidence intervals are extremely different.
      Click image for larger version

Name:	compare_sem_margins.png
Views:	1
Size:	290.0 KB
ID:	1626071

      So my question is whether this procedure for estimating the impact of a change in a latent variable on the indicator variables makes sense. And whether it makes more sense to use the version shown in the lower left panel, which has much smaller standard errors because it ignores the stochastic nature of the latent variable, or instead use the version shown in the lower right panel, which has larger standard errors because the standard errors are transplanted from the -sem- estimation.

      Thanks for any reactions.

      Comment


      • #4
        Writing with a point of information. I'm not that familiar with traditional SEM models, as I mainly deal with IRT models. However, I do at least know that in many SEM models, we are building a measurement model: we think there's a latent variable (i.e. one we can't observe directly) whose value we can infer from some indicators. If you took displacement, gear ratio, turn circle, and trunk space as indicators of vehicle quality, the syntax should look more like:

        Code:
        sem (L -> displ gear turn trunk)
        Note the direction of the arrow. Internally, I interpret this as L (the latent variable) causes responses to the 4 specified indicators. In the original post, the direction of the arrow is reversed. Stata will interpret that as regressing those 4 variables (treated now as independent variables) on the variable L. It knows L is latent. In the original syntax, L (and the observed variable foreign) cause responses to price and mpg, so those are the two indicators of L.

        I haven't yet digested the full example. But if you meant to say that there's a latent variable L (and its indicators are displacement, gear ratio, turn circle, and trunk space), and I want to regress L on price and foreign, then I think the syntax is more like this:

        Code:
        sem (L -> displ gear turn trunk) (L <- price foreign)
        My current copy of Stata is on a remote server from which I can't easily extract output. However, if you run that code on the auto dataset, you'll see a couple of headers in the output table: Structural and Measurement. Measurement is the bit where Stata tells you about how the indicators load on the latent variable. Structural is where you have the results of regressing the independent variables on the latent one. You'll note that in SEM example 10, the output table also has Structural and Measurement headers, whereas the output table in post #1 only has a Structural header.

        Now, how do you get marginal effects of the independent variables (foreign and price - I converted mine to $1k units, but it doesn't matter except for the scale) on the latent variable quality? I'm not sure, so the question is still open. The manual for sem postestimation says that margins are only available with respect to the observed endogenous variables. I don't exactly know my SEM terminology, but observed variables contrasts with latent variables ... so you can't get those predictions from margins? Given that this is a simple linear regression so far, the marginal effects should correspond to the regression coefficients. I don't exactly know how you interpret those coefficients in terms of scale. In IRT, we would normally constrain the variance of the latent variable to 1. In theory, if you predicted the latent variable and then used those predictions in a separate regression model, I think that any coefficients should be interpretable as Z-scores, e.g. a $1k increase in price produces a beta-standard deviation change in Quality. However, in a MIMIC model, if you constrain variance(L@1), Stata tells you that

        invalid specification of variance of 'L';
        'L' is a latent dependent variable
        You can constrain the error of L to 1. I'm not exactly sure why this works. When I've tested a generalized (IRT) MIMIC model in Stata with that constraint against results from a similar model in an R package, I believe I got the same results.

        In theory, predicting the latent variable and regressing it ignores the uncertainty in the latent variable. Mead already covered those objections above. I don't know how much weight to give that objection. That is, I don't know if it's like a dirty instrumental variable (biases the results, possibly fatally), or if it's more like linear probability models are usually close enough to logistic regression (although we've had logistic regression for decades now, and it's not computationally difficult, so there's usually no reason not to use it).
        Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

        When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.

        Comment


        • #5
          I hope I'm not butchering this explanation. In multiple imputation by chained equations, we estimate the mean and variance of each variable conditional on whatever other variables we put into the imputation model. We then make several draws, and we run our regression model on each random draw, then we pool them via Rubin's rules.

          When you fit an SEM model, be that a generalized one or a traditional one, you can get each observation's mean of the latent variable, and the standard error of that latent variable. I think you can tell where I'm going with this. In IRT, this is called drawing plausible values of the latent trait.

          Basically, if you wanted to fit a regression model to the latent variable but you want to account for the uncertainty, and you want to do a margins plot badly enough that you're willing to go through the trouble of randomly generating some number of values, and then manually declaring that as multiply imputed data ... then this might be one option. Basically, fit a measurement model, then predict the latent variable and its SE, then make your draws, then do something like mi import wide. That seems like one solution, albeit a tedious one.
          Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

          When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.

          Comment


          • #6
            Thanks to Weiwen Ng and to Richard Williams (personal communication) for suggesting a bootstrapping approach to my problem. I'm thinking about that and hope to have some kind of useful response to the suggestion.

            In the meantime, because a simple example of a MIMIC model using Stata's -auto.dta- might be useful to others, this post hopefully clarifies my posts #1 and #2 and provides an even simpler example of a MIMIC model using the same data.

            My syntax for the -sem- command in posts #1 and #2 is consistent with my intention. The idea behind my example model is that both “price” and “mpg” reflect the unobservable cost of vehicle ownership over the vehicle’s lifespan. Thus, in the MIMIC/SEM context, “price” and “mpg” are both observable, endogenous indicators (OEn) of a latent variable L which represents “vehicle lifetime cost”. I am assuming that the variables “displacement”, “gear”, “turn” and “trunk” are the exogenous variables which are the “causes” of L. In posts #1 and #2 I also assume that “foreign” is exogenous. In Stata’s -sem- parlance, these are all OEx variables. By “exogenous”, I mean “independent of the error terms”.

            Thanks to Weiwen Ng ‘s post #3 above, I now notice what I believe is an anomaly in the way Stata labels -sem- output. The following -sem- syntax, which corrects the directions of the arrows in Weiwen’s post #3, produces output which clearly labels the “cause” equation as “Structural” and the “indicator” equations as “Measurement”, as Weiwen points out. (Note that -sem- does not successfully estimate the MIMIC models in these posts unless the "price" variable is scaled to thousands. In a post I can't find now, Weiwen points out the vulnerability of -sem- estimation to widely varying scales of the variables, a feature which is common to many -ml estimators.)

            Code:
            sysuse auto, clear
            replace price = price/1000
            lab var price "Price in $1,000's'"
            Results:
            Code:
            . sem  (L <- displ gear turn trunk ) (price mpg <- L)
             
            Endogenous variables
              Measurement: price mpg
              Latent:      L
             
            Exogenous variables
              Observed: displacement gear_ratio turn trunk
             
            Fitting target model:
            Iteration 0:   log likelihood = -3056.1025  (not concave)
            &lt;snip&gt;
            Iteration 15:  log likelihood =  -1190.846
            Iteration 16:  log likelihood =  -1190.846
             
            Structural equation model                                   Number of obs = 74
            Estimation method: ml
             
            Log likelihood = -1190.846
             
             ( 1)  [price]L = 1
            --------------------------------------------------------------------------------
                           |                 OIM
                           | Coefficient  std. err.      z    P&gt;|z|     [95% conf. interval]
            ---------------+----------------------------------------------------------------
            Structural     |
              L            |
              displacement |   .0063288   .0040196     1.57   0.115    -.0015494     .014207
                gear_ratio |  -.0733574   .5003218    -0.15   0.883     -1.05397    .9072554
                      turn |   .1230058   .0505828     2.43   0.015     .0238653    .2221463
                     trunk |   .0626664    .039816     1.57   0.116    -.0153717    .1407044
            ---------------+----------------------------------------------------------------
            Measurement    |
              price        |
                         L |          1  (constrained)
                     _cons |  -.6013418   2.946496    -0.20   0.838    -6.376368    5.173684
              -------------+----------------------------------------------------------------
              mpg          |
                         L |  -3.480387   .8865583    -3.93   0.000    -5.218009   -1.742765
                     _cons |   44.84768   8.562884     5.24   0.000     28.06474    61.63062
            ---------------+----------------------------------------------------------------
               var(e.price)|   6.315486   1.114906                      4.468273    8.926348
                 var(e.mpg)|   5.565638   5.005183                      .9550556    32.43405
                   var(e.L)|   .6662495   .3762295                      .2202751    2.015155
            --------------------------------------------------------------------------------
            LR test of model vs. saturated: chi2(3) = 11.80             Prob &gt; chi2 = 0.0081
            However, when I include the OEx variable “foreign” among the determinants of the two indicator variables, as I did in posts #1 and #2 above, Stata suppresses the -label- “Measurement” as follows:
            Code:
            . sem  (L <- displ gear turn trunk ) (price mpg <- foreign L)
             
            Endogenous variables
              Observed: price mpg
              Latent:   L
             
            Exogenous variables
              Observed: displacement gear_ratio turn trunk foreign
             
            Fitting target model:
            Iteration 0:   log likelihood = -1300.1204  (not concave)
            &lt;snip&gt;
            Iteration 13:  log likelihood = -1196.7388
            Iteration 14:  log likelihood = -1196.7388
             
            Structural equation model                                   Number of obs = 74
            Estimation method: ml
             
            Log likelihood = -1196.7388
             
             ( 1)  [price]L = 1
            --------------------------------------------------------------------------------
                           |                 OIM
                           | Coefficient  std. err.      z    P&gt;|z|     [95% conf. interval]
            ---------------+----------------------------------------------------------------
            Structural     |
              price        |
                         L |          1  (constrained)
                   foreign |   4.042227   .8126842     4.97   0.000     2.449395    5.635059
                     _cons |  -3.554518   3.569106    -1.00   0.319    -10.54984    3.440801
              -------------+----------------------------------------------------------------
              mpg          |
                         L |    -2.0604   .3596857    -5.73   0.000    -2.765371   -1.355429
                   foreign |  -2.739423   1.369886    -2.00   0.046    -5.424351   -.0544954
                     _cons |   39.66227   7.471028     5.31   0.000     25.01933    54.30522
              -------------+----------------------------------------------------------------
              L            |
              displacement |   .0137306   .0049636     2.77   0.006     .0040021    .0234591
                gear_ratio |  -.9408139   .7621917    -1.23   0.217    -2.434682    .5530545
                      turn |   .1976456   .0710133     2.78   0.005     .0584622     .336829
                     trunk |   .0588124   .0533685     1.10   0.270    -.0457879    .1634126
            ---------------+----------------------------------------------------------------
               var(e.price)|   4.576513    .907895                      3.102217    6.751453
                 var(e.mpg)|   10.99188    2.81405                      6.655097    18.15472
                   var(e.L)|   .5206339   .4753274                      .0869769    3.116456
            --------------------------------------------------------------------------------
            I think it's appropriate to describe the two equations specified by “(price mpg <- foreign L)” as “measurement” equations, despite their inclusion of the OEx variable “foreign” among their right-hand-side variables. I’m not sure whether -sem-‘s omission of the label “Measurement” in -sem-‘s output is intentional or a small formatting bug.

            After my two posts #1 and #2 above, I have realized that Stata’s -auto.dta- data contains a third indicator of “lifetime vehicle cost”. The third variable is -rep78- which is an ordinal variable capturing “Repair record 1978”. Conceptually “lifetime vehicle cost” should be positively correlated with the variable “price” and negatively correlated with the variables “mpg” and “rep78”. Also, on reflection, why not include the OEx “foreign” in the “cause” or “structural” equation which determines L along with the other four OEx variables.

            If we blithely ignore -rep78-‘s ordinal nature and pretend it is cardinal, we can estimate a MIMIC model of lifetime vehicle cost with the syntax:
            Code:
            sysuse auto2, clear
            replace price = price/1000
            lab var price "Price in $1,000's'"
            sem  (displ gear turn trunk foreign -> L) (L -> price mpg rep78)
            Or, if one prefers the econometrics convention of placing dependent variables on the left-hand-side, the -sem- syntax could be written:
            Code:
            sem  (L <- displ gear turn trunk foreign) (price mpg rep78 <- L)
            Either way the result is:
            Code:
            . sem  (L <- displ gear turn trunk foreign) (price mpg rep78 <- L)
            note: The following observed variable name will be treated as a latent variable: L.  If this is not your intention use
                  the nocapslatent option, or identify the latent variable names in the latent() option.
            (5 observations with missing values excluded)
             
            Endogenous variables
              Measurement: price mpg rep78
              Latent:      L
             
            Exogenous variables
              Observed: displacement gear_ratio turn trunk foreign
             
            Fitting target model:
            Iteration 0:   log likelihood = -2066.3226  (not concave)
            Iteration 1:   log likelihood = -1271.1914  (not concave)
            &lt;snip&gt;
            Iteration 15:  log likelihood = -1204.5286  (not concave)
            Iteration 16:  log likelihood = -1204.2362
            Iteration 17:  log likelihood = -1203.7643
            Iteration 18:  log likelihood = -1203.7532
            Iteration 19:  log likelihood = -1203.7531
             
            Structural equation model                                   Number of obs = 69
            Estimation method: ml
             
            Log likelihood = -1203.7531
             
             ( 1)  [price]L = 1
            --------------------------------------------------------------------------------
                           |                 OIM
                           | Coefficient  std. err.      z    P&gt;|z|     [95% conf. interval]
            ---------------+----------------------------------------------------------------
            Structural     |
              L            |
              displacement |   .0092242   .0045904     2.01   0.044     .0002273    .0182212
                gear_ratio |  -.8224961    .603289    -1.36   0.173    -2.004921    .3599286
                      turn |   .1643932   .0591321     2.78   0.005     .0484964      .28029
                     trunk |   .0287036   .0405003     0.71   0.478    -.0506755    .1080828
                   foreign |   1.086969   .5914699     1.84   0.066    -.0722912    2.246228
            ---------------+----------------------------------------------------------------
            Measurement    |
              price        |
                         L |          1  (constrained)
                     _cons |  -.4864191   2.847152    -0.17   0.864    -6.066734    5.093896
              -------------+----------------------------------------------------------------
              mpg          |
                         L |  -2.921838   .6679504    -4.37   0.000    -4.230997   -1.612679
                     _cons |   40.66882   7.876091     5.16   0.000     25.23197    56.10568
              -------------+----------------------------------------------------------------
              rep78        |
                         L |  -.2311959   .0940379    -2.46   0.014    -.4155068    -.046885
                     _cons |   4.939195   .8118177     6.08   0.000     3.348062    6.530329
            ---------------+----------------------------------------------------------------
               var(e.price)|   5.691022   1.062862                      3.946561    8.206572
                 var(e.mpg)|   11.13628   4.038859                       5.47054    22.66994
               var(e.rep78)|   .8231382   .1479877                      .5786817    1.170862
                   var(e.L)|   .1509371   .3507742                      .0015872    14.35384
            --------------------------------------------------------------------------------
            LR test of model vs. saturated: chi2(10) = 52.53            Prob &gt; chi2 = 0.0000
            In the real world the variables I have designated OEx seem likely to be jointly determined with the car’s “price”, “mpg” and “rep78” as part of a firm’s profit-maximizing decisions in an imperfectly competitive market. However, for the purposes of this demo model, let’s assume that these OEx are instead exogenous.

            If one can live with the assumption that the OEx variables are exogenous, this little demo MIMIC model “works” quite well. It converges reliably and quickly. The latent variable, L, seems to have the anticipated relationship with the indicators, with statistically significant negative coefficients for both -mpg- and -rep78-. With the latent variable “anchored” by the observed price, units of the latent variable can be interpreted as thousands of dollars. The estimated coefficients of L in the indicator equations for "mpg" can be interpreted to suggest that either a decline in "mpg" of 2.92 mile per gallon or a decline in repair record of 0.23 index points is valued about the same by the consumer as would an increase of $1,000 in the purchase price. A decline of a full point in the repair record is valued by consumers at $4,348 = 1000*(1/0.23).

            Assuming the average car was driven about 10,000 miles per year in 1978 and that gasoline cost 63 cents per gallon, the annual cost of reducing a car’s mileage by 2.92 miles per gallon would be: 10,000 * (1/2.92) * 0.63 = $2,157. So the implication of these estimates is that those buying cars in the US in 1978 did not fully appreciate the cost they would incur over time by buying a car with lower gas mileage.

            This MIMIC model is related to the theory of hedonic pricing as in these references:

            Rosen, S. (1974). "Hedonic prices and implicit markets: product differentiation in pure competition". Journal of Political Economy. 82 (1): 34–55 )

            Reis, Hugo J.; Silva, J. M. C. Santos (2006). "Hedonic Price Indexes for New Passenger Cars in Portugal (1997–2003)"

            The same model can be estimated with -gsem- if one uses the from(sem_eb, skip) option where the matrix sem_eb is defined as the results of the -sem- model at convergence. The coefficients and standard errors from the -gsem- model match exactly those from the -sem- model, as expected..

            Comment


            • #7
              helllo stata fm, i want to impute a subcategory of a variable how do i do that, if i create a dummy variable it still give me impute values for the other catgerories

              Comment


              • #8
                Originally posted by KAYDEN MARLI View Post
                helllo stata fm, i want to impute a subcategory of a variable how do i do that, if i create a dummy variable it still give me impute values for the other catgerories
                Please ask this as a new question and provide more information. If you bump a question that's already been answered, it's possible that people who could help you are missing your question. For instance, people who know structural equation modeling would be drawn to this sort of question, and SEM and multiple imputation probably don't overlap much so the MI specialists could be skipping over this.
                Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

                When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.

                Comment


                • #9
                  input byte p04sex double age byte(p11etnic p15aho) str1 p15cdipl byte(p18wstat p23bstaa)
                  2 3 3 . " " . 1
                  2 20 5 9 "1" 4 1
                  1 37 4 7 "1" 2 2
                  1 24 4 6 "1" 4 1
                  2 35 4 6 "2" . 2
                  1 44 3 6 "1" . 1
                  2 32 5 11 "2" 4 1
                  2 42 3 . " " 4 1
                  2 2 4 . " " . 1
                  1 57 3 8 "1" 4 2
                  2 57 3 8 "1" 4 3
                  2 18 5 6 "1" . 1
                  2 27 8 11 "2" . 1
                  2 4 8 . " " . 1
                  1 37 8 7 "2" . 1
                  2 7 5 4 "2" . 1
                  2 53 5 6 "1" . 2
                  1 21 8 5 "2" 4 1
                  1 22 8 5 "2" 4 1
                  1 16 8 6 "2" 4 1
                  2 31 4 6 "2" . 1
                  2 9 4 4 "2" . 1
                  2 57 3 8 "1" 4 2
                  2 6 8 2 "2" . 1
                  2 51 8 10 "1" 4 2
                  2 12 5 . " " . 1
                  1 5 5 2 "2" . 1
                  2 6 5 1 " " . 1
                  2 50 8 6 "1" . 2
                  1 53 6 9 "7" 1 2
                  1 23 4 10 "2" 4 1
                  1 2 3 . " " . 1
                  2 58 3 6 "1" 4 2
                  1 22 8 9 "1" 4 1
                  1 59 5 7 "1" 4 2
                  2 9 4 . " " . 1
                  2 21 5 7 "1" 4 1
                  2 35 5 8 "1" 4 2
                  1 6 3 2 "2" . 1
                  2 22 6 9 "2" 3 1
                  2 12 6 4 "1" . 1
                  1 46 6 6 "2" 4 2
                  2 17 6 4 "2" . 1
                  2 18 4 11 "2" . 1
                  2 6 4 4 "2" . 1
                  2 50 4 4 "2" . 2
                  1 27 3 4 "2" 2 1
                  1 45 8 . " " 2 2
                  2 43 4 6 "1" 4 2
                  2 18 8 6 "2" . 1
                  1 17 8 6 "2" . 1
                  2 39 8 10 "1" 4 1
                  1 59 6 . " " 2 2
                  2 37 8 6 "2" 4 2
                  2 6 8 4 "2" . 1
                  1 43 8 11 "2" 4 2
                  2 17 5 7 "2" . 1
                  1 48 5 . " " 4 2
                  1 42 3 6 "2" 2 1
                  2 38 8 10 "2" 4 1
                  1 50 4 6 "1" 4 2
                  2 29 4 6 "2" 4 3
                  1 46 8 6 "2" . 2
                  2 3 8 . " " . 1
                  1 5 3 2 "9" . 1
                  2 13 5 6 "2" . 1
                  1 8 4 4 "2" . 1
                  1 34 4 6 "1" 2 2
                  2 22 5 8 "2" . 1
                  1 41 3 4 "2" 4 2
                  2 13 3 4 "2" . 1
                  2 41 6 . " " . 2
                  1 33 5 6 "2" 4 2
                  1 4 5 . " " . 1
                  1 32 5 6 "1" 4 2
                  2 10 5 4 "2" . 1
                  2 22 5 8 "1" . 1
                  1 14 8 4 "1" . 1
                  2 24 4 11 "2" . 2
                  1 27 5 9 "2" 2 2
                  2 28 5 11 "2" 4 2
                  2 25 5 7 "1" 4 2
                  1 38 5 11 "2" 4 2
                  2 45 4 7 "1" . 2
                  1 19 4 9 "2" . 1
                  1 14 5 6 "7" . 1
                  2 40 3 6 "2" . 9
                  2 10 3 4 "2" . 1
                  1 47 6 6 "1" 2 2
                  1 52 5 11 "2" 4 2
                  1 28 5 6 "2" 4 1
                  2 21 5 8 "1" . 1
                  2 19 5 6 "1" . 1
                  2 13 5 6 "2" . 1
                  2 23 5 9 "1" . 1
                  2 12 5 4 "2" . 1
                  1 13 5 6 "2" . 1
                  1 28 8 6 "1" 4 1
                  2 25 8 6 "2" 4 1
                  2 6 8 2 " " . 1



                  . describe p04sex age p11etnic p15aho p15cdipl p18wstat p23bstaa

                  storage display value
                  variable name type format label variable label
                  -----------------------------------------------------------------------------------------------------------------
                  p04sex byte %10.0g Sex
                  age double %12.0g Age
                  p11etnic byte %10.0g Ethnicity
                  p15aho byte %10.0g Type of highest formal education
                  p15cdipl str1 %1s Certficate attained?
                  p18wstat byte %10.0g Status in employment
                  p23bstaa byte %10.0g Marital status

                  Comment

                  Working...
                  X