Structural equation modeling: Whether and how to estimate margins w.r.t. a latent variable

Mead Over

Join Date: Sep 2014

Posts: 112
#1

Structural equation modeling: Whether and how to estimate margins w.r.t. a latent variable

25 Aug 2021, 11:57

I’d like to get input from the Stata community on the advisability and potential utility of estimating the margins of Observable Endogenous (OEn) variables with respect to a Latent Endogenous variable (Len). The specific questions I am asking are at the bottom of this post. Any comments, positive or negative, will be greatly appreciated.

Post-estimation utilities for -sem- and -gsem- facilitate estimating margins of the OEn w.r.t. any Observable Exogenous (OEx) variable. Searching Statalist for discussions of estimating margins after -sem- or -gsem-, I found a 2014 thread initiated by Jan Hultgren . Jeff Pitblado (StataCorp) ’s responses are particularly helpful. I've also searched for this question in various textbooks about SEM including that by Anders Skrondal and Sophia Rabe-Hesketh entitled Generalized latent variable modeling, 2004. But so far I have not found a discussion of whether to estimate a margin w.r.t. an LEn or, if that were desired, how to best approach the task.

Here is a toy example to motivate the issue and, hopefully, to show why it might be pertinent.

I. Preliminaries
Consider an extremely simple SEM using Stata’s demo data, auto.dta.

Code:

sysuse auto, clear replace price = price/1000 lab var price "Price in $1,000's'" sem (price <- foreign L) (mpg <- foreign L) (L <- displ gear turn trunk )

The above model is an instance of a class of SEM models called MIMIC models. See examples 10 and 36g in Stata’s [SEM] PDF documentation. Here are the estimation results:

Code:

sem (price <- foreign L) (mpg <- foreign L) (L <- displ gear turn trunk ) Endogenous variables Observed: price mpg Latent: L Exogenous variables Observed: foreign displacement gear_ratio turn trunk Fitting target model: Iteration 0: log likelihood = -1300.1204 (not concave) <snip> Iteration 13: log likelihood = -1196.7388 Iteration 14: log likelihood = -1196.7388 Structural equation model Number of obs = 74 Estimation method: ml Log likelihood = -1196.7388 ( 1) [price]L = 1 -------------------------------------------------------------------------------- | OIM | Coefficient std. err. z P>|z| [95% conf. interval] ---------------+---------------------------------------------------------------- Structural | price | L | 1 (constrained) foreign | 4.042227 .8126842 4.97 0.000 2.449395 5.635059 _cons | -3.554518 3.569106 -1.00 0.319 -10.54984 3.440801 -------------+---------------------------------------------------------------- mpg | L | -2.0604 .3596857 -5.73 0.000 -2.765371 -1.355429 foreign | -2.739423 1.369886 -2.00 0.046 -5.424351 -.0544954 _cons | 39.66227 7.471028 5.31 0.000 25.01933 54.30522 -------------+---------------------------------------------------------------- L | displacement | .0137306 .0049636 2.77 0.006 .0040021 .0234591 gear_ratio | -.9408139 .7621917 -1.23 0.217 -2.434682 .5530545 turn | .1976456 .0710133 2.78 0.005 .0584622 .336829 trunk | .0588124 .0533685 1.10 0.270 -.0457879 .1634126 ---------------+---------------------------------------------------------------- var(e.price)| 4.576513 .907895 3.102217 6.751453 var(e.mpg)| 10.99188 2.81405 6.655097 18.15472 var(e.L)| .5206339 .4753274 .0869769 3.116456 -------------------------------------------------------------------------------- LR test of model vs. saturated: chi2(3) = 9.17 Prob > chi2 = 0.0271 est store sem_orig

After this estimation, we can estimate and plot the margins of both OEn w.r.t. the OEx, -displacement-. (I use Roger Newson’s utility regaxis available from SSC.)

Code:

which regaxis regaxis displ , lticks(atnumlist) maxticks(7) Xeq margins , at(displ = (`atnumlist')) marginsplot, /// title(Predictive margins and 95% CIs from -sem-) /// subtitle(Conventional margins of OEn variables w.r.t. a single OEx variable) /// note(At mean values of other OEx variables, span) /// ytitle(Predicted values of Price and MPG) /// legend(order(3 "Predicted Price" "in $1,000's" 4 "Predicted Miles-" "Per-Gallon")) /// name(OEx_impact_on_OEn, replace)

This code produces the following margin plot.

II. Narrative/Motivation/"Theory"
In the above model, can the latent variable, L, be interpreted as vehicle "quality"? (Identification problems abound. Please suspend your disbelief here.)

If the latent variable L represents something real, but not directly observable, like "Vehicle Quality", then it is of interest to produce a graph like the one above with respect to "Quality", a LEn variable. Such a graph would support a narrative that the Cause part of the MIMIC model estimates a sort of production function for "Quality", while the indicator part of the MIMIC model can be thought of as a hedonic index of quality. Again suspending our disbelief, imagine that -price- reflects willingness-to-pay, the higher the better, while -mpg- reflects the social value of the car, the lower the better.

III. Stata Problem
To my knowledge, neither -sem-'s nor -gsem-'s postestimation commands enables predicting the values of the OEn indicator variables over a range of values of an LEn variable like the L in this model.

As Jeff Pitblado (StataCorp) emphasizes in his Statalist posts on margins with latent variables (ibid), the unobserved LEn are inherently stochastic, being the result of adding a stochastic error term to a function of observed variables, some of which might also be endogenous and therefore stochastic. Thus the problem is to estimate margins w.r.t. an unobserved random variable. It might seem that a plausible approach would be to derive variation in the predicted value of L from variation in the observed values of the OEx on which it depends. However, although such an approach might give the correct margins, it does not allow an unambiguous estimate of the standard errors of the margins.

To see this ambiguity, note that the observed values of the four OEx variables in the demo data are quite different between the Buick Century and the Buick Opel. Here I use my own -mluwild- which can be downloaded from inside Stata.

Code:

net install mluwild, from(http://digital.cgdev.org/doc/stata/MO/Misc) which mluwild *Extract the coefficients of these four OEx variables from the -sem- results est restore sem_orig mluwild e(b)["y1","L:*"] mat def Lcoefs = r(submat) matlist Lcoefs *Use the matrix -Lcoefs- to score the four OEx variables *for the Buick Century and the Buick Opel, creating predicted values of L. *(This -matrix score- command works by matching the column names of the *matrix of coefficients to the names of variables in the data. *It is a is a cool feature of Stata.) matrix score L_hat = Lcoefs if make =="Buick Century" | make=="Buick Opel"

Now note that, when we calculate the margins of the OEn variables -price- and -mpg- for these two different observations:
. The two estimates of the predicted -price- are almost the same at $6,366, but the standard error of the prediction is twice as large for the Opel.
. The two estimates of the predicted -mpg- are almost the same at 20.9, but the standard error of the prediction is almost 4 times larger for the Opel.

Code:

. margins , at(displ=196 gear == 2.93 turn==40 trunk==16) at(displ=304 gear == 2.87 turn==34 trunk==10) Predictive margins Number of obs = 74 Model VCE: OIM 1._predict: Linear prediction (Price in $1,000's'), predict(xb(price)) 2._predict: Linear prediction (Mileage (mpg)), predict(xb(mpg)) 1._at: displacement = 196 gear_ratio = 2.93 turn = 40 trunk = 16 2._at: displacement = 304 gear_ratio = 2.87 turn = 34 trunk = 10 ------------------------------------------------------------------------------ | Delta-method | Margin std. err. z P>|z| [95% conf. interval] -------------+---------------------------------------------------------------- _predict#_at | 1 1 | 6.365587 .2626448 24.24 0.000 5.850813 6.880362 1 2 | 6.366048 .6387133 9.97 0.000 5.114193 7.617903 2 1 | 20.88454 .4258567 49.04 0.000 20.04987 21.7192 2 2 | 20.88359 1.251274 16.69 0.000 18.43113 23.33604 ------------------------------------------------------------------------------

Since the standard errors of the predictions are necessary for inferences about the statistical significance of the impacts of changes in the latent variable, it seems that in order to proceed we must assume that the latent variable is fixed-in-repeated-samples, not random. Thus our estimates of the margins w.r.t. the latent variable will be conditional on this assumption.

IV. Proposed solution
The work-around I propose is:
Estimate the predicted value of L on each observation in the sample based only on the causal component of the -sem- model.
Use the command -predict LfromOEx, xblatent-.

Assume that LfromOEx is "fixed-in-repeated-samples" as if it were an Observed Exogenous variable

Fit the OEn indicators to the selected estimate of L using one of Stata's multivariate regression packages such as -reg3- or -sureg-
. To replicate -sem-'s anchoring of each Len to one of the OEn, constrain the coefficient of L to equal 1.0 in one of the indicator equations.

Optionally: Replace the e(b) and e(V) matrices in the -reg3- or -sureg- ereturn space with those from the -sem-, which to Stata look like they are from -reg3-, creating a Frankenstein set of estimates

Apply -margins- and -marginsplot- to the resulting -reg3- or -sureg- estimates to obtain the impact on the OEn (here -price- and -mpg-) of change in the LEn (here "Quality").

The attached DO file implements this proposed solution both with and without step (4). Here is the -marginsplots- result of executing steps (1), (2), (3) and (5), without using the Frankenstein estimates.

And here is the proposed solution using all five steps and thus predicting the margins based on the less precise Frankenstein estimates.

As you see, the two sets of estimates produce identical estimated margins for -price- and -mpg- at each estimated value of L, but very different standard errors and confidence intervals.

V. Questions [Should this be a "poll" :-) ]:
(1) Is the objective of estimating the relationship between the hedonic "quality" of these cars and the indicator variables price and mpg understandable? Suspending our disbelief regarding identification, does this objective make sense?
(2) Is there a way to compute margins with respect to a latent variable other than by estimating/predicting the latent variable and then assuming it is fixed in repeated samples?
(3) Is it meaningful to present margins of the OEn which are conditional on the assumption that L is fixed in repeated samples at the values estimated from the causal part of the model?
(4) After -gsem-, Stata offers only the Empirical Bayes method of estimating the values of the latent variable. Since the Empirical Bayes method uses the OEn variables as well as the OEx variables, the assumption that estimates of the LEn based on the Empirical Bayes method are fixed in repeated samples seems to be a heavier lift. Do folks agree that this distinction argues for using -sem- rather than -gsem-?
(Or to implement this "solution" after -gsem-, one could simply use -matrix score- to score the OEx variables with their -gsem- estimated coefficients. This should produce results with properties similar to those produced by -predict ..., xblatent- after -sem-.)
Based on the above and on intuition (and on my experience with large data sets), the estimated coefficients of L from -reg3- without modification are almost the same as the estimated coefficients from the Frankenstein model, but the standard errors are much larger in the Frankenstein model. (Compare graphs named eb_eV_from_reg3 and eb_eV_from_sem.)

(5) Conditional on assuming the estimated values of L are fixed, as I believe we must do to compute the margins of the OEn w.r.t. L, are we justified to make statistical inferences using the much smaller standard errors from -reg3- without modification?
(6) Since replacing the -reg3- results with the e(b) and e(V) values from the -sem- estimates yields larger (and thus more conservative) standard errors, are we on firmer ground to use these Frankenstein results for inference, justifying them as restoring part of the stochastic characteristics inherent in the original -sem- model?

Again, any comments would be appreciated.
Attached Files

sem_margins_wrt_l.do (15.3 KB, 1 view)
Tags: None
Mead Over

Join Date: Sep 2014

Posts: 112
#2

25 Aug 2021, 14:36

Whoops, Please ignore the sentence reading:

Again suspending our disbelief, imagine that -price- reflects willingness-to-pay, the higher the better, while -mpg- reflects the social value of the car, the lower the better.

Replace it with:

Again suspending our disbelief, imagine that cars viewed as of higher "quality" have higher prices and get fewer miles to the gallon.
Comment

Mead Over

Join Date: Sep 2014
Posts: 112

02 Sep 2021, 18:13

OK, I'm trying again to see if anyone is interested in this question.

Here's the code, without the digressions. The code uses the following user-contributed programs:

From SSC:
regaxis
erepost
From http://digital.cgdev.org/doc/stata/MO/Misc
mluwild
grc1leg2 (only used to assemble the three created -marginsplot-s into a single graph)

Code:

*    Estimate a MIMIC model using -sem-
sysuse auto, clear
replace price = price/1000
lab var price "Price in $1,000's'"
sem  (price <- foreign L)  (mpg <- foreign L)  (L <- displ gear turn trunk )

*    Predict values of the latent variable based only on the Observed Exogenous variables
predict LfromOEx, xblatent
clonevar L = LfromOEx
    lab var L "Predicted values of L from the Observed Exogenous variables"

*    Use -reg3- to fit OLS regressions of each indicator on the predicted latent variable
constraint 1 _b[price:L] == 1
reg3 (price L foreign )  (mpg  L foreign ) , constraint(1)
    est store LfromOEx, title(Unmodified -reg3- estimates of price and mpg on LfromOEx)

*    Compare the two sets of estimates.  The coefficients match (by construction), 
*    but the standard errors are much smaller in -reg3- estimates,
*    because -reg3- ignores the stochastic origin of fitted L. 
est tab sem_orig ., se keep(price: mpg: ) stat(N ll r2_1 r2_2)

*    Using -reg3- estimates, we compute the margins of the indicators 
*    over the range of the predicted values of the latent variable.
regaxis L , lticks(atnumlist) maxticks(7)
margins , at(L = (`atnumlist'))

*    The marginsplot shows impact on the indicators of variation of L 
*    and shows the narrow confidence intervals for those margins 
*    that are a consequence of suppressing the stochastic nature of L.
marginsplot,   ///
    title(Predictive margins and 95% CIs from -sem-)  ///
    subtitle("Margins of OEn variables w.r.t. L:"  ///
        "Standard errors are from -reg3- without modification")  ///
    note("Assumes L is fixed at estimated values and other OEn variables" "(i.e. -foreign-) are fixed at their mean", span)  ///
    ytitle(Predicted values of Price and MPG)  ///
    legend(order(4 "Predicted Price" "in $1,000's" 3 "Predicted Miles-" "Per-Gallon"))  ///
    plot1opts(pstyle(p2)) plot2opts(pstyle(p1))  ///
    ci1opts(pstyle(p2)) ci2opts(pstyle(p1))  ///
    name(eb_eV_from_reg3, replace)
    
*    A more conservative approach to estimating the standard errors of the 
*    coefficients linking the indicators to L is to use the portion of the 
*    -sem- vce that relates to them instead of the vce matrix from -reg3.

*    The vce from reg 3 is:

matlist e(V)

    
*    Create a "Frankenstein" set of estimates by , getting and assembling the relevant pieces of the e(V) matrix from -sem-  ...
est restore sem_orig    

which mluwild
mluwild e(V)["price:*","price:*"]  
    mat def V_pxp = r(submat)
    matlist V_pxp

mluwild e(V)["price:*","mpg:*"]  
    mat def V_pxm = r(submat)
    matlist V_pxm
    
mluwild e(V)["mpg:*","price:*"]  
    mat def V_mxp = r(submat)
    matlist V_mxp
    
mluwild e(V)["mpg:*","mpg:*"]  
    mat def V_mxm = r(submat)
    matlist V_mxm
    
mat def eV_from_sem =  ( V_pxp , V_pxm \ V_mxp , V_mxm )

matlist eV_from_sem


*    ... and then stuffing them into the -reg3- estimates to make the Frankenstein estimates:
est restore LfromOEx
erepost V = eV_from_sem, rename

est store reg3_with_sem_eV

*    Compare the reg3 estimates and the Frankenstein estimate (reg3_with_sem_eV)
*    to the original -sem- estimates of the indicator model
est table sem_orig LfromOEx reg3_with_sem_eV, t keep(price: mpg: ) stat(N r2_1 r2_2)

*    compute the margins and make the marginsplot graph using the Frankenstein estimates
regaxis L , lticks(atnumlist) maxticks(7)
margins , at(L = (`atnumlist'))
marginsplot,   ///
    title(Predictive margins and 95% CIs from -sem-)  ///
    subtitle("Margins of OEn variables w.r.t. L:"  ///
        "Standard errors are transplanted into -reg3- from -sem- results")  ///
    note("Assumes L is fixed at estimated values and other OEn variables" "(i.e. -foreign-) are fixed at their mean", span)  ///
    ytitle(Predicted values of Price and MPG)  ///
    legend(order(4 "Predicted Price" "in $1,000's" 3 "Predicted Miles-" "Per-Gallon"))  ///
    plot1opts(pstyle(p2)) plot2opts(pstyle(p1))  ///
    ci1opts(pstyle(p2)) ci2opts(pstyle(p1))  ///
    name(frankenstein, replace)
    
grc1leg2 OEx_impact_on_OEn  eb_eV_from_reg3 frankenstein,  ///
    altshrink ycommon maintotoptitle ytol1title  ///
    holes(2) ring(0) pos(2) lcols(1) lxoffset(-15) lyoffset(-15)  ///
    labsize(small) symysize(small) name(compare_sem_margins, replace)

This code produces these three -marginsplot-s. The upper left panel shows the conventional -marginsplot- of the indicator variables against one of the observable exogenous variables, displacement. The bottom two panels show the marginsplots constructed for the latent variable. The lines in the two bottom panels are identical, but the confidence intervals are extremely different.

Click image for larger version

Name: compare_sem_margins.png
Views: 1
Size: 290.0 KB
ID: 1626071

So my question is whether this procedure for estimating the impact of a change in a latent variable on the indicator variables makes sense. And whether it makes more sense to use the version shown in the lower left panel, which has much smaller standard errors because it ignores the stochastic nature of the latent variable, or instead use the version shown in the lower right panel, which has larger standard errors because the standard errors are transplanted from the -sem- estimation.

Thanks for any reactions.

Comment

Weiwen Ng

Join Date: Jun 2015

Posts: 1241
#4

20 Sep 2021, 09:35

Writing with a point of information. I'm not that familiar with traditional SEM models, as I mainly deal with IRT models. However, I do at least know that in many SEM models, we are building a measurement model: we think there's a latent variable (i.e. one we can't observe directly) whose value we can infer from some indicators. If you took displacement, gear ratio, turn circle, and trunk space as indicators of vehicle quality, the syntax should look more like:

Code:

sem (L -> displ gear turn trunk)

Note the direction of the arrow. Internally, I interpret this as L (the latent variable) causes responses to the 4 specified indicators. In the original post, the direction of the arrow is reversed. Stata will interpret that as regressing those 4 variables (treated now as independent variables) on the variable L. It knows L is latent. In the original syntax, L (and the observed variable foreign) cause responses to price and mpg, so those are the two indicators of L.

I haven't yet digested the full example. But if you meant to say that there's a latent variable L (and its indicators are displacement, gear ratio, turn circle, and trunk space), and I want to regress L on price and foreign, then I think the syntax is more like this:

Code:

sem (L -> displ gear turn trunk) (L <- price foreign)

My current copy of Stata is on a remote server from which I can't easily extract output. However, if you run that code on the auto dataset, you'll see a couple of headers in the output table: Structural and Measurement. Measurement is the bit where Stata tells you about how the indicators load on the latent variable. Structural is where you have the results of regressing the independent variables on the latent one. You'll note that in SEM example 10, the output table also has Structural and Measurement headers, whereas the output table in post #1 only has a Structural header.

Now, how do you get marginal effects of the independent variables (foreign and price - I converted mine to $1k units, but it doesn't matter except for the scale) on the latent variable quality? I'm not sure, so the question is still open. The manual for sem postestimation says that margins are only available with respect to the observed endogenous variables. I don't exactly know my SEM terminology, but observed variables contrasts with latent variables ... so you can't get those predictions from margins? Given that this is a simple linear regression so far, the marginal effects should correspond to the regression coefficients. I don't exactly know how you interpret those coefficients in terms of scale. In IRT, we would normally constrain the variance of the latent variable to 1. In theory, if you predicted the latent variable and then used those predictions in a separate regression model, I think that any coefficients should be interpretable as Z-scores, e.g. a $1k increase in price produces a beta-standard deviation change in Quality. However, in a MIMIC model, if you constrain variance(L@1), Stata tells you that

invalid specification of variance of 'L';
'L' is a latent dependent variable

You can constrain the error of L to 1. I'm not exactly sure why this works. When I've tested a generalized (IRT) MIMIC model in Stata with that constraint against results from a similar model in an R package, I believe I got the same results.

In theory, predicting the latent variable and regressing it ignores the uncertainty in the latent variable. Mead already covered those objections above. I don't know how much weight to give that objection. That is, I don't know if it's like a dirty instrumental variable (biases the results, possibly fatally), or if it's more like linear probability models are usually close enough to logistic regression (although we've had logistic regression for decades now, and it's not computationally difficult, so there's usually no reason not to use it).

Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.
Comment
Weiwen Ng

Join Date: Jun 2015

Posts: 1241
#5

21 Sep 2021, 08:12

I hope I'm not butchering this explanation. In multiple imputation by chained equations, we estimate the mean and variance of each variable conditional on whatever other variables we put into the imputation model. We then make several draws, and we run our regression model on each random draw, then we pool them via Rubin's rules.

When you fit an SEM model, be that a generalized one or a traditional one, you can get each observation's mean of the latent variable, and the standard error of that latent variable. I think you can tell where I'm going with this. In IRT, this is called drawing plausible values of the latent trait.

Basically, if you wanted to fit a regression model to the latent variable but you want to account for the uncertainty, and you want to do a margins plot badly enough that you're willing to go through the trouble of randomly generating some number of values, and then manually declaring that as multiply imputed data ... then this might be one option. Basically, fit a measurement model, then predict the latent variable and its SE, then make your draws, then do something like mi import wide. That seems like one solution, albeit a tedious one.

Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.
Comment

Mead Over

Join Date: Sep 2014
Posts: 112

24 Sep 2021, 14:13

Thanks to Weiwen Ng and to Richard Williams (personal communication) for suggesting a bootstrapping approach to my problem. I'm thinking about that and hope to have some kind of useful response to the suggestion.

In the meantime, because a simple example of a MIMIC model using Stata's -auto.dta- might be useful to others, this post hopefully clarifies my posts #1 and #2 and provides an even simpler example of a MIMIC model using the same data.

My syntax for the -sem- command in posts #1 and #2 is consistent with my intention. The idea behind my example model is that both “price” and “mpg” reflect the unobservable cost of vehicle ownership over the vehicle’s lifespan. Thus, in the MIMIC/SEM context, “price” and “mpg” are both observable, endogenous indicators (OEn) of a latent variable L which represents “vehicle lifetime cost”. I am assuming that the variables “displacement”, “gear”, “turn” and “trunk” are the exogenous variables which are the “causes” of L. In posts #1 and #2 I also assume that “foreign” is exogenous. In Stata’s -sem- parlance, these are all OEx variables. By “exogenous”, I mean “independent of the error terms”.

Thanks to Weiwen Ng ‘s post #3 above, I now notice what I believe is an anomaly in the way Stata labels -sem- output. The following -sem- syntax, which corrects the directions of the arrows in Weiwen’s post #3, produces output which clearly labels the “cause” equation as “Structural” and the “indicator” equations as “Measurement”, as Weiwen points out. (Note that -sem- does not successfully estimate the MIMIC models in these posts unless the "price" variable is scaled to thousands. In a post I can't find now, Weiwen points out the vulnerability of -sem- estimation to widely varying scales of the variables, a feature which is common to many -ml estimators.)

Code:

sysuse auto, clear
replace price = price/1000
lab var price "Price in $1,000's'"

Results:

Code:

. sem  (L <- displ gear turn trunk ) (price mpg <- L)
 
Endogenous variables
  Measurement: price mpg
  Latent:      L
 
Exogenous variables
  Observed: displacement gear_ratio turn trunk
 
Fitting target model:
Iteration 0:   log likelihood = -3056.1025  (not concave)
&lt;snip&gt;
Iteration 15:  log likelihood =  -1190.846
Iteration 16:  log likelihood =  -1190.846
 
Structural equation model                                   Number of obs = 74
Estimation method: ml
 
Log likelihood = -1190.846
 
 ( 1)  [price]L = 1
--------------------------------------------------------------------------------
               |                 OIM
               | Coefficient  std. err.      z    P&gt;|z|     [95% conf. interval]
---------------+----------------------------------------------------------------
Structural     |
  L            |
  displacement |   .0063288   .0040196     1.57   0.115    -.0015494     .014207
    gear_ratio |  -.0733574   .5003218    -0.15   0.883     -1.05397    .9072554
          turn |   .1230058   .0505828     2.43   0.015     .0238653    .2221463
         trunk |   .0626664    .039816     1.57   0.116    -.0153717    .1407044
---------------+----------------------------------------------------------------
Measurement    |
  price        |
             L |          1  (constrained)
         _cons |  -.6013418   2.946496    -0.20   0.838    -6.376368    5.173684
  -------------+----------------------------------------------------------------
  mpg          |
             L |  -3.480387   .8865583    -3.93   0.000    -5.218009   -1.742765
         _cons |   44.84768   8.562884     5.24   0.000     28.06474    61.63062
---------------+----------------------------------------------------------------
   var(e.price)|   6.315486   1.114906                      4.468273    8.926348
     var(e.mpg)|   5.565638   5.005183                      .9550556    32.43405
       var(e.L)|   .6662495   .3762295                      .2202751    2.015155
--------------------------------------------------------------------------------
LR test of model vs. saturated: chi2(3) = 11.80             Prob &gt; chi2 = 0.0081

However, when I include the OEx variable “foreign” among the determinants of the two indicator variables, as I did in posts #1 and #2 above, Stata suppresses the -label- “Measurement” as follows:

Code:

. sem  (L <- displ gear turn trunk ) (price mpg <- foreign L)
 
Endogenous variables
  Observed: price mpg
  Latent:   L
 
Exogenous variables
  Observed: displacement gear_ratio turn trunk foreign
 
Fitting target model:
Iteration 0:   log likelihood = -1300.1204  (not concave)
&lt;snip&gt;
Iteration 13:  log likelihood = -1196.7388
Iteration 14:  log likelihood = -1196.7388
 
Structural equation model                                   Number of obs = 74
Estimation method: ml
 
Log likelihood = -1196.7388
 
 ( 1)  [price]L = 1
--------------------------------------------------------------------------------
               |                 OIM
               | Coefficient  std. err.      z    P&gt;|z|     [95% conf. interval]
---------------+----------------------------------------------------------------
Structural     |
  price        |
             L |          1  (constrained)
       foreign |   4.042227   .8126842     4.97   0.000     2.449395    5.635059
         _cons |  -3.554518   3.569106    -1.00   0.319    -10.54984    3.440801
  -------------+----------------------------------------------------------------
  mpg          |
             L |    -2.0604   .3596857    -5.73   0.000    -2.765371   -1.355429
       foreign |  -2.739423   1.369886    -2.00   0.046    -5.424351   -.0544954
         _cons |   39.66227   7.471028     5.31   0.000     25.01933    54.30522
  -------------+----------------------------------------------------------------
  L            |
  displacement |   .0137306   .0049636     2.77   0.006     .0040021    .0234591
    gear_ratio |  -.9408139   .7621917    -1.23   0.217    -2.434682    .5530545
          turn |   .1976456   .0710133     2.78   0.005     .0584622     .336829
         trunk |   .0588124   .0533685     1.10   0.270    -.0457879    .1634126
---------------+----------------------------------------------------------------
   var(e.price)|   4.576513    .907895                      3.102217    6.751453
     var(e.mpg)|   10.99188    2.81405                      6.655097    18.15472
       var(e.L)|   .5206339   .4753274                      .0869769    3.116456
--------------------------------------------------------------------------------

I think it's appropriate to describe the two equations specified by “(price mpg <- foreign L)” as “measurement” equations, despite their inclusion of the OEx variable “foreign” among their right-hand-side variables. I’m not sure whether -sem-‘s omission of the label “Measurement” in -sem-‘s output is intentional or a small formatting bug.

After my two posts #1 and #2 above, I have realized that Stata’s -auto.dta- data contains a third indicator of “lifetime vehicle cost”. The third variable is -rep78- which is an ordinal variable capturing “Repair record 1978”. Conceptually “lifetime vehicle cost” should be positively correlated with the variable “price” and negatively correlated with the variables “mpg” and “rep78”. Also, on reflection, why not include the OEx “foreign” in the “cause” or “structural” equation which determines L along with the other four OEx variables.

If we blithely ignore -rep78-‘s ordinal nature and pretend it is cardinal, we can estimate a MIMIC model of lifetime vehicle cost with the syntax:

Code:

sysuse auto2, clear
replace price = price/1000
lab var price "Price in $1,000's'"
sem  (displ gear turn trunk foreign -> L) (L -> price mpg rep78)

Or, if one prefers the econometrics convention of placing dependent variables on the left-hand-side, the -sem- syntax could be written:

Code:

sem  (L <- displ gear turn trunk foreign) (price mpg rep78 <- L)

Either way the result is:

Code:

. sem  (L <- displ gear turn trunk foreign) (price mpg rep78 <- L)
note: The following observed variable name will be treated as a latent variable: L.  If this is not your intention use
      the nocapslatent option, or identify the latent variable names in the latent() option.
(5 observations with missing values excluded)
 
Endogenous variables
  Measurement: price mpg rep78
  Latent:      L
 
Exogenous variables
  Observed: displacement gear_ratio turn trunk foreign
 
Fitting target model:
Iteration 0:   log likelihood = -2066.3226  (not concave)
Iteration 1:   log likelihood = -1271.1914  (not concave)
&lt;snip&gt;
Iteration 15:  log likelihood = -1204.5286  (not concave)
Iteration 16:  log likelihood = -1204.2362
Iteration 17:  log likelihood = -1203.7643
Iteration 18:  log likelihood = -1203.7532
Iteration 19:  log likelihood = -1203.7531
 
Structural equation model                                   Number of obs = 69
Estimation method: ml
 
Log likelihood = -1203.7531
 
 ( 1)  [price]L = 1
--------------------------------------------------------------------------------
               |                 OIM
               | Coefficient  std. err.      z    P&gt;|z|     [95% conf. interval]
---------------+----------------------------------------------------------------
Structural     |
  L            |
  displacement |   .0092242   .0045904     2.01   0.044     .0002273    .0182212
    gear_ratio |  -.8224961    .603289    -1.36   0.173    -2.004921    .3599286
          turn |   .1643932   .0591321     2.78   0.005     .0484964      .28029
         trunk |   .0287036   .0405003     0.71   0.478    -.0506755    .1080828
       foreign |   1.086969   .5914699     1.84   0.066    -.0722912    2.246228
---------------+----------------------------------------------------------------
Measurement    |
  price        |
             L |          1  (constrained)
         _cons |  -.4864191   2.847152    -0.17   0.864    -6.066734    5.093896
  -------------+----------------------------------------------------------------
  mpg          |
             L |  -2.921838   .6679504    -4.37   0.000    -4.230997   -1.612679
         _cons |   40.66882   7.876091     5.16   0.000     25.23197    56.10568
  -------------+----------------------------------------------------------------
  rep78        |
             L |  -.2311959   .0940379    -2.46   0.014    -.4155068    -.046885
         _cons |   4.939195   .8118177     6.08   0.000     3.348062    6.530329
---------------+----------------------------------------------------------------
   var(e.price)|   5.691022   1.062862                      3.946561    8.206572
     var(e.mpg)|   11.13628   4.038859                       5.47054    22.66994
   var(e.rep78)|   .8231382   .1479877                      .5786817    1.170862
       var(e.L)|   .1509371   .3507742                      .0015872    14.35384
--------------------------------------------------------------------------------
LR test of model vs. saturated: chi2(10) = 52.53            Prob &gt; chi2 = 0.0000

In the real world the variables I have designated OEx seem likely to be jointly determined with the car’s “price”, “mpg” and “rep78” as part of a firm’s profit-maximizing decisions in an imperfectly competitive market. However, for the purposes of this demo model, let’s assume that these OEx are instead exogenous.

If one can live with the assumption that the OEx variables are exogenous, this little demo MIMIC model “works” quite well. It converges reliably and quickly. The latent variable, L, seems to have the anticipated relationship with the indicators, with statistically significant negative coefficients for both -mpg- and -rep78-. With the latent variable “anchored” by the observed price, units of the latent variable can be interpreted as thousands of dollars. The estimated coefficients of L in the indicator equations for "mpg" can be interpreted to suggest that either a decline in "mpg" of 2.92 mile per gallon or a decline in repair record of 0.23 index points is valued about the same by the consumer as would an increase of $1,000 in the purchase price. A decline of a full point in the repair record is valued by consumers at $4,348 = 1000*(1/0.23).

Assuming the average car was driven about 10,000 miles per year in 1978 and that gasoline cost 63 cents per gallon, the annual cost of reducing a car’s mileage by 2.92 miles per gallon would be: 10,000 * (1/2.92) * 0.63 = $2,157. So the implication of these estimates is that those buying cars in the US in 1978 did not fully appreciate the cost they would incur over time by buying a car with lower gas mileage.

This MIMIC model is related to the theory of hedonic pricing as in these references:

Rosen, S. (1974). "Hedonic prices and implicit markets: product differentiation in pure competition". Journal of Political Economy. 82 (1): 34–55 )

Reis, Hugo J.; Silva, J. M. C. Santos (2006). "Hedonic Price Indexes for New Passenger Cars in Portugal (1997–2003)"

The same model can be estimated with -gsem- if one uses the from(sem_eb, skip) option where the matrix sem_eb is defined as the results of the -sem- model at convergence. The coefficients and standard errors from the -gsem- model match exactly those from the -sem- model, as expected..

Comment

KAYDEN MARLI

Join Date: Apr 2019

Posts: 29
#7

09 Dec 2021, 13:18

helllo stata fm, i want to impute a subcategory of a variable how do i do that, if i create a dummy variable it still give me impute values for the other catgerories
Comment
Weiwen Ng

Join Date: Jun 2015

Posts: 1241
#8

09 Dec 2021, 14:20

Originally posted by KAYDEN MARLI View Post

helllo stata fm, i want to impute a subcategory of a variable how do i do that, if i create a dummy variable it still give me impute values for the other catgerories

Please ask this as a new question and provide more information. If you bump a question that's already been answered, it's possible that people who could help you are missing your question. For instance, people who know structural equation modeling would be drawn to this sort of question, and SEM and multiple imputation probably don't overlap much so the MI specialists could be skipping over this.

Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.
Comment
KAYDEN MARLI

Join Date: Apr 2019

Posts: 29
#9

09 Dec 2021, 19:47

input byte p04sex double age byte(p11etnic p15aho) str1 p15cdipl byte(p18wstat p23bstaa)
2 3 3 . " " . 1
2 20 5 9 "1" 4 1
1 37 4 7 "1" 2 2
1 24 4 6 "1" 4 1
2 35 4 6 "2" . 2
1 44 3 6 "1" . 1
2 32 5 11 "2" 4 1
2 42 3 . " " 4 1
2 2 4 . " " . 1
1 57 3 8 "1" 4 2
2 57 3 8 "1" 4 3
2 18 5 6 "1" . 1
2 27 8 11 "2" . 1
2 4 8 . " " . 1
1 37 8 7 "2" . 1
2 7 5 4 "2" . 1
2 53 5 6 "1" . 2
1 21 8 5 "2" 4 1
1 22 8 5 "2" 4 1
1 16 8 6 "2" 4 1
2 31 4 6 "2" . 1
2 9 4 4 "2" . 1
2 57 3 8 "1" 4 2
2 6 8 2 "2" . 1
2 51 8 10 "1" 4 2
2 12 5 . " " . 1
1 5 5 2 "2" . 1
2 6 5 1 " " . 1
2 50 8 6 "1" . 2
1 53 6 9 "7" 1 2
1 23 4 10 "2" 4 1
1 2 3 . " " . 1
2 58 3 6 "1" 4 2
1 22 8 9 "1" 4 1
1 59 5 7 "1" 4 2
2 9 4 . " " . 1
2 21 5 7 "1" 4 1
2 35 5 8 "1" 4 2
1 6 3 2 "2" . 1
2 22 6 9 "2" 3 1
2 12 6 4 "1" . 1
1 46 6 6 "2" 4 2
2 17 6 4 "2" . 1
2 18 4 11 "2" . 1
2 6 4 4 "2" . 1
2 50 4 4 "2" . 2
1 27 3 4 "2" 2 1
1 45 8 . " " 2 2
2 43 4 6 "1" 4 2
2 18 8 6 "2" . 1
1 17 8 6 "2" . 1
2 39 8 10 "1" 4 1
1 59 6 . " " 2 2
2 37 8 6 "2" 4 2
2 6 8 4 "2" . 1
1 43 8 11 "2" 4 2
2 17 5 7 "2" . 1
1 48 5 . " " 4 2
1 42 3 6 "2" 2 1
2 38 8 10 "2" 4 1
1 50 4 6 "1" 4 2
2 29 4 6 "2" 4 3
1 46 8 6 "2" . 2
2 3 8 . " " . 1
1 5 3 2 "9" . 1
2 13 5 6 "2" . 1
1 8 4 4 "2" . 1
1 34 4 6 "1" 2 2
2 22 5 8 "2" . 1
1 41 3 4 "2" 4 2
2 13 3 4 "2" . 1
2 41 6 . " " . 2
1 33 5 6 "2" 4 2
1 4 5 . " " . 1
1 32 5 6 "1" 4 2
2 10 5 4 "2" . 1
2 22 5 8 "1" . 1
1 14 8 4 "1" . 1
2 24 4 11 "2" . 2
1 27 5 9 "2" 2 2
2 28 5 11 "2" 4 2
2 25 5 7 "1" 4 2
1 38 5 11 "2" 4 2
2 45 4 7 "1" . 2
1 19 4 9 "2" . 1
1 14 5 6 "7" . 1
2 40 3 6 "2" . 9
2 10 3 4 "2" . 1
1 47 6 6 "1" 2 2
1 52 5 11 "2" 4 2
1 28 5 6 "2" 4 1
2 21 5 8 "1" . 1
2 19 5 6 "1" . 1
2 13 5 6 "2" . 1
2 23 5 9 "1" . 1
2 12 5 4 "2" . 1
1 13 5 6 "2" . 1
1 28 8 6 "1" 4 1
2 25 8 6 "2" 4 1
2 6 8 2 " " . 1

. describe p04sex age p11etnic p15aho p15cdipl p18wstat p23bstaa

storage display value
variable name type format label variable label
-----------------------------------------------------------------------------------------------------------------
p04sex byte %10.0g Sex
age double %12.0g Age
p11etnic byte %10.0g Ethnicity
p15aho byte %10.0g Type of highest formal education
p15cdipl str1 %1s Certficate attained?
p18wstat byte %10.0g Status in employment
p23bstaa byte %10.0g Marital status
Comment

Announcement