Imputation of scale items for an RCT

LA Sheira

Join Date: Feb 2020

Posts: 6
#1

Imputation of scale items for an RCT

09 Feb 2024, 17:48

I am working with data from 2 arms (intervention/control) and 3 study waves (v1, v2, v3). I am trying to impute the 23 items of a scale (rather than the summed scale in itself). The missingness is random across the items from what I can tell. Each item has a likert response from 0-4, and the 23 items imputed items will be summed to have a range from 0-92. I was unable to get the mi impute chained ologit to converge, but found a workaround in truncreg which didn't give me ordinal items per se but were within range and usable for the sum. My current code looks like this (I include the overall scale with mean imputed values for those with missingness just to not kick them out of the model, as including the overall scale from my review of literature is recommended):

mi impute chained (truncreg, ll(0) ul(4)) item1 item2 item3....item23 = overall_score educ marstat age food_insecure_i sex, by(arm visit) add(20) replace force rseed (12345)

Here is the issue: in an ideal world, I would use items 2-23 in the prediction model to impute item 1, and so forth. Is there a computationally efficient way to do this? I have two thoughts:

1: If I run mi impute 23 times, would I need to mi export each data set then merge 23 datasets in order to subsequently run mi estimate commands properly? This seems computationally intensive and unnecessary.

and/or

2: These other scale items/potential predictor variables are not all complete (as that is the problem I am dealing with), however the complete values are informative in predicting the missingness for missing folks. However I cannot impute across all the different possible combinations of missingness (for example is someone is missing Q1 and Q3, and I am imputing Q1 and including Q3 for everyone, their value will not get imputed for Q1 since they are missing Q3 and will be kicked out of the predictor model). Is there some specification to avoid this?

I need to keep the by arm and visit specification as the intervention will impact these scores and their changes over time, so I don't necessarily want to use P1V3 to predict P1V1, etc.

I hope this all is clear and I appreciate any advise and guidance anyone can give.
Tags: imputation, missing data, multiple imputation, scale development
Tiago Pereira

Join Date: Jan 2016

Posts: 365
#2

11 Feb 2024, 05:30

Hello, LA Sheira.

It is difficult to help you without the original data. MI is tricky and requires careful assessment.

If you have an RCT, repeated measurements and the assumption of missing at random is acceptable, you do not need to impute data. You can use mixed effects models (e.g., -mixed-), which will naturally account for missing data.

Check these papers:

1. Peters, S. A., Bots, M. L., den Ruijter, H. M., Palmer, M. K., Grobbee, D. E., Crouse III, J. R., ... & METEOR study group. (2012). Multiple imputation of missing repeated outcome measurements did not add to linear mixed-effects models. Journal of clinical epidemiology, 65(6), 686-695.
2. Twisk, J., de Boer, M., de Vente, W., & Heymans, M. (2013). Multiple imputation of missing values was not necessary before performing a longitudinal mixed-model analysis. Journal of clinical epidemiology, 66(9), 1022-1028.

Hope this helps.

Tiago
Comment
daniel klein

Join Date: Mar 2014

Posts: 3806
#3

11 Feb 2024, 05:59

I will only briefly comment on the following:

Originally posted by LA Sheira View Post

[...] (I include the overall scale with mean imputed values for those with missingness just to not kick them out of the model, as including the overall scale from my review of literature is recommended):

mi impute chained (truncreg, ll(0) ul(4)) item1 item2 item3....item23 = overall_score educ marstat age food_insecure_i sex, by(arm visit) add(20) replace force rseed (12345)

Don't do that!

Do not use the force option. The two most common reasons people think they need force is (a) listing variables with missing values to the right hand side of the equals sign and/or (b) having "hard" missing values, i.e., .a, .b, etc. And, of course, because Stata suggests the option in the error message, which I think is a mistake. You never want missing imputed values unless you are doing some sort of methodological research on the effects of having missing imputed values.

If you want the overall score in the model, keep it as is, i.e., with all missing values, and list it among the other imputed variables to the left side of the equals sign.
Comment
LA Sheira

Join Date: Feb 2020

Posts: 6
#4

11 Feb 2024, 11:43

Originally posted by daniel klein View Post

I will only briefly comment on the following:

Don't do that!

Do not use the force option. The two most common reasons people think they need force is (a) listing variables with missing values to the right hand side of the equals sign and/or (b) having "hard" missing values, i.e., .a, .b, etc. And, of course, because Stata suggests the option in the error message, which I think is a mistake. You never want missing imputed values unless you are doing some sort of methodological research on the effects of having missing imputed values.

If you want the overall score in the model, keep it as is, i.e., with all missing values, and list it among the other imputed variables to the left side of the equals sign.

Interesting as this one of the sources of my issue--I do have missing values on the right hand side however I want to gather the information from non-missing people to impute. This is indeed what stata suggests; why do you think it is a mistake? If someone is missing say, 2 items and their mother's education, I don't think that is a good reason to kick them out completely of the analysis and not impute.
Comment
daniel klein

Join Date: Mar 2014

Posts: 3806
#5

11 Feb 2024, 13:22

I think you got me wrong. You are not supposed to exclude any observations from your imputation model. You need to figure out why you have missing imputed values and act accordingly. There are usually two possibilities:

1. Your variables contain "hard" missing values, i.e., .a, .b, etc. If so, recode to "soft" missing values, i.e., . (also called sysmis).
2 Variables on the right-hand side of the equals sign contain missing values. If so, register them imputed and put them to the left-hand hand side.
Comment
LA Sheira

Join Date: Feb 2020

Posts: 6
#6

11 Feb 2024, 16:50

Originally posted by daniel klein View Post

I think you got me wrong. You are not supposed to exclude any observations from your imputation model. You need to figure out why you have missing imputed values and act accordingly. There are usually two possibilities:

1. Your variables contain "hard" missing values, i.e., .a, .b, etc. If so, recode to "soft" missing values, i.e., . (also called sysmis).
2 Variables on the right-hand side of the equals sign contain missing values. If so, register them imputed and put them to the left-hand hand side.

Got it--I don't want to impute (in this case/example) mother's education. I want to use it on the right side to to impute my missing left side items of the scale. Unless I am mistaken, stata kicks out of the imputation model (what I meant by exclude; excuse poor choice of language there) those data points (participants in my situation) when there is missingness on the right hand side variables. I don't particularly want / need to impute for these missing soc-dem items.

None of this, respectfully, gets me closer to my primary issue which is how to use (the incomplete) data from the other 22 items to impute each item, iteratively, on this 23 item scale.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29796
#7

11 Feb 2024, 19:03

None of this, respectfully, gets me closer to my primary issue which is how to use (the incomplete) data from the other 22 items to impute each item, iteratively, on this 23 item scale.

If I understand correctly what you are asking, the answer is you don't have to do anything. Your command

Code:

mi impute chained (truncreg, ll(0) ul(4)) item1 item2 item3....item23 = overall_score educ marstat age food_insecure_i sex, by(arm visit) add(20) replace force rseed (12345)

will cause Stata to impute values of item 1 from items 2 through 23 and overall_score, educ, marstat, age, food_insecure_i and sex. It will also impute item2 from item1 and items 3 through 23 and overall_score through sex. It imputes item3 from items 1, 2, and 4 through 23 as well as overall_score through sex. And so on. In other words, and probably more simply, the syntax has to be understood that every variable on the left side of the = is also (implicitly) on the right side. Not only is this the default behavior of -mi impute chained-, as far as I know there is no way to prevent it even if you wanted to.
Comment
LA Sheira

Join Date: Feb 2020

Posts: 6
#8

11 Feb 2024, 21:56

Originally posted by Clyde Schechter View Post

If I understand correctly what you are asking, the answer is you don't have to do anything. Your command

Code:

mi impute chained (truncreg, ll(0) ul(4)) item1 item2 item3....item23 = overall_score educ marstat age food_insecure_i sex, by(arm visit) add(20) replace force rseed (12345)

will cause Stata to impute values of item 1 from items 2 through 23 and overall_score, educ, marstat, age, food_insecure_i and sex. It will also impute item2 from item1 and items 3 through 23 and overall_score through sex. It imputes item3 from items 1, 2, and 4 through 23 as well as overall_score through sex. And so on. In other words, and probably more simply, the syntax has to be understood that every variable on the left side of the = is also (implicitly) on the right side. Not only is this the default behavior of -mi impute chained-, as far as I know there is no way to prevent it even if you wanted to.

Wow--this is a great and interesting development and completely addresses my problem! Thanks Clyde!
Comment
daniel klein

Join Date: Mar 2014

Posts: 3806
#9

12 Feb 2024, 02:07

Let me start with this one:

Originally posted by LA Sheira View Post

I don't want to impute (in this case/example) mother's education.

Yes, you do want to impute mother's education. You might want to exclude the respective observations with missing values after imputation and before the analyses, but there is no good reason to do that either. Think about it this way: There is an association between mother's education and the scale items. Otherwise, why put mother's edcuation in the model? But associations work both ways. If mother's education is informative for imputing missing scale items, then scale items are also informative for imputing mother's education. If you put variables with missing values to the right-hand side of the equals sign, you are throwing away information. Why would you do that?

More alarmingly, reading

Originally posted by LA Sheira View Post

I want to use it on the right side to to impute my missing left side items of the scale.

and also

Originally posted by Clyde Schechter View Post

Your command [...] will cause Stata to impute values of [,,,] and overall_score, educ, marstat, age, food_insecure_i and sex. And so on. In other words, and probably more simply, the syntax has to be understood that every variable on the left side of the = is also (implicitly) on the right side.

there appears to be a widespread deep misunderstanding of how mi's syntax actually works. So let's clarify.

Here is a simple example:

Code:

. clear . sysuse auto (1978 automobile data) . . replace price = . in 1/10 (10 real changes made, 10 to missing) . replace mpg = . in 5/15 (11 real changes made, 11 to missing) . . list price mpg rep78 in 1/15 +----------------------+ | price mpg rep78 | |----------------------| 1. | . 22 3 | 2. | . 17 3 | 3. | . 22 . | 4. | . 20 3 | 5. | . . 4 | |----------------------| 6. | . . 3 | 7. | . . . | 8. | . . 3 | 9. | . . 3 | 10. | . . 3 | |----------------------| 11. | 11,385 . 3 | 12. | 14,500 . 2 | 13. | 15,906 . 3 | 14. | 3,299 . 3 | 15. | 5,705 . 4 | +----------------------+

Here, I load auto.dta and introduce some missing values in price and mpg. Next, I use mi set and register both variables imputed. Notice that I do not register rep78, which also has missing values, imputed:

Code:

. mi set flong . mi register imputed price mpg (15 m=0 obs now marked as incomplete)

I am going to use rep78 on the right-hand side to impute missing values in price and mpg. Let's force our way through the imputation as it is so often done:

Code:

. mi impute chained (regress) price mpg = i.rep78 , add(1) rseed(42) force Conditional models: price: regress price mpg i.rep78 mpg: regress mpg price i.rep78

Wait. Let's pause for a second. Stata is telling us how it sets up the conditional models. Note that rep78 is never listed as an outcome variable. That means, missing values in rep78 are not going to be imputed. The reason is that I have listed rep78 on the right-hand side of the equals sign. Stata is reminding me of that fact, again, after finishing the forced imputations:

Code:

Note: Right-hand-side variables (or weights) have missing values; model parameters estimated using listwise deletion.

So Stata omits the respective observations from the conditional models. Here is how the imputed data looks like:

Code:

. list price mpg rep78 if _mi_m == 1 & _mi_id <= 15 +---------------------------+ | price mpg rep78 | |---------------------------| 75. | 7,912.7 22 3 | 76. | 10,269 17 3 | 77. | . 22 . | 78. | 10,165 20 3 | 79. | 6,969.8 9.3168 4 | |---------------------------| 80. | 7,219.8 22.4681 3 | 81. | . . . | 82. | 7,055.6 17.8603 3 | 83. | 11,132 10.4983 3 | 84. | 10,778 18.7615 3 | |---------------------------| 85. | 11,385 17.5622 3 | 86. | 14,500 2.2238 2 | 87. | 15,906 9.87846 3 | 88. | 3,299 24.9592 3 | 89. | 5,705 10.213 4 | +---------------------------+

See how none of the missing values of rep78 was imputed? See also that missing values in rep78 cause missing imputed values in price and mpg? Look at observation 77. The missing value of price is not imputed even though the value of mpg is observed! But, because there is also a missing value on rep78, that observation is not used in the model to impute price.

Now, if this behavior is really what you want, for whatever obscure reason, then go ahead. I stand by my original suggestion: you never want that and Stata should not suggest you use the force option.

tl;dr: Regiater all variables with missing values imputed and put them on the left-hand side of mi impute. The right-hand side of the equals sign in mi impute should be reserved for variables with no missing values. If you do not have fully observed variables, omit the right-hand side; there is nothing wrong with that.

Last edited by daniel klein; 12 Feb 2024, 02:23. Reason: fixing lots of typos; might be worth writing this up as a Tip in the Stata Journal ...
1 like
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29796
#10

12 Feb 2024, 11:54

Your command [...] will cause Stata to impute values of [,,,] and overall_score, educ, marstat, age, food_insecure_i and sex. And so on. In other words, and probably more simply, the syntax has to be understood that every variable on the left side of the = is also (implicitly) on the right side.

I think the ellipses that daniel klein has inserted in my comment have changed its meaning. The original, full sentence is:

Your command
Code:

mi impute chained (truncreg, ll(0) ul(4)) item1 item2 item3....item23 = overall_score educ marstat age food_insecure_i sex, by(arm visit) add(20) replace force rseed (12345)
will cause Stata to impute values of item 1 from [emphasis added] items 2 through 23 and overall_score, educ, marstat, age, food_insecure_i and sex. It will also impute item2 from item1 and items 3 through 23 and overall_score through sex.

The meanings are quite different. In particular, the word "from" in my full sentence, elided in the version in #9, syntactically disrupts any claim that missing values of the right hand side variables will be imputed. It only says that the left hand side variables are also used to impute missing values of other left hand side variables. There is nothing in my original sentence that implies that missing values of right hand side variables will be imputed.

Daniel Klein goes on to point out that when right hand side variables themselves contain missing values, this causes problems, including failure to impute left hand side variables in observations where the right hand side variable(s) has missing values. This is correct, and in no way contradicts what I said. Let me be clear: I agree completely with Daniel's advice that all variables with missing variables should be -mi register-ed as imputed and should appear on the left hand side. But my comments in #8 did not address that issue at all: Daniel made those points, very clearly, in #3, and O.P. rejected the advice in #4 and #6. That rejection will, in my opinion, be to her detriment, but given her apparent determination to proceed I did not express my opinion on the issue in #8.

tl;dr I agree with all of Daniel Klein's advice in this thread. However, he appears to have misunderstood the sentence I wrote in #8 that he quoted in #9. In particular, the phrases that he elides in his quote change the meaning of what I wrote. Without the ellipses, my sentence neither agrees nor disagrees with what he said and deals with a different aspect of the problem altogether.
Comment
daniel klein

Join Date: Mar 2014

Posts: 3806
#11

12 Feb 2024, 12:48

I apologize to Clyde Schechter for completely misrepresenting his post. It was an honest misunderstanding from my side. I can't even blame it on English not being my mother tongue. It was just sloppy reading on my part.
1 like
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29796
#12

12 Feb 2024, 13:11

Thanks. No problem. I think we've all misread things here from time to time.

Cheers!
1 like
Comment
LA Sheira

Join Date: Feb 2020

Posts: 6
#13

13 Feb 2024, 12:12

Thanks for the continued conversation! The illustration from Daniel in #9 is interesting--my model was actually showing complete imputation despite missingness on the right hand side variables. Let me tinker with it to understand why as well as putting the variables on the left hand side as Daniel suggests to see how the results compare.

The variables I wanted to use in the imputation were the responses to the other 22 items--not necessarily soc-dem characteristics--as I suspect the items are most informative for predicting values. With this study sample there is not so much variation in soc-dem characteristics--they are all students from the same grade in school and similar SES in an ethnically homogenous country, hence my dedication to trying to figure out how to include the responses to the other items as I knew deep down that the soc-dem alone were not overwhelmingly informative. I didn't include all of this information in the original post as I thought they would unnecessarily complicate the explanation of my question. Thanks!

Edit to add: none of these soc-dem characteristics will be in my model either, as it is a simple model with just arm and time point given we have baseline balance.

Last edited by LA Sheira; 13 Feb 2024, 12:18.
Comment

Announcement

Imputation of scale items for an RCT

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment