Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Exporting imputed data from Stata and working with imputed data in Stata

    Hi. I'm trying to understand the best way to work with imputed data in Stata after imputation, specifically exporting the imputed data and performing regressions on the imputed data. I am using Stata/SE 15.1 for Windows.

    I have 11 categorical variables and 355 respondents, with roughly 20% of each variable's data missing. I have used the Statistics>Multiple imputation>Multiple imputation control panel to impute the missing values. I have several questions about working with the data following imputation.

    1) I have imputed the data 20 times and wanted to export the imputed sets to Excel so that I can average the 11 variable responses per respondent across the 20 data sets and develop a score from the overall average per respondent. After imputation, I've used the File>Export>Data to Excel spreadsheet command, but I'm having difficulty understanding the exported Excel file. The first portion is obvioulsy my original data with the missing values, but the groups of imputed values below don't seem to correspond to the missing values in the original data set (at the top of the spreadsheet). Am I mistaken or not understanding the imputation output? Is there a cleaner way to export the data where I get 20 complete sets of data with the imputed values included? I've played with multiple export commands and can't seem to find a solution. I've attached the associated file. Imputed B24 data.xls Any insight would be greatly appreciated.

    2) Following imputation, I want to perform various analyses on the imputed data. For instance, I want to regress the 11 imputed variables in the attached spreadsheet on 'age'. For a logistic regression I know to use the 'logit' command, but I'm uncertain how to reference my newly imputed data in the command line. I chose 'marginal long' as the data type when conducting the imputation, and thought I could refer to the new data as 'mlong', but I continue to get an 'r111' error (variable not found).

    3) This last part isn't really a problem, just something I need clarification on as I'm new to multiple imputation. When I regress the newly imputed data on other variables (such as age), Stata will use all 20 imputed data sets to run the regressions, correct? I want to make sure I understand this so I can explain it adequately.

    If clarification is needed or I need to move this post to another forum, please let me know. Thank you very much in advance for any advice that is given.

  • #2
    I pretty much stopped at

    1) I have imputed the data 20 times and wanted to export the imputed sets to Excel
    and so should you. This may sound harsh but I really mean to help here. One of the most important things to realize when you start using Stata (or basically any other serious statistical software) is this: Stata is a statistical software package; Excel is not. You will never ever have or want to use Excel for anything involving statistics. It follows that you can do everything you want here in Stata. mi is not just for performing imputations; it also has data management tools and supports many estimation commands.

    The second thing to realize (that I wish I had realized earlier) is that Stata's documentation is so much more than a list of technical details. I believe that you will benefit a lot from reading through the [MI] manual that is also included in your installation of Stata.

    Follow the advice in

    Code:
    help mi
    and start by reading [MI] Intro substantive. Reading through the theoretical discussions and worked examples will give you a pretty good idea of multiple imputations in general and how to perform the imputations as well as the appropriate analysis in Stata.

    Concerning posts on Statalist, a simple principle is: exact code, exact output, and examples say more than a thousand words. This is why you will want to use syntax (or copy/log the syntax that Stata creates for you when you navigate the menus). For example, when you say

    I have used the Statistics>Multiple imputation>Multiple imputation control panel to impute the missing values
    we have no idea about the mi-style that you have chosen, the variables that you have registered (or did not register), the imputation model that you have used, whether you have set a random seed to replicate results, ... Moreover, you do not have the chance to have additional potential problems pointed out that you did not even realize. Compare your verbal descriptions with something along the lines

    Code:
    set seed 42
    mi set mlong
    mi register imputed x1 x2 x3
    mi imputed chained (regress) x1 x2 (logit) x3 = y , add(20)
    Stating problems and attempted solutions in this fashion, that is, in terms of code that you have written [copy the exact code whenever possible] makes it easier for both you and others. Again, this is not pedantic it really helps you get better answers.

    Best
    Daniel
    Last edited by daniel klein; 12 Sep 2019, 15:30.

    Comment


    • #3
      Daniel,

      Thank you for the reponse. I took your advice and read through the relevant portions of the MI manual. Rather than using the soft buttons to conduct the MI, I used the following code:

      .................................................. .................................................. .................................................. ....................................
      .mi set mlong

      .mi register imputed slow_fast meth_cas anal_int reas_felt prec_appr sol_coll diff_easy rav_rseek opt_appr calm_wor info_uninfo
      (107 m=0 obs. now marked as incomplete)

      . set seed 1837

      .mi impute mvn slow_fast meth_cas anal_int reas_felt prec_appr sol_coll diff_easy rav_rseek opt_appr calm_wor info_uninfo, add (20)
      .................................................. .................................................. .................................................. ....................................

      I've marked my variables in blue as there are 11 of them and I didn't want to cause confusion.

      My first question has not changed much. I still need to average the 11 variables per respondent to give each respondent a score. I understand that this forum does not approve of Excel to do that, so is there a way to average the 11 variables per respondent in each of the 20 imputed data sets Stata has created for me? I could then average the 20 averages per respondent together to come up with a final average per respondent.

      If there is not a way to do this in Stata, is there a way to export the 20 imputed data sets to Excel so I can perform the averaging there? I understand this is unorthodox, but at this moment I can't move forward with my analyses until I develop this averaged score per respondent.

      Any help is appreciated. Thank you.

      Comment


      • #4
        First thing I notice is that you have not put more than your 11 (predictor?) variables into the imputation model. If you are planning on doing any analyses that involve other variables, those variables must be in the imputation model, too.

        Concerning your question, have you looked at mi passive? You probably want

        Code:
        mi passive : egen score = rowmean(slow_fast ... info_uninfo)
        Best
        Daniel

        Comment


        • #5
          My plan was to create the averaged score I discussed in the previous post to use as the dependent variable, then use several demographic variables (age, gender education level, etc.) as independent variables. I would be regressing the averaged score on the demographic variables. Since all 11 variables I have listed in the above post have missing values, I thought I should use MI for all of them. I did find this excerpt from the link below, though, that would seem to point me in the direction of your suggestion:

          https://stats.idre.ucla.edu/stata/seminars/mi_in_stata_pt1_new/:

          "The second command is mi impute mvn where the user specifies the imputation model to be used and the number of imputed datasets to be created.

          mi impute mvn female write read math progcat1 progcat2 science = socst, add(10) rseed (53421)

          On the mi impute mvn command line we can use the add option to specify the number of imputations to be performed. In this example we chose 10 imputations. Variables on the left side of the equal sign have missing information, while the right side is reserved for variables with no missing information and are therefore solely considered “predictors” of missing values. As you can see, even through science is an auxiliary variable, science must be included as a variable to be imputed."


          To be clear, you're suggesting I include any associated variables into the code to the right of the equal sign (indicating they are predictor variables for the variables to be imputed and don't require imputation themselves), right?

          Or are you saying that if I plan to do, for example, a linear regression following imputation with the averaged scores (DV) and various demographic variables (IVs), I need to include those IVs to the right of the equal sign?

          Ideally I would like to have the averaged scores as variables in their own right so I can use them in a number of analyses.

          Thank you for the mi passive suggestion. I am researching that right now.

          Comment


          • #6
            I think I understand part of my problem. In my previous post (directly above) I asked the following:

            "To be clear, you're suggesting I include any associated variables into the code to the right of the equal sign (indicating they are predictor variables for the variables to be imputed and don't require imputation themselves), right?

            Or are you saying that if I plan to do, for example, a linear regression following imputation with the averaged scores (DV) and various demographic variables (IVs), I need to include those IVs to the right of the equal sign?"


            I think I was describing the same thing in two different ways. I've made adjustments to my code and have gotten the following result:

            . mi set mlong

            . mi register imputed slow_fast meth_cas anal_int reas_felt prec_appr sol_coll diff_easy averse_prone opt_appr calm_wor info_uninfo
            (107 m=0 obs. now marked as incomplete)

            . set seed 1837

            . mi impute mvn slow_fast meth_cas anal_int reas_felt prec_appr sol_coll diff_easy ave
            > rse_prone opt_appr calm_wor info_uninfo = age gender edu_years yown_years own_crop o
            > wn_live own_wild own_invest own_rec own_other prop_time prop_income, add(10)
            'edu_years' found where numeric variable expected
            r(7);


            I've highlighted the problem in blue. I'm uncertain why Stata expects there to be a numeric value there. I've been unable to find any mention of an upper limit on predictor variables in MI. Does that error code make sense to you?

            I understand my questions are probably very elementary. I'm very new to this and very much appreciate your patience thus far.

            Comment


            • #7
              The problem with basic questions is that you would benefit a lot more from on-site help; a forum is fine but unlikely to get you all the way. Do you have anyone at your institution with some experience in both MI and Stata? Ideally, that person also has some basic understanding of what you are trying to do.

              It appears that you only impute missing values in your dependent variable. It further appears as if you are planning to go for some linear model. In that case, you might want to consider sticking with SEM and possibly FIML (full information maximum likelihood) instead of MI; the latter is usually a lot simpler to implement. I will stick with your questions regarding MI here but note that it is unlikely that I will be able to walk you through the whole process. However, it appears as if you are making some progress which is good.

              In your current code, you use the mvn (multivariate normal) imputation method; this is fine if all of your variables with missing values are (quasi-)continuous. If you have any categorical variables with missing values, you probably want to switch to chained equations. I cannot tell whether this is the case from your code. However, I guess that gender is a categorical variable. If it has values other than 0 and 1, you want to use factor variable notation, that is, type i.gender.

              Concerning the question about edu_years, it appears that this a string variable; those will not work in mi (or any other statistical analyses). You need numeric variables.

              Best
              Daniel

              Comment


              • #8
                Thank you for continuing to help, Daniel. Sorry for the late response. I managed to find someone to help me in person. I really appreciate all of your effort.

                Comment


                • #9
                  how can i save multiple imputed datasets

                  Comment


                  • #10
                    how can i save multiple imputed datasets

                    Comment


                    • #11
                      Here is a way for 10 imputed datasets.

                      Code:
                      webuse mheart5                                                                
                      mi set mlong                                                                    
                      mi register imputed age bmi                                                      
                      set seed 29390                                                                  
                      mi impute mvn age bmi = attack smokes hsgrad female, add(10)
                      preserve
                      forval i=1/10{
                          mi extract `i', clear
                          save mi_dataset`i'
                          restore, preserve
                      }
                      The datasets will be saved as mi_dataset1.dta, mi_dataset2.dta,..., mi_dataset10.dta in your current directory.

                      Comment


                      • #12
                        Hi everyone:

                        I´m starting to work with mi.

                        If I ordered 5 imputations, and I extract that, I will have 5 datasets. But, when I ordered, for example, a regress; this regress will be calculate with the 5 different datasates: Is that correct?

                        If the above question is correct, Can I save a unique dataset (product of the 5 imputations)?

                        Thank you very much.
                        Last edited by Oc td ds; 30 Oct 2022, 09:20.

                        Comment


                        • #13
                          Andrew Musau

                          Dear Andrew,
                          I just tried to follow your steps but I couldn't understand what is this part performing (Loop over consecutive values), what for?


                          forval i=1/10{
                          mi extract `i', clear
                          save mi_dataset`i'
                          restore, preserve }



                          Is it always necessaire to make the loop for saving the new values? why?

                          Thanks in advance
                          Best,
                          Carolina

                          Comment


                          • #14
                            Originally posted by Carolina Hincapie View Post
                            Andrew Musau

                            Dear Andrew,
                            I just tried to follow your steps
                            Why?

                            Originally posted by Carolina Hincapie View Post
                            Is it always necessaire to make the loop for saving the new values? why?
                            No; it is never necessary. If you want individual datasets, use flongsep style. As said, there is rarely a need to save the imputed datasets separately (from each other and/or from the original data).

                            I answer very briefly here because I believe it is in your best interest to stick with your original question/post.

                            Comment


                            • #15
                              Originally posted by Carolina Hincapie View Post
                              Andrew Musau

                              Dear Andrew,
                              I just tried to follow your steps but I couldn't understand what is this part performing (Loop over consecutive values), what for?


                              forval i=1/10{
                              mi extract `i', clear
                              save mi_dataset`i'
                              restore, preserve }
                              The imputed datasets are indexed 1 through \(m\), where \(m\) is the total number of imputations.


                              Is it always necessaire to make the loop for saving the new values? why?
                              No, for some reason, the OP wanted to extract the imputed datasets. As Daniel notes, there does not appear to be any good reason to do this. Perhaps you can specify what you want to do and you will get some advice on that.

                              Comment

                              Working...
                              X