Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Is using polychoric correlations appropriate with my binary variables?

    Dear Stata Users,

    I have 19 variables that ask individuals if they experienced or not (0=no, 1=yes) a specific event. Example of events include divorce, death of a loved one, injury...
    I am using these 19 variables as predictors of a health outcome. I want to reduce the number of predictors and usually I've used factor analysis to identify latent variables that I use as predictors instead of 19 separate variables.

    I realize that factor analysis is not recommended and not appropriate for binary variables. In my searches I came across the following documentation from UCLA about polychoric correlations. I also read the stata help file after installing the polychoric package by Stas Kolenikov.
    Based on this brief description of my data, do you think that using polychoric correlations is appropriate? Will it actually make sense to create "out of the correlations" latent variables that would include predictors that measure similar life events? For example: 1 identified factor could be family life events which would represent divorce, death, serious sickness of a family member. Another identified factor could be resources which would represent: loss of a job, lack of transportation, loss of health insurance. FYI: My life events are less intuitive to categorize into latent variables than the examples I provided.

    Any other suggestions are welcome.

    Thank you for your time,
    Patrick

  • #2
    If you cannot cleanly specify all of the latent factors a priori, then it's probably not a good idea to use confirmatory factor analysis. If your analysis is exploratory, then you could use the official Stata tetrachoric instead (all binary predictors) followed by factormat and manually cull latent factors from a rotated factor matrix. But if it's exploratory, then I would skip factor analysis altogether, and just use all 19 predictors if possible or forward selection (help stepwise) or LASSO (user-written on SSC: search lars) or something along those lines..

    Comment


    • #3
      Thank you for your feedback Joseph it is giving me some things to consider. My analysis is exploratory and I plan to use latent variables to predict different trajectories. I want to use the same factor structure for 16 waves of data to predict at each wave different health outcomes.
      I think for this sentence
      But if it's exploratory, then I would skip factor analysis altogether,
      you meant to say
      But if it isn't exploratory, then I would skip factor analysis...
      am I right?

      Thanks again
      Patrick

      Comment


      • #4
        I would probably hesitate as a first resort to try to use exploratory factor analysis on a tetrachoric correlation matrix to pan for interpretable latent factors that have a stable structure over 16 waves of data. I would most likely just use the observed variables as-is, as factor-variable predictors in a suitable regression model.

        Comment


        • #5
          Actually, maybe you should consider latent class analysis, rather than factor analysis.

          In factor analysis, the assumption is that there are continuous latent variables behind your indicators, and that each indicator loads differently on those latent variables. Now, if you were to run your factor analysis on the tetrachoric correlation matrix, it would estimate. You could then predict a factor score. So yes, this is technically possible. But I am not that sure if you should do it, and clearly Joesph is voting no.

          in LCA, you are assuming a categorical (not necessarily ordinal) latent variable behind your indicators. One thing that LCA could show is whether or not certain traumas group together, and how they group together. Unfortunately, LCA isn't available natively in Stata, but Penn State University's methodology center has a do file written for LCA (why they don't make it an ado file you can install, I don't know), and there is the gllamm command, whose syntax is much denser. One weakness of LCA, of course, is that because it assumes a categorical latent variable, you don't necessarily get a sense of how strong the latent trait is. You can subjectively decide, from examining what indicators load on each latent class, if some are more severe than others.

          however, in any case, if you don't have a theoretical framework, then it may be just as well to include all 19 of your indicators as simple factor variables in a regression model. Or even create a sum of the 19 to start.
          Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

          When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.

          Comment


          • #6
            Hello Joseph and Weiwen,

            I really appreciate your comments. At this point I am leaning towards simplifying everything by creating a total sum score or using 19 factor variables in a regression model.
            FYI: I attempted the polychoric and tetrachoric correlations to conduct factor analysis and as expected each wave of data yielded a different number of factors.


            Thanks again!

            Comment


            • #7
              Originally posted by Patrick Abi Nader View Post
              FYI: I attempted the polychoric and tetrachoric correlations to conduct factor analysis and as expected each wave of data yielded a different number of factors.
              You can get an idea of what to expect just by creating a dataset with known parameters and seeing what results you get with factor analysis on the tetrachoric correlation matrixes of the questionnaire's items over waves of data collection. In the example below, I've created a fictitious dataset with four latent factors that I have made to be invariant over several waves of data collection. Each latent factor has four indicator variables. Each indicator variable is first constructed as a normally distributed latent variable that underlies a corresponding binary manifest variable. I have made the binary manifest variables so that there is no systematic leptokurtosis or skew in them.

              When you perform exploratory factor analysis using the default options for estimation method and factor retention, even on the unobserved multivariate normal indicator variables (that underlie the binary survey responses), you get a varying number of retained factors across the waves of data collection, typically with more retained factors than are known to be—i.e., were deliberately made to have been—present. With this ideal fictional dataset with a known number of latent factors, you can inspect the rotated factor matrix and see that a couple of the extra retained factors don't have much loading, but, with a real dataset, if you don't know how many factors are supposed to be present and which indicator variables are supposed to load together, then it would be a challenge to make sense of these retained factors, let alone to assess their roles in predicting different trajectories in the outcome variable.

              It gets worse when performing exploratory factor analysis on the tetrachoric correlation matrixes of the dichotomized manifest variables: nonpositive-definite matrixes that have to be fixed up, many more retained latent factors than are present, more variation in the number of retained latent factors, Heywood cases. And this with just four waves of data collection and not 16 as in your case. You can see why I would shy away from this as a first-line approach.

              With the binary responses on the survey instrument's items used as factor-variable predictors in a longitudinal regression model, in an exploratory analysis, you can filter on, say, p-values of the questionnaire item × wave interaction term or some other criterion or set of criteria as judged appropriate for assessing the item's ability to predict different trajectories. If desired, you can then, in the context of exploratory analysis, take a closer look for patterns in the nature of the questionnaire's items that shake out.

              .ÿversionÿ14.2

              .ÿ
              .ÿclearÿ*

              .ÿsetÿmoreÿoff

              .ÿsetÿseedÿ1374263

              .ÿ
              .ÿ//ÿParticipants
              .ÿquietlyÿsetÿobsÿ250

              .ÿgenerateÿintÿpidÿ=ÿ_n

              .ÿgenerateÿdoubleÿuÿ=ÿrnormal()

              .ÿ
              .ÿquietlyÿexpandÿ4

              .ÿbysortÿpid:ÿgenerateÿbyteÿtimÿ=ÿ_n

              .ÿsortÿpidÿtim

              .ÿ
              .ÿ//ÿInstrumentÿitems
              .ÿtempnameÿOnÿOffÿCorr

              .ÿmatrixÿdefineÿ`On'ÿ=ÿJ(4,ÿ4,ÿ0.75)ÿ+ÿI(4)ÿ*ÿ0.25

              .ÿmatrixÿdefineÿ`Off'ÿ=ÿJ(4,ÿ4,ÿ0.25)

              .ÿmatrixÿdefineÿ`Corr'ÿ=ÿ(ÿ///
              >ÿÿÿÿÿÿÿÿÿ`On',ÿ`Off',ÿ`Off',ÿ`Off'ÿ\ÿ///
              >ÿÿÿÿÿÿÿÿÿ`Off',ÿ`On',ÿ`Off',ÿ`Off'ÿ\ÿ///
              >ÿÿÿÿÿÿÿÿÿ`Off',ÿ`Off',ÿ`On',ÿ`Off'ÿ\ÿ///
              >ÿÿÿÿÿÿÿÿÿ`Off',ÿ`Off',ÿ`Off',ÿ`On')

              .ÿ
              .ÿforvaluesÿiÿ=ÿ1/16ÿ{
              ÿÿ2.ÿÿÿÿÿÿÿÿÿlocalÿpredictor_listÿ`predictor_list'ÿlat`i'
              ÿÿ3.ÿ}

              .ÿ
              .ÿquietlyÿdrawnormÿ`predictor_list',ÿdoubleÿcorr(`Corr')

              .ÿforvaluesÿiÿ=ÿ1/16ÿ{
              ÿÿ2.ÿÿÿÿÿÿÿÿÿgenerateÿbyteÿman`i'ÿ=ÿlat`i'ÿ>ÿ0
              ÿÿ3.ÿ}

              .ÿ
              .ÿ//ÿOutcome
              .ÿlocalÿpredictor_list

              .ÿforvaluesÿiÿ=ÿ1/16ÿ{
              ÿÿ2.ÿÿÿÿÿÿÿÿÿlocalÿpredictor_listÿ`predictor_list'ÿ0ÿ*ÿman`i'ÿ+ÿ`i'ÿ*ÿman`i'ÿ/ÿ96ÿ*ÿtimÿ+
              ÿÿ3.ÿ}

              .ÿ
              .ÿgenerateÿdoubleÿoutÿ=ÿuÿ+ÿtimÿ/ÿ8ÿ+ÿ`predictor_list'ÿrnormal()

              .ÿ
              .ÿ*
              .ÿ*ÿRegressionÿmodelÿwithÿmanifestÿvariablesÿasÿfactor-variableÿpredictors
              .ÿ*
              .ÿquietlyÿxtregÿoutÿi.man*##i.tim,ÿi(pid)ÿreÿ//ÿregressionÿtableÿisÿworthÿinspectingÿinÿitsÿownÿright

              .ÿ//ÿExploration
              .ÿforvaluesÿiÿ=ÿ1/16ÿ{
              ÿÿ2.ÿÿÿÿÿÿÿÿÿquietlyÿtestparmÿman`i'#tim
              ÿÿ3.ÿÿÿÿÿÿÿÿÿdisplayÿinÿsmclÿasÿtextÿ"Instrumentÿitemÿ"ÿasÿresultÿ`i'ÿ///
              >ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ"ÿasÿpredictingÿtrajectory,ÿpÿ=ÿ"ÿasÿresultÿ%04.2fÿr(p)
              ÿÿ4.ÿ}
              Instrumentÿitemÿ1ÿasÿpredictingÿtrajectory,ÿpÿ=ÿ0.39
              Instrumentÿitemÿ2ÿasÿpredictingÿtrajectory,ÿpÿ=ÿ0.51
              Instrumentÿitemÿ3ÿasÿpredictingÿtrajectory,ÿpÿ=ÿ0.66
              Instrumentÿitemÿ4ÿasÿpredictingÿtrajectory,ÿpÿ=ÿ0.02
              Instrumentÿitemÿ5ÿasÿpredictingÿtrajectory,ÿpÿ=ÿ1.00
              Instrumentÿitemÿ6ÿasÿpredictingÿtrajectory,ÿpÿ=ÿ0.66
              Instrumentÿitemÿ7ÿasÿpredictingÿtrajectory,ÿpÿ=ÿ0.48
              Instrumentÿitemÿ8ÿasÿpredictingÿtrajectory,ÿpÿ=ÿ0.16
              Instrumentÿitemÿ9ÿasÿpredictingÿtrajectory,ÿpÿ=ÿ0.22
              Instrumentÿitemÿ10ÿasÿpredictingÿtrajectory,ÿpÿ=ÿ0.10
              Instrumentÿitemÿ11ÿasÿpredictingÿtrajectory,ÿpÿ=ÿ0.23
              Instrumentÿitemÿ12ÿasÿpredictingÿtrajectory,ÿpÿ=ÿ0.26
              Instrumentÿitemÿ13ÿasÿpredictingÿtrajectory,ÿpÿ=ÿ0.08
              Instrumentÿitemÿ14ÿasÿpredictingÿtrajectory,ÿpÿ=ÿ0.08
              Instrumentÿitemÿ15ÿasÿpredictingÿtrajectory,ÿpÿ=ÿ0.02
              Instrumentÿitemÿ16ÿasÿpredictingÿtrajectory,ÿpÿ=ÿ0.03

              .ÿ
              .ÿ*
              .ÿ*ÿPanningÿforÿgoldÿnuggets
              .ÿ*
              .ÿ//ÿEFAÿonÿactualÿlatentÿstandard-normalÿvariablesÿbehindÿbinaryÿmanifestÿvariables
              .ÿforvaluesÿtimÿ=ÿ1/4ÿ{
              ÿÿ2.ÿÿÿÿÿÿÿÿÿquietlyÿfactorÿlat*ÿifÿtimÿ==ÿ`tim'
              ÿÿ3.ÿÿÿÿÿÿÿÿÿdisplayÿinÿsmclÿasÿtextÿ"Waveÿ=ÿ`tim'",ÿ"Factorsÿretainedÿ=ÿ"ÿe(f)
              ÿÿ4.ÿ}
              Waveÿ=ÿ1ÿFactorsÿretainedÿ=ÿ5
              Waveÿ=ÿ2ÿFactorsÿretainedÿ=ÿ6
              Waveÿ=ÿ3ÿFactorsÿretainedÿ=ÿ6
              Waveÿ=ÿ4ÿFactorsÿretainedÿ=ÿ5

              .ÿ
              .ÿ//ÿEFAÿonÿtetrachoricÿcorrelationÿmatrixÿofÿbinaryÿmanifestÿvariables
              .ÿforvaluesÿtimÿ=ÿ1/4ÿ{
              ÿÿ2.ÿÿÿÿÿÿÿÿÿquietlyÿtetrachoricÿman*ÿifÿtimÿ==ÿ`tim',ÿposdef
              ÿÿ3.ÿÿÿÿÿÿÿÿÿquietlyÿfactormatÿr(Rho),ÿn(250)
              ÿÿ4.ÿÿÿÿÿÿÿÿÿdisplayÿinÿsmclÿasÿtextÿ"Waveÿ=ÿ`tim'",ÿ"Factorsÿretainedÿ=ÿ"ÿe(f),ÿe(heywood)
              ÿÿ5.ÿ}
              Waveÿ=ÿ1ÿFactorsÿretainedÿ=ÿ14ÿHeywoodÿcase
              Waveÿ=ÿ2ÿFactorsÿretainedÿ=ÿ15ÿHeywoodÿcase
              Waveÿ=ÿ3ÿFactorsÿretainedÿ=ÿ15ÿHeywoodÿcase
              Waveÿ=ÿ4ÿFactorsÿretainedÿ=ÿ11ÿ.

              .ÿ
              .ÿexit

              endÿofÿdo-file


              .

              Comment


              • #8
                Hello Joseph,

                Thank you for providing a very comprehensive example to explain the limitations of conducting factor analysis on binary variables across multi-waves of data.
                I appreciate your time.

                Patrick

                Comment


                • #9
                  Hello Joseph,

                  I hope this message finds you well!

                  Do you have any references that I can read and possibly cite regarding why polychoric and tetrachoric correlations do not work with binary variables especially when we are expecting a similar factor structure across multiple time points?

                  Thank you for your time

                  Patrick

                  Comment

                  Working...
                  X