How to interpret regression coefficients after pca with dummy variables?

Berend Nijhuis

Join Date: Mar 2018

Posts: 17
#1

How to interpret regression coefficients after pca with dummy variables?

26 Sep 2018, 06:36

Dear community,

In my research I've performed a principal component analysis on several independent variables. All of these independent variables are dummy variables (i.e. they have values of 0 or 1). The outcome of the analysis was that eight variables (about process quality) could be loaded onto 2 components with a 85% variance explanation. However, when I use these two components in linear regression how should I interpret them? Is it still right to say that when people rate the process quality as bad (stated as one in the dummy) the effect on the dependent variable is xxx percentage points in comparison when people rate the process quality as good?

Thanks in advance.

Kind regards,

Berend
Tags: None
Bruce Weaver

Join Date: May 2014

Posts: 1119
#2

26 Sep 2018, 07:12

Hello Berend. As I read your post, I can't help but wonder if this is an example of the so-called XY problem (see the last line of point 9 in the FAQ). In other words, I think it would help if you provided more info about your data and the questions you are hoping to address. For example:
What is your outcome variable?

What are the explanatory variables?

What is the sample size?

Is it an experiment, or an observational study?

What are the main questions you want to address?

Also, when I hear people talk about doing PCA prior to regression, I am always reminded of this article by Hadi & Ling (1998).

HTH.

--
Bruce Weaver
Email: [email protected]
Version: Stata/MP 18.5 (Windows)
Comment
Berend Nijhuis

Join Date: Mar 2018

Posts: 17
#3

26 Sep 2018, 07:27

Bruce Weaver Dear Bruce,

I'm sorry to not point out additional information about my topic.
My research is about explanations for costunder- and overruns in building construction. In the regression, the dependent variable is the percentage profit or loss on the building costs. These are obtained by the formula ((budgeted costs - realized costs)/(budgeted costs).

The independent variables are obtained via a survey. One question was about process quality and consisted of 8 questions where the participants could answer whether the process quality on the eight different topics was bad or good (so 0 or 1).

Since my sample size consists only of 87 projects, these 8 variables need to be transformed into less variables because of the limited degrees of freedom. Therefore I conducted a pca which lead to the conclusion that the 8 variables could be loaded onto 2 components.

The study is based on real project data from a company that has only 87 projects over the past 5 years. The interest of the study is whether for example the process quality could explain the variation in the relative profit or loss on the project.

Both components of process quality have significant results in the regression. However, I don't know how I should interpret them. Is it allowed to say that when people rate the process quality as good, this leads to a xxx percentage point increase in the profit on the project?

I hope this post clarifies my research a bit.

With kind regards,

Berend Nijhuis
Comment
Bruce Weaver

Join Date: May 2014

Posts: 1119
#4

26 Sep 2018, 07:49

Thanks Berend, that is helpful. What are the 8 topics? Would it make sense to combine them into a single measure, such as the mean (i.e., the proportion of topics flagged as good)? Or are there some topics that are of specific interest?

--
Bruce Weaver
Email: [email protected]
Version: Stata/MP 18.5 (Windows)
Comment
Berend Nijhuis

Join Date: Mar 2018

Posts: 17
#5

26 Sep 2018, 07:54

Bruce Weaver Dear Bruce,

The respondents had to rate the process quality on eight different spectrums of process quality. These were the eight topics:
- Goal of the project
- Achievable planning
- Quality of the project
- Financial result of the project
- Satisfaction during the project
- Customer satisfaction
- Communication between different parties
- Honor agreements

The principal component analysis suggested to use 2 components instead of 1, since then too much variance explained will get lost. Therefore I think it wouldn't be a good idea to combine them in just a single measure.

With kind regards,

Berend Nijhuis
Comment
Bruce Weaver

Join Date: May 2014

Posts: 1119
#6

26 Sep 2018, 08:24

Earlier, you said:

One question was about process quality and consisted of 8 questions where the participants could answer whether the process quality on the eight different topics was bad or good (so 0 or 1).

Does this mean there are other explanatory variables you want to include in your model? If not, what is wrong with a model that includes all 8 dichotomous variables? I would be the first to recognize that the so-called "rule of 10" does not always work for linear regression. But in some circumstances, 10 observations per explanatory variable can yield a fairly decent model. And with n = 87, you've got nearly 10 observations per variable (if these 8 variables are the only ones in the model).

Did the questions on the survey ask people to choose between bad and good (0 vs 1)? Or were there more response options, with the variables being dichotomized later?

HTH.

--
Bruce Weaver
Email: [email protected]
Version: Stata/MP 18.5 (Windows)
Comment
Berend Nijhuis

Join Date: Mar 2018

Posts: 17
#7

26 Sep 2018, 08:35

Bruce Weaver Dear Bruce,

Yes, the whole survey consisted of 26 questions about 4 topics (complexity, process quality, delay, and changes). Furthermore, there are 5 explanatory variables that are obtained from the data from the company. Therefore this would yield to a model with 31 variables which will be too much with only 87 observations. According to the pca the questions on complexity, delay, and changes could be reduced to only one component, but for process quality, two components are needed.

The questions on the survey asked the respondents to choose on a 5 Likert scale (very bad, bad, neutral, good, very good). The results are dichotomized into bad (very bad and bad) and good (neutral, good, very good) since the interest is mainly on the effect of a negative rating on the different topics.

In total, I have 5 components in my regression and 5 other variables so that there are nearly 10 observations per variable. However, I don't know how I should interpret the 5 components.

With kind regards,

Berend nijhuis
Comment
Bruce Weaver

Join Date: May 2014

Posts: 1119
#8

26 Sep 2018, 15:51

Different strokes for different folks, perhaps. But if I just wanted a general measure of process quality*, I would use the original 5-point scores, and compute a mean across the 8 topics (assuming no items require reverse-coding), and use that mean as the process quality variable in my model. I'd then use -margins- and -marginsplot- to plot the relationship between process quality and Y. Having on general measure of process quality would free up degrees of freedom, allowing you to include some of those other variables you mentioned. (But of course, if you have specific interest in one or more of the 8 topics individually, this won't fly.)

For reasons discussed in the Hadi & Ling article, and because interpretation is well nigh impossible, I would eschew entirely use of PCA.

* What you said in #3 suggested that maybe you do want a general measure of process quality.

The interest of the study is whether for example the process quality could explain the variation in the relative profit or loss on the project.

HTH.

--
Bruce Weaver
Email: [email protected]
Version: Stata/MP 18.5 (Windows)
Comment

Announcement

How to interpret regression coefficients after pca with dummy variables?

Comment

Comment

Comment

Comment

Comment

Comment

Comment