Coding dummy variables

Denila Jinny

Join Date: Jun 2014

Posts: 25
#1

Coding dummy variables

13 Jul 2018, 09:57

Hi, I have a categorical variable "Source of Funding". It has 4 categories: 1. Institutional Sources (IS) 2. Non-Institutional Sources (NIS) 3. Both IS and NIS 4. No funding. How do I code this in dummies? Generally, I can have 3 dummy variables: one for IS, one for NIS and another for both. My doubt is if I can just have 2 dummies (dummy for IS and dummy for NIS) which can both be coded 1 when the firm is funded by both IS and NIS. If yes, how can I interpret the results from a regression model?
Tags: None
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17671
#2

13 Jul 2018, 11:03

Denila:
you may want to try something along the following lines:

Code:

set obs 10 g Source_of_Funding=1 in 1/3 replace Source_of_Funding=2 in 4/6 replace Source_of_Funding=3 in 7/8 replace Source_of_Funding=4 in 9/10 label define Source_of_Funding 1 "Institutional Sources (IS)" 2 "Non-Institutional Sources (NIS)" 3 "Both IS and NIS" 4 "No funding" label val Source_of_Funding Source_of_Funding tab Source_of_Funding

You can group levels of a given categorical variable provided that it makes sense in your research field and/or that categorizing results in such a small number of observations for a given level that it becomes practically immaterial.
As far as the recoded categorical predictors is concerned, its contribution in explaining the variation of the conditional mean of the dependent variable (you do not provide further details, hence what follows is unavoidably a general advice) changes accordingly.

Kind regards,
Carlo
(StataNow 18.5)
Comment
Bruce Weaver

Join Date: May 2014

Posts: 1116
#3

13 Jul 2018, 11:12

Hello Denila. I see you're a new member of Statalist, so it may be that you are also a (relatively) new user of Stata. If so, I wonder if you have read about factor variables yet. I suspect that whatever it is you want to do with your set of indicator variables can be achieved more efficiently via factor variable notation. To get started, type the following in the Command window and hit Enter:

Code:

help fvvarlist

Pay attention to the i. prefix and the ib#. prefix, which can be used to specify a base level (aka., a reference category).

Here's a simple (and silly) example using the auto dataset that comes with Stata.

Code:

clear * sysuse auto tab rep78 regress mpg i.rep78 // 1st level as referent (the default) regress mpg ib3.rep78 // 3rd level as the referent regress mpg ib5.rep78 // 5th level as the referent

HTH.

--
Bruce Weaver
Email: [email protected]
Version: Stata/MP 18.5 (Windows)
1 like
Comment
Denila Jinny

Join Date: Jun 2014

Posts: 25
#4

16 Jul 2018, 00:41

Originally posted by Bruce Weaver View Post

Hello Denila. I see you're a new member of Statalist, so it may be that you are also a (relatively) new user of Stata. If so, I wonder if you have read about factor variables yet. I suspect that whatever it is you want to do with your set of indicator variables can be achieved more efficiently via factor variable notation. To get started, type the following in the Command window and hit Enter:

Code:

help fvvarlist

Pay attention to the i. prefix and the ib#. prefix, which can be used to specify a base level (aka., a reference category).

Here's a simple (and silly) example using the auto dataset that comes with Stata.

Code:

clear * sysuse auto tab rep78 regress mpg i.rep78 // 1st level as referent (the default) regress mpg ib3.rep78 // 3rd level as the referent regress mpg ib5.rep78 // 5th level as the referent

HTH.

Dear Bruce, Thank you for your reply. I am aware of this option. Let me try to explain my current fix in a better manner.
The data, if coded as per your suggestion, will have 3 dummy variables, say D1 for IS, D2 for NIS and D3 for Both having 4 as base.
My question is when a firm borrows from both sources (IS and NIS), can dummy for IS and dummy for NIS both be coded 1, leaving out D3?
So, I have 2 ways I can code this data. One is, when a firm borrows from both,
D1=0; D2=0; D3=1
The other is,
D1=1; D2=1
The second leaves out D3 completely. It still indicates that the firm has borrowed fromboth IS and NIS.

My concern is that I have been taught that the number of dummy variables should be the number of categories minus 1. So, I should actually have 3 categories. But my PhD supervisor suggests that I should follow the second way of coding to keep it simple. My question is, is it technically right to do so and how do I interpret in case it is technically right.

I hope this makes things little bit clearer.

Last edited by Denila Jinny; 16 Jul 2018, 00:47.
Comment
Bruce Weaver

Join Date: May 2014

Posts: 1116
#5

16 Jul 2018, 06:59

Hi Denila. Sorry for not reading #1 carefully enough the first time around. Given that IS and NIS are not mutually exclusive, I think you would need to include their interaction to account for the possibility of both occurring. E.g.,

Code:

* Mimic the IS & NIS data clear * sysuse auto tab rep78 generate rep45 = rep78 > 3 rename (mpg foreign rep45) (Y IS NIS) * Regression model with IS, NIS and their interaction regress Y IS##NIS * Use -margins- to show the cell means & simple main effects margins IS#NIS margins IS@NIS, contrast(nowald effects) margins NIS@IS, contrast(nowald effects)

HTH.

--
Bruce Weaver
Email: [email protected]
Version: Stata/MP 18.5 (Windows)
Comment
Denila Jinny

Join Date: Jun 2014

Posts: 25
#6

16 Jul 2018, 07:46

Hello Bruce, I have information on whether the firms have borrowed from both the sources. Should I still use interactions?
Comment
Bruce Weaver

Join Date: May 2014

Posts: 1116
#7

16 Jul 2018, 07:55

Hi Denila. Yes, you need to include the interaction to account for all 4 possibilities: neither, IS only, NIS only, both.

--
Bruce Weaver
Email: [email protected]
Version: Stata/MP 18.5 (Windows)
Comment
Denila Jinny

Join Date: Jun 2014

Posts: 25
#8

16 Jul 2018, 08:28

Bruce, thank you for patiently replying to my queries.

I do not understand what difference using an interaction variable would make when I already have a separate dummy to indicate if the firm has borrowed from both the sources. Can you please elaborate? Thank you.

A glimpse

Firm ID . IS . NIS. BOTH
1 . 1. O. 0

2. 0. 1 . 0 .

3. 0 . 0 . 1

4. 0 . 0 . 0

Firm 4 doesn't borrow from any external aource, which category is considered the base.

Can you please elaborate a bit more as to why you are suggesting that I have to consider interaction term? Thank you so much.
Comment

Bruce Weaver

Join Date: May 2014
Posts: 1116

16 Jul 2018, 08:59

By including the interaction of the two dichotomous explanatory variables, you are able to separate the effects of the two, and get 2*2 = 4 cell means. Did you try the example in #5? The first -margins- command displays the 4 cell means for that example. If you do not include the interaction, you will not separate the effects of the two dichotomous variables. Here is a cleaned up version of the demo in #5 with some more comments added. Perhaps it will help to clarify things.

Code:

* Mimic the IS & NIS data
clear *
sysuse auto
tab rep78
* Compute new variables to eliminate labels
generate Y = mpg
generate byte IS = foreign
generate byte NIS = rep78 > 3
tabulate IS NIS  // Show the n for each of the 4 cells

* The rest of the code mimics your analysis.

* Regression model with IS, NIS and their interaction
regress Y IS##NIS
* Use -margins- to show the cell means & simple main effects
margins IS#NIS
* This table shows the means of the 4 cells obtained by
* the factorial combination of the two dichotomous variables (IS & NIS).
margins IS@NIS, contrast(nowald effects)
* Compare "(1 vs base) 0" to "1.IS" in regression output
margins NIS@IS, contrast(nowald effects)
* Compare "(1 vs base) 0" to "1.NIS" in regression output

HTH.

--
Bruce Weaver
Email: [email protected]
Version: Stata/MP 18.5 (Windows)

Comment

Denila Jinny

Join Date: Jun 2014

Posts: 25
#10

17 Jul 2018, 04:53

Hello Bruce, Thank you for patiently helping me out. I finally understand what you say. Thank you so much once again.
Comment

Announcement