Create factor dummy variables from a long variable

Lola Nait

Join Date: Nov 2019

Posts: 12
#1

Create factor dummy variables from a long variable

12 Nov 2019, 02:05

Hello I have a variable called categories which is weirdly stored in numeric long (it was a string variable and I used the command encode)
However what I want to do is create 4 category dummy variables: below30
30to60
60to90
90above
(and of course reference categorie below30)

How can I do it?

Attached Files
Tags: None

Carlo Lazzaro

Join Date: Apr 2014
Posts: 17600

12 Nov 2019, 02:11

Lola:
do you mean something along the following lines?

Code:

. set obs 10
number of observations (_N) was 0, now 10

. g cat=1 in 1/3
(7 missing values generated)

. replace cat=2 in 4/6
(3 real changes made)

. replace cat=3 in 7/8
(2 real changes made)

. replace cat=4 if cat==.
(2 real changes made)


. tab cat, gen(cat_dummies)

        cat |      Freq.     Percent        Cum.
------------+-----------------------------------
          1 |          3       30.00       30.00
          2 |          3       30.00       60.00
          3 |          2       20.00       80.00
          4 |          2       20.00      100.00
------------+-----------------------------------
      Total |         10      100.00

. list

     +-------------------------------------------------+
     | cat   cat_du~1   cat_du~2   cat_du~3   cat_du~4 |
     |-------------------------------------------------|
  1. |   1          1          0          0          0 |
  2. |   1          1          0          0          0 |
  3. |   1          1          0          0          0 |
  4. |   2          0          1          0          0 |
  5. |   2          0          1          0          0 |
     |-------------------------------------------------|
  6. |   2          0          1          0          0 |
  7. |   3          0          0          1          0 |
  8. |   3          0          0          1          0 |
  9. |   4          0          0          0          1 |
 10. |   4          0          0          0          1 |
     +-------------------------------------------------+

.

Kind regards,
Carlo
(StataNow 18.5)

Comment

Lola Nait

Join Date: Nov 2019
Posts: 12

12 Nov 2019, 03:13

Originally posted by Carlo Lazzaro View Post

Lola:
do you mean something along the following lines?

Code:

. set obs 10
number of observations (_N) was 0, now 10

. g cat=1 in 1/3
(7 missing values generated)

. replace cat=2 in 4/6
(3 real changes made)

. replace cat=3 in 7/8
(2 real changes made)

. replace cat=4 if cat==.
(2 real changes made)


. tab cat, gen(cat_dummies)

cat | Freq. Percent Cum.
------------+-----------------------------------
1 | 3 30.00 30.00
2 | 3 30.00 60.00
3 | 2 20.00 80.00
4 | 2 20.00 100.00
------------+-----------------------------------
Total | 10 100.00

. list

+-------------------------------------------------+
| cat cat_du~1 cat_du~2 cat_du~3 cat_du~4 |
|-------------------------------------------------|
1. | 1 1 0 0 0 |
2. | 1 1 0 0 0 |
3. | 1 1 0 0 0 |
4. | 2 0 1 0 0 |
5. | 2 0 1 0 0 |
|-------------------------------------------------|
6. | 2 0 1 0 0 |
7. | 3 0 0 1 0 |
8. | 3 0 0 1 0 |
9. | 4 0 0 0 1 |
10. | 4 0 0 0 1 |
+-------------------------------------------------+

.

Hi Carlo thank you a lot for your reply!

I don't really understand what you did but
I hae a tab like this in a csv file
Country gdpgrowth Debttogdpratio
France 2.5 0-30%
Uk 1.4 30-60%
.....

In my file debtogdpratio is saved as a string variable.

So i want a regression which has reg gdpgrowth 30to60 60to90 above90

Where
30to60
60to90
above90

Are the categories for debttogdpratio
but i just don't know how to create these categories considering the nature of my variable debttogdpratio

Comment

Nick Cox

Join Date: Mar 2014

Posts: 35209
#4

12 Nov 2019, 03:32

In #1 you showed a numeric variable called categories with four distinct values. So the implication of #2 from Carlo Lazzaro is that

Code:

tab categories, gen(categories)

will generate 4 new indicator variables (you say dummy variables).

Doing the same on the string variable debttogdpratio is a bad idea as this output shows. I see no (consistent) examples of this variable in posts to date but tabulate will use the alphanumeric sort order of the distinct strings, and it's likely that new variables will be named out of order:

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input str8 whatever "below 30" "30-60" "60-90" "above 90" end tab whatever, gen(whatever) whatever | Freq. Percent Cum. ------------+----------------------------------- 30-60 | 1 25.00 25.00 60-90 | 1 25.00 50.00 above 90 | 1 25.00 75.00 below 30 | 1 25.00 100.00 ------------+----------------------------------- Total | 4 100.00 list +------------------------------------------------------+ | whatever whatev~1 whatev~2 whatev~3 whatev~4 | |------------------------------------------------------| 1. | below 30 0 0 0 1 | 2. | 30-60 1 0 0 0 | 3. | 60-90 0 1 0 0 | 4. | above 90 0 0 1 0 | +------------------------------------------------------+

That said,

1. Unless you have a very old version of Stata, in which case you should be telling us about it, use factor variable notation, not explicit indicator variables.

2. Unless you have no such information, the precise debt/gdp ratio is better used as a single predictor. Why degrade it arbitrarily to four classes? (If need be, use its logarithm.) Also, the class boundaries are ambiguous. Which way would 60% jump?

You'd get faster answers as you want them with

* more careful proof-reading of your posts, which are messy (e.g. lacking in punctuation)

* explicit data examples using dataex, as we request in the FAQ Advice.

P.S. This is the 5th thread you've started since 7 November. You have yet to close any of the previous four threads.

Last edited by Nick Cox; 12 Nov 2019, 03:37.
1 like
Comment
Lola Nait

Join Date: Nov 2019

Posts: 12
#5

12 Nov 2019, 04:26

Originally posted by Nick Cox View Post

In #1 you showed a numeric variable called categories with four distinct values. So the implication of #2 from Carlo Lazzaro is that

Code:

tab categories, gen(categories)

will generate 4 new indicator variables (you say dummy variables).

Doing the same on the string variable debttogdpratio is a bad idea as this output shows. I see no (consistent) examples of this variable in posts to date but tabulate will use the alphanumeric sort order of the distinct strings, and it's likely that new variables will be named out of order:

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input str8 whatever "below 30" "30-60" "60-90" "above 90" end tab whatever, gen(whatever) whatever | Freq. Percent Cum. ------------+----------------------------------- 30-60 | 1 25.00 25.00 60-90 | 1 25.00 50.00 above 90 | 1 25.00 75.00 below 30 | 1 25.00 100.00 ------------+----------------------------------- Total | 4 100.00 list +------------------------------------------------------+ | whatever whatev~1 whatev~2 whatev~3 whatev~4 | |------------------------------------------------------| 1. | below 30 0 0 0 1 | 2. | 30-60 1 0 0 0 | 3. | 60-90 0 1 0 0 | 4. | above 90 0 0 1 0 | +------------------------------------------------------+

That said,

1. Unless you have a very old version of Stata, in which case you should be telling us about it, use factor variable notation, not explicit indicator variables.

2. Unless you have no such information, the precise debt/gdp ratio is better used as a single predictor. Why degrade it arbitrarily to four classes? (If need be, use its logarithm.) Also, the class boundaries are ambiguous. Which way would 60% jump?

You'd get faster answers as you want them with

* more careful proof-reading of your posts, which are messy (e.g. lacking in punctuation)

* explicit data examples using dataex, as we request in the FAQ Advice.

P.S. This is the 5th thread you've started since 7 November. You have yet to close any of the previous four threads.

Hi Nick.

Thank you for your reply.

Basically the only thing I am trying to do is do a regression with categories as I want to get the coefficient on each category. By that I meant I want to get how having a debt to gdp ratio between 0-30 affect GDP growth , how having a debt to gdp ratio between 30-60 affect GDP growth ..
I don't know what is the factor variable notation?

(I will close old post I didn't know I had to do it.. )
Comment
Marcos Almeida

Join Date: Apr 2014

Posts: 4047
#6

12 Nov 2019, 04:51

Basically the only thing I am trying to do is do a regression with categories as I want to get the coefficient on each category.

Maybe you wish something like:

Code:

sysuse auto regress mpg weight i.rep78 foreign margins rep78 marginsplot

Hopefully that helps

Best regards,

Marcos
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35209
#7

12 Nov 2019, 06:03

#5

Basically the only thing I am trying to do is do a regression with categories as I want to get the coefficient on each category. By that I meant I want to get how having a debt to gdp ratio between 0-30 affect GDP growth , how having a debt to gdp ratio between 30-60 affect GDP growth ..
I don't know what is the factor variable notation?

That's evident. I am suggesting that there are better ways to use debt/GDP ratio as a predictor. It's not born categorical.

Code:

. search factor variable

leads to documentation.

(I will close old post I didn't know I had to do it.. )

it's not compulsory, but it is a really good idea to tell people what worked (or didn't work) and to give thanks. Again, we do explain in the FAQ Advice:

16.1 Close by giving a summary and thanks

Trying to wrap up a thread you started is helpful, especially if you report what solved your problem. You can then thank those who tried to help. Conversely, ignoring answers is less sociable, even if those answers did not solve your problem. "Thanks in advance" does not absolve you from either expectation.

Please note that a Like on a post is not publicly visible as coming from you and, while friendly, also does not absolve you from either expectation.
1 like
Comment
Lola Nait

Join Date: Nov 2019

Posts: 12
#8

12 Nov 2019, 12:59

Originally posted by Marcos Almeida View Post

Maybe you wish something like:

Code:

sysuse auto regress mpg weight i.rep78 foreign margins rep78 marginsplot

Hopefully that helps

Marcos hiii
when I try to do that I get an error message : 'string variables may not be used as factor variables'
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35209
#9

12 Nov 2019, 13:08

You are not giving us the exact code you used, but we can guess that you used the string variable debtlogratio.

So, use categories instead. You've already told us -- and more crucially Stata has told you -- that it is numeric. The information is there in #1.
Comment
Marcos Almeida

Join Date: Apr 2014

Posts: 4047
#10

12 Nov 2019, 13:15

If, as you mentioned in #5, you wish to perform a regression analysis "with categories", you shan't use string variables in the command.

Best regards,

Marcos
Comment

Carlo Lazzaro

Join Date: Apr 2014
Posts: 17600

#11

12 Nov 2019, 13:20

Lola:
-string- format cannot be included in -regress-.
Provided that I share Nick Cox 's wise comment that debt/GDP ratio was not born categorical (#7), if you still want to stick with the categorical flavour, you can do something along the following lines:

Code:

.  set obs 10
. g cat=1 in 1/3
. replace cat=2 in 4/6
. replace cat=3 in 7/8
. replace cat=4 if cat==.
. tab cat, gen(cat_dummies)

        cat |      Freq.     Percent        Cum.
------------+-----------------------------------
          1 |          3       30.00       30.00
          2 |          3       30.00       60.00
          3 |          2       20.00       80.00
          4 |          2       20.00      100.00
------------+-----------------------------------
      Total |         10      100.00

. g y=runiform()*1000

. reg y i.cat_dummies1 1.cat_dummies2 i.cat_dummies3

      Source |       SS           df       MS      Number of obs   =        10
-------------+----------------------------------   F(3, 6)         =      1.49
       Model |  345909.264         3  115303.088   Prob > F        =    0.3105
    Residual |  465715.256         6  77619.2094   R-squared       =    0.4262
-------------+----------------------------------   Adj R-squared   =    0.1393
       Total |   811624.52         9  90180.5022   Root MSE        =     278.6

--------------------------------------------------------------------------------
             y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
---------------+----------------------------------------------------------------
1.cat_dummies1 |  -464.7458   254.3279    -1.83   0.117    -1087.064    157.5721
1.cat_dummies2 |  -299.4321   254.3279    -1.18   0.284      -921.75    322.8858
1.cat_dummies3 |  -518.3106   278.6022    -1.86   0.112    -1200.026    163.4046
         _cons |   715.5471   197.0015     3.63   0.011     233.5017    1197.592
--------------------------------------------------------------------------------

Just an aside: you cannot include all 4 categorical variables (https://en.wikipedia.org/wiki/Dummy_...le_(statistics).
If you do, Stata will omit one of them due to extreme collinearity:

Code:

. reg y i.cat_dummies1 1.cat_dummies2 i.cat_dummies3 i.cat_dummies4
note: 1.cat_dummies4 omitted because of collinearity

      Source |       SS           df       MS      Number of obs   =        10
-------------+----------------------------------   F(3, 6)         =      1.49
       Model |  345909.264         3  115303.088   Prob > F        =    0.3105
    Residual |  465715.256         6  77619.2094   R-squared       =    0.4262
-------------+----------------------------------   Adj R-squared   =    0.1393
       Total |   811624.52         9  90180.5022   Root MSE        =     278.6

--------------------------------------------------------------------------------
             y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
---------------+----------------------------------------------------------------
1.cat_dummies1 |  -464.7458   254.3279    -1.83   0.117    -1087.064    157.5721
1.cat_dummies2 |  -299.4321   254.3279    -1.18   0.284      -921.75    322.8858
1.cat_dummies3 |  -518.3106   278.6022    -1.86   0.112    -1200.026    163.4046
1.cat_dummies4 |          0  (omitted)
         _cons |   715.5471   197.0015     3.63   0.011     233.5017    1197.592
--------------------------------------------------------------------------------

.

Kind regards,
Carlo
(StataNow 18.5)

Comment

julian mwanana

Join Date: Apr 2019

Posts: 17
#12

12 Nov 2019, 23:17

Greetings all,
Please help!
In my data analysis i have generated seven dummy variables. I omitted one dummy on the model (only 6 entered into regression). But, output indicates 5 dummies are omitted due to collinearity. Why has happened and how to fix it. The correlation matrix table shows that all the dummy variables are significant
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17600
#13

13 Nov 2019, 01:05

Julian:
without seeing what you typed and what Stata gave you back is virtually impossible to reply positively.
A temptative answer would consider that multicollinearity is simply a matter of fact related to your dataset (as per the correlation matrix): all you can do is to change your model specification, with no guarantees that the problem will be solved.

Kind regards,
Carlo
(StataNow 18.5)
Comment
julian mwanana

Join Date: Apr 2019

Posts: 17
#14

13 Nov 2019, 01:21

thanks for your quick response, this is how i generated the dummy variables: tabulate industry, gen(inddummy) - i got eight dummy variables. on the regression i included 7 dummy variable (i.e. inddummy1-inddummy7) by this formular : xtreg aroa ln_TA tmtsize leverage inddummy1- inddummy7 agedivc edudiv tenuredivc av_tenure , fe
Output: inddummy1- inddummy7 omitted because of collinearity

Last edited by julian mwanana; 13 Nov 2019, 01:37.
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17600
#15

13 Nov 2019, 01:40

Julian:
the issue is not the (right) way the dummies were generated, but their having hard times in living together in the right-hand side of your regression.
As an aside, posters are kindly requested to tell interested listers if the use a community-contributed programme (like -asdoc-) (please see the FAQ). Thanks.

Kind regards,
Carlo
(StataNow 18.5)
Comment

Announcement