Question regarding turning numerical variables into factor variables

Luke Schreuder

Join Date: Mar 2022

Posts: 2
#1

Question regarding turning numerical variables into factor variables

06 Mar 2022, 08:43

Dear all,

I am new to this forum, so do let me know if I'm not making use of it's users help and expertise in the right way.
I am trying to make a regression analysis of several variables on CEO's salary across 10 years of panel data.

One of the control variables will be the industry the CEO is active in, indicated by the SIC description.
I have tried to simply include it by using the following code:

encode SICDescription, gen(nSICDescription)
reg SalaryCEO variable1 variable2 variable3 i.nSICDescription

Please let me know first of all if this is a legitimate way to use the description as a control variable and secondly, as there are 354 different industries recognised across 2412 companies, would you recommend consolidating into bigger groups and is there a statistical way to organise/explain this?

Another control variable I want to use for predicting the salary is the year in which it was earned. This is a 'continuous' variable ranging from 2010 to 2020, but I assume it has to be used as a categorical variable too as the numerical value of the year does not provide any information in itself. Would the following syntax be a good way to go about using it?

tostring Year, gen(stYear)
encode stYear, gen(nstYear)
reg SalaryCEO variable1 variable2 variable3 i.nSICDescription i.Year

The results do not seem unexpected, but the use of syntax seems irregular and ineffective at the least.

Final question regarding this subject.
Variable profit/loss can obviously not be normalised using it's logged values as there are negative values present in the variable. Would generating a loss variable and transforming all negative values for the former variable to positive entries in the latter and then taking the logarithm of both variables and using the new logged values in the regression be a correct way of dealing with this data?

Thank you very much in advance for helping me with understand these issues and as mentioned before do let me know if I should be using this board in a different way!

Best,

Luke

Last edited by Luke Schreuder; 06 Mar 2022, 09:04. Reason: Factor variables
Tags: categorical, factor variables, regression, syntax, variable transformation
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17708
#2

06 Mar 2022, 09:22

Luke:
welcome to this forum.
1) if you have panel data with a continuous regressand, -regress. should not be your first choice. See -xtreg- instead;
2)

Code:

encode SICDescription, gen(nSICDescription)

is correct if your original variable was in -string- format. In addition, see -encode- cautionary tale in its entry in Stata .pdf manual;
3) if -Year- is already numeric, just go -i.Year-;
4) normality is a weak requirement for reidual distribution only. I would stick with your -profit_loss- in its original metric.

Kind regards,
Carlo
(Stata 19.0)
1 like
Comment
Luke Schreuder

Join Date: Mar 2022

Posts: 2
#3

06 Mar 2022, 09:44

Dear Carlo,

Thank you for your quick and concise answer. I am quite unfamiliar with using -xtreg-, so I will make sure to read into the manual (as well as revise the encode manual as you advise).
As I assume from your answer Stata will no longer assign meaning to the numerical value of the observations if the i-prefix is used in the regression command so I'll skip the tostring transformation on the Year variable.

I'll continue working on the regression with this new information and return to the forum once I have reached my desired outcome or additional questions arise!

Kind regards,
Luke
Comment

Announcement

Question regarding turning numerical variables into factor variables

Comment

Comment