How to recode variables that have greater or less than values?

Ariella Meltzer

Join Date: Jul 2023

Posts: 1
#1

How to recode variables that have greater or less than values?

03 Jul 2023, 11:39

Hi everyone. This is my first time posting on this forum so my apologies if there is anything I miss or don't post correctly. I am trying to create a compliance variable to measure whether class sizes in an urban city are in compliance with a new class size mandate or not. The variable "MaximumClassSize has manyvalues ranging from 1-34, but it also has values that are listed "<15," ">15" "<34" and etc. How should I recode these while still keeping it as a non-categorical variable?

So far, I've done this:

destring MaximumClassSize, generate(MaximumClassSizenum) force
list MaximumClassSize MaximumClassSizenum if missing(MaximumClassSizenum)
destring MaximumClassSize, replace ignore("<15"|"<6"|">34"|">15")

What should I do from here? I want to retain these less than or greater values, but I don't know how.

Last edited by Ariella Meltzer; 03 Jul 2023, 11:43.
Tags: beginner, recode, syntax
Daniel Schaefer

Join Date: Mar 2020

Posts: 808
#2

03 Jul 2023, 12:07

Hi Ariella, welcome to the forum. It would be helpful if you could provide the error along with the commands that you've already tried. Could you also provide a data example with the -dataex- command? This should make answering your question much easier. Assuming these are string values (and we aren't looking at value labels) it seems like you will need to change instances with extra non-numeric characters like "<34" to a single numerical value, or you will have to assign those strings the missing value in your new column. Both of these options have drawbacks, and I wonder what you think would be most appropriate for your data. For example, would it be appropriate to change "<34" to "34"? Or would "33" be more appropriate?
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35429
#3

03 Jul 2023, 16:36

Cross-posted at https://www.reddit.com/r/stata/comme...as_greater_or/
1 like
Comment

Shen YANG

Join Date: Apr 2023
Posts: 41

04 Jul 2023, 00:43

Hi, Ariella. Are you saying some observations of MaximumClassSize do not have values between 1-34, they were purposely assigned "<15", ">15" or "<34" for some reasons?

Then the storage type of MaximumClassSize should be string (just as what Daniel said above). In my work environment, we normally retain the numeric values, and use value labels to categorize the variable. For example:

Variable name: GENDER

Value	Value label
1	Male
2	Female
99	Other

In your case, I guess you can replace the "<15", ">15" and "<34" values with some specific numbers, such as "14.9999", "15.0001" and "33.9999", then destring the MaximumClassSize and apply the value labels.

Code:

replace MaximumClassSize = "14.9999" if MaximumClassSize == "<15"
replace MaximumClassSize = "15.0001" if MaximumClassSize == ">15"
replace MaximumClassSize = "33.9999" if MaximumClassSize == "<34"
destring MaximumClassSize, replace

label define ClassLabel   1    "Label_1"  ///
                          2    "Label_2" ///
                          3    "Label_3" ///
                                  ...
                                  ...
                                  ...
                          14.9999    "Label_<15" ///
                          15.0001    "Label_>15" ///
                          33.9999    "Label_<34" ///
                          34    "Label_34" 

label values MaximumClassSize ClassLabel

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 29953
#5

04 Jul 2023, 01:08

Unfortunately, the code in #4 will not work, because Stata only allows you to label integer values and extended missing values.

I think the optimal handling of your situation depends on whether this maximum class size variable is going to be an explanatory variable in your analyses, or an outcome variable.

If it is going to be an outcome variable, you must create additional variables lower bound and upper bound. For the values of maximum class size that are simply numeric, set both lower and upper bound equal to the value of maximum class size. For the values of maximum class size that were coded as < 15, set the lower bound to 0 and the upper bound to 14. Do the analogous thing for < 34. And for >15, set the lower bound to 16 and the upper bound to missing value. Then you can use the lower and upper bound variables in a censored regression model (-help intreg-).

If, however, it is going to be an explanatory variable, then I am not aware of any procedure that will work with censored predictors. So in this case, I would just set these values to missing, which will result in their being excluded from analysis. If you want to include them in descriptive statistics, you can recode <15 as .a, >15 as .b, and <34 as .c, and then create a value label along the lines of #4, but with .a, .b, and .c in place of 14.9999, 15.0001, and 33.9999. Then when you -tab maximum_class_size, miss- these will be listed. But you still won't be able to do any calculations with these values.

Last edited by Clyde Schechter; 04 Jul 2023, 01:11.
3 likes
Comment
Shen YANG

Join Date: Apr 2023

Posts: 41
#6

05 Jul 2023, 18:30

Thanks to Clyde Schechter for the correction and apologies for my carelessness.

Perhaps a few minor changes to my code would fix this error as well. For example, change 14.9999 to 149999 (it seems a bit speculative though).
Comment
Daniel Schaefer

Join Date: Mar 2020

Posts: 808
#7

05 Jul 2023, 19:40

So in this case, I would just set these values to missing, which will result in their being excluded from analysis.

It doesn't really look like OP is following this thread, but I will add that it may be possible to impute these missing values.
Comment

Announcement