Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to recode variables that have greater or less than values?

    Hi everyone. This is my first time posting on this forum so my apologies if there is anything I miss or don't post correctly. I am trying to create a compliance variable to measure whether class sizes in an urban city are in compliance with a new class size mandate or not. The variable "MaximumClassSize has manyvalues ranging from 1-34, but it also has values that are listed "<15," ">15" "<34" and etc. How should I recode these while still keeping it as a non-categorical variable?

    So far, I've done this:

    destring MaximumClassSize, generate(MaximumClassSizenum) force
    list MaximumClassSize MaximumClassSizenum if missing(MaximumClassSizenum)
    destring MaximumClassSize, replace ignore("<15"|"<6"|">34"|">15")

    What should I do from here? I want to retain these less than or greater values, but I don't know how.
    Last edited by Ariella Meltzer; 03 Jul 2023, 11:43.

  • #2
    Hi Ariella, welcome to the forum. It would be helpful if you could provide the error along with the commands that you've already tried. Could you also provide a data example with the -dataex- command? This should make answering your question much easier. Assuming these are string values (and we aren't looking at value labels) it seems like you will need to change instances with extra non-numeric characters like "<34" to a single numerical value, or you will have to assign those strings the missing value in your new column. Both of these options have drawbacks, and I wonder what you think would be most appropriate for your data. For example, would it be appropriate to change "<34" to "34"? Or would "33" be more appropriate?

    Comment


    • #3
      Cross-posted at https://www.reddit.com/r/stata/comme...as_greater_or/

      Comment


      • #4
        Hi, Ariella. Are you saying some observations of MaximumClassSize do not have values between 1-34, they were purposely assigned "<15", ">15" or "<34" for some reasons?

        Then the storage type of MaximumClassSize should be string (just as what Daniel said above). In my work environment, we normally retain the numeric values, and use value labels to categorize the variable. For example:

        Variable name: GENDER
        Value Value label
        1 Male
        2 Female
        99 Other

        In your case, I guess you can replace the "<15", ">15" and "<34" values with some specific numbers, such as "14.9999", "15.0001" and "33.9999", then destring the MaximumClassSize and apply the value labels.

        Code:
        replace MaximumClassSize = "14.9999" if MaximumClassSize == "<15"
        replace MaximumClassSize = "15.0001" if MaximumClassSize == ">15"
        replace MaximumClassSize = "33.9999" if MaximumClassSize == "<34"
        destring MaximumClassSize, replace
        
        label define ClassLabel   1    "Label_1"  ///
                                  2    "Label_2" ///
                                  3    "Label_3" ///
                                          ...
                                          ...
                                          ...
                                  14.9999    "Label_<15" ///
                                  15.0001    "Label_>15" ///
                                  33.9999    "Label_<34" ///
                                  34    "Label_34" 
        
        label values MaximumClassSize ClassLabel

        Comment


        • #5
          Unfortunately, the code in #4 will not work, because Stata only allows you to label integer values and extended missing values.

          I think the optimal handling of your situation depends on whether this maximum class size variable is going to be an explanatory variable in your analyses, or an outcome variable.

          If it is going to be an outcome variable, you must create additional variables lower bound and upper bound. For the values of maximum class size that are simply numeric, set both lower and upper bound equal to the value of maximum class size. For the values of maximum class size that were coded as < 15, set the lower bound to 0 and the upper bound to 14. Do the analogous thing for < 34. And for >15, set the lower bound to 16 and the upper bound to missing value. Then you can use the lower and upper bound variables in a censored regression model (-help intreg-).

          If, however, it is going to be an explanatory variable, then I am not aware of any procedure that will work with censored predictors. So in this case, I would just set these values to missing, which will result in their being excluded from analysis. If you want to include them in descriptive statistics, you can recode <15 as .a, >15 as .b, and <34 as .c, and then create a value label along the lines of #4, but with .a, .b, and .c in place of 14.9999, 15.0001, and 33.9999. Then when you -tab maximum_class_size, miss- these will be listed. But you still won't be able to do any calculations with these values.
          Last edited by Clyde Schechter; 04 Jul 2023, 01:11.

          Comment


          • #6
            Thanks to Clyde Schechter for the correction and apologies for my carelessness.

            Perhaps a few minor changes to my code would fix this error as well. For example, change 14.9999 to 149999 (it seems a bit speculative though).

            Comment


            • #7
              So in this case, I would just set these values to missing, which will result in their being excluded from analysis.
              It doesn't really look like OP is following this thread, but I will add that it may be possible to impute these missing values.

              Comment

              Working...
              X