Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Dropping Variables Based on Condition and in a Loop

    Hello I need help learning how to drop a variable if one or more observations are greater than a number x. I tried googling this but what i found didn't help or was for one specific variable or if all variables were to be looked at. I only want to look at 100 of the 150 variables I have in my dataset.

    Ex. I have variables dist_stop1, dist_stop2 and so on till 100. I want to be able to write a loop that will drop variable "dist_stopP" (with P being a number 1-100) if that variable has an observation bigger than say 33.315.
    So if dist_stop4 has observations smaller than 30 and has one observation that is 40.56 then the loop will drop dist_stop4 then move on to dist-stop5.
    All of the dist_stop variables are float types.

    I have been googling and reading for an hour but nothing is helping. Please help me. Thank you for reading and helping.

  • #2
    Code:
    forval x = 1/100{
        sum dist_stop`x',meanonly
        if r(max)>33.315 drop dist_stop`x'
    }

    Comment


    • #3
      Hi Ali Atia, thank you for posting the reply. Could I trouble for an explanation for the lines? Just for my understanding as it works after I made some specific tweaks for my dataset.

      "forval x = 1/100{" I see that this is iterating x from 1 to 100 so that is good.

      "sum dist_stop'x', meanonly" what does sum and meanonly mean and do in this step?


      "if r(max)>33.315 drop dist_stop`x'" what is r(max) mean? I can guess it means the max value of the variable. Is r(blah) a common command and do you have any info on r(blah)?

      Thank you again for helping me and furthermore if you do see this.

      If anyone else sees this and wants to answer please do! I am just trying to learn and process the syntax and stuff as STATA is a lot different than other languages/programs I have looked at.

      Comment


      • #4
        Sum is an abbreviation of the command -summarize-, which is used to calculate univariate summary statistics. The meanonly option suppresses display of the output of that command, and specifies that the variance should not be calculated, which is useful because neither display of output nor variance are required here, and omitting them speeds up the loop.

        r(max) is a reference to a result stored by the previous summarize command, in this case the maximum. Many commands leave behind stored results, which are usually described at the end of a command's helpfile. For instance, -help summarize- notes that:

        Code:
            summarize stores the following in r():
        
            Scalars  
              r(N)           number of observations
              r(mean)        mean
              r(skewness)    skewness (detail only)
              r(min)         minimum
              r(max)         maximum
              r(sum_w)       sum of the weights
              r(p1)          1st percentile (detail only)
              r(p5)          5th percentile (detail only)
              r(p10)         10th percentile (detail only)
              r(p25)         25th percentile (detail only)
              r(p50)         50th percentile (detail only)
              r(p75)         75th percentile (detail only)
              r(p90)         90th percentile (detail only)
              r(p95)         95th percentile (detail only)
              r(p99)         99th percentile (detail only)
              r(Var)         variance
              r(kurtosis)    kurtosis (detail only)
              r(sum)         sum of variable
              r(sd)          standard deviation
        See pages 29-33 of this documentation file (https://www.stata.com/manuals/u18.pdf) for more information on accessing results calculated and stored by other programs.
        Last edited by Ali Atia; 08 Jul 2021, 15:31.

        Comment


        • #5
          Ali Atia you are awesome! Thank you for continuing to help educate me.

          Comment


          • #6
            Ali Atia if you can still see this, could you help me with a related question? I want to also see which observations are say within 1000 units of each stop represented by the many dist_stopx variables. I was thinking of making a loop that would past which observations are kept. The observations are represented by a variable called lot_id. I have set up a loop but I can't get it right. Could you help, I based it off of what you were so kindly provided earlier.

            forval x = 1/91{
            drop if dist_stop'x' > 1000
            }

            My line drops all observations that have a value greater than 1000 for the variable at each iteration. My question would be how do I correctly do this as I feel that dropping would affect every subsequent iteration and I don't know how to get stata to display the observations that match my critereon each iteration.

            What I am hoping to see is that for variable x, blah blah remained. So it looks like

            dist_stop1
            1232, 14321, 534123, 8493, 8445

            dist_stop2
            8949, 6456, ,8941, 3849,

            and so on. The numbers after the dist_stopx would be the value of lot_id for that observation.

            Do you know how to help me do that or get stata to past that but it doesn't have to be as clean as I exampled above? I just need variable name and lot_id values next to each other.

            Comment

            Working...
            X