Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Keep if first X numbers contain certain numbers

    Hello everyone,

    I have a question regarding data cleaning. My dataset consists of several composite numbers or rather SIC codes (with varying lengths, but the composites always consist of a varying amount of 4-digit numbers in a row). An example would look as follows:
    # newvar
    1 38412834
    2 28342834
    3 283628342834
    4 2835672628346289
    5 28342834
    6 28343081
    7 283428342834
    8 55412834
    9 51222834
    10 2834283620262834
    Now I would like to filter out (i.e. keep) those entries that contain the values 283 or 384 in the first three digits. My attempt for the code goes as follows:

    encode oldvar, gen(newvar) --------I first convert the string to a numeric variable
    keep if substr(string(newvar), 1, 3) == "283" | substr(string(newvar), 1, 3) == "384" -------and then keep those entries that either start with "283" or "384"

    Unfortunately, the code only keeps those entries with "283" in the first three digits. Does anyone see the mistake I made and would be so kind and help me?

    Thank you so much!

    Best,
    Carolin

  • #2
    encode is not the same as e.g. -destring-, use oldvar to keep the desired observations,
    Code:
    keep if inlist(substr(oldvar,1,3),"283","384")

    Comment


    • #3
      Actually, it may not be a good idea for O.P. to convert this variable to numeric even with -destring-. In the example shown, the longest of them has 16 digits, which is the maximum number of digits that can be held in a -double-. If there are any values in oldvar that are even longer, then they would inevitably lose some low order digits when converted with -destring-. (And -destring- will refuse to do it without the -force- option.) If 16 digits is the longest, then it will work, but it is unclear what use such a variable would be: these "numbers" are just arbitrary codes and there is no meaningful arithmetic that can be performed with them. If a numeric variable is required as a surrogate, say for using -xtset- or something like that, then, actually, -encode- would be the appropriate command for that.

      Comment


      • #4
        Thank you so much, that already helped a lot! Indeed my data structure is a bit tricky, since the variable I am looking at is the composite of company SIC codes and the number of SIC codes combined can range up to 10 companies (so 10x4 digits). Is there a way that I can more finely filter my results? Let's say keep in list if "2834" and "2834" or "2836" so that only SIC codes with "28342834" or "28342836" are left. I have 20 SIC codes so that's why unfortunately it would be quite tedious to enter these conditions manually. Thanks again!

        Comment


        • #5
          I do not get what you are asking. If #2 works for 3 digits, what is the difficulty in extending it to 8 digits? Here is an alternative using regular expressions.

          Code:
          * Example generated by -dataex-. For more info, type help dataex
          clear
          input str2 var1 str16 var2
          "1"  "38412834"        
          "2"  "28342834"        
          "3"  "283628342834"    
          "4"  "2835672628346289"
          "5"  "28342834"        
          "6"  "28343081"        
          "7"  "283428342834"    
          "8"  "55412834"        
          "9"  "51222834"        
          "10" "2834283620262834"
          end
          
          keep if regexm(strtrim(var2),"(^28342834|^28342836)")
          Res.:

          Code:
          . l
          
               +-------------------------+
               | var1               var2 |
               |-------------------------|
            1. |    2           28342834 |
            2. |    5           28342834 |
            3. |    7       283428342834 |
            4. |   10   2834283620262834 |
               +-------------------------+

          Comment

          Working...
          X