Hello everyone,
I have a question regarding data cleaning. My dataset consists of several composite numbers or rather SIC codes (with varying lengths, but the composites always consist of a varying amount of 4-digit numbers in a row). An example would look as follows:
Now I would like to filter out (i.e. keep) those entries that contain the values 283 or 384 in the first three digits. My attempt for the code goes as follows:
encode oldvar, gen(newvar) --------I first convert the string to a numeric variable
keep if substr(string(newvar), 1, 3) == "283" | substr(string(newvar), 1, 3) == "384" -------and then keep those entries that either start with "283" or "384"
Unfortunately, the code only keeps those entries with "283" in the first three digits. Does anyone see the mistake I made and would be so kind and help me?
Thank you so much!
Best,
Carolin
I have a question regarding data cleaning. My dataset consists of several composite numbers or rather SIC codes (with varying lengths, but the composites always consist of a varying amount of 4-digit numbers in a row). An example would look as follows:
# | newvar |
1 | 38412834 |
2 | 28342834 |
3 | 283628342834 |
4 | 2835672628346289 |
5 | 28342834 |
6 | 28343081 |
7 | 283428342834 |
8 | 55412834 |
9 | 51222834 |
10 | 2834283620262834 |
encode oldvar, gen(newvar) --------I first convert the string to a numeric variable
keep if substr(string(newvar), 1, 3) == "283" | substr(string(newvar), 1, 3) == "384" -------and then keep those entries that either start with "283" or "384"
Unfortunately, the code only keeps those entries with "283" in the first three digits. Does anyone see the mistake I made and would be so kind and help me?
Thank you so much!
Best,
Carolin
Comment