Dear Statalist members,
I am currently writing a command to implement pattern matching with a syntax similar to switch expressions in other programming languages, but with usefulness and exhaustiveness checks similar to Rust match expression (the second example comes back on these terms).
This is my first time doing a package in Stata and coding in Mata, so I am looking for opinions on the concept and the syntax. Does it sound useful to you ? What features do you think would be required ? Would you use it once I stabilize a first version?
The code is available on GitHub if you want to take a look: https://github.com/MaelAstruc/stata_match
Rather than trying to explain it with a long essay, here is a short example creating a new variable from the rep78 variable in the auto dataset:
Here the variable var_2 is replaced by matching the variable rep78 on constant patterns equal '1', '2', '3', '4', '5' or '.'. Each condition has a corresponding value after an arrow '=>' and the arms are separated by a comma.
To explain the nebulous terms of usefulness and exhaustiveness, here are two short examples:
The main interest of the match command would be to ensure that all the cases are covered in the different arms and that no conditions are overlapping across the arms.
The previous examples are quite straightforward (even if I often forget the missing values), but the match command also supports other patterns:
I hope the intent of this function is clearer with these examples and shows how this could improve data analysis by handling as soon as possible some mistakes that might appear in our code.
Even if other possibilities exist, such as chains of cond() functions or the recode command (an earlier discussion on the subject), this approach seems more flexible and allows these checks. I discovered the advanced possibilities of the recode command recently and the purpose of the two commands seems close with ranges, wildcards and overlaps checks, but I'm not sure if recode could be adapted to support all these features or if the match command could be used to relabel.
Regarding the limitations
Thank you for your reading this and for your help.
I am currently writing a command to implement pattern matching with a syntax similar to switch expressions in other programming languages, but with usefulness and exhaustiveness checks similar to Rust match expression (the second example comes back on these terms).
This is my first time doing a package in Stata and coding in Mata, so I am looking for opinions on the concept and the syntax. Does it sound useful to you ? What features do you think would be required ? Would you use it once I stabilize a first version?
The code is available on GitHub if you want to take a look: https://github.com/MaelAstruc/stata_match
Rather than trying to explain it with a long essay, here is a short example creating a new variable from the rep78 variable in the auto dataset:
Code:
sysuse auto, clear * Usual way with 'replace newvar = value if condition' gen var_1 = "" replace var_1 = "very low" if rep78 == 1 replace var_1 = "low" if rep78 == 2 replace var_1 = "mid" if rep78 == 3 replace var_1 = "high" if rep78 == 4 replace var_1 = "very high" if rep78 == 5 replace var_1 = "missing" if rep78 == . * With the match command: match newvar, variables(other_var) body(condition => value) gen var_2 = "" match var_2, variables(rep78) body( /// 1 => "very low", /// 2 => "low", /// 3 => "mid", /// 4 => "high", /// 5 => "very high", /// . => "missing", /// ) assert var_1 == var_2
To explain the nebulous terms of usefulness and exhaustiveness, here are two short examples:
Code:
// Usefulness gen var_3 = "" match var_3, variables(rep78) body( /// 1 => "very low", /// 2 => "low", /// 3 => "mid", /// 4 => "high", /// 5 => "very high", /// 1 => "also very low", /// . => "missing", /// ) * Warning : Arm 6 has overlaps * Arm 1: 1 // Exhaustiveness gen var_4 = "" match var_4, variables(rep78) body( /// 1 => "very low", /// 2 => "low", /// 3 => "mid", /// 4 => "high", /// ) * Warning : Missing values * 5 * .
The previous examples are quite straightforward (even if I often forget the missing values), but the match command also supports other patterns:
- Range pattern '~' to match on a range of values, such as:
- '1~3' equivalent to 'rep78 >= 1 & rep78 <= 3'
- '1!~3' equivalent to 'rep78 > 1 & rep78 <= 3'
- '1~!3' equivalent to 'rep78 >= 1 & rep78 < 3'
- '1!!3' equivalent to 'rep78 > 1 & rep78 < 3'
- If the minimum or the maximum value is not precised, the pattern uses the minimum or maximum value of the variable
- Note that this avoids matching the missing value when checking 'rep78 >= 3'
- Or pattern '|' to combine multiple patterns
- '4~5 | .' is equivalent to '(rep78 >= 4 & rep78 <= 5) | rep78 == .'
- Wildcard pattern '_' to match all the remaining values
- This can be seen as a default value
- Tuple pattern '(..., ...)' to match multiple variables at the same time
Code:
gen var_5 = "" replace var_5 = "case 1" if rep78 < 3 & price < 10000 replace var_5 = "case 2" if rep78 < 3 & price >= 10000 replace var_5 = "case 3" if rep78 >= 3 replace var_5 = "missing" if rep78 == . | price == . gen var_6 = "" match var_6, variables(rep78, price) body( /// (~!3, ~!10000) => "case 1", /// (~!3, 10000~) => "case 2", /// (3~, _) => "case 3", /// (., _) | (_, .) => "missing", /// ) assert var_5 == var_6
Even if other possibilities exist, such as chains of cond() functions or the recode command (an earlier discussion on the subject), this approach seems more flexible and allows these checks. I discovered the advanced possibilities of the recode command recently and the purpose of the two commands seems close with ranges, wildcards and overlaps checks, but I'm not sure if recode could be adapted to support all these features or if the match command could be used to relabel.
Regarding the limitations
- Even if the examples displayed here work with the code on GitHub, I have not created a proper package that can be installed for now.
- There is a performance cost due to the determination of the levels at run time. I still need to do a proper profiling, but the first measures give the impression that the checks are relatively cheap.
- For now the command only supports numeric and string variables, but I plan to introduce support for dates and encoded variables.
Thank you for your reading this and for your help.