Re #180:
Regarding 1), I tend to doubt that allowing -encode- to handle more than 65,536 distinct values will really accomplish much. At least in my workflow, in the very large data sets where this becomes an issue, the strings in question tend to be unique to each observation, or very nearly so. In that case, the value label, which must be stored in the data set and also be in memory when the data set is in use, will be roughly as large as the original string variable. Yes, if you had a data set with several million observations and a string variable that took on, say, 200,000 distinct values, then you would be able to appreciably shrink the dataset with -encode-, and also get a performance boost on many commands as a result. But in my experience, it is vanishingly rare to see that situation.
Regarding 2), use of a RAM drive to save files to memory, -frame copy- fulfills that need.
Regarding 3), the current update of version 18 now has a -favor()- option that allows you to favor speed or memory. I'm told that with -favor(speed)- things go much faster than before. I haven't actually had occasion to use this in a large data set yet, so I can't really comment from personal experience. I'd also point out that even before this, there were user-written versions of -reshape- that were much faster: -greshape- (part of Mauricio Caceres' -gtools- package) and Rafal Raciborski's -tolong- (for wide to long only). Both are available from SSC.
Regarding 4), I disagree, pretty strongly. If you are trying to combine two files, and in one of them a variable is a string and in the other is numeric, that is simply a data error, pure and simple. Frankly, if I were going to revise -append- I would do it by eliminating the -force- option so that this condition would always throw an error and abort execution. It's a condition that should never happen, and when it does, I want to know about it so I can find out why and fix it. I don't want Stata making any repair attempts. After all, while investigating why the data sets are incompatible, I may well uncover other errors in the creation of those data sets that could trip me up later if left undetected now. It's an opportunity for me to find and fix problems in my data and get a better understanding of my data. Also, since you are concerned about speed of execution with large files, I'll point out that string-numeric conversions in either direction are very time consuming.
FWIW, I think one of the most underrated user-written programs is Mark Chatfield's -precombine-. It allows you to compare files that you plan to -merge- or -append- together and it gives you warnings about any incompatibilities or potential problems that will arise. Using it before you do a -merge- or -append- is, I think, almost always a wise precaution, especially if the files are large and the combining will take a long time. Much better to know in seconds that your files are inconsistently organized than to find out late into a time-consuming -append- or -merge-.
Regarding 5), anything that could be done with -joinby- can also be done with Robert Picard's -rangejoin- command (just set the interval to be from negative infinity to positive infinity), and it will both be substantially faster and use less memory. -rangejoin- is available from SSC. To use it, you must also install -rangestat-, by Robert Picard, Nick Cox, and Roberto Ferrer, also available from SSC.
Regarding 1), I tend to doubt that allowing -encode- to handle more than 65,536 distinct values will really accomplish much. At least in my workflow, in the very large data sets where this becomes an issue, the strings in question tend to be unique to each observation, or very nearly so. In that case, the value label, which must be stored in the data set and also be in memory when the data set is in use, will be roughly as large as the original string variable. Yes, if you had a data set with several million observations and a string variable that took on, say, 200,000 distinct values, then you would be able to appreciably shrink the dataset with -encode-, and also get a performance boost on many commands as a result. But in my experience, it is vanishingly rare to see that situation.
Regarding 2), use of a RAM drive to save files to memory, -frame copy- fulfills that need.
Regarding 3), the current update of version 18 now has a -favor()- option that allows you to favor speed or memory. I'm told that with -favor(speed)- things go much faster than before. I haven't actually had occasion to use this in a large data set yet, so I can't really comment from personal experience. I'd also point out that even before this, there were user-written versions of -reshape- that were much faster: -greshape- (part of Mauricio Caceres' -gtools- package) and Rafal Raciborski's -tolong- (for wide to long only). Both are available from SSC.
Regarding 4), I disagree, pretty strongly. If you are trying to combine two files, and in one of them a variable is a string and in the other is numeric, that is simply a data error, pure and simple. Frankly, if I were going to revise -append- I would do it by eliminating the -force- option so that this condition would always throw an error and abort execution. It's a condition that should never happen, and when it does, I want to know about it so I can find out why and fix it. I don't want Stata making any repair attempts. After all, while investigating why the data sets are incompatible, I may well uncover other errors in the creation of those data sets that could trip me up later if left undetected now. It's an opportunity for me to find and fix problems in my data and get a better understanding of my data. Also, since you are concerned about speed of execution with large files, I'll point out that string-numeric conversions in either direction are very time consuming.
FWIW, I think one of the most underrated user-written programs is Mark Chatfield's -precombine-. It allows you to compare files that you plan to -merge- or -append- together and it gives you warnings about any incompatibilities or potential problems that will arise. Using it before you do a -merge- or -append- is, I think, almost always a wise precaution, especially if the files are large and the combining will take a long time. Much better to know in seconds that your files are inconsistently organized than to find out late into a time-consuming -append- or -merge-.
Regarding 5), anything that could be done with -joinby- can also be done with Robert Picard's -rangejoin- command (just set the interval to be from negative infinity to positive infinity), and it will both be substantially faster and use less memory. -rangejoin- is available from SSC. To use it, you must also install -rangestat-, by Robert Picard, Nick Cox, and Roberto Ferrer, also available from SSC.
Comment