I'm having trouble reconciling in my head how to work with the st_data/st_sdata functions, in conjunction with -marksample-, on large datasets. I'm writing a program that uses Mata to operate on two string variables, -s1- and -s2-. The Mata function loops through both variables, calls some function on the pair s1[i], s2[i] and returns an integer in a third variable -n-. The datasets I'm working with have millions of observations in s1 and s2.
These are the points I'm having trouble reconciling in my head:
1. Because these datasets are large, I want to avoid using st_sdata() naively to create a copy of my string variables, e.g.
and then iterating over each row of this string matrix, performing my operation, saving the result in some matrix, then writing that result to -n- in the Stata dataset.
2. After reading the documentation for st_view() and William Gould's articles, "Mata Matters: Using views onto the data" (http://www.stata-journal.com/sjpdf.h...iclenum=pr0019) and "Mata Matters: Creating new variables—sounds boring, isn’t" (http://www.stata-journal.com/sjpdf.h...iclenum=pr0021), it seems at first blush like views and st_sview() are what I want to use. However, the documentation for st_view() states
This makes me think I should use _st_sdata() because my use case is exactly what that paragraph refers to.
3. However, _st_sdata() doesn't support a selectvar (i.e. touse), so I'm a bit stuck. The solution I can think of is to a) use st_varindex() to get the numeric indexes of the variables s1, s2, and `touse', then b) loop over all observations from 1 to st_nobs() and use _st_data() to get the value for `touse'. If it's 1, use _st_sdata() to get the values for strings s1 and s2, apply my function to them, and d) either store the result in a matrix, which I would write to a Stata variable after the loop completes, or write the results after each loop iteration to that observation for n using st_store().
I don't quite like my solution in #3, though, because it requires looping through every observation in the dataset, regardless of how many observations are specified in `touse'. If I call
on a dataset with 10 million observations, my solution in #3 would still require looping through every observation.
In short, I'm unsure how to reconcile a) being efficient and using _st_sdata() and _st_data() when looping through every observation, and b) avoiding looping through every observation when working with a subset of the data, specified by marksample and touse.
Thank you for the help,
Michael Anbar
These are the points I'm having trouble reconciling in my head:
1. Because these datasets are large, I want to avoid using st_sdata() naively to create a copy of my string variables, e.g.
Code:
s = st_sdata(., ("s1 s2"), touse) // assume -touse- is a Mata string scalar containing "`touse'" from the Stata program
2. After reading the documentation for st_view() and William Gould's articles, "Mata Matters: Using views onto the data" (http://www.stata-journal.com/sjpdf.h...iclenum=pr0019) and "Mata Matters: Creating new variables—sounds boring, isn’t" (http://www.stata-journal.com/sjpdf.h...iclenum=pr0021), it seems at first blush like views and st_sview() are what I want to use. However, the documentation for st_view() states
Do not use views as a substitute for scalars. If you are going to loop through the data an observation at a time, and if every usage you will make of X is in scalar calculations, use _st_data(). There is nothing faster for that problem.
3. However, _st_sdata() doesn't support a selectvar (i.e. touse), so I'm a bit stuck. The solution I can think of is to a) use st_varindex() to get the numeric indexes of the variables s1, s2, and `touse', then b) loop over all observations from 1 to st_nobs() and use _st_data() to get the value for `touse'. If it's 1, use _st_sdata() to get the values for strings s1 and s2, apply my function to them, and d) either store the result in a matrix, which I would write to a Stata variable after the loop completes, or write the results after each loop iteration to that observation for n using st_store().
I don't quite like my solution in #3, though, because it requires looping through every observation in the dataset, regardless of how many observations are specified in `touse'. If I call
Code:
myprogram s1 s2 in 1/10, gen(newvar)
In short, I'm unsure how to reconcile a) being efficient and using _st_sdata() and _st_data() when looping through every observation, and b) avoiding looping through every observation when working with a subset of the data, specified by marksample and touse.
Thank you for the help,
Michael Anbar
Comment