How do I reconcile using st_data() to efficiently loop through observations in large datasets while working with marksample and subsets?

Michael Anbar

Join Date: Aug 2014

Posts: 116
#1

How do I reconcile using st_data() to efficiently loop through observations in large datasets while working with marksample and subsets?

26 Sep 2014, 09:31

I'm having trouble reconciling in my head how to work with the st_data/st_sdata functions, in conjunction with -marksample-, on large datasets. I'm writing a program that uses Mata to operate on two string variables, -s1- and -s2-. The Mata function loops through both variables, calls some function on the pair s1[i], s2[i] and returns an integer in a third variable -n-. The datasets I'm working with have millions of observations in s1 and s2.

These are the points I'm having trouble reconciling in my head:

1. Because these datasets are large, I want to avoid using st_sdata() naively to create a copy of my string variables, e.g.

Code:

s = st_sdata(., ("s1 s2"), touse) // assume -touse- is a Mata string scalar containing "`touse'" from the Stata program

and then iterating over each row of this string matrix, performing my operation, saving the result in some matrix, then writing that result to -n- in the Stata dataset.

2. After reading the documentation for st_view() and William Gould's articles, "Mata Matters: Using views onto the data" (http://www.stata-journal.com/sjpdf.h...iclenum=pr0019) and "Mata Matters: Creating new variables—sounds boring, isn’t" (http://www.stata-journal.com/sjpdf.h...iclenum=pr0021), it seems at first blush like views and st_sview() are what I want to use. However, the documentation for st_view() states

Do not use views as a substitute for scalars. If you are going to loop through the data an observation at a time, and if every usage you will make of X is in scalar calculations, use _st_data(). There is nothing faster for that problem.

This makes me think I should use _st_sdata() because my use case is exactly what that paragraph refers to.

3. However, _st_sdata() doesn't support a selectvar (i.e. touse), so I'm a bit stuck. The solution I can think of is to a) use st_varindex() to get the numeric indexes of the variables s1, s2, and `touse', then b) loop over all observations from 1 to st_nobs() and use _st_data() to get the value for `touse'. If it's 1, use _st_sdata() to get the values for strings s1 and s2, apply my function to them, and d) either store the result in a matrix, which I would write to a Stata variable after the loop completes, or write the results after each loop iteration to that observation for n using st_store().

I don't quite like my solution in #3, though, because it requires looping through every observation in the dataset, regardless of how many observations are specified in `touse'. If I call

Code:

myprogram s1 s2 in 1/10, gen(newvar)

on a dataset with 10 million observations, my solution in #3 would still require looping through every observation.

In short, I'm unsure how to reconcile a) being efficient and using _st_sdata() and _st_data() when looping through every observation, and b) avoiding looping through every observation when working with a subset of the data, specified by marksample and touse.

Thank you for the help,

Michael Anbar
Tags: None
Phil Schumm

Join Date: Mar 2014

Posts: 169
#2

27 Sep 2014, 10:48

You are correct to conclude that _st_sdata() (or st_sdata()) are to be preferred to st_sview() in this case. However, my sense is that you may be trying to optimize prematurely. Your approach, which is basically

Code:

s1 = st_varindex("s1") s2 = st_varindex("s2") newvar = st_addvar("float", "newvar") touse = st_varindex("touse") for (i=1; i<=st_nobs(); i++) { if (_st_data(i, touse)) { _st_store(i, newvar, <function of _st_sdata(i,s1) and _st_sdata(i,s2)>) } }

is in general fine. In cases where you are only using a subset of the observations, my hunch (without testing it) is that

Code:

mydata = st_sdata(., (st_varindex("s1"), st_varindex("s2")), st_varindex("touse")) result = J(rows(mydata), 1, .) for (i=1; i<=rows(mydata); i++) { result[i] = <function of mydata[i,1] and mydata[i,2]> } st_store(., st_addvar("float", "newvar"), st_varindex("touse"), result)

might be slightly faster, since it reduces the number of iterations of the loop. Of course, the downside of this approach is that it involves storing copies of your two input variables (plus the result) in memory, however that's unlikely to matter much (even with millions of observations) unless your string variables are very long.

Now, if you're concerned about performance in the case where you are working with a very small range of the data (e.g., 1/10), note that even with your original approach you should still see a noticeable increase in speed depending on the relative cost of evaluating the statement(s) inside the if block. Moreover, I would expect the speed improvement of the second approach to be even greater in that case. And if really necessary, you could handle those cases separately by passing the range directly into st_sdata(), however I wouldn't do this unless you really need to (otherwise it just complicates the code unnecessarily).

It may be unnecessary to ask, but I presume that you have already determined that you can't get by with
gen newvar = <function of s1 and s2> [if] [in]
where the function can be expressed in terms of one or more of Stata's built-in functions and/or operators?

Last edited by Phil Schumm; 27 Sep 2014, 10:52.
Comment

Announcement

How do I reconcile using st_data() to efficiently loop through observations in large datasets while working with marksample and subsets?

Comment