Hi there, Stata brethren.
Recently I have been trying to use the new nonparametric regression feature in Stata 16, npregress series, on different subsamples of my data. I found it to be slow. After digging in, I think I've discovered a strange behavior, where npregress becomes much slower when you increase the size of the data-set in memory, without changing the size of the sample in the estimation.
Consider the below example.
Toy example
This takes my computer about 60 seconds to run. Now I use the exact same sample, but drop the unused observations.
This takes about 2 seconds. This was not the expected behavior, because if I run a similar experiment regress instead of npregress, the speeds will be roughly the same.
Can someone explain why this is happening? Is npregress utilizing the unsampled data somehow? I was hoping to be able to repeatedly run npregress on subsamples of my data in order to construct non-parametric predictions without needing to repeatedly shuffle the data in memory (which will also take a long time, given that I am using a moderately large data-set).
Best,
Rustin
Recently I have been trying to use the new nonparametric regression feature in Stata 16, npregress series, on different subsamples of my data. I found it to be slow. After digging in, I think I've discovered a strange behavior, where npregress becomes much slower when you increase the size of the data-set in memory, without changing the size of the sample in the estimation.
Consider the below example.
Toy example
Code:
clear set obs 100000 gen x1 = runiform() gen x2 = runiform() gen y = cos(x1)*sin(x2) + x1^2 + 1/3*runiform() npregress series y x1 x2 if _n < 1001, polynomial
Code:
drop if _n >=1001 npregress series y x1 x2 if _n < 1001, polynomial
Can someone explain why this is happening? Is npregress utilizing the unsampled data somehow? I was hoping to be able to repeatedly run npregress on subsamples of my data in order to construct non-parametric predictions without needing to repeatedly shuffle the data in memory (which will also take a long time, given that I am using a moderately large data-set).
Best,
Rustin
Comment