Npregress slow with large data-sets, small samples

Rustin Partow

Join Date: Aug 2019

Posts: 1
#1

Npregress slow with large data-sets, small samples

12 Aug 2019, 03:04

Hi there, Stata brethren.

Recently I have been trying to use the new nonparametric regression feature in Stata 16, npregress series, on different subsamples of my data. I found it to be slow. After digging in, I think I've discovered a strange behavior, where npregress becomes much slower when you increase the size of the data-set in memory, without changing the size of the sample in the estimation.

Consider the below example.
Toy example

Code:

clear set obs 100000 gen x1 = runiform() gen x2 = runiform() gen y = cos(x1)*sin(x2) + x1^2 + 1/3*runiform() npregress series y x1 x2 if _n < 1001, polynomial

This takes my computer about 60 seconds to run. Now I use the exact same sample, but drop the unused observations.

Code:

drop if _n >=1001 npregress series y x1 x2 if _n < 1001, polynomial

This takes about 2 seconds. This was not the expected behavior, because if I run a similar experiment regress instead of npregress, the speeds will be roughly the same.

Can someone explain why this is happening? Is npregress utilizing the unsampled data somehow? I was hoping to be able to repeatedly run npregress on subsamples of my data in order to construct non-parametric predictions without needing to repeatedly shuffle the data in memory (which will also take a long time, given that I am using a moderately large data-set).

Best,
Rustin
Tags: None
Phil Bromiley

Join Date: Apr 2014

Posts: 4348
#2

14 Aug 2019, 12:23

I hit something similar in xtreg a few months ago. Report it to tech support. It appears that npregress is doing some preliminary (computation intensive) work before dropping the observations.
Comment

Announcement

Npregress slow with large data-sets, small samples

Comment