Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Npregress slow with large data-sets, small samples

    Hi there, Stata brethren.

    Recently I have been trying to use the new nonparametric regression feature in Stata 16, npregress series, on different subsamples of my data. I found it to be slow. After digging in, I think I've discovered a strange behavior, where npregress becomes much slower when you increase the size of the data-set in memory, without changing the size of the sample in the estimation.

    Consider the below example.
    Toy example
    Code:
    clear
    set obs 100000
    gen x1 = runiform()
    gen x2 = runiform()
    gen y = cos(x1)*sin(x2) + x1^2 + 1/3*runiform()
    npregress series y x1 x2 if _n < 1001, polynomial
    This takes my computer about 60 seconds to run. Now I use the exact same sample, but drop the unused observations.

    Code:
    drop if _n >=1001
    npregress series y x1 x2 if _n < 1001, polynomial
    This takes about 2 seconds. This was not the expected behavior, because if I run a similar experiment regress instead of npregress, the speeds will be roughly the same.

    Can someone explain why this is happening? Is npregress utilizing the unsampled data somehow? I was hoping to be able to repeatedly run npregress on subsamples of my data in order to construct non-parametric predictions without needing to repeatedly shuffle the data in memory (which will also take a long time, given that I am using a moderately large data-set).

    Best,
    Rustin

  • #2
    I hit something similar in xtreg a few months ago. Report it to tech support. It appears that npregress is doing some preliminary (computation intensive) work before dropping the observations.

    Comment

    Working...
    X