Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Non-parametric regression estimations to understand relationship between 2 variables: npregress, lpoly vs lowess

    I am trying to understand the relationship between two variables with non-parametric regressions using commands npregress, lpoly, and lowess. Are they all considered to be kernel regressions?

    As far as I understand:

    (1) All of them fit local regressions at each point (ie, observation) based on a neighbourhood of points (within the chosen bandwidth). The further away from the observation in question, the less weight the data contribute to that regression. This makes the resulting function smooth when they are added together.

    (2) The main difference between -lpoly- and -lowess- and -npregress- is that the -lowess- and -npregress- fits linear regressions or local means while -lpoly- fit polynomial regressions (i.e., you can choose the degree of the polynomial). Therefore, lpoly seems more general.

    (3) Besides, there are some differences in terms of the default bandwidth and whether more than one explanatory variable can be included. Are there any other (important) differences I am missing out on?

    I have been trying all the three commands with the same regression specification. The -lpoly- regression has proven to be a lot faster with my data, which I do not understand why given that this seemed to be the most general estimator (see item (2) above). The specification with command -npregress- has been taking forever: it has been 30 hours and the command is still running (I have 14 million obs). I ran the same specification with -lpoly- and got the result in less than 2 hours. I have also been running the same specification with -lowess- for the past 5 hours and still have no result.

    Is there any way in which I could speed up the estimation with -npregress- and -lowess-? I am only interested in the prediction and not on standard errors.

    Many thanks
    Paula
    Last edited by Paula de Souza Leao Spinola; 09 Feb 2023, 14:13.

  • #2
    Hi Paula
    So, as you have just described nonparametric estimators with two variables are really nothing else but "glorified" weighted regressions, where the weights is what determine the "locality" of the estimator.
    I cannot say much for what happens with lowess, but as you say lpoly and npregress kernel are the most similar in how they work.
    Now, the difference with npregress and lpoly is that npregress tries to make the model predictions for ALL distinct points in your dep variable. So if your dep variable has...say....1million observations, npregress is trying to estimate 1million regressions every time, go make the appropriate predictions, out of sample predictions, and find the most appropriate bandwidth (Tunning variable that defines how smooth of rough the function will be.

    Lpoly is more practical because it doesn't estimate the nonparametric for ALL points, but rather just for a handful of points, enough that for the untrained eye, will be as good as smooth. The bandwidth is obtained using a plug-in formula.

    Unfortunately, i don't think you could make either lpoly or npregress (or lowess) any faster. However, you could cut the middle man, and do the estimation yourself. (with a bandwidth chosen by lpoly, or your own).
    On the other hand, given how large your sample is, you probably don't need to worry about the bw

    here a small script

    HTML Code:
    frause oaxaca, clear
    expand 1000
    replace lnwage=lnwage+rnormal()
    replace age = age +rnormal()
     lpoly lnwage age, degree(1) nodraw kernel(gaussian)
    local lbw=r(bwidth) 
    gen agec=.
    forvalues i = 15/65 {
        replace agec=age-`i'
        qui:regress lnwage agec [iw=normalden(agec,0, `lbw')]
        matrix bb=nullmat(bb)\[_b[_cons],`i']
    }
    
     
     lpoly lnwage age, degree(1) kernel(gaussian) noscatter
    As you can see, the two bottle necks of doing this manually are
    1) data management. Using regression in Stata will take more time, if your dataset is large, even if the sample is small
    If your data is 1Million Obs but you use only 100 in a regression. It will take much longer time using the full sample compare to a case where you only keep the 100 observations
    2) number of points of reference. In my example I use 15/65. So i can reduce it , reducing the number of regressions to be estimated

    Bottom line, I probably would suggest you to use a combination of weighted regressions, matrix storage and Frames to speedup the handling of that size dataset.

    Let me know if you need more help
    Fernando

    Comment

    Working...
    X