Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Odd bias in lpoly regression line

    While toying with a simulation, I noticed that lpoly was systematically failing to recognize the true linear relationship between two variables. This is visually obvious when the lpoly line is compared to a scatterplot or to the lowess line, as I demonstrate in a simpler example below. The problem must stem from the lpoly default settings, as the "localp" command, which I understand to be a particular set-up of lpoly, does not suffer from the same problem.

    Can anyone explain which default settings within lpoly are causing this systematic bias? I want to be sure that I avoid this problem if using lpoly in future. Thank you!

    Code:
    clear
    drop _all
    set obs 10000
    set seed 12345
    
    gen larea = rnormal(-1.5, 1.2)       /* original variable */
    gen error = rnormal(0, .4)           /* error to be added around larea */
    
    gen larea_me_big = -.3 + .8*larea + error /* larea is tilted, and noise is added */
    gen larea_me_sm = -.15 + .9*larea + error /* smaller tilt, same noise */
    gen larea_me_rnd = larea + error          /* only noise is added, no tilt */
    
    /* In each of the plots below, lpoly is missing the correct and obvious linear
    relationship between the noisy variable and the original variable, larea */
    two (scatter larea_me_big larea, mcolor(gray*.5) msize(tiny)) ///
        (lowess larea_me_big larea, lcolor(blue)) ///
        (lpoly larea_me_big larea, lcolor(green)) ///
        (line larea larea, lcolor(black)), ///
        legend(order(1 "data" 2 "lowess fit" 3 "lpoly fit" 4 "45 degrees"))
    
    two (scatter larea_me_sm larea, mcolor(gray*.5) msize(tiny)) ///
        (lowess larea_me_sm larea, lcolor(blue)) ///
        (lpoly larea_me_sm larea, lcolor(green)) ///
        (line larea larea, lcolor(black)), ///
        legend(order(1 "data" 2 "lowess fit" 3 "lpoly fit" 4 "45 degrees"))
        
    two (scatter larea_me_rnd larea, mcolor(gray*.5) msize(tiny)) ///
        (lowess larea_me_rnd larea, lcolor(blue)) ///
        (lpoly larea_me_rnd larea, lcolor(green)) ///
        (line larea larea, lcolor(black)), ///
        legend(order(1 "data" 2 "lowess fit" 3 "lpoly fit" 4 "45 degrees"))
        
    /* yet localp DOES recognize the correct linear relationship */
    localp larea_me_rnd larea

  • #2
    Good news: Nothing odd here, and all is documented. lpoly has default degree 0; it's not even trying to fit linear trends locally. In contrast localp (SSC, as you are asked to explain) has default degree 1 and it really is.

    FWIW, I sometimes think the lpoly defaults were chosen to oblige users to think about what they want (a feature, really) as they often don't produce "nice" results for me. Hence localp -- although that is not especially smart. It just formalizes some of the author's experiences.

    Comment


    • #3
      Thanks so much --- I'm embarrassed that I didn't understand the default properly. I do know that degree zero gives "local mean smoothing." I understood this, clearly incorrectly, to be the local (kernel-smoothed) mean of y over x. But this is what deg(1) seems to give. (I've now tried both.)

      Can you possibly point me towards a resource that explains minimization under deg(0)? I understand the lpoly minimization the problem to be something along the lines of

      Sum_i^N K([x_i - x_o]/h) * [y_i - alpha - beta(x_i - x_0)^d]^2

      So I see that d=1 gives local OLS, and d=0 would minimize

      Sum_i^N K([x_i - x_o]/h) * [y_i - alpha - beta]^2

      But I don't see why such a minimization would result in the over-estimation/under-estimation in the plot below...

      Code:
      clear
      drop _all
      set obs 10000
      set seed 12345
      
      gen larea = rnormal(-1.5, 1.2)       /* original variable */
      gen error = rnormal(0, .4)           /* error to be added around larea */
      gen larea_error= larea + error          /* add noise */
      
      gen larea_bins = round(larea, .1) /* bins for x-variable */
      bysort larea_bins: egen larea_error_mns=mean(larea_error)
      
      two (scatter larea_error larea, msize(small) mcolor(eltblue)) ///
          (scatter larea larea, msize(small) mcolor(gray)) ///
          (lpoly larea_error larea, deg(0) lcolor(orange)) ///
          (lpoly larea_error larea, deg(1) lcolor(pink)) ///
          (scatter larea_error_mns larea, msize(tiny) mcolor(yellow)), ///
          legend(order(1 "variable with error" 2 "original variable" ///
              3 "lpoly deg 0" 4 "lpoly deg 1" ///
              5 "variable with error meaned by bins"))
      Apologies for the follow-on, and thanks for your time!

      Comment


      • #4
        No beta term in the second equation, I think. See [R] lpoly for a formal statement.

        Comment


        • #5
          Thanks, Nick. I'm sorry to obnoxiously add to this chain after such a long pause... but I'm still unclear on what is happening under degree(0). The minimizing equation looks like it should be taking the (kernel) mean of y within the bandwidth. And indeed, the help for lpolyci says the following: " The default is degree(0), meaning local-mean smoothing." However, in the example that I posted directly above, you can see that lpoly deg(0) is NOT capturing the variable mean. Not in the least; the line produced actually falls above all points on the LHS of the graph, and and below all points on the RHS of the graph. Can you explain what is going on here?
          Last edited by Leah Bevis; 16 Apr 2018, 21:12.

          Comment

          Working...
          X