Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Cubic Spline Interpolation

    Hi, I'm using a survey data for my dissertation and suspect that the dependent variable is not linear in some independent variables.

    From the paper I read I learnt that I can use cubic spline to solve this problem.
    However, I'm not really sure how exactly to use it.

    Also, how could I know if I should include any interaction terms of independent variables? i.e. are there any commands for this?

    Feel free to suggest any other ways to model nonlinear relationships for survey data, as I'm really struggling right now.

    BTW, the data I'm using is pooled cross-sectional
    Thanks
    Last edited by eric chen; 22 Jul 2015, 13:26.

  • #2
    Welcome to Statalist, Eric!

    Let's start from the beginning: 1) Have you svyset your data yet? If not, do so and show us the command and the result of svydes. 2) What is the goal of your analysis? 3) Show us a svy: regress command that includes only linear terms, together with the results. then we can talk about non-linearity and interactions..

    Very important: Read the FAQ, especially Section 12 and follow it's advice. it will tell you how to properly format commands and results in Statalist posts by means of CODE delimiters. Without this kind of formatting, the listings are not easy to read..
    Steve Samuels
    Statistical Consulting
    [email protected]

    Stata 14.2

    Comment


    • #3
      I apologise for the wrong format. It's my first time using Statalist and I'm still learning.

      The version I'm using is STATA13. Here are the results and codes.

      Code:
      svyset ID [pw = WTSSALL]
      pweight: WTSSALL
      VCE: linearized
      Single unit: missing
      Strata 1: <one>
      SU 1: ID
      FPC 1: <zero>


      My observations are individual respondent to the survey questions. WTSSALL is the weight given by the survey.

      I'm trying to find the predictors of extramarital affairs, using the survey data.



      Regression Results
      Code:
      svy: logit INFI AGE m working middle upper black race_other high coll bac grad, nolog
      (running logit on estimation sample)

      Survey: Logistic regression

      Number of strata = 1 Number of obs = 18507
      Number of PSUs = 3613 Population size = 18673.461
      Design df = 3612
      F( 11, 3602) = 27.57
      Prob > F = 0.0000

      ------------------------------------------------------------------------------
      | Linearized
      INFI | Coef. Std. Err. t P>|t| [95% Conf. Interval]
      -------------+----------------------------------------------------------------
      AGE | .0017679 .0013201 1.34 0.181 -.0008203 .0043561
      m | .6189137 .0429376 14.41 0.000 .5347293 .7030981
      working | -.1847295 .0943901 -1.96 0.050 -.3697927 .0003338
      middle | -.3407727 .094229 -3.62 0.000 -.5255201 -.1560254
      upper | -.0488208 .1447036 -0.34 0.736 -.3325297 .234888
      black | .4762045 .0656899 7.25 0.000 .3474116 .6049974
      race_other | -.1343427 .1006761 -1.33 0.182 -.3317304 .063045
      high | .1535421 .0707681 2.17 0.030 .0147927 .2922914
      coll | .0429336 .1027959 0.42 0.676 -.1586101 .2444773
      bac | -.0438088 .0875743 -0.50 0.617 -.2155089 .1278913
      grad | .1275921 .0974689 1.31 0.191 -.0635076 .3186917
      _cons | -1.904147 .1323298 -14.39 0.000 -2.163595 -1.644698
      ------------------------------------------------------------------------------

      Where INFI is a binary indicator for having affair or not. working, middle and upper indicate social class. race and highest degree achieved are also included. m is indicator for gender, m=1 represents male. The independent variables I used here are all binary except age. But data for level of education measured in number of years completed is also available.

      This is my base model so I'm sure there are omitted variables. For what I know now is that the age effect could be nonlinear.

      Also, I'd like to know if there's any simple technique to determine if I should include any interaction terms.

      It's my first project and I'm really new to STATA, sorry for so many questions.
      Last edited by eric chen; 23 Jul 2015, 08:37.

      Comment


      • #4
        The simplest way of including some non-linearity for age is to add a quadratic term to your model:

        Code:
        svy: logit INFI c.AGE##c.AGE m working middle ...
        personally I am quite fond of linear splines (see: help mkspline)

        Code:
        mkspline age1 30 age2 40 ag3 = AGE
        svy: logit INFI age1 age2 age3 m working middle ...
        Interaction effects are something very different from non-linearity (though a squared term can be seen as an interaction effect of a variable with itself, and I used that trick in the first example, but that is just trickery; substantively they are very different beasts). An interaction effect says that the effect of one variable differs across groups. One might expect that the effect of age is different for men compared to female. The argument would be that men tend to become more attractive as they age while for women the opposite is true and more attractive persons have more opportunities to have an extramarital affair. If you want to check that you add an interaction effect between AGE and m. So that is the answer to your question of how you know whether you should include an interaction effect: you think about whether you can think of some reasonable argument for why the effect might differ and if you can find such an argument you include it, and if you don't you don't include it.
        ---------------------------------
        Maarten L. Buis
        University of Konstanz
        Department of history and sociology
        box 40
        78457 Konstanz
        Germany
        http://www.maartenbuis.nl
        ---------------------------------

        Comment


        • #5
          Hi Maarten,

          Thanks for the reply.

          I have tried to use restricted cubic spline with 5 knots and it generated 4 variables for age.

          I originally included age and the squared term in the logit and both were significant, with positive age effect and negative sign on squared age. This made sense as you'd expect income increases with age, which increases the resource available for engaging in an affair. But when reaching certain age, other factors dominate, such as appearance and health, which gives the negative sign on squared age.

          But when I included the 4 variables from mkspline, only agesp1, which is AGE itself, is significant.

          Here is what I did.


          Results

          Code:
          mkspline agesp = AGE, cubic k(5) dis

          | knot1 knot2 knot3 knot4 knot5
          -------------+-------------------------------------------------------
          AGE | 27 38 48 60 79


          Code:
          svy: logit  INFI m working middle upper black race_other high coll bac grad agesp*

          ------------------------------------------------------------------------------
          | Linearized
          INFI | Coef. Std. Err. t P>|t| [95% Conf. Interval]
          -------------+----------------------------------------------------------------
          m | .6005033 .0431304 13.92 0.000 .5159408 .6850657
          working | -.1914946 .095073 -2.01 0.044 -.3778968 -.0050924
          middle | -.3354261 .0948848 -3.54 0.000 -.5214592 -.1493931
          upper | -.0615419 .145267 -0.42 0.672 -.3463554 .2232716
          black | .4449788 .0661402 6.73 0.000 .3153029 .5746547
          race_other | -.1295719 .101416 -1.28 0.201 -.3284103 .0692665
          high | .0899251 .0716605 1.25 0.210 -.0505739 .2304242
          coll | -.0508791 .1042809 -0.49 0.626 -.2553344 .1535762
          bac | -.1299657 .0887746 -1.46 0.143 -.304019 .0440877
          grad | .0049822 .0990311 0.05 0.960 -.1891803 .1991447
          agesp1 | .0275516 .0106086 2.60 0.009 .0067522 .0483509
          agesp2 | .0077659 .0829591 0.09 0.925 -.1548854 .1704171
          agesp3 | -.1418264 .2521903 -0.56 0.574 -.636276 .3526233
          agesp4 | .1342192 .2744515 0.49 0.625 -.4038762 .6723145
          _cons | -2.827855 .353867 -7.99 0.000 -3.521654 -2.134056
          ------------------------------------------------------------------------------


          Do you know why?

          Thanks
          Last edited by eric chen; 23 Jul 2015, 09:19.

          Comment


          • #6
            There is no contradiction here. The spline variables created by mkspline are highly correlated and, except for the first, and their coefficients cannot be easily interpreted. This phenomenon is easy to reproduce with the auto data (below).

            To get understandable coefficients, I recommend the flexcurv command in Roger Newson's bspline package (SSC). and described in Newson, 2012. The regression coefficients are equal to the model predictions at reference values of the dependent variable. You will see if you run the code below that the predictions from the quadratic model, mkspline, and flexcurv are very similar. (In fact, flexcurv itself can fit the quadratic polynomial model (p 19).)

            Code:
            . sysuse auto, clear
            (1978 Automobile Data)
            
            . reg mpg c.weight##c.weight
            
                  Source |       SS           df       MS      Number of obs   =        74
            -------------+----------------------------------   F(2, 71)        =     72.80
                   Model |  1642.52197         2  821.260986   Prob > F        =    0.0000
                Residual |  800.937487        71  11.2808097   R-squared       =    0.6722
            -------------+----------------------------------   Adj R-squared   =    0.6630
                   Total |  2443.45946        73  33.4720474   Root MSE        =    3.3587
            
            -----------------------------------------------------------------------------------
                          mpg |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
            ------------------+----------------------------------------------------------------
                       weight |  -.0141581   .0038835    -3.65   0.001    -.0219016   -.0064145
                              |
            c.weight#c.weight |   1.32e-06   6.26e-07     2.12   0.038     7.67e-08    2.57e-06
                              |
                        _cons |   51.18308   5.767884     8.87   0.000     39.68225    62.68392
            -----------------------------------------------------------------------------------
            
            . predict p1
            (option xb assumed; fitted values)
            
            . mkspline msp = weight, cubic nk(5) dis
            
                         |     knot1      knot2      knot3      knot4      knot5
            -------------+-------------------------------------------------------
                  weight |      1930    2336.25       3190    3518.75       4130
            
            . corr msp*
            (obs=74)
            
                         |     msp1     msp2     msp3     msp4
            -------------+------------------------------------
                    msp1 |   1.0000
                    msp2 |   0.8976   1.0000
                    msp3 |   0.8503   0.9946   1.0000
                    msp4 |   0.7076   0.9322   0.9630   1.0000
            
            . reg mpg msp1 msp2 msp3 msp4
            
                  Source |       SS           df       MS      Number of obs   =        74
            -------------+----------------------------------   F(4, 69)        =     36.38
                   Model |  1657.52708         4  414.381771   Prob > F        =    0.0000
                Residual |  785.932376        69  11.3903243   R-squared       =    0.6784
            -------------+----------------------------------   Adj R-squared   =    0.6597
                   Total |  2443.45946        73  33.4720474   Root MSE        =     3.375
            
            ------------------------------------------------------------------------------
                     mpg |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
            -------------+----------------------------------------------------------------
                    msp1 |  -.0123434   .0039503    -3.12   0.003     -.020224   -.0044627
                    msp2 |    .031321   .0294871     1.06   0.292    -.0275043    .0901462
                    msp3 |  -.0532444   .0583229    -0.91   0.364    -.1695953    .0631065
                    msp4 |   .0588327   .1006072     0.58   0.561     -.141873    .2595385
                   _cons |   53.08371   8.256304     6.43   0.000     36.61283    69.55458
            ------------------------------------------------------------------------------
            
            . predict p2
            (option xb assumed; fitted values)
            
            . flexcurv, xvar(weight) power(3) refpts(1500(900)5100) gen(cs_)
            
            . regress mpg cs_*, noconst nohead
            ------------------------------------------------------------------------------
                     mpg |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
            -------------+----------------------------------------------------------------
                    cs_1 |   33.86387   3.733922     9.07   0.000      26.4149    41.31284
                    cs_2 |    24.6141   .7811342    31.51   0.000     23.05578    26.17242
                    cs_3 |   18.79659   .6841035    27.48   0.000     17.43184    20.16134
                    cs_4 |   15.47252   1.113113    13.90   0.000     13.25191    17.69312
                    cs_5 |   10.05772   5.322653     1.89   0.063    -.5606797    20.67613
            ------------------------------------------------------------------------------
            . predict p3
            (option xb assumed; fitted values)
            
            . label var p1 "quadratic"
            . label var p2 "mkspline"
            . label var p3 "flexcurv"
            
            . sort weight mpg
            
            . graph twoway scatter mpg p1  p2 p3 weight, c( i l l l) ms( o p p p )
            Reference: Newson, Roger B. 2012. Sensible parameters for univariate and multivariate splines. Stata Journal 12, no. 3: 479.
            Available at: http://www.imperial.ac.uk/nhli/r.new...s/sensparm.pdf.
            Last edited by Steve Samuels; 25 Jul 2015, 09:25.
            Steve Samuels
            Statistical Consulting
            [email protected]

            Stata 14.2

            Comment


            • #7
              I should have added that if you plot the model predictions from the quadratic model against those from one of the spline models, I would expect them to be very similar. The point of spline model is not to test individual coefficients, but simply to get a good non-linear fit. It's quite possible that in your problem, a quadratic is sufficient.
              Steve Samuels
              Statistical Consulting
              [email protected]

              Stata 14.2

              Comment

              Working...
              X