Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Issues selecting correct functional form for multiple regression

    Hi all,

    I have been asked to verify the effect of water and sanitation on the mortality of children under the age of 5 and quantify whether providing water services or sanitation services has a larger effect on child mortality.

    Using what I understand about changing functional form and l
    ooking at the scatter graphs I concluded that only GDPPC would need to be transformed as the other variables were measured in percentages or proportions.

    I ran regress and I got a much smaller coefficient for WATER than I would’ve expected and a P value for WATER of 0.948 which leads me the believe something is wrong with the model.

    I then ran regress on all variations of the model I thought could be plausible based on the nature of the variables. However, they all turn up either coefficients whose direction is different to that seen in the scatter graphs and/or very large P values.

    The only thing I could thing of is the high correlation between the variables.

    I would be very grateful for any thoughts of the correct functional form to use or why my results are so unexpected.

    Code:
    summarize
    
    gen LGDPPC=log(GDPPC)
    
    regress INFMORT LGDPPC WATER SANIT
    
    ovtest
    
    vif
    
    correlate
    . summarize

    Variable | Obs Mean Std. Dev. Min Max
    -------------+---------------------------------------------------------
    CCODE | 0
    GDPPC | 40 14312.08 14534.66 1349.372 52926.54
    INFMORT | 40 32.31 25.03579 3.4 95.1
    WATER | 39 85.70895 17.52148 36.59633 100
    SANIT | 40 70.71659 28.64349 13.94848 100


    . regress INFMORT LGDPPC WATER SANIT

    Source | SS df MS Number of obs = 39
    -------------+---------------------------------- F(3, 35) = 35.75
    Model | 18397.8535 3 6132.61782 Prob > F = 0.0000
    Residual | 6003.51577 35 171.529022 R-squared = 0.7540
    -------------+---------------------------------- Adj R-squared = 0.7329
    Total | 24401.3692 38 642.141296 Root MSE = 13.097

    ------------------------------------------------------------------------------
    INFMORT | Coef. Std. Err. t P>|t| [95% Conf. Interval]
    -------------+----------------------------------------------------------------
    LGDPPC | -8.453043 3.386237 -2.50 0.017 -15.32747 -1.578615
    WATER | -.0147222 .2241601 -0.07 0.948 -.4697913 .4403469
    SANIT | -.5062109 .1450451 -3.49 0.001 -.8006682 -.2117536
    _cons | 146.1188 24.48058 5.97 0.000 96.42059 195.8171
    ------------------------------------------------------------------------------


    . correlate
    (CCODE ignored because string variable)
    (obs=39)

    | GDPPC INFMORT WATER SANIT LGDPPC
    -------------+---------------------------------------------
    GDPPC | 1.0000
    INFMORT | -0.6592 1.0000
    WATER | 0.5542 -0.7316 1.0000
    SANIT | 0.6441 -0.8401 0.8262 1.0000
    LGDPPC | 0.9029 -0.7849 0.7342 0.7664 1.0000









  • #2
    You didn't get a quick response. You'll increase your chances of a useful response by following the FAQ on asking questions - provide Stata code in code delimiters, readable Stata output (fixed spacing fonts help), and sample data using dataex.

    With only 39 observations and four regressors, you don't have much power. Even modest colinearity can mess up your estimates. With most variables correlated over .7, it is hard to effectively differentiate the different effects with so few observations.

    Comment


    • #3
      In fact James Hodkinson asked the question again at https://www.statalist.org/forums/for...form-selection and various discussion followed.

      Comment

      Working...
      X