Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • What to do when a scatter plot shows no association with variables that are clearly associated?

    Hello,

    I have just started using Stata for my econometrics course. I am trying to build an OLS model examining determinants of income. As explanatory variables I chose education (recoded in 4 dummy variables according to the level of educ obtained), place of living (recoded into 3 dummy variables according to the size of the place of living, i.e 1=countryside, 2=town, 3=big city), gender (0= male, 1=female), if married (0=no, 1=yes) and age (I am taking into account age>18). Also, by looking at the distribution of income ( in thousands) I know it is worth considering log of income, but it barely changed my analysis (if changed anything ). All of the chosen explanatory variables seem reasonable, however when I try to plot for instance income and education, I obtain such result:
    graf.png
    When I try to plot education and place of living, I obtain more or less the same graph. However, when I examine education or any other variable with tab command, all the data looks fine. I am overhelmed by the task since I have just started using this app. I would be extremely grateful for every piece of advice on how to fix this issue and patience- I am very eager to learn but I clearly have some trouble understanding what I am doing. I hope it is easier to solve than it seems
    If needed, here is how I created my dummy variables:

    **THE PLACE OF LIVING**
    recode domicil 1=5 2=4 4=2 5=1
    label variable domicil "pl_living"
    label define pl_living 1 "Farm or home in countryside" 2 "Country village" 3 "Town or small city" 4 "Suburbs or outskirts of big city" 5 "big city"
    label values domicil pl_living
    codebook domicil
    replace domicil=1 if domicil>=1 & domicil<=2 //countryside
    replace domicil=2 if domicil>=3 & domicil<=4 //town
    replace domicil=3 if domicil==5 //big city
    tab domicil, gen(domicil)


    **THE LEVEL OF EDUCATION BASED ON POLISH SCHOOLING SYSTEM, NO EDUCATION DROPPED**
    drop if edlvgpl==1

    replace edlvgpl =1 if edlvgpl==2 | edlvgpl==3 | edlvgpl==4
    replace edlvgpl =2 if edlvgpl==6 | edlvgpl==5
    replace edlvgpl =3 if edlvgpl==7 | edlvgpl==8 | edlvgpl==9 | edlvgpl==10 | edlvgpl==11 | edlvgpl==12
    replace edlvgpl =4 if edlvgpl==13 | edlvgpl==14 | edlvgpl==15
    tab edlvgpl, gen(edlvgpl)

  • #2
    The problem with this graph is that you don't get an indication of how many data points there are for each pair of categories.

    You could try the equivalent of

    Code:
    scatter income education, xla(1/4) yla(1/10, ang(h)) jitter(3)
    but I recommend instead a two-way bar chart. (search this forum for mentions of tabplot from the Stata Journal for example).

    If I were working with your data, I wouldn't categorize income but I would work with logarithm of income.

    Comment

    Working...
    X