What to do when a scatter plot shows no association with variables that are clearly associated?

Wiktoria Aleksandra

Join Date: Jan 2022

Posts: 2
#1

What to do when a scatter plot shows no association with variables that are clearly associated?

04 Jan 2022, 12:35

Hello,

I have just started using Stata for my econometrics course. I am trying to build an OLS model examining determinants of income. As explanatory variables I chose education (recoded in 4 dummy variables according to the level of educ obtained), place of living (recoded into 3 dummy variables according to the size of the place of living, i.e 1=countryside, 2=town, 3=big city), gender (0= male, 1=female), if married (0=no, 1=yes) and age (I am taking into account age>18). Also, by looking at the distribution of income ( in thousands) I know it is worth considering log of income, but it barely changed my analysis (if changed anything ). All of the chosen explanatory variables seem reasonable, however when I try to plot for instance income and education, I obtain such result:

When I try to plot education and place of living, I obtain more or less the same graph. However, when I examine education or any other variable with tab command, all the data looks fine. I am overhelmed by the task since I have just started using this app. I would be extremely grateful for every piece of advice on how to fix this issue and patience- I am very eager to learn but I clearly have some trouble understanding what I am doing. I hope it is easier to solve than it seems
If needed, here is how I created my dummy variables:

**THE PLACE OF LIVING**
recode domicil 1=5 2=4 4=2 5=1
label variable domicil "pl_living"
label define pl_living 1 "Farm or home in countryside" 2 "Country village" 3 "Town or small city" 4 "Suburbs or outskirts of big city" 5 "big city"
label values domicil pl_living
codebook domicil
replace domicil=1 if domicil>=1 & domicil<=2 //countryside
replace domicil=2 if domicil>=3 & domicil<=4 //town
replace domicil=3 if domicil==5 //big city
tab domicil, gen(domicil)

**THE LEVEL OF EDUCATION BASED ON POLISH SCHOOLING SYSTEM, NO EDUCATION DROPPED**
drop if edlvgpl==1

replace edlvgpl =1 if edlvgpl==2 | edlvgpl==3 | edlvgpl==4
replace edlvgpl =2 if edlvgpl==6 | edlvgpl==5
replace edlvgpl =3 if edlvgpl==7 | edlvgpl==8 | edlvgpl==9 | edlvgpl==10 | edlvgpl==11 | edlvgpl==12
replace edlvgpl =4 if edlvgpl==13 | edlvgpl==14 | edlvgpl==15
tab edlvgpl, gen(edlvgpl)
Tags: None
Nick Cox

Join Date: Mar 2014

Posts: 35698
#2

04 Jan 2022, 14:41

The problem with this graph is that you don't get an indication of how many data points there are for each pair of categories.

You could try the equivalent of

Code:

scatter income education, xla(1/4) yla(1/10, ang(h)) jitter(3)

but I recommend instead a two-way bar chart. (search this forum for mentions of tabplot from the Stata Journal for example).

If I were working with your data, I wouldn't categorize income but I would work with logarithm of income.
Comment

Announcement

What to do when a scatter plot shows no association with variables that are clearly associated?

Comment