Data cleaning

Sara Zakaryan

Join Date: Mar 2016

Posts: 30
#1

Data cleaning

05 Mar 2016, 13:06

Hello everyone.. I have an issue with data management in stata.. can you please give some hints what you do to understand whether the variable is useful for regresson or not if you have only values and no other information and description of the variable. The target variable is binary one... model should be logistic one.
Tags: None
Sara Zakaryan

Join Date: Mar 2016

Posts: 30
#2

05 Mar 2016, 13:34

there are a lot of missing values in each variable..so i want to know which explains somehow.. to start impute those missings... and i have more than 200 variables.. so maybe there are some bar charts that can help to see trends with dependent variable? I cant find what can visually signal....any idea? i dont think working on each variable is meaningful
Comment
Friedrich Huebler

Join Date: Apr 2014

Posts: 1053
#3

05 Mar 2016, 20:05

Welcome to Statalist. I think you have to do some reading. Here is a start: http://www.ats.ucla.edu/stat/stata/w.../statareg1.htm

Please also review the FAQ for advice on how to ask questions on Statalist.
Comment
Sara Zakaryan

Join Date: Mar 2016

Posts: 30
#4

06 Mar 2016, 12:14

Dear Friedrich,
Thank you for your reply. I think I have formulated my question not clear and not that understandable and I am sorry for it,
I have posted here after doing a week of reading and having my masters course of econometrics. and I am sure that the basics of stata wouldnt help me with what I have now issues.
I am quite sure that multiple imputations, missing values management, and dealing with big data blindly without knowing what that variables means makes a lot of difficulties to do analysis. In general forum the questions were raised like how to reshape the data, how to code time variables and so on and were well welcomed.
I again will repeat.. Sorry if I was not clear and my question seemed to be irrelevant.

Thank you for your help
Sara
Comment
Friedrich Huebler

Join Date: Apr 2014

Posts: 1053
#5

06 Mar 2016, 16:03

Was the web page I mentioned in post #3 not useful? Could you please confirm?

I am sorry, but I don't see a question in your last post. What you wrote until now is far too vague for anyone who would like to help you.

Originally posted by Sara Zakaryan View Post

can you please give some hints what you do to understand whether the variable is useful for regresson or not if you have only values and no other information and description of the variable.

Originally posted by Sara Zakaryan View Post

so maybe there are some bar charts that can help to see trends with dependent variable? I cant find what can visually signal....any idea?

Please read the FAQ and try to formulate a question that we can answer. The members of Statalist are very helpful but you have to give us something more concrete to work with.
Comment
Sara Zakaryan

Join Date: Mar 2016

Posts: 30
#6

07 Mar 2016, 00:34

I attached a doc where it is a print screen of part of my data that I have. There are a lot of variables and dataset is more than 100000 observations. I need to do predictions and might be having some logistic regression because the dependent variable is the target one.
Now I need to work with missing values. They are about 40% of my data and deleting them doesn't seem to be good option and replacing with means, mods too because the number of missing values are huge for example in variable 1 it is about 40000 missing values.
I am trying to find a way how to understand whether the variable can have some impact on the probability of having 0 or 1. Then choose only those variables which have predictive power.
The one option that I could think is to have some frequency box plots, that can show how is the frequencies of 1 changes along the values of first variable, and if it is something similar to constant then might be the variable is not useful.
How can I graph something like the one in attachment?
And is it a correct way to choose between variables?
The main reason of doing this to have less amount of variables and missing values to work on because I don't have economic definitions what those variables mean to choose that way..
I hope this was clear this time and I will be very grateful to have some help.

Thanks

Attached Files

data and graphs.docx (402.3 KB, 1 view)
Comment
Friedrich Huebler

Join Date: Apr 2014

Posts: 1053
#7

07 Mar 2016, 08:15

Sara, either you have not read the FAQ or you are reluctant to follow the advice given there. Here are some excerpts from section 12:

We can understand your dataset only to the extent that you explain it clearly. For example, it may help to show the results of describe to explain your variable names and types.

Stata graphs or other images should be posted as .png file attachments (start with the Clipboard icon).

In particular, please do not post screenshots. Many members will not be able to read them at all; they usually can't be read easily; and they do not allow copy and paste of data or code, which is highly desirable to allow experienced members to make precise suggestions for your questions.

In addition, many list members will not download attachments in Word format because of the risk of malware.

I am going to move on and will leave you with some excerpts from section 17 of the FAQ.

Why did my question not get answered?
We do not have the knowledge of your project needed to work out the best thing to do in your circumstances, and, in any case, it is really your call.

Whether what you are doing is “correct” is very difficult to discuss helpfully.

Your question is too unclear or too complicated to understand. For example, questions on very complicated data-management tasks or large chunks of code that are not working may ask too much.

Perhaps someone else can help you. If not, please read the FAQ and see how other list members describe their problems.

Last edited by Friedrich Huebler; 07 Mar 2016, 08:35.
1 like
Comment
Sara Zakaryan

Join Date: Mar 2016

Posts: 30
#8

08 Mar 2016, 02:59

Thanks, I have read all in FAQ.. I will try to solve all myself, if be not successful will try again to post following all rules.
Comment

Khalid Atinoaga Compaore

Join Date: Apr 2022
Posts: 16

19 Apr 2022, 04:58

Good day everyone,
stata 16.0

Objective:
[CODE][

Provide background information of the data (by describing and pointing out all the necessary features of the data)
Pooled OLS (I need to run this)---- I know it is not a 1st choice in panel analyses but for practice....

Using these below to check for the model fit.....

Fixed and random effect models (test for the appropriate model)
Panel IV estimations
Dynamic panel models

/CODE]
Austria, Belgium, Denmark, France, Germany, Greece, Ireland, Italy, Luxembourg, Netherlands, Portugal, Spain, Sweden and the United Kingdom.
(These countries did not appear by names............as per code...)

1).

Code:

 des country hid hg015 hd001 year wave pid

              storage   display    value
variable name   type    format     label      variable label
----------------------------------------------------------------------------------
country         float   %8.0g                 
hid             long    %12.0g                
hg015           str4    %4s                   
hd001           byte    %8.0g                 
year            float   %9.0g                 
wave            byte    %8.0g                 
pid             double  %10.0g

2).

Code:

sum $xlist country hid hg015 hd001 year wave pid

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
     country |    269,423    10.24544     9.80832          1         55
         hid |    269,423    6.53e+07    2.08e+08        101   1.45e+09
       hg015 |          0
       hd001 |    225,667    3.371955    1.464971          1         16
        year |    269,423    1999.977    .8191261       1999       2001
-------------+---------------------------------------------------------
        wave |    269,423    6.977485    .8191261          6          8
         pid |    269,423    6.04e+08    2.07e+09       1102   1.45e+10
3).

Code:

list flag country_id in 1/10

     +-----------------+
     | flag   countr~d |
     |-----------------|
  1. |   11         11 |
  2. |   11         11 |
  3. |   12         12 |
  4. |   12         12 |
  5. |    3          3 |
     |-----------------|
  6. |    3          3 |
  7. |    3          3 |
  8. |   10         10 |
  9. |    3          3 |
 10. |    3          3 |
     +-----------------+

Dofile

Code:

 use echp99_00_01_new.dta
preserve
*sellect variables of interest and drop the rest*

keep lnhwage weekhours hid wave pid pg007 nchild0_2 nchild3_5 female school occup health age age2 country year 
xtset country year (

	Code:
	repeated time values within panel
)
xtset country (

	Code:
	
	
		
			
			
				 panel variable:  country (unbalanced)
			
		
	

)
*generate group_id (To figure out the issue of "repeated time within panel")*

egen country_id=group(country)
egen flag=group(country_id)
list flag country_id in 1/10
sum $xlist


*Panel summary statistics: within and between variation for some variable*.

xtsum 

*pooled OLS

quietly reg lnhwage occup weekhours school female age nchild0_2 pid year hid country 
estimates store POLS

quietly reg lnhwage occup weekhours school female age nchild0_2 pid year hid country, robust
estimates store OLS_rob
 
quietly xtreg lnhwage occup weekhours school female age nchild0_2 pid year hid country, fe
estimates store fix_eff

quietly reg lnhwage occup weekhours school female age nchild0_2 pid year hid country, re
estimates store rand_eff
estimates table OLS_rob fix_eff rand_eff, b p se stats(N r2),

*Testing the model with other Intrumental Variables.

xtreg lnhwage occup weekhours school female age nchild0_2 pid year hid country (nchild3_5 age2), fe

xtreg lnhwage occup weekhours school female age nchild0_2 pid year hid country (nchild3_5 age2), re

In trying to achieve the above objecives will this approach be the way to go? I am certain with this platform i willl get the needed advice and direction.

Thank you
Atinoaga
BR

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment