Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Data cleaning

    Hello everyone.. I have an issue with data management in stata.. can you please give some hints what you do to understand whether the variable is useful for regresson or not if you have only values and no other information and description of the variable. The target variable is binary one... model should be logistic one.

  • #2
    there are a lot of missing values in each variable..so i want to know which explains somehow.. to start impute those missings... and i have more than 200 variables.. so maybe there are some bar charts that can help to see trends with dependent variable? I cant find what can visually signal....any idea? i dont think working on each variable is meaningful

    Comment


    • #3
      Welcome to Statalist. I think you have to do some reading. Here is a start: http://www.ats.ucla.edu/stat/stata/w.../statareg1.htm

      Please also review the FAQ for advice on how to ask questions on Statalist.

      Comment


      • #4
        Dear Friedrich,
        Thank you for your reply. I think I have formulated my question not clear and not that understandable and I am sorry for it,
        I have posted here after doing a week of reading and having my masters course of econometrics. and I am sure that the basics of stata wouldnt help me with what I have now issues.
        I am quite sure that multiple imputations, missing values management, and dealing with big data blindly without knowing what that variables means makes a lot of difficulties to do analysis. In general forum the questions were raised like how to reshape the data, how to code time variables and so on and were well welcomed.
        I again will repeat.. Sorry if I was not clear and my question seemed to be irrelevant.

        Thank you for your help
        Sara

        Comment


        • #5
          Was the web page I mentioned in post #3 not useful? Could you please confirm?

          I am sorry, but I don't see a question in your last post. What you wrote until now is far too vague for anyone who would like to help you.

          Originally posted by Sara Zakaryan View Post
          can you please give some hints what you do to understand whether the variable is useful for regresson or not if you have only values and no other information and description of the variable.
          Originally posted by Sara Zakaryan View Post
          so maybe there are some bar charts that can help to see trends with dependent variable? I cant find what can visually signal....any idea?
          Please read the FAQ and try to formulate a question that we can answer. The members of Statalist are very helpful but you have to give us something more concrete to work with.

          Comment


          • #6
            I attached a doc where it is a print screen of part of my data that I have. There are a lot of variables and dataset is more than 100000 observations. I need to do predictions and might be having some logistic regression because the dependent variable is the target one.
            Now I need to work with missing values. They are about 40% of my data and deleting them doesn't seem to be good option and replacing with means, mods too because the number of missing values are huge for example in variable 1 it is about 40000 missing values.
            I am trying to find a way how to understand whether the variable can have some impact on the probability of having 0 or 1. Then choose only those variables which have predictive power.
            The one option that I could think is to have some frequency box plots, that can show how is the frequencies of 1 changes along the values of first variable, and if it is something similar to constant then might be the variable is not useful.
            How can I graph something like the one in attachment?
            And is it a correct way to choose between variables?
            The main reason of doing this to have less amount of variables and missing values to work on because I don't have economic definitions what those variables mean to choose that way..
            I hope this was clear this time and I will be very grateful to have some help.

            Thanks
            Attached Files

            Comment


            • #7
              Sara, either you have not read the FAQ or you are reluctant to follow the advice given there. Here are some excerpts from section 12:

              We can understand your dataset only to the extent that you explain it clearly. For example, it may help to show the results of describe to explain your variable names and types.

              Stata graphs or other images should be posted as .png file attachments (start with the Clipboard icon).

              In particular, please do not post screenshots. Many members will not be able to read them at all; they usually can't be read easily; and they do not allow copy and paste of data or code, which is highly desirable to allow experienced members to make precise suggestions for your questions.
              In addition, many list members will not download attachments in Word format because of the risk of malware.

              I am going to move on and will leave you with some excerpts from section 17 of the FAQ.

              Why did my question not get answered?
              • We do not have the knowledge of your project needed to work out the best thing to do in your circumstances, and, in any case, it is really your call.
              • Whether what you are doing is “correct” is very difficult to discuss helpfully.
              • Your question is too unclear or too complicated to understand. For example, questions on very complicated data-management tasks or large chunks of code that are not working may ask too much.
              Perhaps someone else can help you. If not, please read the FAQ and see how other list members describe their problems.
              Last edited by Friedrich Huebler; 07 Mar 2016, 08:35.

              Comment


              • #8
                Thanks, I have read all in FAQ.. I will try to solve all myself, if be not successful will try again to post following all rules.

                Comment


                • #9
                  Good day everyone,
                  stata 16.0

                  Objective:
                  [CODE][
                  • Provide background information of the data (by describing and pointing out all the necessary features of the data)
                  • Pooled OLS (I need to run this)---- I know it is not a 1st choice in panel analyses but for practice....
                  Using these below to check for the model fit.....
                  • Fixed and random effect models (test for the appropriate model)
                  • Panel IV estimations
                  • Dynamic panel models
                  /CODE]
                  Austria, Belgium, Denmark, France, Germany, Greece, Ireland, Italy, Luxembourg, Netherlands, Portugal, Spain, Sweden and the United Kingdom.
                  (These countries did not appear by names............as per code...)

                  1).
                  Code:
                   des country hid hg015 hd001 year wave pid
                  
                                storage   display    value
                  variable name   type    format     label      variable label
                  ----------------------------------------------------------------------------------
                  country         float   %8.0g                 
                  hid             long    %12.0g                
                  hg015           str4    %4s                   
                  hd001           byte    %8.0g                 
                  year            float   %9.0g                 
                  wave            byte    %8.0g                 
                  pid             double  %10.0g
                  2).
                  Code:
                  sum $xlist country hid hg015 hd001 year wave pid
                  
                      Variable |        Obs        Mean    Std. Dev.       Min        Max
                  -------------+---------------------------------------------------------
                       country |    269,423    10.24544     9.80832          1         55
                           hid |    269,423    6.53e+07    2.08e+08        101   1.45e+09
                         hg015 |          0
                         hd001 |    225,667    3.371955    1.464971          1         16
                          year |    269,423    1999.977    .8191261       1999       2001
                  -------------+---------------------------------------------------------
                          wave |    269,423    6.977485    .8191261          6          8
                           pid |    269,423    6.04e+08    2.07e+09       1102   1.45e+10
                  3).
                  Code:
                  list flag country_id in 1/10
                  
                       +-----------------+
                       | flag   countr~d |
                       |-----------------|
                    1. |   11         11 |
                    2. |   11         11 |
                    3. |   12         12 |
                    4. |   12         12 |
                    5. |    3          3 |
                       |-----------------|
                    6. |    3          3 |
                    7. |    3          3 |
                    8. |   10         10 |
                    9. |    3          3 |
                   10. |    3          3 |
                       +-----------------+
                  Dofile
                  Code:
                   use echp99_00_01_new.dta
                  preserve
                  *sellect variables of interest and drop the rest*
                  
                  keep lnhwage weekhours hid wave pid pg007 nchild0_2 nchild3_5 female school occup health age age2 country year 
                  xtset country year (
                  
                  Code:
                  repeated time values within panel
                  ) xtset country (
                  Code:
                  panel variable: country (unbalanced)
                  ) *generate group_id (To figure out the issue of "repeated time within panel")* egen country_id=group(country) egen flag=group(country_id) list flag country_id in 1/10 sum $xlist *Panel summary statistics: within and between variation for some variable*. xtsum *pooled OLS quietly reg lnhwage occup weekhours school female age nchild0_2 pid year hid country estimates store POLS quietly reg lnhwage occup weekhours school female age nchild0_2 pid year hid country, robust estimates store OLS_rob quietly xtreg lnhwage occup weekhours school female age nchild0_2 pid year hid country, fe estimates store fix_eff quietly reg lnhwage occup weekhours school female age nchild0_2 pid year hid country, re estimates store rand_eff estimates table OLS_rob fix_eff rand_eff, b p se stats(N r2), *Testing the model with other Intrumental Variables. xtreg lnhwage occup weekhours school female age nchild0_2 pid year hid country (nchild3_5 age2), fe xtreg lnhwage occup weekhours school female age nchild0_2 pid year hid country (nchild3_5 age2), re
                  In trying to achieve the above objecives will this approach be the way to go? I am certain with this platform i willl get the needed advice and direction.

                  Thank you
                  Atinoaga
                  BR




                  Comment

                  Working...
                  X