Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Is there a way to replace observations by missings if the number of missings per country per year are greater than a certain threshold?

    Hello

    Sorry for this long title, I do not know if it's very clear.
    I am cleaning the World Value Survey database. It's a panel dataset, with a lot of variables. For some variables, there are a lot of missing values.
    I want to compute the average of these variables, to merge them with a macroeconomic database at the country year level.
    However, for my averages to be relevant, for each variables I need to have a certain threshold of non missing data per country per year (let's say less than 75% missings).
    Therefore I was thinking about replacing by missing values all observations in each country year if the number of missings exceeds 75%.

    I read and used this post, which is quite close to what I would like.
    https://www.statalist.org/forums/for...issing-in-data

    I tried a lot of things, but nothing works.
    For example I tried:

    * Generate some data
    input x y
    1 1
    1 1
    1 .
    . .
    end
    * Percent missing above which we'll replace observations by missings
    glo p=0.75
    * Loop over variables
    foreach var of varlist A* C* E* F* G* H* X* Y* {
    count if `var'==., by (Origin year)
    if (r(N)/_N) >= $p replace `var'==., by (Origin year)
    }

    But it does not work. Same for the other option proposed by Nick Cox in the post I linked.

    I hope my problem is clear, it's quite difficult to describe it by text.

    Best,
    Killian Foubert
    Last edited by Killian Foubert; 07 Jun 2018, 08:57.

  • #2
    We can't help you because "it does not work" is uninformative. You'll increase your chances of a useful answer by following the FAQ on asking questions - provide Stata code in code delimiters (yes, exactly what you ran so we can run it), readable Stata output, and sample data using dataex.

    The most obvious error in your code is"if (r(N)/_N) >= $p replace `var'==., by (Origin year)" If you want to restrict the operation to a subset of data, the if condition has to go right before the comma. I also don't see what the by(origin,year) is doing here. I also suspect your count statement is wrong. You want something like
    by origin year: egen nummis=count(`var'==.)
    by origin year: egen numobs=count(year)
    g ratio=nummis/numobs
    replace `var'=. if ratio >.75

    Comment

    Working...
    X