Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Converting missing values into 0 for some columns with conditions

    Dear community members,

    I would like to convert NA values (.) into 0 if a column (variable) contains no 0 value at all across whole observations for some columns. So, my logic is like this:

    1. I am going to use a for loop so that this task is repeated from “n”th column to “p”th column.
    2. For each column in range, I would like to use two pieces of information. The first is whether a column has at least one 0 value. If Yes, all missing values should remain as it is. Here is a second thing. If not, I would like to convert missing values to 0s in that specific column.
    3. I would like to repeat it for “p-n+1” times, meaning from column "p" to column "n".

    Here are my detailed questions:
    1. Is there any way to access “n”th column from dataset? For example, in R, we can access by entering dataset[ , n].
    2. As for replacement procedure, I would like to use the one like this: replace Varname = 0 if (count if Varname == 0) == 0 & Varnmae == . But, obviously it is not working. It seems that Stata is not supporting the feature to use a result of value, such as count if Varname == 0 as a condition. Would there be a way to handle this issue?

    To help you better understand my situation and for the users who is familiar with R and Stata, I attached my lines of codes for R.

    for (i in n:p) { * The for loop for each variable
    for (j in 1:nrow(dataset) { * The for loop for each observation in a specific variable
    if (sum(dataset[ , i] == 0) == 0 & dataset[j, i] == .) {
    data[j, i] = 0} else {
    data[j, i] = data[j, i]
    }
    }

    Any advice would be greatly appreciated.
    Last edited by John Charlse Doe; 20 Nov 2022, 11:00.

  • #2
    John:
    I'm not sure I got you right, but why not replacein the missing values with -999-, as in the following toy-example?;
    Code:
    . use "C:\Program Files\Stata17\ado\base\a\auto.dta"
    (1978 automobile data)
    
    . foreach var of varlist price-foreign {
      2. 
    .                 replace `var'=999 if `var'==.
      3. 
    .         }
    (0 real changes made)
    (0 real changes made)
    (5 real changes made)
    (0 real changes made)
    (0 real changes made)
    (0 real changes made)
    (0 real changes made)
    (0 real changes made)
    (0 real changes made)
    (0 real changes made)
    (0 real changes made)
    
    .
    Two further comments:
    1) converting missing values in zeros increase the chance of making mistakes in statistics, because an hypothetical -if- clause would rule out from calculation all the zeros, including those perfectly legal. Therefore, using a value less likely to be observed, such as -999- is probably safer;
    2) Stata adopts the listwise deletion. Hence, if you're concerned about the possible bearing of missing values on the results of a given statistical procedures, you run no risk at all in thist respect, because Stata deletes all the observations with at least one missing value in any variable.
    Kind regards,
    Carlo
    (Stata 19.0)

    Comment


    • #3
      It is unusual, to say the least, to refer to variables by their ordinal number in the data set. I suppose if you are working on a series of data sets which contain different variables and yet, through some mysterious process, the 5th through 18th variables are the relevant ones in each case, then coding in this way would be sensible. But barring some truly unusual circumstance like that, this would not be a particularly effective way to approach this problem in Stata. So I'm going to disregard your proposed approach (that might well be the best way to do this in R, I wouldn't know) and instead suggest a Stata-ish way to achieve the goal of replacing missing values by 0 in a range of variables when the variable has no actual 0 values.

      One issue that has to be confronted is the possibility that the range of variables will include some string variables. One either has to complicate the code considerably to check for this on each variable, or, better, exclude the string variables in the first place. So the code will be something like this:

      Code:
      ds first_variable-last_variable, has(type numeric) // FILTER OUT STRING VARIABLES
      foreach v of varlist `r(varlist)' {
          capture assert `v' != 0    // CHECK WHETHER VARIABLE IS EVER 0
          if c(rc) { // VARIABLE IS NEVER 0; REPLACE
              replace `v' = 0 if missing(`v')
          }
      }
      This code is untested because no example data was provided. The gist of it is correct; I cannot exclude the possibility of typos. Evidently, you need to replace first_variable and last_variable by the actual names of the first and last variables of interest. In the future, when requesting help with code, it is best to show example data. And that should be done using the -dataex- command. If you are running version 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

      If you really are in the situation where, at programming time, you know the order of the variables in the data set but not their names, then you can modify the code as follows:
      Code:
      // CODE TO HANDLE THE 12th THROUGH 19th VARIABLES IN A DATA SET
      unab all_vars: _all
      local first_variable: word 12 of `all_vars'
      local last_variable: word 19 of `all_vars'
      ds `first_variable'-`last_variable', has(type numeric)
      foreach v of varlist `r(varlist)' {
          capture assert `v' != 0    // CHECK WHETHER VARIABLE IS EVER 0
          if c(rc) { // VARIABLE IS NEVER 0; REPLACE
              replace `v' = 0 if missing(`v')
          }
      }
      Note that everything from -foreach- on down is the same as before. The first four lines are just handling the task of referring to variables by order instead of by name. Ultimately all Stata commands refer to variables by name, but, as you see, you can find the name from the order when needed.

      By the way, if you really are in a situation where you know the orders but not the names at the time you write the code, I'm curious how that arose. I've been in this business a long time and never encountered it.

      Added: Crossed with #2. And I strongly agree with Carlo that it is usually not a good idea to replace missing values with magic number codes in Stata, unless those missing values are, themselves, miscoding of actual values. (E.g. one often sees yes/no variables coded as 1/., whereas 1/0 works far better in Stata.)
      Last edited by Clyde Schechter; 20 Nov 2022, 11:22.

      Comment


      • #4
        Referring to Stata variables by column number is rarely needed or helpful, but it can be done.

        findname from Stata Journal can do that, by virtue of calling Mata. Or you can just tokenize a variable list and loop over the names you want., but see my point 1 below.

        Your question raises various doubts, beyond your numbered point #2 to which the short answer is that Stata can do what you want there, but the Stata syntax is a long way from your guess.

        1. You need to trap string variables for which the goal here does not apply.

        2. Numeric missings in Stata can also be any of .a to .z.

        3. Your R code seems pretty inefficient for R. My understanding is that a double for loop is rarely needed and that you surely don't need to evaluate sum(dataset[ , i] == 0) for every value in each column. But you will get better advice on that from others.

        4. In your code data[] and dataset[] are presumably meant to be the same array.

        Here is a sketch of what you're asking for with an example.

        Code:
        sysuse auto, clear
        
        findname, columns(1/12) local(vars) type(numeric)
        
        quietly foreach v of local vars {
            count if `v' == 0
            if r(N) == 0 {
                count if `v' == .
                noisily if r(N) > 0 {
                    di "`v'"
                    replace `v' = 0 if `v' == .
                }
            }
        }
        EDIT: I agree with just about everything in posts 2 and 3.
        Last edited by Nick Cox; 20 Nov 2022, 11:33.

        Comment


        • #5
          Dear All, truly appreciate kind guidance and advice.

          Comment

          Working...
          X