Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • What is it called when you copy and paste the value of a variable across all records of another key variable?

    My vocabulary has failed me. I used to refer to this as "indexing" a variable across values of another variable, but I believe I have been misusing that term.

    I think the best way to explain is with an example.

    I have data on human subjects. Participants are seen three times. Records are uniquely identified by subject_id and visit_date. Participant gender is collected only at the first visit. I need to "X" gender so that it appears on each record for that participant. What is the correct word to describe what I mean by X?

    Here's my data, and here's my code to do X:

    clear all
    input subject_id str20 visit_date str2 sex
    1 "19mar23" "F"
    1 "30mar23" ""
    1 "10apr23" ""
    2 "20mar23" "M"
    2 "31mar23" ""
    2 "8apr23" ""
    3 "22mar23" "F"
    3 "1apr23" ""
    end


    And then I use this code to do X (what would you call X?)

    list
    bysort subject_id: replace sex = sex[_n-1] if sex[_n-1] !=""
    bysort subject_id: replace sex = sex[_n] if sex[_N] !="" /* Just in case someone accidentally has "gender" entered at their second visit */
    list


    I'm not sure how to even search on this concept, but this is a technique I find I use frequently (where a baseline value needs to be applied to all records of the same participant) and if I'm worried I'm not using the right term to discuss it.

  • #2
    There are a few ways of naming this process. It can be called "spreading" the value. In the case, as yours, where the source value appears at the beginning of the data it can be called "carrying forward." (Similarly, if the source value appears at the end of the data, it can be called "carrying back.") Another term sometimes seen is "backfilling" the data (and it is used without regard to the direction.)

    Terminology aside, your code isn't quite right. -replace sex = sex[_n] if sex[_N] != ""- does nothing. That's be cause sex and sex[_n] refer to the same thing, so you are just replacing the existing value of sex (which might be missing) with itself. The best way to do this is to first sort the data to guarantee that the reference value you want to spread appears in the first observation. This is easy to do: with a string variable we know that the empty string sorts first. (With a numeric variable it is the opposite: missing value sorts last.) You want to put the non-missing value in the first observation. So:
    Code:
    gsort subject_id -sex
    by subject_id: replace sex = sex[1] if missing(sex)
    Some things to consider: what if for some subject_id there is more than one observation containing a non-missing value of sex? The above code leaves it as it was. Which is fine assuming that all the non-missing values for the subject_id are the same. But what if you have a person reported as "M" in one observation and "F" in another? That is almost certainly a data error. So after doing the above, you will want to check for this problem:
    Code:
    by subject_id (sex), sort: assert sex[_N] == sex[1]
    If all is well, this code produces no output. If there are contradictory values of sex for a subject_id, the command will provide an error message and terminate execution.

    Added: As for wanting to be able to search on the term, I tend to doubt that using this terminology will help you. There have been many posts raising this type of question, but people rarely use the terminology in their question. Usually they describe or show what they want, just as you did. I think that pretty much everybody who is familiar with the terminology also knows how to do it!
    Last edited by Clyde Schechter; 25 Mar 2024, 13:46.

    Comment


    • #3
      Ah, you're right, good catch - that is indeed a typo! That first [_n] should be [_N].

      And I'd never considered the assert command - I had convoluted code to check consistency, but that looks easier!

      Thanks for this, very helpful. I had assumed that maybe there was a known, common term for this sort of thing (because I do it frequently for the data I work with), but this makes sense.
      Last edited by Matt Price; 25 Mar 2024, 14:39.

      Comment

      Working...
      X