Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Unique values from lots of duplicate values

    Hello,

    I have several variables, each of which have duplicate values. For each variable, I want to generate a new variable which lists one example for each group of duplicates.

    For instance, I know that the following code:

    Code:
    duplicates examples assets
    will list one example for each group of duplicates of the assets variable. But how do I create a new variable with these unique values?

    Best,
    A

  • #2
    What do you mean by creating a new variable with these unique values? A variable can only have one value in any observation. Perhaps you should show us a sample of your data, and then show us some hand-worked results of what you want to get.

    Comment


    • #3
      Thanks for your reply, Clyde.

      I mean that I want to pick out all the distinct values in a variable and put them in a new variable. For instance, in the example below, x has duplicate values, I want to pick out all the distinct values of x and store them in a new variable y. As you can see, the new variable y has only 5 number (all distinct values of x). All its other entries are missing values.

      I have also attached the excel file.


      year x year2 y
      2001 5 2001 5
      2001 5 2002 7
      2002 7 2003 3
      2002 7 2004 10
      2003 3 2005 40
      2003 3 .
      2003 3 .
      2003 3 .
      2004 10 .
      2004 10 .
      2005 40 .
      2005 40 .
      2005 40 .
      Attached Files

      Comment


      • #4
        You can do that but you lose the alignment of variables within observations. A more Stataish way would be this:

        Code:
         
        bysort x : gen x_tag = _n == 1 
        list x if x_tag
        Spreadsheet files: Just say no. (FAQ Advice Section 12 please)

        Comment


        • #5
          It appears that your data always have the same value of x associated with a given value of year. I will assume you are starting with a Stata data set that contains year, x, and perhaps other variables, but not year2 and y.

          Code:
          // VERIFY ONLY ONE DISTINCT VALUE OF X PER YEAR
          by year (x), sort: assert x[1] == x[_N]
          
          // SAVE DATA
          tempfile holding
          save `holding'
          
          // ELIMINATE DUPLICATES AND EXTRANEOUS VARIABLES
          keep year x
          duplicates drop
          sort year
          rename year year2
          rename x y
          
          
          // MERGE BACK ORIGINAL DATA
          merge 1:1 _n using `holding', assert(match using) nogenerate
          That said, this sounds like a really terrible thing to do, and I'm wondering why you want it. You are creating associations between observations of year and x with values of year2 and y that have no apparent relevance to each other. The resulting data, it seems to me, will lead to all sorts of difficulties if you try to analyze it. Perhaps if you explain where you are going, there is a better approach. This really strikes me as a recipe for chaos.

          Comment


          • #6
            Hi Nick, apologies about the spreadsheet and thanks for your reply.

            The reason why I want to generate new variables is that I want to plot the distinct values of the variable.

            I have 40 variables (columns) and about 30,000 rows of data for each variable. Each variable has duplicate values. But I want to plot only the distinct values of each variable.

            So I want to column-wise remove duplicates from all the variables. If I use the following code, it does not work because it drops all observations that are duplicate in x . This would have been fine if x was the only variable I was working with but I have many other variables that I want distinct observations of.

            Code:
            duplicates drop  x, force

            Comment


            • #7
              What you want is entirely possible using code such as I used. I gave you an example of listing only the distinct values but the same idea could be used with any graph command. (You don't say what kind of plot you want.)

              See also http://www.stata-journal.com/article...article=dm0042 for a review of distinct observations.

              Comment

              Working...
              X