Unique values from lots of duplicate values

Ananya Kotia

Join Date: Feb 2015

Posts: 22
#1

Unique values from lots of duplicate values

30 Apr 2015, 17:49

Hello,

I have several variables, each of which have duplicate values. For each variable, I want to generate a new variable which lists one example for each group of duplicates.

For instance, I know that the following code:

Code:

duplicates examples assets

will list one example for each group of duplicates of the assets variable. But how do I create a new variable with these unique values?

Best,
A
Tags: duplicates
Clyde Schechter

Join Date: Apr 2014

Posts: 29812
#2

30 Apr 2015, 19:08

What do you mean by creating a new variable with these unique values? A variable can only have one value in any observation. Perhaps you should show us a sample of your data, and then show us some hand-worked results of what you want to get.
Comment
Ananya Kotia

Join Date: Feb 2015

Posts: 22
#3

30 Apr 2015, 20:06

Thanks for your reply, Clyde.

I mean that I want to pick out all the distinct values in a variable and put them in a new variable. For instance, in the example below, x has duplicate values, I want to pick out all the distinct values of x and store them in a new variable y. As you can see, the new variable y has only 5 number (all distinct values of x). All its other entries are missing values.

I have also attached the excel file.

year x year2 y

2001 5 2001 5

2001 5 2002 7

2002 7 2003 3

2002 7 2004 10

2003 3 2005 40

2003 3 .

2003 3 .

2003 3 .

2004 10 .

2004 10 .

2005 40 .

2005 40 .

2005 40 .

Attached Files

duplicates.xlsx (12.6 KB, 1 view)
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35226
#4

30 Apr 2015, 20:19

You can do that but you lose the alignment of variables within observations. A more Stataish way would be this:

Code:

bysort x : gen x_tag = _n == 1 list x if x_tag

Spreadsheet files: Just say no. (FAQ Advice Section 12 please)
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29812
#5

30 Apr 2015, 20:31

It appears that your data always have the same value of x associated with a given value of year. I will assume you are starting with a Stata data set that contains year, x, and perhaps other variables, but not year2 and y.

Code:

// VERIFY ONLY ONE DISTINCT VALUE OF X PER YEAR by year (x), sort: assert x[1] == x[_N] // SAVE DATA tempfile holding save `holding' // ELIMINATE DUPLICATES AND EXTRANEOUS VARIABLES keep year x duplicates drop sort year rename year year2 rename x y // MERGE BACK ORIGINAL DATA merge 1:1 _n using `holding', assert(match using) nogenerate

That said, this sounds like a really terrible thing to do, and I'm wondering why you want it. You are creating associations between observations of year and x with values of year2 and y that have no apparent relevance to each other. The resulting data, it seems to me, will lead to all sorts of difficulties if you try to analyze it. Perhaps if you explain where you are going, there is a better approach. This really strikes me as a recipe for chaos.
Comment
Ananya Kotia

Join Date: Feb 2015

Posts: 22
#6

30 Apr 2015, 21:00

Hi Nick, apologies about the spreadsheet and thanks for your reply.

The reason why I want to generate new variables is that I want to plot the distinct values of the variable.

I have 40 variables (columns) and about 30,000 rows of data for each variable. Each variable has duplicate values. But I want to plot only the distinct values of each variable.

So I want to column-wise remove duplicates from all the variables. If I use the following code, it does not work because it drops all observations that are duplicate in x . This would have been fine if x was the only variable I was working with but I have many other variables that I want distinct observations of.

Code:

duplicates drop x, force
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35226
#7

01 May 2015, 03:12

What you want is entirely possible using code such as I used. I gave you an example of listing only the distinct values but the same idea could be used with any graph command. (You don't say what kind of plot you want.)

See also http://www.stata-journal.com/article...article=dm0042 for a review of distinct observations.
Comment

year	x	year2	y
2001	5	2001	5
2001	5	2002	7
2002	7	2003	3
2002	7	2004	10
2003	3	2005	40
2003	3		.
2003	3		.
2003	3		.
2004	10		.
2004	10		.
2005	40		.
2005	40		.
2005	40		.

Announcement

Unique values from lots of duplicate values

Comment

Comment

Comment

Comment

Comment

Comment