Creating a variable with a distinct value from two other variables with same values

JungHwan Kim

Join Date: Jun 2020

Posts: 21
#1

Creating a variable with a distinct value from two other variables with same values

23 Jun 2020, 15:36

Hello everyone.

I have a dataset where the two(or more) variables have the same values. I want to create a variable that would just indicate one distinct values. I tried to find an easy way but couldn't except doing it manually in excel. Could you please give me an advice how to do it? The sample of data looks like below.

To explain, I want to create the variable "Major_1997" that would just have one distinct value from "Major_1997_1" and "Major_1997_2" since they both have same value. I hope what I explained is clear. Thank you very much in advance for your help!

ID Major_1997_1 Major_1997_2 Major_1997

1 3 3 3
Tags: None
Bruce Weaver

Join Date: May 2014

Posts: 1133
#2

23 Jun 2020, 15:50

Hello JungHwan Kim. I see this is your first post. Welcome.

Does this code produce the result you want?

Code:

generate byte Major_1997 = . replace Major_1997 = Major_1997_1 if (Major_1997_1==Major_1997_2)

PS- Please see the advice in the FAQ about providing a small data set (via dataex) to illustrate the problem, and about using code delimiters to show show code and output in a more readable fashion.

--
Bruce Weaver
Email: [email protected]
Version: Stata/MP 19.5 (Windows)
1 like
Comment
JungHwan Kim

Join Date: Jun 2020

Posts: 21
#3

23 Jun 2020, 18:29

Dear Bruce,

Thank you very much for your reply. The code you provide works in the dataset example I mentioned! I have one more question though for some other cases in this dataset. For example, there is a case where some variables have missing value (i.e. dot) while the other variables have a value. In this case, I wanted to code such that it will replace the value by automatically choosing the variable that has the value instead of missing variable. Example would be the following:

ID Major_1997_1 Major_1997_2 Major_1997_3 Major_1997_4 Major_1997

1 3 . 3 . 3

In this case, would there be a way to generate 'Major_1997' and replace its value to unique value '3' without having to specify specific variable (i.e. Major_1997_3)? In other words, is there a way where Stata can choose the distinct value without specifying a specific variable?

Thank you very much for your help!

Best regards,

Jung Hwan Kim
Comment

Bruce Weaver

Join Date: May 2014
Posts: 1133

24 Jun 2020, 08:41

The first thing that comes to mind for me is using egen with the rowsd() function to flag rows where all values are the same--they will have a row SD = 0.

Code:

* Create a small dataset to illustrate
clear
input byte(ID Major_1997_1 Major_1997_2 Major_1997_3 Major_1997_4)
1     3     .  3  .
2   2   3  3  3
3   3   2  .  3
4   2   2  .  2
5   .   .  .  .
end

* The following code assumes variables
* Major_1997_1 to Major_1997_4 are contiguous in
* the data file.  If they are not, list all 4 variables
* inside the parentheses.

egen double sd1997 = rowsd(Major_1997_1 - Major_1997_4)
egen Major_1997 = rowmin(Major_1997_1 - Major_1997_4) if sd1997==0
list, clean noobs
drop sd1997 // Assuming it is no longer needed

Here is the output from the -list- command:

Code:

. list, clean noobs

    ID   Major_~1   Major_~2   Major_~3   Major_~4      sd1997   Maj~1997  
     1          3          .          3          .           0          3  
     2          2          3          3          3          .5          .  
     3          3          2          .          3   .57735027          .  
     4          2          2          .          2           0          2  
     5          .          .          .          .           .          .

I hope this helps.

--
Bruce Weaver
Email: [email protected]
Version: Stata/MP 19.5 (Windows)

Comment

Mike Lacy

Join Date: Apr 2014

Posts: 2416
#5

24 Jun 2020, 09:41

I think I have a different interpretation than Bruce regarding what JungHwan wants. Perhaps JungHwan wants to replace missing values with the nearest previous good (non-missing) value:

Code:

gen int previous_good = . foreach m of varlist Major_1997_1 Major_1997_2 Major_1997_3 Major_1997_4 Major_1997 { replace `m' = previous_good if missing(`m') replace previous_good = `m' }
1 like
Comment
JungHwan Kim

Join Date: Jun 2020

Posts: 21
#6

25 Jun 2020, 07:45

Thank you very much for your help! I got an idea from yourposts and did the following, and it seems to work! :D

Code:

gen Major_1997 = . foreach x of varlist Major_1997_*{ replace Major_1997= `x' if `x'>=0 & `x' !=. }
Comment

Bruce Weaver

Join Date: May 2014
Posts: 1133

25 Jun 2020, 09:00

Hello JungHwan Kim. Using the small dataset I created in #4, my method and yours give different results for a couple of cases. Just want to make sure you've spotted that, and that you are getting the result you want.

The following has the dataset from #4, but with one additional observation added (the one in red), which I think illustrates a problem with your method.

Code:

* Create a small dataset to illustrate
clear
input byte(ID Major_1997_1 Major_1997_2 Major_1997_3 Major_1997_4)
1     3     .  3  .
2   2   3  3  3
3   3   2  .  3
4   2   2  .  2
5   .   .  .  .
6   3   3  3  2
end

* Bruce Weaver's method in #4
egen double sd1997 = rowsd(Major_1997_1 - Major_1997_4)
egen bw_1997 = rowmin(Major_1997_1 - Major_1997_4) if sd1997==0
drop sd1997 // Assuming it is no longer needed

* JungHwan Kim's method in #6
gen jk_1997 = .
foreach x of varlist Major_1997_*{
replace jk_1997= `x' if `x'>=0 & `x' !=.
}
list, clean noobs

Output from -list- command:

Code:

. list, clean noobs

    ID   Major_~1   Major_~2   Major_~3   Major_~4   bw_1997   jk_1997  
     1          3          .          3          .         3         3  
     2          2          3          3          3         .         3   <-- Methods disagree
     3          3          2          .          3         .         3   <-- Methods disagree
     4          2          2          .          2         2         2  
     5          .          .          .          .         .         .  
     6          3          3          3          2         .         2   <-- Methods disagree

My method gives system missing as the result on the three cases flagged above because there are 2 or more distinct values across the 4 variables. I thought you wanted to fill in a value only when all values are the same. And I doubt you want a value of 2 on that last observation. But perhaps I misunderstood what you want.

Cheers,
Bruce

--
Bruce Weaver
Email: [email protected]
Version: Stata/MP 19.5 (Windows)

Comment

JungHwan Kim

Join Date: Jun 2020

Posts: 21
#8

01 Jul 2020, 12:33

Hello Bruce Weaver. Thank you very much for your reply. I think I might have written my question in a way that is confusing.
I wanted to create a variable that would indicate individual's major. There are some case where values change along the time horizen because individuals could have changed major. I wanted to produce the variable such that it indicates the final major that individuals had during the education.

Best,

Jung Hwan Kim
Comment
Bruce Weaver

Join Date: May 2014

Posts: 1133
#9

01 Jul 2020, 17:29

Ah, okay. Thank you for clarifying.

--
Bruce Weaver
Email: [email protected]
Version: Stata/MP 19.5 (Windows)
Comment

Announcement