Identifying unique codes for similar names

Fimi Karimi

Join Date: Jun 2023

Posts: 71
#1

Identifying unique codes for similar names

29 Sep 2024, 15:25

Hello everyone,

I am dealing with cleaning a string variable. I want to identify those company names which have only one unique code and tag those companies which have multiple codes. For example, for campany "blackrock "which its code is BZW, fill 1 in all cells and for company "fidelity" fill 2. I would appreciate it if you could help me with this problem.

clear
input str69 Cleanname2 str4 mgmt_cd float Year_Control
"blackrock" "BZW" 2001
"blackrock" "BZW" 2001
"blackrock" "BZW" 2001
"fidelity" "FDI" .
"fidelity" "FDI" 2001
"fidelity" "FRS" .
"fidelity" "FDI" .

Thanks.
Tags: None
Andrew Musau

Join Date: Oct 2014

Posts: 10195
#2

29 Sep 2024, 16:02

Code:

bys Cleanname2 (mgmt_cd): gen unique= mgmt_cd[1]==mgmt_cd[_N]

See https://www.stata.com/support/faqs/d...ions-in-group/
1 like
Comment
Fimi Karimi

Join Date: Jun 2023

Posts: 71
#3

29 Sep 2024, 16:45

Thank you, Andrew. It works perfectly.
Comment
Fimi Karimi

Join Date: Jun 2023

Posts: 71
#4

01 Oct 2024, 17:59

I have three variables, Cleanname2, mgmt_cd and unique (which takes 1 if there is only 1 unique code for each company and takes 0 for companies with multiple codes.). I have missing codes for some companies. I want to tell Stata that those companies with unique code (==1) fill in the same code instead of their missing code value. The mgmt_cd is a string variable. For example, if we have these codes for a company ( .,.,., ABC, .,. ABC, ., ABC) I want Stata to fill ABC for missing if unique is 1.
I tried this code, but it didn't work (0 real changes were made), while I am sure there are some cases that need to be filled. Could you tell me what the problem is with my code?
bysort Cleanname2 (mgmt_cd): replace mgmt_cd = mgmt_cd[_n-1] if missing(mgmt_cd) & unique == 1. (0 real changes made)

Thanks.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30101
#5

01 Oct 2024, 21:08

The first problem is that when Stata starts a Cleanname2 group in executing this code, _n == 1. So _n-1 == 0 & mgmt_cd[0] is, by convention, missing since there is no mgmt_cd[0]. So you're just replacing missing with missing there. Next, we have the problem that mgmt_cd is a string variable. So when you sort it, the missing values sort first--it is the opposite of the way numeric variables sort. So whenever there are any missing values in a Cleanname2 group, they line up starting at _n = 1, and so filling from the preceding element does nothing to them. What you want is:

Code:

gsort Cleanname2 -mgmt_cd // THIS SORTS MISSING VALUES OF mgmt_cd TO THE END by Cleanname2: replace mgmt_cd = mgmt_cd[_n-1] if _n > 1 & unique == 1 & missing(mgmt_cd) // NOTE: NO sort IN THIS COMMAND
1 like
Comment
Fimi Karimi

Join Date: Jun 2023

Posts: 71
#6

02 Oct 2024, 13:31

That's completely right. Your command works perfectly. Thank you for your help and for helpful explanations!
Comment

Announcement

Identifying unique codes for similar names

Comment

Comment

Comment

Comment

Comment