Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Identifying unique codes for similar names

    Hello everyone,

    I am dealing with cleaning a string variable. I want to identify those company names which have only one unique code and tag those companies which have multiple codes. For example, for campany "blackrock "which its code is BZW, fill 1 in all cells and for company "fidelity" fill 2. I would appreciate it if you could help me with this problem.

    clear
    input str69 Cleanname2 str4 mgmt_cd float Year_Control
    "blackrock" "BZW" 2001
    "blackrock" "BZW" 2001
    "blackrock" "BZW" 2001
    "fidelity" "FDI" .
    "fidelity" "FDI" 2001
    "fidelity" "FRS" .
    "fidelity" "FDI" .


    Thanks.

  • #2
    Code:
    bys Cleanname2 (mgmt_cd): gen unique= mgmt_cd[1]==mgmt_cd[_N]
    See https://www.stata.com/support/faqs/d...ions-in-group/

    Comment


    • #3
      Thank you, Andrew. It works perfectly.

      Comment


      • #4
        I have three variables, Cleanname2, mgmt_cd and unique (which takes 1 if there is only 1 unique code for each company and takes 0 for companies with multiple codes.). I have missing codes for some companies. I want to tell Stata that those companies with unique code (==1) fill in the same code instead of their missing code value. The mgmt_cd is a string variable. For example, if we have these codes for a company ( .,.,., ABC, .,. ABC, ., ABC) I want Stata to fill ABC for missing if unique is 1.
        I tried this code, but it didn't work (0 real changes were made), while I am sure there are some cases that need to be filled. Could you tell me what the problem is with my code?
        bysort Cleanname2 (mgmt_cd): replace mgmt_cd = mgmt_cd[_n-1] if missing(mgmt_cd) & unique == 1. (0 real changes made)

        Thanks.

        Comment


        • #5
          The first problem is that when Stata starts a Cleanname2 group in executing this code, _n == 1. So _n-1 == 0 & mgmt_cd[0] is, by convention, missing since there is no mgmt_cd[0]. So you're just replacing missing with missing there. Next, we have the problem that mgmt_cd is a string variable. So when you sort it, the missing values sort first--it is the opposite of the way numeric variables sort. So whenever there are any missing values in a Cleanname2 group, they line up starting at _n = 1, and so filling from the preceding element does nothing to them. What you want is:

          Code:
          gsort Cleanname2 -mgmt_cd // THIS SORTS MISSING VALUES OF mgmt_cd TO THE END
          by Cleanname2: replace mgmt_cd = mgmt_cd[_n-1] if _n > 1 & unique == 1 & missing(mgmt_cd) // NOTE: NO sort IN THIS COMMAND


          Comment


          • #6
            That's completely right. Your command works perfectly. Thank you for your help and for helpful explanations!

            Comment

            Working...
            X