Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Check whether content of two variables is the same in different columns

    Hey guys!

    I am looking at the following dataset:

    Click image for larger version

Name:	Screen Shot 2018-07-18 at 10.01.04.png
Views:	1
Size:	69.0 KB
ID:	1453941


    What I essentially want is to check whether the owner of the company, classified as "GUO 50 or DUO 50", shares the last name with one of the directors. So see if shareholder_name of shareholder_type "GUO 50" or "DUO 50" is equal to ANY directors_name.
    So far I have managed to isolate the last names and put these next to each other, but if I now just check whether these are the same or not it gets difficult when the names are not aligned perfectly. This is the code I used for that:
    Code:
    gen director_name = word(directors_name, -1)
    gen shareholder_name1 = word(shareholder_name, -2)
    Unfortunately I do not know where to go from here. Any ideas?

  • #2
    Hi Tom,

    I think a good way to do this is using the merge command. You can have all the shareholders in one dataset, and all the directors in another. Then, you use the merge command to do a merge based on last name and specify you only want keep the variables if _merge == 3.

    If you like this approach, you can get started by googling for a Stata merge tutorial, as it can be a little tricky at first. Or otherwise consultant the Stata manual on merging by typing:
    Code:
    help merge
    If you don't find it out yourself, I'd be more than happy to help again!

    Comment


    • #3
      It is not a bad idea to make separate files as Jesse suggests in #2. This is true in particular because your current data layout is one that helps viewing it, but not so much one that will help with most analyses in Stata.

      The following below would also work with a data structure similar to yours. For future posts, please do include a data example with dataex, and not a screenshot. This mkaes it tons easier for people to understand your data and provide exact code for your problem. Read more on how and why in the FAQ: https://www.statalist.org/forums/help#stata
      Code:
      * Example generated by -dataex-. To install: ssc install dataex
      clear
      input str1 company str3 owner str2 shareholder
      "A" "X"   "Y" 
      "A" "Y"   "Y" 
      "A" "Z"   "Y" 
      "A" ""    "Y" 
      "A" ""    "Y" 
      "B" "XX"  "ZZ"
      "B" "YY"  "ZZ"
      "B" "ZZ"  "ZZ"
      "B" ""    "ZZ"
      "B" ""    "ZZ"
      "C" "XXX" "X" 
      "C" "YYY" "Y" 
      "C" "ZZZ" "Z" 
      "C" ""    "XX"
      "C" ""    "YY"
      end
      
      gen check=0
      levelsof company, local (companies) 
      foreach company in `companies'{
          levelsof owner if company=="`company'", local (owners)
          foreach owner in `owners'{
          replace check=1 if shareholder=="`owner'" & company=="`company'"
          }
      }

      Comment


      • #4
        Thanks to both of you for your help!

        I have split up the datasets in "shareholder" and "director" and tried to merge them.
        By now I came up with the following solution/code, which seems to work. However, I am only working with a very small subset of companies for now and it could also be due to luck that it works for them.
        I have a unique identifier between the two called "bvd_id" and have the shareholder dataset loaded.
        Code:
        joinby bvd_id using "directors.dta"
        gen shareholder_name = word(name, -2)
        gen director_name = word(full_name, -1)
        keep if type_of_relation == "GUO 50"
        gen ff_2 = shareholder_name == director_name
        drop if ff_2 == 0
        In the first step, I modify the names to only include the last name. Then drop all the observations except for "GUO 50" and generate a dummy variable that equals 1 if the last name of the GUO 50 is equal to the last name of the director.
        I am not 100% sure if the "joinby" function is the right way to do this - what are your thoughts on this?

        Comment


        • #5
          joinby is the best strategy here as you would have duplicates of names in both datasets, going by your first post. Do read the joinby help file and figure out what variant you need for the unmatched() option.
          I would also warn that matching people based on last name alone isn't a supersolid strategy. You have first names as well, so you might as well use those as well?

          Comment

          Working...
          X