Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Create group for company with fuzzy similar names

    Hello everyone,

    I have a very large dataset of thousands of companies and unfortunately after pulling the data from the database, i realized that the company names sometimes differ only in a few letters or an appended description. To give a short example, for company "Apple" the different items would be listet like:

    Code:
    CompanyName | Year | Sales
    Apple        2007   xxxx$
    Apple Inc. 2008    xxxx$
    I require the names to be identical in order to correctly transpose it and also to do the data analysis.

    I used the -reclink- to merge one dataset with not 100% equal identifiers into another one so i thougt that it also should be possible to solve the problem I have right now.

    The goal is to have all items of the company with the same "CompanyName" variable or also another identifier which assigns a certain value to the whole group.

    Maybe someone has experience with that and could provide any help.

    any help is very welcome. Thanks in advance

    Best

  • #2
    There is no way to fully automate this. Programs like -reclink- can get you most of the way there, but there will always be cases that cannot be resolved just be fine-tuning the parameters of the fuzzy match and require human judgment (perhaps supported by some research into the details.) FWIW, I have found Julio Raffo's -matchit- program, available from SSC, easier to use and more effective than -reclink-, so you might want to give it a try. But, in the end, you should expect to have to finish the job manually no matter what software you use.

    That said, it surprises me that you cannot find an alternative identifier in the data set that is more consistent. Although I do not work in the field myself, I have seen here on Statalist that in the finance and economics sectors, the large databases usually identify firms by coded identifiers that do not have the inconsistencies of names. I'm thinking of things like CRSP, GVKEY, or stock ticker symbols. Are you sure there isn't such a variable in your data set? If there isn't, is there an alternative data set out there that has one?

    Comment


    • #3
      Thank You!

      Yes indeed you are very right. I will have a look into the database again and now that you are saying it, I also assume that there must be a unique identifier for every company. Since I am new to all that extensive data stuff I did not really think of that in the first approach.

      Thanks for your help again.

      Comment

      Working...
      X