Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Can we find out if a name in one dataset appears in another dataset without isolating first from last names?

    Hi all. I found two datasets with the 10,000 most popular Brazilian male first names and the 10,000 most popular Brazilian female first names. Each of these has the names’ frequencies in the population and a variable called “rank” (1 for the most popular name, 2 for second most popular, etc.). The first names in each dataset appear only once, no repetitions. That can be useful for some very basic text analysis with Brazilian names.

    Let’s say I have a list of candidates running for office in one dataset (with first and last names) and the list with the most popular Brazilian female first names in another dataset.

    Can I issue a command in Stata asking the program to link “rank” in the using dataset every time the name appears on the variable "name" in the master dataset without isolating first name from last name(s)? Or must I create a variable that contains only first names first (for instance, first_female_name) and after that run
    Code:
    merge m:1 first_female_name using using_dataset.dta, keepusing(rank)?
    Example:

    Master dataset has one observation with name: “CAMILA RODRIGUES

    What I want: a command that gives me the “rank” for “CAMILA” appearance.


    If the question is not clear, please let me know. Thanks.

  • #2
    Well, you certainly can't do it with -merge-, because -merge- will only link observations that are exact matches on the key variable.

    I can see away to combine these two data sets in a way that would give the result you want without creating a firstname variable. But if your data set is large it would require a lot of memory, and it might not finish running during your lifetime (hyperbole).
    Code:
    cross using using_dataset // WILL EXPLODE MEMORY AND TAKE FOREVER WITH LARGE DATA SETS
    keep if word(name, 1) == first_female_name

    So I have to ask why you want to do that? What is your objection to creating a firstname variable? It's certainly easy to do:
    Code:
    gen first_female_name = word(name, 1)
    merge m:1 first_female_name using using_dataset, keep(master match)
    and then your -merge- command will work. This approach will not expand the size of the original master data set and will run quickly.

    Comment


    • #3
      You'd probably need to break the names apart, or use some type of fuzzy match (I haven't tried those).

      Code:
      g firstname = substr(name, 1, strpos(name, " ") -1)
      Last edited by George Ford; 26 Feb 2024, 15:18. Reason: or why Clyde said. Didn't know the word trick.

      Comment


      • #4
        Thanks Clyde and George for the replies. No objection at all to creating a first name variable, Clyde. Furthermore, I didn't pay attention to the memory issue you raised. In fact, the datasets I use to work, such as electoral data, can have more than 400,000 observations in municipal elections. Therefore, getting a first name variable will be the my choice. Last but not least: this data can be used for an original paper assessing whether mental shortcuts such as first names can lead voters to privilege candidates with more popular first names on the ballot. There is a big literature in political science and social psychology studying how "heuristics", or mental shortcuts (names, physical appearance, etc.) can affect political behavior. My background, in college and grad school, is in political science. I'm a political scientist.

        George: I will try your code. When I have the result I will post again here.

        Comment


        • #5
          Clyde's approach is better.

          Comment

          Working...
          X