Counting unique alphabets in word

Sawaeng Watcharathanakij

Join Date: Nov 2016

Posts: 13
#1

Counting unique alphabets in word

11 Mar 2022, 21:35

I tried to find Stata module to count unique alphabets in word, but can't find one. For example, the word "alberta" has a, b, e, l, r, and t. So the new variable displays 6.
Tags: None
Nick Cox

Join Date: Mar 2014

Posts: 35698
#2

12 Mar 2022, 02:38

From your question I take it that you are indifferent to case -- otherwise alberta should be Alberta and A counts as different from a. I am also going to guess that spaces aren't interesting.

There is discussion of related problems in various places

https://www.stata.com/support/faqs/d...tinct-strings/

https://www.stata-journal.com/articl...article=pr0046 Section 7.

Here I put it together to create one variable for each character and then get the community-contributed egen function rowsvals() from egenmore on SSC to do the harder work.

Code:

* Example generated by -dataex-. For more info, type help dataex clear input str20 province "alberta" "new brunswick" "prince edward island" end gen length = strlen(province) su length quietly forval j = 1/`r(max)' { gen char`j' = substr(province, `j', 1) if substr(province, `j', 1) != "" } egen wanted = rowsvals(char*) l province wanted +-------------------------------+ | province wanted | |-------------------------------| 1. | alberta 6 | 2. | new brunswick 11 | 3. | prince edward island 12 | +-------------------------------+ drop char*

If you do care about spaces, then don't ignore them.
1 like
Comment

Bjarte Aagnes

Join Date: Apr 2014
Posts: 783

12 Mar 2022, 06:40

Code:

gen byte nchars = 0

foreach alpha in `c(alpha)' { // a-z

    replace nchars = nchars + 1 if strpos(strlower(province), "`alpha'")
}

Comment

Nick Cox

Join Date: Mar 2014

Posts: 35698
#4

12 Mar 2022, 07:17

#3 is good for 26 characters A to Z. Not good for Norwegian?
Comment

Bjarte Aagnes

Join Date: Apr 2014
Posts: 783

12 Mar 2022, 08:11

#3 including Latin-1 supplement letters:

Code:

gen byte nchars = 0

forvalues codepoint = 65/255 { // parts of unicode blocks "Basic Latin" and "Latin-1 Supplement"

    if ( uisletter(uchar(`codepoint')) ) {

        replace nchars = nchars + 1 if ustrpos(ustrlower(province), uchar(`codepoint') )
    }
}

and #2 would need usubstr():

Code:

     gen char`j'  = usubstr(ustrlower(province), `j', 1) if usubstr(ustrlower(province), `j', 1) != ""

Comment

Leonardo Guizzetti

Join Date: Jul 2016
Posts: 2402

12 Mar 2022, 08:54

Here's an alternative way, that accepts any valid letter codepoint.

Code:

clear *
cls

input str30 input
"Ontario"
"Alberta"
"New Brunswick"
"Prince Edward Island"
"München"
"Malmö"
"L'Aquila"
"Emiglia-Romagna"
end


gen textonly = ustrregexra(input, "\P{L}", "", 1)
replace textonly = ustrlower(textonly)  // <-- comment line if you care about capitalization

gen textlen = ustrlen(textonly)
gen unique_letters = ""
gen next_letter = ""
gen remaining = textonly

summ textlen, meanonly
forval i = 1/`r(max)' {
  qui replace next_letter = usubstr(remaining, 1, 1)
  qui replace unique_letters = unique_letters + next_letter
  qui replace remaining = ustrregexra(remaining, next_letter, "", 0)
}
drop textonly next_letter remaining
gen n_unique = ustrlen(unique_letters)

list, sep(0) abbrev(20)

Result

Code:

     +------------------------------------------------------------+
     |                input   textlen   unique_letters   n_unique |
     |------------------------------------------------------------|
  1. |              Ontario         7           ontari          6 |
  2. |              Alberta         7           albert          6 |
  3. |        New Brunswick        12       newbrusick         10 |
  4. | Prince Edward Island        18      princedwasl         11 |
  5. |              München         7           münche          6 |
  6. |                Malmö         5             malö          4 |
  7. |             L'Aquila         7            laqui          5 |
  8. |      Emiglia-Romagna        14        emiglaron          9 |
     +------------------------------------------------------------+

Comment

Sawaeng Watcharathanakij

Join Date: Nov 2016

Posts: 13
#7

12 Mar 2022, 18:39

Thank you very much for all the answers.
Comment

Announcement

Counting unique alphabets in word

Comment

Comment

Comment

Comment

Comment

Comment