Help removing special characters from SIPRI country name string variable

Sam Volpe

Join Date: Jun 2021
Posts: 36

Help removing special characters from SIPRI country name string variable

10 Apr 2024, 17:44

Hi, I am using a dataset from SIPRI for a time-series analysis I am doing with many countries (100+) and many years (~60). I am also using a few other datasets and in order to merge them, I need a consistent country code (numbers or letters) or country name. SIPRI's dataset doesn't have a country code but it does have country names, however many of the entries have weird characters, like brackets and stars. I've included 3 examples.

I'm familiar with the command "kountry" but even using kountry stuck means missing out on a lot of countries' observations. Does anyone know what dataset SIPRI is using for their name variable or if they came up with their own completely? Otherwise, are there any commands where I can remove various special characters from string variables for many observations efficiently?

Any help is greatly appreciated!

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str37 country_name int year
"ANC (South Africa)*" 1950
"ANC (South Africa)*" 1951
"ANC (South Africa)*" 1952
"ANC (South Africa)*" 1953
"ANC (South Africa)*" 1954
"ANC (South Africa)*" 1955
"ANC (South Africa)*" 1956
"ANC (South Africa)*" 1957
"ANC (South Africa)*" 1958
"ANC (South Africa)*" 1959
end

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str37 country_name int year
"African Union**" 1950
"African Union**" 1951
"African Union**" 1952
"African Union**" 1953
"African Union**" 1954
"African Union**" 1955
"African Union**" 1956
"African Union**" 1957
"African Union**" 1958
"African Union**" 1959
end

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str37 country_name int year
"Amal (Lebanon)*" 1950
"Amal (Lebanon)*" 1951
"Amal (Lebanon)*" 1952
"Amal (Lebanon)*" 1953
"Amal (Lebanon)*" 1954
"Amal (Lebanon)*" 1955
"Amal (Lebanon)*" 1956
"Amal (Lebanon)*" 1957
"Amal (Lebanon)*" 1958
"Amal (Lebanon)*" 1959
end

Tags: None

Andrew Musau

Join Date: Oct 2014
Posts: 9947

10 Apr 2024, 18:03

You can try regular expressions:

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str37 country_name int year
"ANC (South Africa)*" 1950
"ANC (South Africa)*" 1951
"ANC (South Africa)*" 1952
"ANC (South Africa)*" 1953
"ANC (South Africa)*" 1954
"ANC (South Africa)*" 1955
"ANC (South Africa)*" 1956
"ANC (South Africa)*" 1957
"ANC (South Africa)*" 1958
"ANC (South Africa)*" 1959
"African Union**" 1950
"African Union**" 1951
"African Union**" 1952
"African Union**" 1953
"African Union**" 1954
"African Union**" 1955
"African Union**" 1956
"African Union**" 1957
"African Union**" 1958
"African Union**" 1959
"Amal (Lebanon)*" 1950
"Amal (Lebanon)*" 1951
"Amal (Lebanon)*" 1952
"Amal (Lebanon)*" 1953
"Amal (Lebanon)*" 1954
"Amal (Lebanon)*" 1955
"Amal (Lebanon)*" 1956
"Amal (Lebanon)*" 1957
"Amal (Lebanon)*" 1958
"Amal (Lebanon)*" 1959
end

replace country= trim(itrim(ustrregexra(country, "[^a-zA-Z]", " ")))

Res.:

Code:

. l, sepby(country)

     +-------------------------+
     |     country_name   year |
     |-------------------------|
  1. | ANC South Africa   1950 |
  2. | ANC South Africa   1951 |
  3. | ANC South Africa   1952 |
  4. | ANC South Africa   1953 |
  5. | ANC South Africa   1954 |
  6. | ANC South Africa   1955 |
  7. | ANC South Africa   1956 |
  8. | ANC South Africa   1957 |
  9. | ANC South Africa   1958 |
 10. | ANC South Africa   1959 |
     |-------------------------|
 11. |    African Union   1950 |
 12. |    African Union   1951 |
 13. |    African Union   1952 |
 14. |    African Union   1953 |
 15. |    African Union   1954 |
 16. |    African Union   1955 |
 17. |    African Union   1956 |
 18. |    African Union   1957 |
 19. |    African Union   1958 |
 20. |    African Union   1959 |
     |-------------------------|
 21. |     Amal Lebanon   1950 |
 22. |     Amal Lebanon   1951 |
 23. |     Amal Lebanon   1952 |
 24. |     Amal Lebanon   1953 |
 25. |     Amal Lebanon   1954 |
 26. |     Amal Lebanon   1955 |
 27. |     Amal Lebanon   1956 |
 28. |     Amal Lebanon   1957 |
 29. |     Amal Lebanon   1958 |
 30. |     Amal Lebanon   1959 |
     +-------------------------+

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 29796
#3

10 Apr 2024, 18:09

Well, it is easy enough to strip out the characters other than letters and spaces:

Code:

replace country_name = ustrregexra(country_name, "[^a-zA-Z\s]", "")

But you have other problems. African Union is not a country. I don't know for sure what the ANC in "ANC (South Africa)*" refers to, but ANC is the initialism of the African National Congress, which, I believe, is their ruling political party. And, similarly Amal is a political movement in Lebanon. The problem is that removing these "adornments" from the country names is not a matter of string manipulations: it requires substantive geopolitical knowledge and cannot be automated in ways that Stata is capable of. It would require human or artificial intelligence.

Added: Crossed with #2.
Comment

Announcement

Help removing special characters from SIPRI country name string variable

Comment

Comment