string correction

oluyemi omale

Join Date: Mar 2024

Posts: 3
#1

string correction

10 Mar 2024, 11:12

Hi all,
i am currently working on a thesis project and in it, i have a part regarding identifications of degrees of individuals, grades and so on. i am stuck in a problem regarding duplicates in strings.
I have a string that contains numerous words and digits separated by a comma. I would like to remove all duplicates inside these strings that are separated by a comma. For example,
100, 98, 89, 98, 100, 0, undergraduate, undergraduate, graduate, post graduate, 19-25, 25-29, ...... would become
100, 98, 89, 0, undergraduate, graduate, post graduate, 19-25, 25-29, ..
Notice that the 25 is not counted as a duplicate as it is connected to either 19 or 29 by a "-" in this case.

I would prefer to use regular expressions if it is possible. i would appreciate any help in this matter!
Thank you so much in advance!
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#2

10 Mar 2024, 13:10

I have created a toy data set with just one observation based on the example you gave.

Code:

clear set obs 1 gen str_var = "100, 98, 89, 98, 100, 0, undergraduate, undergraduate, graduate, post graduate, 19-25, 25-29" // SOLUTION BEGINS HERE gen `c(obs_t)' obs_no = _n split str_var, parse(",") gen(token) reshape long token, i(obs_no) drop if missing(token) by obs_no token (_j), sort: keep if _n == 1 by obs_no (_j), sort: replace _j = _n reshape wide egen edited_str_var = concat(token*), punct(", ") drop token*

In the future, when showing data examples, please use the -dataex- command to do so. If you are running version 18, 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.
Comment
oluyemi omale

Join Date: Mar 2024

Posts: 3
#3

10 Mar 2024, 13:38

Dear Mr. Schechter,

Thank you for your reply. I will surely read and use the -dataex- in the future! I have run the code you provided but I seem to get "100, 98, 89, 100, 0, undergraduate, graduate, post graduate, 19-25, 25-29". I apologize for not using the -dataex- now to display the result as I still don't know how to use it, but in the result I am getting, 100 is still displayed twice although all the rest work perfectly fine. I am using stata 18. I am not sure if its a problem with my execution. I would appreciate any help. Thank you for your reply once again.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#4

10 Mar 2024, 14:49

It's my mistake. In the -split- command, the -parse(",")- option should have been -parse(", ")-. Note the blank space following the comma. I left that out before. If you make that change, the duplicate 100 (or whatever else might come first) will be eliminated. Sorry about that error.
1 like
Comment
oluyemi omale

Join Date: Mar 2024

Posts: 3
#5

10 Mar 2024, 14:54

Dear Mr. Schechter,

Alright, got it. Thanks so much!
Comment

Announcement

Comment

Comment

Comment

Comment