Extracting specific text from large string entry

Pranab Mukhopadhyay

Join Date: Apr 2019

Posts: 6
#1

Extracting specific text from large string entry

14 Oct 2019, 13:03

I am seeking help for extracting specific data from a large string entry (strL). I have a list of names and their institutional affiliation separated by semi-colons. I am seeking names of individuals from a particular institution so that I do not have manually seek those names. I am attaching a sample in excel. I have used the "parse" command but this separates the entries by identifiers and creates new columns. Sometimes this creates over 30 new columns and it is very hard to track it manually.
Would be very grateful for any help. Thanks in advance
Attached Files

SAMPLE file.csv (1.1 KB, 2 views)
Tags: None
Marcos Almeida

Join Date: Apr 2014

Posts: 4047
#2

14 Oct 2019, 13:07

Welcome to the Stata Forum\Statalist.

Please read the FAQ. There you'll find how to share data and provide a more informative post. This is the best approach to entice a helpful reply.

Best regards,

Marcos
Comment
Pranab Mukhopadhyay

Join Date: Apr 2019

Posts: 6
#3

14 Oct 2019, 23:41

Dear Marcos,
Thank for your reply and your suggestion. I have been using the Statalist for many problems I had encountered and had my answers even before the need to post a query. However, on this one I dont seem to have been able to get any ready disucssion on the list -- or my search has been inadequate.
In any case, I will rephrase my query as you suggest.
Thanks
pranab
Comment
Pranab Mukhopadhyay

Join Date: Apr 2019

Posts: 6
#4

15 Oct 2019, 00:04

Dear All,

I have a data set which list names and affiliations with location of a set of people in a single cell of a row item. This is a string variable (and sometimes a long entry taking the form strL). I have many such row entries in a file. I have a two line example of the what the data looks like for illustration at the end of this post. My interest is to find a way to extract names of people (like Smith, J.", which is the second name entry in the line below) if they belong to a particular institutions (like "XYZ of Science and Technology").

I have used the "parse" command using ";" as the separator. However, this created many new columns with each name and affiliation as a new column. In the next stage I used parse again with "," as separator and this split up the first name and surname and institution, city, country.

Some entries have the city information and some don't so there is lack of uniformity in the entry. While the name of the institution will remain constant in my search, the name of the person will change.

So in the two line entry below I am interested in extracting the names Smith, J. (line 1) and Pure, M. (line 2) as they belong to XYZ of Science and Technology, City 1.

I looked up the technique mentioned at link below but could not replicate it for my query:
https://stats.idre.ucla.edu/stata/fa...r-expressions/

I would be grateful could get any help please.

Thanks,
pranab

Example of data:

input string variable name_affiliation

Adams, J., ABC University, India; Smith, J., XYZ of Science and Technology, City1, India; Krun, K., English Center, University of MNO, City 2, India:
Jim, R., TRU University, India; Jason, T., DRI Technology, City1, India; Pure, M., XYZ of Science and Technology, City1, India:
Comment

David Benson

Join Date: Oct 2018
Posts: 489

15 Oct 2019, 01:27

Hi Pranab,

The joys of dealing with text fields! So you will probably need to end up parsing the field a few times as you mention. However, you could lighten your workload by deleting observations that don't contain institutions you care about (i.e. "XYZ of Science and Technology", "University of Delhi", and "Jawaharlal Nehru University") and then parsing the rest. I would use the strpos function to accomplish that.

I would take a look at some examples here, here, and here to help you get started (all are on Statalist).

If your data always follows Last Name, First Name, University, city (sometimes), India, then I would parse the data by ";" (as you have done), then find the location of that 2nd comma and extract everything before it (to extract the name).

It would be *really* helpful if you could provide another 20-30 observations of your data. (Normally I would also say, "using Stata's dataex command." But your string variables are too wide for dataex (see the code window below)).

Code:

* NOTE: Dataex won't work because the string variables are too long
* input str151 author1 str144 author2 str140 author3
* data width (300 chars) exceeds max linesize. Try specifying fewer variables
* (1000);

*** I used the data from your CSV file
split author_affil, parse(;) gen(author)

*** Check to see if ANY of the authors are affiliated with the relevant universities
gen has_affil = 0
foreach univ in "VIT University" "Science and Technology" "Delhi University" {
    replace has_affil = 1 if strpos(author_affil, "`univ'") > 0
    }

*** Checking for individual authors
gen univ1 = 0  // check to see if author1's university is on the list
gen univ2 = 0
gen univ3 = 0

*** NOTE: This code won't catch misspellings (i.e. "univeristy" rather than "university")
foreach univ in "VIT University" "Science and Technology" "Delhi University" {
    replace univ1=1 if strpos(author1, "`univ'") > 0
    replace univ2=1 if strpos(author2, "`univ'") > 0
    replace univ3=1 if strpos(author3, "`univ'") > 0
    }


list has_affil univ1 univ2 univ3, abbrev(12)

     +-----------------------------------+
     | has_affil   univ1   univ2   univ3 |
     |-----------------------------------|
  1. |         1       0       1       0 |
  2. |         1       1       1       0 |
  3. |         0       0       0       0 |
  4. |         1       1       1       1 |
     +-----------------------------------+

I tried to create a table for author1 and univ1 below, but couldn't get the Statalist table tools editor to allow borders. However, I have highlighted the matches that Stata found using the above code.
'

author1	univ1
Karthikeyan, J., Vellore Institute of Technology University, India	0
Michael Raj, J., Department of English and Foreign Languages, SRM Institute of Science and Technology, Chennai, India	1
Vaditya, V., Department of Social Exclusion Studies, School of Inter-Disciplinary Studies, The English & Foreign Languages University, Hyderabad, India	0
Rajkhowa, G., English and Foreign Languages University, Hyderabad, Centre for Studies in Social Sciences, Calcutta, Delhi University, India	1

'
Hope this helps!

Last edited by David Benson; 15 Oct 2019, 01:35.

Comment

Pranab Mukhopadhyay

Join Date: Apr 2019

Posts: 6
#6

16 Oct 2019, 13:27

Hi David,
Thanks for taking the time out to respond. This is an interesting way forward. But suppose I was interested in anything else except the author's name of one of the institutions, say Delhi University. So every time the entry Delhi university shows up in a row, I would like to extract the name of the author belonging to Delhi University. Hypothetically, lets say I am not interested in the other names or institutions. Is there a way you would suggest to do this extraction?
best,
pranab
Comment

David Benson

Join Date: Oct 2018
Posts: 489

17 Oct 2019, 17:05

Hi Pranab,

I'm not the best at these types of loops, so I will point you to some other links and hopefully someone else can come along and point to a more efficient way to do this. Also, as long as the name is the first thing listed, it will probably be easier to extract all names, then look to see if the university is one you are interested in:

Code:

* Example generated by -dataex-. To install: ssc install dataex
dataex
clear
input str151 author1 byte univ1
"Karthikeyan, J., Vellore Institute of Technology University, India"                                                                                      0
"Michael Raj, J., Department of English and Foreign Languages, SRM Institute of Science and Technology, Chennai, India"                                   1
"Vaditya, V., Department of Social Exclusion Studies, School of Inter-Disciplinary Studies, The English & Foreign Languages University, Hyderabad, India" 0
"Rajkhowa, G., English and Foreign Languages University, Hyderabad, Centre for Studies in Social Sciences, Calcutta, Delhi University, India"             1
end

*** Replace the 1st comma with a ";" (just because I don't know how to replace the 2nd comma)
replace author1 =  subinstr( author1, "," , ";", 1)

* Now the 2nd comma is really the first comma
gen auth1_comma = strpos(author1, ",")   // Find position of that new first comma
gen author1_name = substr(author1, 1, auth1_comma -1)  // name is everything before that new first comma


gen Delhi_univ = 0
replace Delhi_univ = 1 if strpos( author1, "Delhi Univ") > 1

*** Or, if you would like to do a different university id for each relevant university
gen univ_id = 1 if strpos( author1, "Delhi Univ") > 1
replace univ_id = 2 if strpos( author1, "XYZ of Science and Technology") > 1


*** You will probably want change the ";" back to a comma
. list author1_name Delhi_univ, noobs abbrev(12)

  +------------------------------+
  |    author1_name   Delhi_univ |
  |------------------------------|
  | Karthikeyan; J.            0 |
  | Michael Raj; J.            0 |
  |     Vaditya; V.            0 |
  |    Rajkhowa; G.            1 |
  +------------------------------+


*** You could also do it with moss (SSC)
ssc install moss
moss author1, match(",")  // finds the number of commas in author1 as well as their positions

. list author1_name auth1_comma _count _pos*, noobs abbrev(12)

  +----------------------------------------------------------------------------------------+
  |    author1_name   auth1_comma   _count   _pos1   _pos2   _pos3   _pos4   _pos5   _pos6 |
  |----------------------------------------------------------------------------------------|
  | Karthikeyan; J.            16        2      16      60       .       .       .       . |
  | Michael Raj; J.            16        4      16      61     102     111       .       . |
  |     Vaditya; V.            12        5      12      52      90     134     145       . |
  |    Rajkhowa; G.            13        6      13      55      66     105     115     133 |
  +----------------------------------------------------------------------------------------+

. list author1, noobs

  +-----------------------------------------------------------------------------------------------------------------------------------------+
  | author1                                                                                                                                 |
  |-----------------------------------------------------------------------------------------------------------------------------------------|
  | Karthikeyan; J., Vellore Institute of Technology University, India                                                                      |
  | Michael Raj; J., Department of English and Foreign Languages, SRM Institute of Science and Technology, Chennai, India                   |
  | Vaditya; V., Department of Social Exclusion Studies, School of Inter-Disciplinary Studies, The English & Foreign Languages University.. |
  | Rajkhowa; G., English and Foreign Languages University, Hyderabad, Centre for Studies in Social Sciences, Calcutta, Delhi University,.. |
  +-----------------------------------------------------------------------------------------------------------------------------------------+

You might also take a look at https://stackoverflow.com/questions/...-of-the-string
Nick Cox also has an explanation about extracting plaintiff and defendant from a single string variable here

Hope that helps!
--David

Last edited by David Benson; 17 Oct 2019, 17:12.

Comment

Pranab Mukhopadhyay

Join Date: Apr 2019

Posts: 6
#8

18 Oct 2019, 00:15

Hi David,

I think this would narrow the list down ....

split authorswithaffiliations, parse(
list authorswithaffiliations1 if strmatch(authorswithaffiliations1,"*XYZ University*")

The only thing is that I have to run this each time with the number linked to author changed. I tried using $ and * attached to the authorswithaffiliations -- like authorswithaffiliations$ or authorswithaffiliations* and I came up with an error

"uthorswithaffiliations* invalid name
r(198);"

Will look at other commands and see if I can automate this.
Comment
Marcos Almeida

Join Date: Apr 2014

Posts: 4047
#9

18 Oct 2019, 03:58

Beware “uthorswithaffiliations” may be a typo.

Best regards,

Marcos
Comment
David Benson

Join Date: Oct 2018

Posts: 489
#10

18 Oct 2019, 13:28

Originally posted by Pranab Mukhopadhyay View Post

The only thing is that I have to run this each time with the number linked to author changed. I tried using $ and * attached to the authorswithaffiliations -- like authorswithaffiliations$ or authorswithaffiliations* and I came up with an error

Actually, once you get the commands doing what you want, getting Stata to loop over numbers (i.e. for author1, author2, author3, etc) is quite easy.

Code:

*** Using my example from #7 forvalues i= 1/4 { replace author`i' = subinstr( author`i', "," , ";", 1) gen auth`i'_comma = strpos(author`i', ",") gen author`i'_name = substr(author`i', 1, auth`i'_comma -1) } *** Using your example in #8 // NOTE: Split authorswithaffiliations first forvalues i= 1/4 { list authorswithaffiliations`i' if strmatch(authorswithaffiliations`i',"*XYZ University*") }
Comment

Announcement