Create new variable based on text within a substring

Kevin Matthews

Join Date: Jan 2017

Posts: 1
#1

Create new variable based on text within a substring

13 Jan 2017, 13:13

I want to be able to search for a string within an observation and then write a new file based on the value of the string. I have a list of school names in a flat file. Each school has a type in its name (i.e., elementary, primary, charter, middle, junior, high, secondary, etc.) but the location of the type varies for each observation varies as does the length of string. For example, if I have a variable called "School Name" and the first observation of that variable is the text "Johnson Elementary School", I want to create a second variable called "School Type" whose first observation is the text "Elementary".
Tags: None
Roger Chu

Join Date: Feb 2016

Posts: 84
#2

13 Jan 2017, 13:28

Hello Kevin,

Try the regexm() and regexs() functions.

Code:

help f_regexm
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30358
#3

13 Jan 2017, 13:30

Well, if the words "elementary, primary, ..." etc actually occur in the school name variable, it is easy enough to capture that information, even if it isn't always in the same location in the string.

Code:

gen school_type = "" foreach x in elementary primary charter middle junior high secondary { replace school_type = "`x'" if strpos(lower(school_name)), "`x'") }

Notes:
1. I think it is safer to seek the keyword in a lower case transform of the school name, so that we don't have to deal with possible inconsistencies in capitalization in the original data.
2. The variable school_type created by this code is a string variable. Depending on what you are going to do with the information, you may find it more convenient to work with a value-labeled numeric variable. See -help encode-.

Comment on posting technique: you don't have a variable called "School Name" -- you can't possibly, because that is not a legal variable name in Stata. Embedded blanks are not permitted. When describing Stata data as part of a question, it is best to be exact and use correct terminology.

Added: Crossed with Roger Chu's response. While this can be done using the functions he suggests, I think that is using a sledge hammer to swat a fly. The regular expression functions are fairly complicated to use properly. Sometimes they are indispensible, but in this case a far simpler solution is available.

Last edited by Clyde Schechter; 13 Jan 2017, 13:32.
Comment
Rich Goldstein

Join Date: Mar 2014

Posts: 4549
#4

13 Jan 2017, 13:34

not entirely clear (e.g., I assume you mean you have a string variable that contains these descriptors) but I think that the following will do what you want:

Code:

gen stelem=strpos(School_Name,"Elementary") gen sthigh=strpos(School_Name,"High")

etc.
note that I have changed the illegitimate variable name "School Name" (spaces are not allowed) to something legitimate
this will give you variables with the number where the string starts (if it doesn't exist then the new variable (stelem) will have a value of 0
you can then do, e.g.,

Code:

gen str School_Type="Elementary" if stelem>0 replace School_Type="High School" if sthigh>0

etc.

added: crossed with above replies from Roger and Clyde (and Clyde's is fuller)
Comment

Dr. Iqra Yaseen

Join Date: Jun 2022
Posts: 89

30 Oct 2023, 23:54

Clyde Schechter Rich Goldstein Roger Chu Joao Santos Silva Nick Cox

Code:

 Example generated by -dataex-. To install: ssc install dataex
clear
input int year str9 iso3_d str253 ProductSector str33 iso3_o double tradeflow_wto_d

2015 "IND" " Product: 2603 Copper  concentrates"                "EU"    2 0
2015 "IND" " Product: 2603 Copper  concentrates"                "EU"    10
2015 "IND" " Product: 2603 Copper ores "                         "EU"    37
2015 "IND" " Product: 2603 Copper ores "                         "EU"    10
2015 "IND" " Product: 2603 Copper ores "                         "EU"    10
2015 "IND" " Product: 2603 Copper ores and concentrates"   AUS"    27
2015 "IND" " Product: 2603 Copper ores and concentrates"   AUS     10
2015 "IND" " Product: 2603 Copper ores and concentrates"   AUS     10
2015 "IND" " Product: 2604 Nickel ores and concentrates"    "EU"     0
2016 "IND" " Product: 2604 Nickel ores and concentrates"    EU"    11
2017 "IND" " Product: 2604 Nickel ores and concentrates"    EU"     0
2018 "IND" " Product: 2604 Nickel ores and concentrates"    EU"     2
2019 "IND" " Product: 2604 Nickel ores and concentrates"    EU"     2
2020 "IND" " Product: 2604 Nickel ores and concentrates"    "AUS"     0
2021 "IND" " Product: 2604 Nickel ores and concentrates"    AUS    0
2022 "IND" " Product: 2604 Nickel ores and concentrates"     AUS     0
2015 "IND" " Product: 2605 Cobalt ores and concentrates"   EU"    58
2016 "IND" " Product: 2605 Cobalt ores and concentrates"   EU"     0
2017 "IND" " Product: 2605 Cobalt ores and concentrates"   EU"     0
2018 "IND" " Product: 2605 Cobalt ores and concentrates"   AUS     3
2019 "IND" " Product: 2605 Cobalt ores and concentrates"   AUS    60
2020 "IND" " Product: 2605 Cobalt ores and concentrates"   AUS  123
2021 "IND" " Product: 2605 Cobalt ores and concentrates"   AUS    73

this my data example
i need two new variables namely minerals (string variable) and trade_minerals (numerical variable) whose value should be the sum of variable tradeflow_imf_d on the basis of string variable ProductSector. i want a data like this

Code:

year iso3_d  iso3_o minerals trade_minerals
2015 "IND"  "EU"      Copper    57
2015  "IND"  "EU"      Nickel      47

my dataset ranges from 2015 to 2022 .
in other words i need to aggregate trade value according to mineral type on year and country basis .

please help. my paper is stuck because of this problem.

Last edited by Dr. Iqra Yaseen; 31 Oct 2023, 00:51.

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30358
#6

31 Oct 2023, 09:19

Well, based on your description of what you want it sounds like you would get it with:

Code:

collapse (sum) tradeflow_wto_d, by(iso3_d iso3_o year ProductSector)

But, I do not see how you arrived at a total of 47 for 2015 "IND" "EU" Nickel. The only observation I find that matches that year, origin, destination, and product has 0 tradeflow. Did you make a mistake, or am I misunderstanding what you are trying to do?
Comment
Dr. Iqra Yaseen

Join Date: Jun 2022

Posts: 89
#7

31 Oct 2023, 10:44

Clyde Schechter yes i have made a mistake here. but we have already discussed this query at https://www.statalist.org/forums/for...03#post1732203

thank you so much
Comment

Announcement