How to split string variables with multiple repeating phrases?

Julius Doyle

Join Date: Feb 2023

Posts: 10
#1

How to split string variables with multiple repeating phrases?

02 Feb 2023, 11:37

Hi, So I have a data set that describes the different kinds of graffiti identified at various business locations. The graffiti types are "pen", "spray paint", "pencil", "paint", "etched", and "chalk".

In some data points, multiple types of graffiti, in any order, may be listed and separated by a " , " [comma]. Some entry points don't have any data in them.

For example:
1 Etched

2 Etched

3 Pen

4 Etched , Pen

5 Etched

6 Chalk , Paint

7

8 Spray Paint

I figure that if I want to identify each of these in such a way that I could quantify the occurrence of each type of graffiti, I'd want to use the

Code:

split

command.

So I have been using the following link as a reference.

This resource is helpful to an extent, but differs from what I need to do because in the example provided about court cases, it is splitting the case variable by different variations of "versus". And where the phrases on either side of the "versus" expression varies by case, the phrases on either side of my commas is different, but also repeat.

And in some cases, one term is distinct from another term, even though they use the same words: So..... 'Paint' is different from 'Spray Paint'.

I want to assign each type of graffiti a categorical number, where; 1 = "Pen", 2 = "Spray Paint", 3 = "Pencil'...and so on....and then I want to create a variable, or variables, that allows me to measure each type of graffiti in order to quantify their occurrence.

But I'm just confused...and the example reference in the link above doesn't quite help me to do that.

So....basically I need to quantify the occurrence of each type of graffiti by splitting the graffiti types variable. But the example I have to help me understand how to do that is doing something completely different from what I need, and is not as helpful.

But I'm not quite sure how to do this....please help?

Last edited by Julius Doyle; 02 Feb 2023, 11:55.
Tags: categorical, split, string, syntax

William Lisowski

Join Date: Dec 2014
Posts: 10150

02 Feb 2023, 12:36

A more general reference for using the split command is the output of help split. This example may start you in a useful direction.

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input byte id str13 marks
1 "Etched"      
2 "Etched"      
3 "Pen"          
4 "Etched , Pen"
5 "Etched"      
6 "Chalk , Paint"
7 ""            
8 "Spray Paint"  
end

split marks, generate(marker) parse(",")
drop marks
reshape long marker, i(id) j(j)
replace marker = trim(marker)
replace marker = "NONE" if j==1 & marker==""
drop if missing(marker)
drop j
list, sepby(id)
tab marker

Code:

. list, sepby(id)

     +------------------+
     | id        marker |
     |------------------|
  1. |  1        Etched |
     |------------------|
  2. |  2        Etched |
     |------------------|
  3. |  3           Pen |
     |------------------|
  4. |  4        Etched |
  5. |  4           Pen |
     |------------------|
  6. |  5        Etched |
     |------------------|
  7. |  6         Chalk |
  8. |  6         Paint |
     |------------------|
  9. |  7          NONE |
     |------------------|
 10. |  8   Spray Paint |
     +------------------+

. tab marker

     marker |      Freq.     Percent        Cum.
------------+-----------------------------------
      Chalk |          1       10.00       10.00
     Etched |          4       40.00       50.00
       NONE |          1       10.00       60.00
      Paint |          1       10.00       70.00
        Pen |          2       20.00       90.00
Spray Paint |          1       10.00      100.00
------------+-----------------------------------
      Total |         10      100.00

Last edited by William Lisowski; 02 Feb 2023, 12:44.

Comment

Julius Doyle

Join Date: Feb 2023
Posts: 10

13 Feb 2023, 11:25

Originally posted by William Lisowski View Post

A more general reference for using the split command is the output of help split. This example may start you in a useful direction.

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input byte id str13 marks
1 "Etched"
2 "Etched"
3 "Pen"
4 "Etched , Pen"
5 "Etched"
6 "Chalk , Paint"
7 ""
8 "Spray Paint"
end

split marks, generate(marker) parse(",")
drop marks
reshape long marker, i(id) j(j)
replace marker = trim(marker)
replace marker = "NONE" if j==1 & marker==""
drop if missing(marker)
drop j
list, sepby(id)
tab marker

Code:

. list, sepby(id)

+------------------+
| id marker |
|------------------|
1. | 1 Etched |
|------------------|
2. | 2 Etched |
|------------------|
3. | 3 Pen |
|------------------|
4. | 4 Etched |
5. | 4 Pen |
|------------------|
6. | 5 Etched |
|------------------|
7. | 6 Chalk |
8. | 6 Paint |
|------------------|
9. | 7 NONE |
|------------------|
10. | 8 Spray Paint |
+------------------+

. tab marker

marker | Freq. Percent Cum.
------------+-----------------------------------
Chalk | 1 10.00 10.00
Etched | 4 40.00 50.00
NONE | 1 10.00 60.00
Paint | 1 10.00 70.00
Pen | 2 20.00 90.00
Spray Paint | 1 10.00 100.00
------------+-----------------------------------
Total | 10 100.00

What does the j mean in this command script?

Comment

William Lisowski

Join Date: Dec 2014

Posts: 10150
#4

13 Feb 2023, 13:56

The split command created variables marker1 and marker2, because there were at most 2 types in a single observation in your example data.

The reshape long command combined these in the single variable marker with the variable j indicating whether the original was from marker1 and marker 2. But that doesn't really matter to you, I assumed, since you didn't incate that it was important to know which one came first, so I dropped that variable.
1 like
Comment
Julius Doyle

Join Date: Feb 2023

Posts: 10
#5

28 Feb 2023, 16:15

So...I think I made a mistake in this post in that I put ID numbers in the first row, when those shouldn't be there. Unfortunately, it might affect the necessary code for this because I've followed your suggestion and I get the following error:

HTML Code:

variable id does not uniquely identify the observations Your data are currently wide. You are performing a reshape long. You specified i(id) and j(j). In the current wide form, variable id should uniquely identify the observations. Remember this picture: Type reshape error for a list of problem observations.

I think I'm getting this error because I naiively numbered by table above, giving you the impression that there was some variable you could call ID. So then I'm following your suggestion, and its giving me this error....
Comment

Announcement