Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to split string variables with multiple repeating phrases?

    Hi, So I have a data set that describes the different kinds of graffiti identified at various business locations. The graffiti types are "pen", "spray paint", "pencil", "paint", "etched", and "chalk".

    In some data points, multiple types of graffiti, in any order, may be listed and separated by a " , " [comma]. Some entry points don't have any data in them.

    For example:
    1 Etched
    2 Etched
    3 Pen
    4 Etched , Pen
    5 Etched
    6 Chalk , Paint
    7
    8 Spray Paint
    I figure that if I want to identify each of these in such a way that I could quantify the occurrence of each type of graffiti, I'd want to use the
    Code:
    split
    command.

    So I have been using the following link as a reference.

    This resource is helpful to an extent, but differs from what I need to do because in the example provided about court cases, it is splitting the case variable by different variations of "versus". And where the phrases on either side of the "versus" expression varies by case, the phrases on either side of my commas is different, but also repeat.

    And in some cases, one term is distinct from another term, even though they use the same words: So..... 'Paint' is different from 'Spray Paint'.

    I want to assign each type of graffiti a categorical number, where; 1 = "Pen", 2 = "Spray Paint", 3 = "Pencil'...and so on....and then I want to create a variable, or variables, that allows me to measure each type of graffiti in order to quantify their occurrence.

    But I'm just confused...and the example reference in the link above doesn't quite help me to do that.

    So....basically I need to quantify the occurrence of each type of graffiti by splitting the graffiti types variable. But the example I have to help me understand how to do that is doing something completely different from what I need, and is not as helpful.

    But I'm not quite sure how to do this....please help?
    Last edited by Julius Doyle; 02 Feb 2023, 12:55.

  • #2
    A more general reference for using the split command is the output of help split. This example may start you in a useful direction.
    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input byte id str13 marks
    1 "Etched"      
    2 "Etched"      
    3 "Pen"          
    4 "Etched , Pen"
    5 "Etched"      
    6 "Chalk , Paint"
    7 ""            
    8 "Spray Paint"  
    end
    
    split marks, generate(marker) parse(",")
    drop marks
    reshape long marker, i(id) j(j)
    replace marker = trim(marker)
    replace marker = "NONE" if j==1 & marker==""
    drop if missing(marker)
    drop j
    list, sepby(id)
    tab marker
    Code:
    . list, sepby(id)
    
         +------------------+
         | id        marker |
         |------------------|
      1. |  1        Etched |
         |------------------|
      2. |  2        Etched |
         |------------------|
      3. |  3           Pen |
         |------------------|
      4. |  4        Etched |
      5. |  4           Pen |
         |------------------|
      6. |  5        Etched |
         |------------------|
      7. |  6         Chalk |
      8. |  6         Paint |
         |------------------|
      9. |  7          NONE |
         |------------------|
     10. |  8   Spray Paint |
         +------------------+
    
    . tab marker
    
         marker |      Freq.     Percent        Cum.
    ------------+-----------------------------------
          Chalk |          1       10.00       10.00
         Etched |          4       40.00       50.00
           NONE |          1       10.00       60.00
          Paint |          1       10.00       70.00
            Pen |          2       20.00       90.00
    Spray Paint |          1       10.00      100.00
    ------------+-----------------------------------
          Total |         10      100.00
    Last edited by William Lisowski; 02 Feb 2023, 13:44.

    Comment


    • #3
      Originally posted by William Lisowski View Post
      A more general reference for using the split command is the output of help split. This example may start you in a useful direction.
      Code:
      * Example generated by -dataex-. For more info, type help dataex
      clear
      input byte id str13 marks
      1 "Etched"
      2 "Etched"
      3 "Pen"
      4 "Etched , Pen"
      5 "Etched"
      6 "Chalk , Paint"
      7 ""
      8 "Spray Paint"
      end
      
      split marks, generate(marker) parse(",")
      drop marks
      reshape long marker, i(id) j(j)
      replace marker = trim(marker)
      replace marker = "NONE" if j==1 & marker==""
      drop if missing(marker)
      drop j
      list, sepby(id)
      tab marker
      Code:
      . list, sepby(id)
      
      +------------------+
      | id marker |
      |------------------|
      1. | 1 Etched |
      |------------------|
      2. | 2 Etched |
      |------------------|
      3. | 3 Pen |
      |------------------|
      4. | 4 Etched |
      5. | 4 Pen |
      |------------------|
      6. | 5 Etched |
      |------------------|
      7. | 6 Chalk |
      8. | 6 Paint |
      |------------------|
      9. | 7 NONE |
      |------------------|
      10. | 8 Spray Paint |
      +------------------+
      
      . tab marker
      
      marker | Freq. Percent Cum.
      ------------+-----------------------------------
      Chalk | 1 10.00 10.00
      Etched | 4 40.00 50.00
      NONE | 1 10.00 60.00
      Paint | 1 10.00 70.00
      Pen | 2 20.00 90.00
      Spray Paint | 1 10.00 100.00
      ------------+-----------------------------------
      Total | 10 100.00
      What does the j mean in this command script?

      Comment


      • #4
        The split command created variables marker1 and marker2, because there were at most 2 types in a single observation in your example data.

        The reshape long command combined these in the single variable marker with the variable j indicating whether the original was from marker1 and marker 2. But that doesn't really matter to you, I assumed, since you didn't incate that it was important to know which one came first, so I dropped that variable.

        Comment


        • #5
          So...I think I made a mistake in this post in that I put ID numbers in the first row, when those shouldn't be there. Unfortunately, it might affect the necessary code for this because I've followed your suggestion and I get the following error:

          HTML Code:
          variable id does not uniquely identify the observations
          Your data are currently wide. You are performing a reshape long. You specified i(id) and j(j). In the current wide form, variable id should uniquely identify the observations.
          
          Remember this picture:
          
          Type reshape error for a list of problem observations.
          I think I'm getting this error because I naiively numbered by table above, giving you the impression that there was some variable you could call ID. So then I'm following your suggestion, and its giving me this error....

          Comment

          Working...
          X