Messy string data: how to do crosstabulations and descriptives

Olena Mazurenko

Join Date: May 2018

Posts: 18
#1

Messy string data: how to do crosstabulations and descriptives

28 Apr 2022, 19:04

Dear STATA community,

Please help!
I have messy data containing over thirty variations of different words (e.g., leadership) in four columns for over 800 observations. A screenshot of the data is attached. I could not use the dataex command due to the large size of the file.

How do I quickly calculate how many times each of the words appear among all four columns?

I also need to do crosstabulations between these two words at a time. How would I do it?

Do I need to recode each word into a numeric value?

I would appreciate your help!
Olena

Attached Files
Tags: None
Jared Greathouse

Join Date: Sep 2021

Posts: 2170
#2

28 Apr 2022, 19:20

Do

Code:

dataex id Theme1 Theme2

and report that. We literally cannot proceed without a dataex example.

Please give your example data using the dataex command. For us to provide meaningful feedback, you must provide your example data using the dataex command, the real data from an easily importable source (i.e., Github), or the equivalent of a toy example.
Otherwise, anything we say is simply a waste of time. Note, that I'm not trying to be mean in saying this, I'm saying this because if we can't see your dataset as you do with a minimal worked example, anything we suggest is just guesswork. The reason that I'm emphasizing this is because questions like this one likely have a relatively simple fix, but even simple fixes can be wildly overcomplicated without a minimal worked example of a dataset and code that you've tried to accomplish your task.

So please, provide us with your example data.
Comment

Tom Scott

Join Date: Apr 2019
Posts: 264

28 Apr 2022, 20:54

Code:

gen leadership = 1 if strpos(lower(Theme1), "leadership") ==   1 | strpos(lower(Theme2), "leadership") ==   1
replace leadership = 0 if leadership!=1
gen strain_burn = 1 if strpos(lower(Theme1), "strain") ==   1 | strpos(lower(Theme1), "burnout") ==   1 | strpos(lower(Theme2), "strain") ==   1 | strpos(lower(Theme2), "burnout") ==   1
replace strain_burn = 0 if strain_burn!=1
gen missingthemes = 1 if Theme1=="" & Theme2==""
replace leadership = . if missingthemes==1
replace strain_burn= . if missingthemes==1
tab leadership strainburn

Comment

Olena Mazurenko

Join Date: May 2018

Posts: 18
#4

29 Apr 2022, 08:01

Jared Greathouse and Tom Scott

Huge thanks for being willing to help me, I really appreciate it! An example of the data is below. I was able to get the overall frequency of the themes by reshaping the data in a long format.
However, I am still not sure how to answer the following information: How can I know which theme was mentioned most often with another theme.
For example, if leadership was the most commonly mentioned theme, what was another theme mentioned along with the leadership, was it overtime/extra work or something else? I was thinking of some sort of a matrix that will allow me to see how often each theme co-occurred in the interview. Any suggestions?
I am in a time crunch to complete a report! Any clues would be absolute lifesavers!
Thanks,
Olena
input int id str72 t1 str73 t2
49 "PH expertise not valued" "Politicization"
128 "burnout/overwhelmed" "harrassment/threats to PH agency/staff"
3059 "inequity in response roles/obligations " "staff shortagess"
308 "burnout/overwhelmed" "harrassment/threats to PH agency/staff"
416 "distrust/negative view of PH" "telework"
326 "frontline risk" " employee wellness"
340 "lack of emergency response protocols" "jaded"
483 "leadership issues" "politicization activities"
593 "leadership issues" "distrust/negative view of PH"
4252 "administrative preparedness/response" "strain on core responsibilities of PH"
4904 "administrative preparedness/response" "unappreciated"
140 "burnout/overwhelmed" "employee wellness"
150 "burnout/overwhelmed" "inequity in response roles/obligations"
255 "burnout/overwhelmed" "overtime/extra work"
522 "burnout/overwhelmed" "strain on core responsibilities of PH"
739 "burnout/overwhelmed" "strain on core responsibilities of PH"
848 "burnout/overwhelmed" "lack of emergency response protocols"
1274 "burnout/overwhelmed" "overtime/extra work"
1513 "burnout/overwhelmed" "leadership support"
1568 "burnout/overwhelmed" "overtime/extra work"
1689 "burnout/overwhelmed" "overtime/extra work"
1857 "burnout/overwhelmed" "overtime/extra work"
1920 "burnout/overwhelmed" "overtime/extra work"
2179 "burnout/overwhelmed" "strain on core responsibilities of PH"
Comment
Tom Scott

Join Date: Apr 2019

Posts: 264
#5

29 Apr 2022, 08:16

This would be one option

Code:

replace t1=strtrim(t1) replace t2=strtrim(t2) egen themepattern = concat( t1 t2), p(---) tab themepattern, sort

You would want to make sure that there aren't spelling variants or other issues (e.g., "burnout" vs. "burn out")
Comment
Olena Mazurenko

Join Date: May 2018

Posts: 18
#6

29 Apr 2022, 08:31

Tom Scott : Thanks for offering the solution! I truly appreciate your help!!!!! I am not sure it worked though..

See the screenshot (I hope it would be displayed correctly). These are the first few rows. Does this mean that pride in PH work/mission was not mentioned with any other themes?

pride in PH work/mission--- | 84 10.36 10.36
burnout/overwhelmed---

I also failed to correctly describe that I have four themes, so I need to get this matrix for all four themes. How would I need to modify the code?

See dataex output for complete data:
input int id str72 t1 str73 t2 str72 t3 str38 t4
49 "PH expertise not valued" "Politicization" "Turnover" ""
128 "burnout/overwhelmed" "harrassment/threats to PH agency/staff" "" ""
3059 "inequity in response roles/obligations" "staff shortagess" "retain temporary employees" ""
308 "burnout/overwhelmed" "harrassment/threats to PH agency/staff" "" ""
416 "distrust/negative view of PH" "telework" "" ""
326 "frontline risk" "employee wellness" "distrust/negative view of PH" ""
340 "lack of emergency response protocols" "jaded" "burnout/overwhelmed" ""
483 "leadership issues" "politicization activities" "" ""
593 "leadership issues" "distrust/negative view of PH" "" ""

Attached Files
Comment
Olena Mazurenko

Join Date: May 2018

Posts: 18
#7

29 Apr 2022, 08:39

Tom Scott I went ahead and tried applying your code to all four themes. I am not sure how to interpret it:
pride in PH work/mission--------- | 84 10.36 10.36
burnout/overwhelmed--------- | 21 2.59 12.95
teamwork--------- | 15 1.85 14.80
lack of appropriate training/educatio.. | 14 1.73 16.52
distrust/negative view of PH--------- | 13 1.60 18.13
Comment
Tom Scott

Join Date: Apr 2019

Posts: 264
#8

29 Apr 2022, 08:50

Don't hold back data like that in the future when sharing on this site because it wastes peoples time. See revised code below. No, it means that the most common grouping across themes was "pride in ..." as the first theme and a blank values in the other theme. The "sort" after tab means that it is sorting by frequency of occurrence not name. You could use tab without the sort option to see what other themes "pride in...." paired with when it was listed as the first theme. But it could show up in themes 2-4. You could also use the code I provided in my first post. Since you can't see the full patterns by using tabulate, I provided code to sort on frequency so you can easily find the pattern value in data editor.
replace t1=strtrim(t1) replace t2=strtrim(t2) replace t3=strtrim(t3) replace t4=strtrim(t4) egen themepattern = concat( t1 t2 t3 t4), p(---) tab themepattern, sort

Code:

egen themenumber = count(themepattern), by(themepattern) gsort -themenumber
Comment

Jared Greathouse

Join Date: Sep 2021
Posts: 2170

29 Apr 2022, 08:53

You don't need to post the results in normal text boxes, it's less readable like that. Just do

Code:

sysuse auto, clear

reg price weight

      Source |       SS           df       MS      Number of obs   =        74
-------------+----------------------------------   F(1, 72)        =     29.42
       Model |   184233937         1   184233937   Prob > F        =    0.0000
    Residual |   450831459        72  6261548.04   R-squared       =    0.2901
-------------+----------------------------------   Adj R-squared   =    0.2802
       Total |   635065396        73  8699525.97   Root MSE        =    2502.3

------------------------------------------------------------------------------
       price | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
      weight |   2.044063   .3768341     5.42   0.000     1.292857    2.795268
       _cons |  -6.707353    1174.43    -0.01   0.995     -2347.89    2334.475
------------------------------------------------------------------------------

and the output comes out much smoother like that. Either way, I'm still not sure I'm understanding the question.

What are the 30 words in question? It isn't obvious by your data examples and you only mentioned leadership. What are some of the others? The code so far seems to do what you want, right?

Last edited by Jared Greathouse; 29 Apr 2022, 08:56.

Comment

Tom Scott

Join Date: Apr 2019

Posts: 264
#10

29 Apr 2022, 08:57

Your results from Post #7 mean that in 84 rows/observations, individuals reported only 1 theme and that was "pride in PH....". Because you sorted on frequency of observations, the fact that it appears first means that it is the most common pattern of themes. The second most common is having 1 theme that is burnout, and that appeared in the data 21 times. etc.
Comment
Olena Mazurenko

Join Date: May 2018

Posts: 18
#11

29 Apr 2022, 18:20

Dear Tom Scott and Jared Greathouse,

First, let me apologize for asking for your input and not being clear on my request. I definitely did not mean to withhold my data or waste your time. I am ver sorry about this situation. Unfortunately, I am still struggling with creating an output that would tell me which combinations of the themes occurred most often. I created a bunch of dummy variables for each theme, hoping that I could use a table command, but it wasnt helpful. I just need to tell which themes were most commonly mentioned together. For example, if someone said that pride in PH work is important he/she also talked about burnout. My data is in a long format, so may be its contributing to the issue. See below. I know I already burned a ton of your time, but I would really appreciate your wisdom.

clear
input int id byte j str72 t float(pride burnout leadership_issues compensation lack_em_response_prt leadership_support staff_shortagess communication teamwork empl_wellness)
25 1 "collaboration" 0 0 0 0 0 0 0 0 0 0
25 2 "" 0 0 0 0 0 0 0 0 0 0
25 3 "" 0 0 0 0 0 0 0 0 0 0
25 4 "" 0 0 0 0 0 0 0 0 0 0
32 1 "overtime/extra work" 0 0 0 0 0 0 0 0 0 0
32 2 "Extra compensation" 0 0 0 0 0 0 0 0 0 0
32 3 "" 0 0 0 0 0 0 0 0 0 0
32 4 "" 0 0 0 0 0 0 0 0 0 0
33 1 "Inequity in response roles/obligations " 0 0 0 0 0 0 0 0 0 0
33 2 "staff shortagess" 0 0 0 0 0 0 1 0 0 0
Comment
Tom Scott

Join Date: Apr 2019

Posts: 264
#12

29 Apr 2022, 18:45

This is the last I can do for you. The code I gave you is doing what you want. I think you're just having a hard time understanding it. Or maybe I'm misunderstanding what you want and if that is the case please just be very very specific on what you want to be able to say. If you want to know the most common pattern for a certain response type, you can just show the results if that response type is included in one of the themes:

Code:

replace t1=strtrim(t1) replace t2=strtrim(t2) replace t3=strtrim(t3) replace t4=strtrim(t4) egen themepattern = concat( t1 t2 t3 t4), p(---) tab themepattern if (strpos(lower(t1), "pride in PH") == 1 | strpos(lower(t2), "pride in PH") == 1 | strpos(lower(t3), "pride in PH") == 1 | strpos(lower(t4), "pride in PH") == 1), sort list themepattern if (strpos(lower(t1), "pride in PH") == 1 | strpos(lower(t2), "pride in PH") == 1 | strpos(lower(t3), "pride in PH") == 1 | strpos(lower(t4), "pride in PH") == 1)

The code following "tab themepattern if" and "list themepattern if" is just telling Stata to only show the results for observations where one of the 4 themes had a value of "pride in PH ....". You could change "pride in PH" to other categories like "burnout" if you want to see results that are specific to that response type.
The tab code will show you the most common patterns in order from most frequent to least frequent, and the list command will show you the full categories.

Good luck
Comment

Announcement