How do you deal with multi-value cells in data cleaning?

Laila Voss

Join Date: Jul 2020

Posts: 8
#1

How do you deal with multi-value cells in data cleaning?

30 Jul 2020, 15:00

Hello! I have a question about a) data cleaning techniques/practices and b) how to implement those techniques and practices in Stata. What I am looking at is a survey that asks about countries. Some respondents have answered with one country, while others had answered with multiple, like below:

COUNTRY NAME BUDGET ENJOYED THE TRIP

Argentina Joe $5 YES

Spain Samantha $10 YES

Mexico, Guatemala Tony $10 NO

Nicaragua, Haiti, South Africa Lisa $5 YES

I want to encode each country to a number, so I'll need to separate them so there's only one country per cell (I think? correct me if this is a wrong assumption). I have seen some stuff online that suggests creating a new row for each country and copying the rest of the data for that row along with it; something like this:

COUNTRY NAME BUDGET ENJOYED THE TRIP

Argentina Joe $5 YES

Spain Samantha $10 YES

Mexico Tony $10 NO

Guatemala Tony $10 NO

Nicaragua Lisa $5 YES

Haiti Lisa $5 YES

South Africa Lisa $5 YES

However, my concern with this is that it seems like it'll mess up my descriptive and summary statistics; for example, the mean for budget should be $7.50 but if I calculate the mean on the "clean" version, it gives undue weight to Lisa and Tony and makes the mean $7.14. If I go by rows, I'll get that 5/7 entries enjoyed the trip when it really should be 3/4 people. I have a lot of columns like this, where the value will be the same if I were to split up the countries in a single cell. My problem is that I want to split up the multi-value cells (since I think that's the proper way to have tidy data?) but I still want the split rows to have the same weight in my summary statistics as it did when it was one row.

If this concern is not valid, why not and how can it be circumvented?

Another concern I have will be getting the frequencies of countries and some list 10+ so I'm not sure if it'd be better to make more country columns instead. Is there another or better way to deal with multi-value cells in which you still want the equal weight to be held according to some identifier, like a name? And what if I have several columns with multi-value cells — do I need to create a row for each combination of values within those columns?

Any advice/help would be very appreciated. Thank you so much!
Tags: data cleaning

1 like
Joro Kolev

Join Date: Aug 2018

Posts: 3050
#2

30 Jul 2020, 15:24

This

Code:

split country, gen(land) parse(,)

will put each country in new separate variable.
2 likes
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30101
#3

30 Jul 2020, 15:24

I don't think there is a single best way to handle this situation. I think it depends on the context. (And sometimes even in the same study, you may have to do it more than one way for particular analyses.)

From what you've described,it seems to me it would be a mistake to create new observations ("rows") when there are multiple countries, for precisely the computational reasons you outline in your post. Moreover, conceptually it would be wrong to attribute the budget for Lisa's trip to any one of the countries she visited it, let alone to all of them. There were probably some economies achieved by visiting more than one country in the same trip. (Or, maybe there were some excess expenses. The point is, that there is no clear way to allocate the budget among them.)

So, at least for the kinds of analysis you've described, I'd leave the data just the way you show it.

Now, there may be other purposes for which you need single country information, e.g. if you want to tally how many people visited each country. Then for that analysis, I'd do something like this:

Code:

preserve split COUNTRY, parse(",") gen(c) keep c* name reshape long c, i(name) j(_j) tab c restore

Note: Above code presumes there is only one observation for each name. If that's not true, some other variable or combination of variables that uniquely identify observations must be used in the i() option. If none exists, create one for the purpose.

The example data you provided was easy to read but is difficult to import into Stata. So the code is untested and may contain typos or other errors. In the future, to get code that has been tried out, show example data by using the -dataex- command. If you are running version 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

When asking for help with code, always show example data. When showing example data, always use -dataex-.

Added: crossed with #2
2 likes
Comment

Shadrack Muema

Join Date: Aug 2020
Posts: 5

12 Sep 2020, 05:38

Dear fellow STATA users.
I have some unclean AntiMicrobial Resistance(AMR) data that i would like to clean . The data is in this format

LabNo	Test	Results
210514	SENSITIVITY	clindamycin 3+,cephalexim 2+,Cyprofloxaxin3+.
210514	RESISTANCE	Tetracyclin,erythromycin,septrin,cloxacillin,Amoxy clav
203658	SENSITIVITY	NO PATHOGEN ISOLATED AT 37 0c for 48hrs
203819	Salmonella Typhi H	Positive
203918	Salmonella Typhi H	negative
204089	SENSITIVITY	levofloxacin 3+,ciprofloxacin 3+,genta 3+,Amoxicillin 3+,Norbactin 3+
204197	SENSITIVITY	LEVOFLOXACIN 3+ NORBACTIN 3+,GENTAMYCIN 3+,CIPROFLOXACIN 3+
204197	RESISTANCE	AMPICILLIN,AMOXYCILLIN

I want to split the RESULTS column based on the following

1.) where TEST column == "SENSITIVITY" or "RESISTANCE"
2.) the cell results for the resultant columns(from the act of splitting) should be populated with the value integer at the last part of the current RESULTS column value .i.e
3+ or +++ mean Sensitivity or “S”.
2+ or ++ and 1+ or + mean Intermediate or “I”.
Resistance or “R” is on its own in the data.

Expected table output: anybody with any idea on how to code this?

LabNo	Test	Results	clindamycin	cephalexim	Cyprofloxaxin	levofloxacin	NORBACTIN	GENTAMYCIN	AMPICILLIN	AMOXYCILLIN
210514	SENSITIVITY	clindamycin 3+,cephalexim 2+,Cyprofloxaxin3+.	S	I	S
210514	RESISTANCE	Tetracyclin,erythromycin,septrin,cloxacillin,Amoxy clav	R	R	R
203658	SENSITIVITY	NO PATHOGEN ISOLATED AT 37 0c for 48hrs
203819	Salmonella Typhi H	Positive
203918	Salmonella Typhi H	negative
204089	SENSITIVITY	levofloxacin 3+,ciprofloxacin 3+,genta 3+,Amoxicillin 3+,Norbactin 3+			S	S	S	S		S
204197	SENSITIVITY	LEVOFLOXACIN 3+ NORBACTIN 3+,GENTAMYCIN 3+,CIPROFLOXACIN 3+			S	S	S	S
204197	RESISTANCE	AMPICILLIN,AMOXYCILLIN							R	R

Last edited by Shadrack Muema; 12 Sep 2020, 06:19.

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30101
#5

12 Sep 2020, 11:51

This is typical of data provided by clinical laboratories. The technical term for this is "hot mess."

I could write code that would deal with the example you showed. It would be fairly lengthy code to deal with all of the irregularities in the data. But I can almost guarantee that code will not work in your real data set. I say that because there are probably many other irregularities in the rest of the data, and it is impossible to anticipate what all of them would be. We could go back and forth as you stumble over all of them and end up with a thread that has 50+ posts in it.

So, instead, I'm going to just outline the approach that needs to be taken.

First, you need to split the data set into two pieces. One piece consists of the observations that contain results that will be split (like clindamycin 3+, cephalexim 2+, Cyprofloxaxin3+), and the second will contain results that will not (like "NO PATHOGEN ISOLATED AT 37 0c for 48 hrs" or "Positive").

In both of those you need to eliminate variation in upper vs lower case. Make everything upper case or make everything lower case. It doesn't matter which, but you need it to be consistently one or the other.

Next, correct all the spelling errors. There is no drug cephalexim, it's cephalexin. There is no septrin, it's septra. (These are just two--there are others.)

Standardize full names vs abbreviations: you don't want some observations with genta and others with gentamicin. Pick one and replace all the others.

Fix irregularities in punctuation. For smooth code, it is important that there always be commas between the drugs. So you need to put a comma after LEVOFLOXACIN 3+ in the 7th observation. Remove periods at the end of entries. Also there must always be a space between the end of the drug name and the number. So, fully corrected, Cyprofloxacixin3+ has to be ciprofloxacin 3+.

The code will not tolerate drug names with blanks. So change Amoxy clav to Amoxy_clav (or, better still, to amoxicillin_clavulanate.)

Those are just the types of irregularities I can see in the example. I'm sure there will be many more in the full data set. I've dealt with this kind of thing often in my own work, and I sympathize with your plight. This will be a long tedious slog. There is no way to automate this kind of process in Stata. This type of data cleaning requires at least AI if not human level intelligence.

Once you have done all of that cleaning you will have a data set that looks more like this:

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input long labno str18 test str74 results 210514 "SENSITIVITY" "clindamycin 3+,cephalexin 2+,ciprofloxacin 3+" 210514 "RESISTANCE" "tetracycline,erythromycin,septra,cloxacillin,amoxicillin_clavulanate" 203658 "SENSITIVITY" "no pathogen isolated at 37 0c for 48hrs" 203819 "Salmonella Typhi H" "positive" 203918 "Salmonella Typhi H" "negative" 204089 "SENSITIVITY" "levofloxacin 3+,ciprofloxacin 3+,gentamicin 3+,amoxicillin 3+,norbactin 3+" 204197 "SENSITIVITY" "levofloxacin 3+, norbactin 3+,gentamicin 3+,ciprofloxacin 3+" 204197 "RESISTANCE" "ampicillin,amoxicillin" end

and then you can transform it to your desired results with this:

Code:

gen long obs_no = _n tempfile leave_alone preserve keep if strpos(results, ",") == 0 save `leave_alone', replace restore drop if strpos(result, ",") == 0 split results, parse(,) gen(drug_result) reshape long drug_result, i(obs_no) drop if missing(drug_result) split drug_result, parse(" ") gen(dr) replace dr2 = "S" if inlist(dr2, "+++", "3+") replace dr2 = "I" if inlist(dr2, "++", "2+", "+", "1+") replace dr2 = "R" if test == "RESISTANCE" drop drug_result _j reshape wide dr2, i(obs_no) j(dr1) string rename dr2* * order labno test results, after(obs_no) append using `leave_alone' sort obs_no

In the future, when showing data examples, please use the -dataex- command to do so. If you are running version 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.
1 like
Comment
Shadrack Muema

Join Date: Aug 2020

Posts: 5
#6

26 Sep 2020, 02:18

Originally posted by Clyde Schechter View Post

This is typical of data provided by clinical laboratories. The technical term for this is "hot mess."

I could write code that would deal with the example you showed. It would be fairly lengthy code to deal with all of the irregularities in the data. But I can almost guarantee that code will not work in your real data set. I say that because there are probably many other irregularities in the rest of the data, and it is impossible to anticipate what all of them would be. We could go back and forth as you stumble over all of them and end up with a thread that has 50+ posts in it.

So, instead, I'm going to just outline the approach that needs to be taken.

First, you need to split the data set into two pieces. One piece consists of the observations that contain results that will be split (like clindamycin 3+, cephalexim 2+, Cyprofloxaxin3+), and the second will contain results that will not (like "NO PATHOGEN ISOLATED AT 37 0c for 48 hrs" or "Positive").

In both of those you need to eliminate variation in upper vs lower case. Make everything upper case or make everything lower case. It doesn't matter which, but you need it to be consistently one or the other.

Next, correct all the spelling errors. There is no drug cephalexim, it's cephalexin. There is no septrin, it's septra. (These are just two--there are others.)

Standardize full names vs abbreviations: you don't want some observations with genta and others with gentamicin. Pick one and replace all the others.

Fix irregularities in punctuation. For smooth code, it is important that there always be commas between the drugs. So you need to put a comma after LEVOFLOXACIN 3+ in the 7th observation. Remove periods at the end of entries. Also there must always be a space between the end of the drug name and the number. So, fully corrected, Cyprofloxacixin3+ has to be ciprofloxacin 3+.

The code will not tolerate drug names with blanks. So change Amoxy clav to Amoxy_clav (or, better still, to amoxicillin_clavulanate.)

Those are just the types of irregularities I can see in the example. I'm sure there will be many more in the full data set. I've dealt with this kind of thing often in my own work, and I sympathize with your plight. This will be a long tedious slog. There is no way to automate this kind of process in Stata. This type of data cleaning requires at least AI if not human level intelligence.

Once you have done all of that cleaning you will have a data set that looks more like this:

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input long labno str18 test str74 results 210514 "SENSITIVITY" "clindamycin 3+,cephalexin 2+,ciprofloxacin 3+" 210514 "RESISTANCE" "tetracycline,erythromycin,septra,cloxacillin,amoxicillin_clavulanate" 203658 "SENSITIVITY" "no pathogen isolated at 37 0c for 48hrs" 203819 "Salmonella Typhi H" "positive" 203918 "Salmonella Typhi H" "negative" 204089 "SENSITIVITY" "levofloxacin 3+,ciprofloxacin 3+,gentamicin 3+,amoxicillin 3+,norbactin 3+" 204197 "SENSITIVITY" "levofloxacin 3+, norbactin 3+,gentamicin 3+,ciprofloxacin 3+" 204197 "RESISTANCE" "ampicillin,amoxicillin" end

and then you can transform it to your desired results with this:

Code:

gen long obs_no = _n tempfile leave_alone preserve keep if strpos(results, ",") == 0 save `leave_alone', replace restore drop if strpos(result, ",") == 0 split results, parse(,) gen(drug_result) reshape long drug_result, i(obs_no) drop if missing(drug_result) split drug_result, parse(" ") gen(dr) replace dr2 = "S" if inlist(dr2, "+++", "3+") replace dr2 = "I" if inlist(dr2, "++", "2+", "+", "1+") replace dr2 = "R" if test == "RESISTANCE" drop drug_result _j reshape wide dr2, i(obs_no) j(dr1) string rename dr2* * order labno test results, after(obs_no) append using `leave_alone' sort obs_no

In the future, when showing data examples, please use the -dataex- command to do so. If you are running version 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

Thanks so much Clyde, much appreciated.
shall try it out.
Comment

COUNTRY	NAME	BUDGET	ENJOYED THE TRIP
Argentina	Joe	$5	YES
Spain	Samantha	$10	YES
Mexico, Guatemala	Tony	$10	NO
Nicaragua, Haiti, South Africa	Lisa	$5	YES

COUNTRY	NAME	BUDGET	ENJOYED THE TRIP
Argentina	Joe	$5	YES
Spain	Samantha	$10	YES
Mexico	Tony	$10	NO
Guatemala	Tony	$10	NO
Nicaragua	Lisa	$5	YES
Haiti	Lisa	$5	YES
South Africa	Lisa	$5	YES

Announcement