Hello! I have a question about a) data cleaning techniques/practices and b) how to implement those techniques and practices in Stata. What I am looking at is a survey that asks about countries. Some respondents have answered with one country, while others had answered with multiple, like below:
I want to encode each country to a number, so I'll need to separate them so there's only one country per cell (I think? correct me if this is a wrong assumption). I have seen some stuff online that suggests creating a new row for each country and copying the rest of the data for that row along with it; something like this:
However, my concern with this is that it seems like it'll mess up my descriptive and summary statistics; for example, the mean for budget should be $7.50 but if I calculate the mean on the "clean" version, it gives undue weight to Lisa and Tony and makes the mean $7.14. If I go by rows, I'll get that 5/7 entries enjoyed the trip when it really should be 3/4 people. I have a lot of columns like this, where the value will be the same if I were to split up the countries in a single cell. My problem is that I want to split up the multi-value cells (since I think that's the proper way to have tidy data?) but I still want the split rows to have the same weight in my summary statistics as it did when it was one row.
If this concern is not valid, why not and how can it be circumvented?
Another concern I have will be getting the frequencies of countries and some list 10+ so I'm not sure if it'd be better to make more country columns instead. Is there another or better way to deal with multi-value cells in which you still want the equal weight to be held according to some identifier, like a name? And what if I have several columns with multi-value cells — do I need to create a row for each combination of values within those columns?
Any advice/help would be very appreciated. Thank you so much!
COUNTRY | NAME | BUDGET | ENJOYED THE TRIP |
Argentina | Joe | $5 | YES |
Spain | Samantha | $10 | YES |
Mexico, Guatemala | Tony | $10 | NO |
Nicaragua, Haiti, South Africa | Lisa | $5 | YES |
I want to encode each country to a number, so I'll need to separate them so there's only one country per cell (I think? correct me if this is a wrong assumption). I have seen some stuff online that suggests creating a new row for each country and copying the rest of the data for that row along with it; something like this:
COUNTRY | NAME | BUDGET | ENJOYED THE TRIP |
Argentina | Joe | $5 | YES |
Spain | Samantha | $10 | YES |
Mexico | Tony | $10 | NO |
Guatemala | Tony | $10 | NO |
Nicaragua | Lisa | $5 | YES |
Haiti | Lisa | $5 | YES |
South Africa | Lisa | $5 | YES |
However, my concern with this is that it seems like it'll mess up my descriptive and summary statistics; for example, the mean for budget should be $7.50 but if I calculate the mean on the "clean" version, it gives undue weight to Lisa and Tony and makes the mean $7.14. If I go by rows, I'll get that 5/7 entries enjoyed the trip when it really should be 3/4 people. I have a lot of columns like this, where the value will be the same if I were to split up the countries in a single cell. My problem is that I want to split up the multi-value cells (since I think that's the proper way to have tidy data?) but I still want the split rows to have the same weight in my summary statistics as it did when it was one row.
If this concern is not valid, why not and how can it be circumvented?
Another concern I have will be getting the frequencies of countries and some list 10+ so I'm not sure if it'd be better to make more country columns instead. Is there another or better way to deal with multi-value cells in which you still want the equal weight to be held according to some identifier, like a name? And what if I have several columns with multi-value cells — do I need to create a row for each combination of values within those columns?
Any advice/help would be very appreciated. Thank you so much!
Comment