Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How do you deal with multi-value cells in data cleaning?

    Hello! I have a question about a) data cleaning techniques/practices and b) how to implement those techniques and practices in Stata. What I am looking at is a survey that asks about countries. Some respondents have answered with one country, while others had answered with multiple, like below:

    COUNTRY NAME BUDGET ENJOYED THE TRIP
    Argentina Joe $5 YES
    Spain Samantha $10 YES
    Mexico, Guatemala Tony $10 NO
    Nicaragua, Haiti, South Africa Lisa $5 YES

    I want to encode each country to a number, so I'll need to separate them so there's only one country per cell (I think? correct me if this is a wrong assumption). I have seen some stuff online that suggests creating a new row for each country and copying the rest of the data for that row along with it; something like this:

    COUNTRY NAME BUDGET ENJOYED THE TRIP
    Argentina Joe $5 YES
    Spain Samantha $10 YES
    Mexico Tony $10 NO
    Guatemala Tony $10 NO
    Nicaragua Lisa $5 YES
    Haiti Lisa $5 YES
    South Africa Lisa $5 YES

    However, my concern with this is that it seems like it'll mess up my descriptive and summary statistics; for example, the mean for budget should be $7.50 but if I calculate the mean on the "clean" version, it gives undue weight to Lisa and Tony and makes the mean $7.14. If I go by rows, I'll get that 5/7 entries enjoyed the trip when it really should be 3/4 people. I have a lot of columns like this, where the value will be the same if I were to split up the countries in a single cell. My problem is that I want to split up the multi-value cells (since I think that's the proper way to have tidy data?) but I still want the split rows to have the same weight in my summary statistics as it did when it was one row.

    If this concern is not valid, why not and how can it be circumvented?

    Another concern I have will be getting the frequencies of countries and some list 10+ so I'm not sure if it'd be better to make more country columns instead. Is there another or better way to deal with multi-value cells in which you still want the equal weight to be held according to some identifier, like a name? And what if I have several columns with multi-value cells — do I need to create a row for each combination of values within those columns?

    Any advice/help would be very appreciated. Thank you so much!

  • #2
    This
    Code:
     split country, gen(land) parse(,)
    will put each country in new separate variable.

    Comment


    • #3
      I don't think there is a single best way to handle this situation. I think it depends on the context. (And sometimes even in the same study, you may have to do it more than one way for particular analyses.)

      From what you've described,it seems to me it would be a mistake to create new observations ("rows") when there are multiple countries, for precisely the computational reasons you outline in your post. Moreover, conceptually it would be wrong to attribute the budget for Lisa's trip to any one of the countries she visited it, let alone to all of them. There were probably some economies achieved by visiting more than one country in the same trip. (Or, maybe there were some excess expenses. The point is, that there is no clear way to allocate the budget among them.)

      So, at least for the kinds of analysis you've described, I'd leave the data just the way you show it.

      Now, there may be other purposes for which you need single country information, e.g. if you want to tally how many people visited each country. Then for that analysis, I'd do something like this:

      Code:
      preserve
      split COUNTRY, parse(",") gen(c)
      keep c* name
      reshape long c, i(name) j(_j)
      tab c
      restore
      Note: Above code presumes there is only one observation for each name. If that's not true, some other variable or combination of variables that uniquely identify observations must be used in the i() option. If none exists, create one for the purpose.

      The example data you provided was easy to read but is difficult to import into Stata. So the code is untested and may contain typos or other errors. In the future, to get code that has been tried out, show example data by using the -dataex- command. If you are running version 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

      When asking for help with code, always show example data. When showing example data, always use -dataex-.

      Added: crossed with #2

      Comment


      • #4
        Dear fellow STATA users.
        I have some unclean AntiMicrobial Resistance(AMR) data that i would like to clean . The data is in this format
        LabNo Test Results
        210514 SENSITIVITY clindamycin 3+,cephalexim 2+,Cyprofloxaxin3+.
        210514 RESISTANCE Tetracyclin,erythromycin,septrin,cloxacillin,Amoxy clav
        203658 SENSITIVITY NO PATHOGEN ISOLATED AT 37 0c for 48hrs
        203819 Salmonella Typhi H Positive
        203918 Salmonella Typhi H negative
        204089 SENSITIVITY levofloxacin 3+,ciprofloxacin 3+,genta 3+,Amoxicillin 3+,Norbactin 3+
        204197 SENSITIVITY LEVOFLOXACIN 3+ NORBACTIN 3+,GENTAMYCIN 3+,CIPROFLOXACIN 3+
        204197 RESISTANCE AMPICILLIN,AMOXYCILLIN

        I want to split the RESULTS column based on the following

        1.) where TEST column == "SENSITIVITY" or "RESISTANCE"
        2.) the cell results for the resultant columns(from the act of splitting) should be populated with the value integer at the last part of the current RESULTS column value .i.e
        3+ or +++ mean Sensitivity or “S”.
        2+ or ++ and 1+ or + mean Intermediate or “I”.
        Resistance or “R” is on its own in the data.

        Expected table output: anybody with any idea on how to code this?
        LabNo Test Results clindamycin cephalexim Cyprofloxaxin levofloxacin NORBACTIN GENTAMYCIN AMPICILLIN AMOXYCILLIN
        210514 SENSITIVITY clindamycin 3+,cephalexim 2+,Cyprofloxaxin3+. S I S
        210514 RESISTANCE Tetracyclin,erythromycin,septrin,cloxacillin,Amoxy clav R R R
        203658 SENSITIVITY NO PATHOGEN ISOLATED AT 37 0c for 48hrs
        203819 Salmonella Typhi H Positive
        203918 Salmonella Typhi H negative
        204089 SENSITIVITY levofloxacin 3+,ciprofloxacin 3+,genta 3+,Amoxicillin 3+,Norbactin 3+ S S S S S
        204197 SENSITIVITY LEVOFLOXACIN 3+ NORBACTIN 3+,GENTAMYCIN 3+,CIPROFLOXACIN 3+ S S S S
        204197 RESISTANCE AMPICILLIN,AMOXYCILLIN R R
        Last edited by Shadrack Muema; 12 Sep 2020, 06:19.

        Comment


        • #5
          This is typical of data provided by clinical laboratories. The technical term for this is "hot mess."

          I could write code that would deal with the example you showed. It would be fairly lengthy code to deal with all of the irregularities in the data. But I can almost guarantee that code will not work in your real data set. I say that because there are probably many other irregularities in the rest of the data, and it is impossible to anticipate what all of them would be. We could go back and forth as you stumble over all of them and end up with a thread that has 50+ posts in it.

          So, instead, I'm going to just outline the approach that needs to be taken.

          First, you need to split the data set into two pieces. One piece consists of the observations that contain results that will be split (like clindamycin 3+, cephalexim 2+, Cyprofloxaxin3+), and the second will contain results that will not (like "NO PATHOGEN ISOLATED AT 37 0c for 48 hrs" or "Positive").

          In both of those you need to eliminate variation in upper vs lower case. Make everything upper case or make everything lower case. It doesn't matter which, but you need it to be consistently one or the other.

          Next, correct all the spelling errors. There is no drug cephalexim, it's cephalexin. There is no septrin, it's septra. (These are just two--there are others.)

          Standardize full names vs abbreviations: you don't want some observations with genta and others with gentamicin. Pick one and replace all the others.

          Fix irregularities in punctuation. For smooth code, it is important that there always be commas between the drugs. So you need to put a comma after LEVOFLOXACIN 3+ in the 7th observation. Remove periods at the end of entries. Also there must always be a space between the end of the drug name and the number. So, fully corrected, Cyprofloxacixin3+ has to be ciprofloxacin 3+.

          The code will not tolerate drug names with blanks. So change Amoxy clav to Amoxy_clav (or, better still, to amoxicillin_clavulanate.)

          Those are just the types of irregularities I can see in the example. I'm sure there will be many more in the full data set. I've dealt with this kind of thing often in my own work, and I sympathize with your plight. This will be a long tedious slog. There is no way to automate this kind of process in Stata. This type of data cleaning requires at least AI if not human level intelligence.

          Once you have done all of that cleaning you will have a data set that looks more like this:
          Code:
          * Example generated by -dataex-. To install: ssc install dataex
          clear
          input long labno str18 test str74 results
          210514 "SENSITIVITY"        "clindamycin 3+,cephalexin 2+,ciprofloxacin 3+"                        
          210514 "RESISTANCE"         "tetracycline,erythromycin,septra,cloxacillin,amoxicillin_clavulanate"
          203658 "SENSITIVITY"        "no pathogen isolated at 37 0c for 48hrs"                              
          203819 "Salmonella Typhi H" "positive"                                                            
          203918 "Salmonella Typhi H" "negative"                                                            
          204089 "SENSITIVITY"        "levofloxacin 3+,ciprofloxacin 3+,gentamicin 3+,amoxicillin 3+,norbactin 3+"
          204197 "SENSITIVITY"        "levofloxacin 3+, norbactin 3+,gentamicin 3+,ciprofloxacin 3+"        
          204197 "RESISTANCE"         "ampicillin,amoxicillin"                                              
          end
          and then you can transform it to your desired results with this:
          Code:
          gen long obs_no = _n
          tempfile leave_alone
          preserve
          keep if strpos(results, ",") == 0
          save `leave_alone', replace
          
          restore
          drop if strpos(result, ",") == 0
          split results, parse(,) gen(drug_result)
          reshape long drug_result, i(obs_no)
          drop if missing(drug_result)
          split drug_result, parse(" ") gen(dr)
          
          replace dr2 = "S" if inlist(dr2, "+++", "3+")
          replace dr2 = "I" if inlist(dr2, "++", "2+", "+", "1+")
          replace dr2 = "R" if test == "RESISTANCE"
          
          drop drug_result _j
          reshape wide dr2, i(obs_no) j(dr1) string
          rename dr2* *
          order labno test results, after(obs_no)
          append using `leave_alone'
          sort obs_no
          In the future, when showing data examples, please use the -dataex- command to do so. If you are running version 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

          Comment


          • #6
            Originally posted by Clyde Schechter View Post
            This is typical of data provided by clinical laboratories. The technical term for this is "hot mess."

            I could write code that would deal with the example you showed. It would be fairly lengthy code to deal with all of the irregularities in the data. But I can almost guarantee that code will not work in your real data set. I say that because there are probably many other irregularities in the rest of the data, and it is impossible to anticipate what all of them would be. We could go back and forth as you stumble over all of them and end up with a thread that has 50+ posts in it.

            So, instead, I'm going to just outline the approach that needs to be taken.

            First, you need to split the data set into two pieces. One piece consists of the observations that contain results that will be split (like clindamycin 3+, cephalexim 2+, Cyprofloxaxin3+), and the second will contain results that will not (like "NO PATHOGEN ISOLATED AT 37 0c for 48 hrs" or "Positive").

            In both of those you need to eliminate variation in upper vs lower case. Make everything upper case or make everything lower case. It doesn't matter which, but you need it to be consistently one or the other.

            Next, correct all the spelling errors. There is no drug cephalexim, it's cephalexin. There is no septrin, it's septra. (These are just two--there are others.)

            Standardize full names vs abbreviations: you don't want some observations with genta and others with gentamicin. Pick one and replace all the others.

            Fix irregularities in punctuation. For smooth code, it is important that there always be commas between the drugs. So you need to put a comma after LEVOFLOXACIN 3+ in the 7th observation. Remove periods at the end of entries. Also there must always be a space between the end of the drug name and the number. So, fully corrected, Cyprofloxacixin3+ has to be ciprofloxacin 3+.

            The code will not tolerate drug names with blanks. So change Amoxy clav to Amoxy_clav (or, better still, to amoxicillin_clavulanate.)

            Those are just the types of irregularities I can see in the example. I'm sure there will be many more in the full data set. I've dealt with this kind of thing often in my own work, and I sympathize with your plight. This will be a long tedious slog. There is no way to automate this kind of process in Stata. This type of data cleaning requires at least AI if not human level intelligence.

            Once you have done all of that cleaning you will have a data set that looks more like this:
            Code:
            * Example generated by -dataex-. To install: ssc install dataex
            clear
            input long labno str18 test str74 results
            210514 "SENSITIVITY" "clindamycin 3+,cephalexin 2+,ciprofloxacin 3+"
            210514 "RESISTANCE" "tetracycline,erythromycin,septra,cloxacillin,amoxicillin_clavulanate"
            203658 "SENSITIVITY" "no pathogen isolated at 37 0c for 48hrs"
            203819 "Salmonella Typhi H" "positive"
            203918 "Salmonella Typhi H" "negative"
            204089 "SENSITIVITY" "levofloxacin 3+,ciprofloxacin 3+,gentamicin 3+,amoxicillin 3+,norbactin 3+"
            204197 "SENSITIVITY" "levofloxacin 3+, norbactin 3+,gentamicin 3+,ciprofloxacin 3+"
            204197 "RESISTANCE" "ampicillin,amoxicillin"
            end
            and then you can transform it to your desired results with this:
            Code:
            gen long obs_no = _n
            tempfile leave_alone
            preserve
            keep if strpos(results, ",") == 0
            save `leave_alone', replace
            
            restore
            drop if strpos(result, ",") == 0
            split results, parse(,) gen(drug_result)
            reshape long drug_result, i(obs_no)
            drop if missing(drug_result)
            split drug_result, parse(" ") gen(dr)
            
            replace dr2 = "S" if inlist(dr2, "+++", "3+")
            replace dr2 = "I" if inlist(dr2, "++", "2+", "+", "1+")
            replace dr2 = "R" if test == "RESISTANCE"
            
            drop drug_result _j
            reshape wide dr2, i(obs_no) j(dr1) string
            rename dr2* *
            order labno test results, after(obs_no)
            append using `leave_alone'
            sort obs_no
            In the future, when showing data examples, please use the -dataex- command to do so. If you are running version 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.
            Thanks so much Clyde, much appreciated.
            shall try it out.

            Comment

            Working...
            X