Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Help!! Data cleaning, creating new variables with observations based on complicated conditions

    Please help me with cleaning this data:


    below is my data from excel file and I have 1000s of rows more like this.
    • patientid and patientid2 are same patient id number. (a patient has many measurements of iron & calcium)
    • startdate is the date we started treatment-baseline date
    • testdate is the laboratory test date for iron or calcium
    • value is the measured value of calcium or iron concentration
    Now I need concentration values at baseline for each patient for calcium and iron which is the closest date after treatment or the same date as treatment started.
    & also calcium_final iron_final as the final measurements which is the latest tested measurement
    calcium_final_date and iron_final_date are basically the latest test dates.

    what could be the best possible approach to solve this mess?


    patientid startdate patientid2 testdate parameter value calcium_base calcium_final calcium_final_date iron_base iron_final iron_final_date
    1122 10/07/2016 1122 9/26/2018 iron 227
    1122 11/08/2017 iron 194
    1122 9/29/2017 iron 4520
    1122 12/29/2016 iron 416
    1122 9/30/2016 iron 3238
    1122 11/21/2017 calcium 4
    1122 9/28/2017 calcium 11
    1122 01/09/2017 calcium 22
    1122 9/29/2016 calcium 33
    1127 10/20/2016 1127 10/10/2018 iron 25
    1127 10/23/2017 iron 121
    1127 01/09/2017 iron 94
    1127 10/07/2016 iron 152
    1127 10/09/2017 calcium 24
    1127 01/09/2017 calcium 22
    1127 10/11/2016 calcium 222
    1199 4/17/2018 1199 3/28/2018 calcium 5000
    1199 1/22/2018 iron 669
    1289 08/11/2017 1289 6/27/2018 iron 219
    1289 01/08/2018 iron 221
    1289 9/21/2017 calcium 6000
    1289 6/27/2018 calcium 5
    1450 11/13/2015 1450 11/08/2021 iron 509
    1450 5/17/2021 iron 30
    1450 5/29/2020 iron 117
    1450 01/08/2020 calcium 3457
    1450 07/02/2019 calcium 502

  • #2
    copying and pasting the data above in the excel file should help to explore the possibilities.

    Comment


    • #3

      There is no point in writing Stata code for data that does not exist in Stata. If you want to do this in Excel (which I strongly recommend against unless this is just for amusement) then you should seek help from an Excel forum.

      If you want to do this in Stata, first import the data to Stata. Then use the -dataex- command to show an example of your Stata data. If you are running version 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

      If you do that, I'm confident you will get a timely and helpful response.

      Comment


      • #4
        Thank you Clyde. I am new to stata and didn't know about this dataex command ( really so useful) . below is the data in stata


        Code:
        * Example generated by -dataex-. For more info, type help dataex
        clear
        input int patientid float startdate int patientid2 float testdate str7 parameter int value byte(calcium_base calcium_final calcium_final_date iron_base iron_final iron_final_date)
        1122 20645 1122 21453 "iron"     227 . . . . . .
           .     . 1122 21042 "iron"     194 . . . . . .
           .     . 1122 21091 "iron"    4520 . . . . . .
           .     . 1122 20817 "iron"     416 . . . . . .
           .     . 1122 20727 "iron"    3238 . . . . . .
           .     . 1122 21144 "calcium"    4 . . . . . .
           .     . 1122 21090 "calcium"   11 . . . . . .
           .     . 1122 21063 "calcium"   22 . . . . . .
           .     . 1122 20726 "calcium"   33 . . . . . .
        1127 20747 1127 21467 "iron"      25 . . . . . .
           .     . 1127 21115 "iron"     121 . . . . . .
           .     . 1127 21063 "iron"      94 . . . . . .
           .     . 1127 20645 "iron"     152 . . . . . .
           .     . 1127 21072 "calcium"   24 . . . . . .
           .     . 1127 21063 "calcium"   22 . . . . . .
           .     . 1127 20768 "calcium"  222 . . . . . .
        1199 21291 1199 21271 "calcium" 5000 . . . . . .
           .     . 1199 21206 "iron"     669 . . . . . .
        1289 21131 1289 21362 "iron"     219 . . . . . .
           .     . 1289 21397 "iron"     221 . . . . . .
           .     . 1289 21083 "calcium" 6000 . . . . . .
           .     . 1289 21362 "calcium"    5 . . . . . .
        1450 20405 1450 22503 "iron"     509 . . . . . .
           .     . 1450 22417 "iron"      30 . . . . . .
           .     . 1450 22064 "iron"     117 . . . . . .
           .     . 1450 22128 "calcium" 3457 . . . . . .
           .     . 1450 21587 "calcium"  502 . . . . . .
        end
        format %td startdate
        format %td testdate
        Last edited by Nishan Lamichhane; 14 Mar 2022, 16:54.

        Comment


        • #5
          copying and pasting the data above in the excel file should help to explore the possibilities.
          Not really, because it is possible that anomalies in the "1000s of rows more like this" that we don't see will cause Stata to import your full spreadsheet differently than it does these few rows. So then someone develops code that works on these 27 observations but it then fails when you apply it to the full dataset. And then someone has to figure out what needs to be done to prepare the full dataset to meet the needs of this analysis.

          One step at a time. The first step is to create the Stata data you need to work with. Then and only then do you invest the time in solving your calculation problem.

          Added in edit: This post crossed with your response in post #4. I note that your dates were imported as string variables rather than as Stata dates, as was the case when I imported your 27 observations in post #1. This suggests there are cells in the date columns of your spreadsheet that are neither empty nor have Excel dates in them. We don't know what specific import excel command you used to import your data, but one common cause of this sort of problem is that you have left it to Stata to infer the limits of the data on your Excel spreadsheet, and it has picked up some extra rows at the end. You should use Stata's Data Editor Browse window to look at your Stata data and scroll to the bottom and see if there are extra rows, and if so, rerun the import excel using the cellrange() option to select only those rows and columns that contain your data.

          If my guess isn't right, then you have some junk in your Excel date columns that might be easier for you to sort out in Excel rather than in Stata. Stata does a nice job of importing and formatting the dates in the 27 example observations, it would be a shame to have to waste the time to reproduce that by starting with string variables.
          Last edited by William Lisowski; 14 Mar 2022, 16:36.

          Comment


          • #6
            yes, I got your point William. The command is basically for each unique patientid2 we need a value with parameter calcium that has the difference difference between startdate and testdate zero or smallest

            Comment


            • #7
              I edited post #5 after post #6 went up, just posting this to be sure my edits are noticed.

              Comment


              • #8
                Crossed with #5-#7. The following code takes your -dataex- as being correctly imported. Wiliam Lisowski is right that there is something suspicious about the testdate variable being imported as a string. Be that as it may, the code below converts it to a Stata internal format date variable. If a better import of the data makes that unnecessary, you can just remove that part of the code.

                Code:
                assert patientid == patientid2 if !missing(patientid)
                drop patientid
                
                
                replace startdate = startdate[_n-1] if missing(startdate)
                
                by patientid2 testdate parameter, sort: assert _N == 1
                
                foreach v of varlist *date {
                    gen _`v' = daily(`v', "MDY")
                    assert !missing(_`v')
                    format _`v' %td
                    drop `v'
                    rename _`v' `v'
                }
                
                by patientid2 parameter (testdate), sort: egen base_date = ///
                    min(cond(testdate >= startdate, testdate, .))
                by patientid2 parameter (testdate): egen base_value = ///
                    max(cond(testdate == base_date, value, .))
                by patientid2 parameter (testdate): gen final_date = testdate[_N]
                by patientid2 parameter (testdate): gen final_value = value[_N]
                
                drop testdate value
                by patientid2 parameter: keep if _n == 1
                rename (base* final*) =_
                reshape wide base* final*, i(patientid2) j(parameter) string
                Notes: You have two variables patientid, and patientid2, that appear to be the same, except that patient id is missing in all but the first observation for each patient. That makes patient id superfluous, so the code verifies this assumption and then drops patientid.

                Patientid 1199 has no tests that precede the treatment start date, so baseline values are missing. You didn't say what to do about the final value if all values precede startdate, so I just let Stata pick up the last of the pre-startdate values as the "final" one.

                It is possible in real life, though perhaps not in your data, for a patient to have two different test results on the same date. In this case, both the baseline value and the final value for either analyte could be undefined. The code tests for this possibility and it will halt with an "assertion is false" error message if it finds violations. You will then have to explore those cases and fix the data.

                Comment


                • #9
                  Thank you Clyde . your solution made my work easier.

                  Comment

                  Working...
                  X