Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Missing entries when importing excel into stata

    Hi. My code below creates one single dataset by appending sheets from four excel workbooks where each has 31 sheets. When I check the number of entries (tab hospital below) in the final dataset, I notice some anomalies. The KAR workbook has 29,117 entries, but the final dataset has only 29,099 observations. How can I:

    1. identify what is happening here, i.e., why are some excel cells not carrying over as observations in stata?
    2. which excel cells are not carrying over as observations in stata?
    3. for the one observation that has no hospital listed against it, is there anyway for me to identify in which excel sheet this observation exists?

    Any help will be very much appreciated


    Code:
    
    tab hospital
    
       Hospital |      Freq.     Percent        Cum.
    ------------+-----------------------------------
                |          1        0.00        0.00
            BBI |      4,436        9.47        9.47
           GMRV |      4,282        9.14       18.61
            KAR |     29,099       62.11       80.72
            KVC |      9,035       19.28      100.00
    ------------+-----------------------------------
          Total |     46,853      100.00
    
    
    _____________code
    
    
    clear 
    tempfile building 
    save `building', emptyok  
    
     foreach f in "KAR" "GMRV" "MTC" "KVC" {
    
     
    
        import excel using "${db}\raw data\OC_Jan22_Jul24_`f'.xlsx", describe 
        local n_sheets `r(N_worksheet)'
        *return list local n_sheets `r(N_worksheet)' 
        forvalues j = 1/`n_sheets' { 
        local sheet`j' `r(worksheet_`j')' 
        }    
    
        forvalues j = 1/`n_sheets' { 
        import excel using "${db}\raw data\OC_Jan22_Jul24_`f'.xlsx", sheet(`"`sheet`j''"') firstrow case(lower) clear 
        gen sheet = `j' 
        append using `building' 
        save `"`building'"', replace 
        } 
    
        sort sheet
        *drop DoctorWorkup Workup Dilation PostDilation ReadyforConsultant
        *drop if mrno=="KAR"
        
        *generating an age and sex var from the combined var
        rename age Age
        split Age, gen(age) parse("/")
        replace age1 = "1" if ustrregexm(age1, "[0-9]{1,2} Mo.")
        destring age1, replace
        rename age1 age
        rename age2 sex
        *drop Age
    
    
        *splitting combined date-time var
        gen double appt_dttm = clock(appttime, "DMYhm")
        *assert missing(appt_dttm) == missing(appttime)
        format appt_dttm %tc
        gen appt_date = dofc(appt_dttm)
        format appt_date %td
        gen double appt_time_of_day = appt_dttm - cofd(appt_date)
        format appt_time_of_day %tcHH:MM
    
        *month and year from date
        gen year = year(appt_date)
        gen month = month(appt_date)
    
    
        *replace missing values with .
        rename district District
        rename state State 
        encode District, gen(district)
        encode State, gen(state)
        drop District State appt_dttm
    
        save "${db}\clean data\emr_`f'.dta", replace
        clear
     }
    
    
    
    
    ___________sample data
    
    input str4 hospital str16 mrno str55 location str4 department str9 age str1 sex float appt_date double appt_time_of_day float(year month) long(district state)
    "KAR"  "P1331797"     ""                "OCU" "22" "F" 22660 28800000 2022 1 366 21
    "KAR"  "P983268"      ""                "UVA" "44" "M" 22669 29700000 2022 1 459 32
    "KAR"  "P1354760"     ""                "PED" "6"  "M" 22672 35400000 2022 1 235 29
    "KAR"  "P1363931"     ""                "COR" "69" "M" 22648 47700000 2022 1 459 32
    "BBI"  "MTC-P324302"  "Madanpur"        "PED" "59" "F" 22665 52500000 2022 1 395 26
    "BBI"  "MTC-P306148"  "Ghantua"         "IVR" "3"  "F" 22670 43200000 2022 1  44 26
    "KVC"  "KVC-P117086"  "Modekurru"       "RET" "49" "F" 22662 35100000 2022 1 170  2
    "KVC"  "CHP-NP29556"  "Kunchanapalle"   "OCU" "58" "F" 22646 37800000 2022 1 206  2
    "KVC"  "KVC-P203284"  "Guntur"          "OCU" "45" "M" 22667 41400000 2022 1 206  2
    "GMRV" "GMRV-N44504"  "Ravikamatham"    "COR" "58" "F" 22674 36000000 2022 1 570  2
    "KVC"  "KVC-NP67821"  "Mudinepalli"     "RET" "68" "M" 22655 31500000 2022 1 321  2
    "BBI"  "MTC-N57088"   "Alatara"         "RET" "53" "M" 22669 33300000 2022 1 271 26
    "KAR"  "KVC-P202085"  ""                "COR" "52" "F" 22650 53100000 2022 1 321  2
    "KAR"  "N440996"      ""                "GLA" "9"  "M" 22657 29700000 2022 1 509 21
    "KAR"  "P1007960"     ""                "GLA" "75" "M" 22672 38700000 2022 1 321  2
    "KAR"  "P1294321"     ""                "GLA" "77" "M" 22657 34200000 2022 1 573 32
    "KAR"  "P1389977"     ""                "COR" "48" "F" 22649 38100000 2022 1   .  .
    "KAR"  "N336487"      ""                "RET" "23" "M" 22646 36900000 2022 1 285 32
    "KVC"  "KVC-P202297"  "Mangalagiri"     "COR" "60" "M" 22670 43200000 2022 1 206  2
    "KVC"  "KVC-NP67609"  "Krishnalanka"    "OCU" "54" "F" 22660 39600000 2022 1 321  2
    "KAR"  "PN1376368"    ""                "COR" "21" "F" 22656 34200000 2022 1 573 32
    "GMRV" "GMRV-P39106"  "Rajahmundry"     "PED" "21" "M" 22662 45600000 2022 1 170  2
    "BBI"  "MTC-P264925"  "Raipur"          "COR" "62" "F" 22662 55800000 2022 1  78 26
    "KAR"  "P1392029"     ""                "COR" "57" "M" 22666 35100000 2022 1 127 36
    "KAR"  "P1036367"     ""                "GLA" "52" "F" 22658 34200000 2022 1 170  2
    "KAR"  "P1305697"     ""                "PED" "8"  "F" 22649 55200000 2022 1 321  2
    "KAR"  "P1391661"     ""                "RET" "45" "F" 22655 41400000 2022 1   .  .
    "KAR"  "P1364004"     ""                "OCU" "16" "F" 22646 31500000 2022 1   .  .
    "KVC"  "KVC-N68172"   "rachakonda"      "COR" "51" "M" 22673 32100000 2022 1 436  2
    "KAR"  "P1272562"     ""                "GLA" "57" "F" 22650 33300000 2022 1 559 36
    "KVC"  "KND-P27626"   "Kanakanampadu"   "OCU" "59" "M" 22663 48600000 2022 1 436  2
    "KVC"  "KND-P14774"   "Gangapalem"      "RET" "67" "M" 22653 35100000 2022 1 436  2
    "KAR"  "N357867"      ""                "COR" "53" "M" 22665 51000000 2022 1 383 32
    "KAR"  "P1389916"     ""                "RET" "19" "M" 22663 35100000 2022 1  38 18
    "KAR"  "P1050999"     ""                "GLA" "80" "M" 22676 33300000 2022 1 226 32
    "KAR"  "P1227482"     ""                "COR" "50" "M" 22663 32400000 2022 1 231 20
    "BBI"  "MTC-P289397"  "Mancheswar"      "IVR" "3"  "M" 22674 42300000 2022 1 302 26
    "BBI"  "MTC-P323939"  "Kolkata"         "RET" "49" "F" 22651 42900000 2022 1   .  .
    "KAR"  "P1278372"     ""                "PED" "30" "M" 22672 33300000 2022 1 383 32
    "BBI"  "MTC-P318686"  "Kaligadu"        "OCU" "8"  "M" 22651 36000000 2022 1 276 26
    "KAR"  "P1392119"     ""                "COR" "71" "F" 22657 52200000 2022 1 226 32

  • #2
    Presuming I'm understanding you correctly, i.e. that the 29,117 represents what you think the total of the number valid rows is across all the sheets within the "KAR" file: It appears that simply tabulating the "sheet" variable for each hospital would be a helpful start.
    Code:
    tab hospital sheet
    You could then check by eye the "N" you obtain for each sheet in Stata to what is present in the relevant Excel file.

    Note what I presume is an error in your code. I don't think you intended this line to be commented out.
    Code:
    *return list local n_sheets `r(N_worksheet)'

    As for: "3. for the one observation that has no hospital listed against it, is there anyway for me to identify in which excel sheet this observation exists?"
    I don't understand what you mean here, so I can't help.

    Comment


    • #3
      1. identify what is happening here, i.e., why are some excel cells not carrying over as observations in stata?
      I'd guess you have rows in your excel file that are completely empty. Stata may ignore rows that are completely empty.

      2. which excel cells are not carrying over as observations in stata?
      You could give each row a unique identifier in excel. There are many ways to do this, but I would probably just use an integer that counts up the number of rows. Then after you load into Stata, try to figure out which rows are missing. Be sure to make sure the row numbers are unique across sheets.

      3. for the one observation that has no hospital listed against it, is there anyway for me to identify in which excel sheet this observation exists?
      In your second loop that goes through each sheet, include the following code:

      Code:
      forvalues j = 1/`n_sheets' {
          import excel using "${db}\raw data\OC_Jan22_Jul24_`f'.xlsx", sheet(`"`sheet`j''"') firstrow case(lower) clear
          qui count if missing(hospital)
          if r(N) {
              display "`=r(N)' observations with missing hospital detected in sheet `j'"
          }
          [...]
      }
      Edit: Crossed with #2

      Comment


      • #4
        tab hospital sheet from #2 is a really nice way to do this.

        Comment


        • #5
          Mike Lacy and Daniel Schaefer Thank you. This was really helpful. For queries 1 and 2, I used Daniel's suggestion in #3 of giving each row a unique identifier in excel.
          Code:
          tab hospital sheet
          unfortunately didn't work as the number of observations are too many.

          Daniel Schaefer I added the code in #3 to identify the observation with the missing hospital. Strangely, it doesn't throw up anything. Is there any other way to do this?

          Comment


          • #6
            Hi Scott,

            The code I gave you assumes the value is missing or the empty string, but it may also be a whitespace character (like a space or tab). Try this.

            Code:
            qui count if !inlist(hospital, "BBI", "GMRV", "KAR", "KVC")
            if r(N) {
                display "`=r(N)' observations with missing hospital detected in sheet `j'"
            }
            Edit: fixed hospital codes.
            Last edited by Daniel Schaefer; 06 Aug 2024, 10:01.

            Comment


            • #7
              Daniel Schaefer Thank you. That worked perfectly!

              Comment

              Working...
              X