Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Extracting multiple dates from string with varying date formats

    Hello all: I have an issue with extracting multiple dates with varying formats as MM/DD/YYYY, M/DD/YY per observation (basically missing leading zeros for month and day randomly sometimes). Any help. Did not use dataex for this since this was a small usecase. Pardon that in advance. Tried moss install from a post by Nick but my data looks messier and I don't know how best to tweak the code.

    var_finaldiagnosistext
    Accession1. Node, biopsy (outside slides S21-334, 2/03/21): Hodgkin lymphoma, see microscopy
    Accession2. A. Node, biopsy (outside slides dated SMS-20-44, dated 03/3/2020): Diffuse large B-cell lymphoma, see microscopic description. B. Bone marrow, biopsy (outside slides BMS-19-44, 01/01/19) : No evidence of lymphoma.

    Desired outcome

    date1 date2
    02/03/2021
    03/03/2020 01/01/2019
    Last edited by Girish Venkataraman; 08 Jan 2022, 12:47.

  • #2
    Increase the number of iterations in the loop to match the maximum number of dates to be extracted.

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str225 var_finaldiagnosistext
    "Accession1. Node, biopsy (outside slides S21-334, 2/03/21): Hodgkin lymphoma, see microscopy"                                                                                                                                    
    "Accession2. A. Node, biopsy (outside slides dated SMS-20-44, dated 03/3/2020): Diffuse large B-cell lymphoma, see microscopic description. B. Bone marrow, biopsy (outside slides BMS-19-44, 01/01/19) : No evidence of lymphoma."
    end
    
    gen text=var_finaldiagnosistext
    forval i=1/3{
        gen wanted`i'=daily(ustrregexra(text,".*\b([\d]+/[\d]+/[\d]+)\b.*", "$1"), "MDY", 2030)
        format wanted`i' %td
        replace text= subinstr(text, ustrregexra(text,".*\b([\d]+/[\d]+/[\d]+)\b.*", "$1"), "", 1)
    }
    Res.:

    Code:
    . l wanted*
    
         +---------------------------------+
         |   wanted1     wanted2   wanted3 |
         |---------------------------------|
      1. | 03feb2021           .         . |
      2. | 01jan2019   03mar2020         . |
         +---------------------------------+

    Comment


    • #3
      Thanks so much, Andrew. Tried it in my Stata11 and its giving me an error as below. is there some module I need to install since its an old version? Guess this is an expression available only from Stata14. Will try in Stata17 which I will install and try shortly.

      unknown function ustrregexra()
      r(133);
      Last edited by Girish Venkataraman; 08 Jan 2022, 14:28.

      Comment


      • #4
        The Unicode regular expression functions were added in Stata 14, so you need that version or above.

        Comment

        Working...
        X