Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Import text files: Delimiter and Transformation Issues

    Dear Statalisters,

    I am trying to import a text file that seems to lack a consistent delimiter. It is about different financial transactions and their time of arrival and processing. The format (see shortened and anonymized sample below) seems to be tab-delimited, but the letters and numbers are actually separated by (multiple) spaces. Moreover, there is a lot of "noise" in the file that I am not interested in. My idea to prepare a dataset ready for a proper analysis was initially to import all the information in one column (as one single variable) and then manually split the column into variables and observations. Here, each new observation is marked by its reference ("REFERENZ-NUMMER"), and the other variables of interest are each preceded by "ERFASSUNG/BEARBEITUNG". Then I am facing the challenge of disentangling the observations/variables from one column.

    I now wonder what would be the optimal approach to clean this messy text file. Is there a smarter approach than to use
    Code:
    import delimited "filename", delimiter("", collapse)
    and to manually retrieve the information I need from one column?

    I've also tried to prepare the file in excel, unfortunately, without any improvements. Any comments/guidance is appreciated!

    Regards,
    Julian
    Attached Files

  • #2
    Like many people here, I don't like to open attachments, even if named as "txt." I believe that more people would be willing to take a look at your problem if you posted your text material within your message, between code delimiters.

    Beyond that, some suggestions:
    1) You say "seems to be tab-delimited [but actually spaces]." What makes you think that? charlist- (available at SSC) or -hexdump- might be relevant if there is whitespace that may or may not be tabs or spaces, and might be relevant to other issues as well.
    2) -filefilter- might offer a way to regularize the delimiters.
    3) -fileread- and -filewrite-, along with some search/replace work, might help with cleaning up the file.
    4) Knowing something about the line structure of the file would matter.

    Comment


    • #3
      Thank you, Mike!

      here is the text material:

      Code:
      REFERENZ-NUMMER: 191025022BB110025     VSS-VORGANGSTYP: MT100ES  
      Freig. Datum:          29.10.2019  
      Betrag Soll:                 0,00  
      Betrag Haben:                0,00  
      Sender BIC:           HTLxxxxF  
      Empfänger BIC:  
      Empfänger BLZ:           00000000  
      Kto-Nr. Soll:          9400446005  
      Kto-Nr. Haben:         9999999999  
      Fremde Referenz: 00103C89491  
      MT:                         MT103  
      
      
      ----> AUFTRAGSUEBERSICHT <----  
      STATUSINFORMATIONEN  
      AUFTRAGSDATEN  
      MT103-EINGANG  
      MT103-AUSGANG  
      Kennz. KOMPC = 0  
      EMBARGOPRUEFUNG  
      
      
      ----> STATUSINFO <----  
          
      MT103 VON SWIFT ERHALTEN                           25.10.2019 09:53  
      ERFASSUNG/BEARBEITUNG    00000000 MASCHINELL       25.10.2019 09:53  
      ERFASSUNG/BEARBEITUNG    S022K480 VVVN             25.10.2019 10:11  
      ERFASSUNG/BEARBEITUNG    S022K414 KKKKKKSSSSHHHHH  29.10.2019 15:43  
      FREIGABE ERF./BEARB.     S022K558 HUGHA            29.10.2019 15:59  
      MT103 WEITERZULEITEN  
      KURSERMITTLUNG ERFORDERLICH  
      AKTUALISIERUNG DER STATISTIK ERFORDERLICH  
      UEBERGABE AN BUCHUNGSSYSTEM ERFORDERLICH  
      ********* DER VORGANG IST GELOESCHT *********  
      
      
      
      REFERENZ-NUMMER: 1910255665500002     VSS-VORGANGSTYP: MT100ES  
      Freig. Datum:          29.10.2019  
      Betrag Soll:               644,49  
      Betrag Haben:              583,35  
      Sender BIC:           XXXXXLLLF  
      Empfänger BIC:  
      Empfänger BLZ:           00000000  
      Kto-Nr. Soll:          9036416007  
      Kto-Nr. Haben:            1174689  
      Fremde Referenz: 00101BkkC990  
      MT:                         MT103  
      
      
      ----> AUFTRAGSUEBERSICHT <----  
      STATUSINFORMATIONEN  
      AUFTRAGSDATEN  
      MT103-EINGANG  
      MT103-AUSGANG  
      Kennz. KOMPC = 0  
      EMBARGOPRUEFUNG  
      
      
      ----> STATUSINFORMATIONEN <----  
          
      MT103 VON SWIFT ERHALTEN                           28.10.2019 05:56  
      KEIN EMBARGO-VERDACHT    oder aus (im MT103-E)     28.10.2019 05:56  
      ERFASSUNG/BEARBEITUNG    00000000 MASCHINELL       28.10.2019 05:56  
      ERFASSUNG/BEARBEITUNG    S022K480 sSSSS            28.10.2019 08:55  
      ERFASSUNG/BEARBEITUNG    S022K480 XXXX             29.10.2019 08:24  
      FREIGABE ERF./BEARB.     S022XAKR KKKLLLLWWW       29.10.2019 10:58  
      KURSERMITTLUNG DURCHGEFUEHRT                      29.10.2019 13:41  
        
      ********* DER VORGANG IST BEENDET *********  
      
      
      REFERENZ-NUMMER: 1966580XXXX00008     VSS-VORGANGSTYP: MT100ES  
      Freig. Datum:          28.10.2019  
      Betrag Soll:            12.159,44  
      Betrag Haben:           12.159,44  
      Sender BIC:           TTTTTTAF  
      Empfänger BIC:  
      Empfänger BLZ:           00000000  
      Kto-Nr. Soll:          9555555500  
      Kto-Nr. Haben:             444445  
      Fremde Referenz: 00101BS5140  
      MT:                         MT103  
      
      
      ----> AUFTRAGSUEBERSICHT <----  
      STATUSINFORMATIONEN  
      AUFTRAGSDATEN  
      MT103-EINGANG  
      MT103-AUSGANG  
      Kennz. KOMPC = 0  
      EMBARGOPRUEFUNG  
      
      
      ----> STATUSINFORMATIONEN <----  
          
      MT103 VON SWIFT ERHALTEN                           28.10.2019 07:40  
      KEIN EMBARGO-VERDACHT    oder aus (im MT103-E)     28.10.2019 07:41  
      ERFASSUNG/BEARBEITUNG    00000000 MASCHINELL       28.10.2019 07:41  
      ERFASSUNG/BEARBEITUNG    S022K480 DDDD             28.10.2019 08:54  
      ********* DER VORGANG IST BEENDET *********
      1) Most numbers or words are presented in a justified manner but neither excel nor STATA recognize tab as delimiter.
      4) The file is an automatic excerpt and I think is in line sequential format (If I get things straight).

      Last edited by Julian Scholz; 14 Apr 2021, 09:20.

      Comment


      • #4
        In general, Stata *can* recognize tabs as delimiters, as it's one of the options described in -help import_delimited-.

        Setting that aside: I'd think first about separating your data into "transactions," as I understand that to be the unit of analysis. If I can assume that every line beginning with "REFERENZ-NUMMER:" starts a new transaction, I'd begin as follows:

        Code:
        clear
        // Assume "zzzzz" never occurs so that each line is read as one string.
        import delimited using "YourFile.txt", delimiter("zzzzz", asstring)
        rename v1 s 
        // Put an ID and line number on each line that belongs to the same transaction.
        gen int ID = .
        quiet replace ID = cond(_n ==1, 1, ID[_n-1] + (strpos(s, "REFERENZ-NUMMER") > 0))
        bysort ID: gen int line = _n
        order ID line // shows the structure
        desc
        From this point, reshaping to a wide format, with one observation per transaction, stored in a single variable, *might* be useful, depending on how you want to break up the information within each transaction. If that's relevant, you could do:
        Code:
        reshape wide s, i(ID) j(line)
        This may be slow if you have a large file.
        It might also turn out that treating the whole transaction as one string would help, in order to process it with string functions:
        Code:
        egen wholething = concat(s*)
        drop s*
        However, the usefulness of these last two ideas would depend on how you want to define, separate, and process different parts of the transaction, which to me seems like the hardest part of your problem. Doing that will require some substantive knowledge of the content of transactions, as well as what characters might appear as delimiters.

        Comment


        • #5
          It occurred to me that -bysort ID- might not be guaranteed to leave the original order of lines within transaction untouched, so to be safe, I'd recommend capturing that original order and including it in the -bysort-:

          Code:
          clear
          // Assume "zzzzz" never occurs so that each line is read as one string.
          import delimited using "YourFile.txt", delimiter("zzzzz", asstring)
          rename v1 s
          // Put an ID and line number on each line that belongs to the same transaction.
          gen int ID = .
          quiet replace ID = cond(_n ==1, 1, ID[_n-1] + (strpos(s, "REFERENZ-NUMMER") > 0))
          gen long origorder = _n // new
          bysort ID (origorder) : gen int line = _n  //new
          order ID line // shows the structure
          desc

          Comment


          • #6
            Mike, thanks for your help! In order to "clean" the set, I used the following codes

            Code:
            clear
            **Assume "zzzzz" never occurs so that each line is read as one string.
            import delimited using "C:\Users\Scholz.ECFS-SERV\Desktop\DSGF\sample.txt", delimiter("zzzzz", asstring)
            rename v1 s
            // Put an ID and line number on each line that belongs to the same transaction.
            gen int ID = .
            quiet replace ID = cond(_n ==1, 1, ID[_n-1] + (strpos(s, "REFERENZ-NUMMER") > 0))
            gen long origorder = _n // new
            bysort ID (origorder) : gen int line = _n //new
            order ID line // shows the structure
            desc
            **drop noise
            keep if strpos(s,"REFERENZ")>0 | strpos(s, "ERFASSUNG") >0 | strpos(s, "FREIGABE")>0
            Let's assume that I am only interested in observations in s that contain string positions used in "keep" above. I now like to structure the set that each reference ("REFERENZ-NUMMER") identifies the observations with the variables of interest being the dates and times in s following the prefix "ERFASSUNG/BEARBEITUNG" or "FREIGABE". In s, the left-hand side somehow contains the variable names (e.g., "REFERENZ-NUMMER", "ERFASSUNG" etc.) and the right-hand side the actual observations I am interested in, the date and times when the transactions were processed and approved and the employee (here: anonymized) who processed it (e.g. VVVN).

            Code:
            ERFASSUNG/BEARBEITUNG S022K480 VVVN 25.10.2019 10:11
            The line above, therefore, contains three variables: the name of the employee who processed it ("VVVN"), the date (25.10.2019), and time (10:11).

            Ideally, the dataset would look like this, with each reference as an identifier and the other variables containing processing date, time, and employee (if processing was not automated). For the analysis, I think the long format is the way to go.
            Reference Automated_processing date Automated_processing_time Processing_1_employee Processing_1_date Processing_1_time
            191025022BB110025 25.10.2019 09:53 VVVN 25.10.2019 10:11
            I hope I could express the desired structure of the data set. Any suggestions on how to accomplish this?

            Regards,
            Julian
            Last edited by Julian Scholz; 15 Apr 2021, 03:00.

            Comment


            • #7
              I don't completely understand all of the details of what you want, and I left out dealing with the time of day, but the following (to be run after your #6) should illustrate some relevant techniques:

              Code:
              replace s = ustrregexra(s, "  *", " ")
              split s, gen(s)
              gen referenz = s2 if strpos(s1, "REFERENZ")
              gen vorgangstyp = s4 if strpos(s3, "VORGANGSTYP")
              bysort ID: replace referenz = referenz[1] if missing(referenz)
              order referenz, before(s1)
              gen double erfdate  = daily(s4, "DMY")  if strpos(s1, "ERFASSUNG")
              gen double fredate = daily(s5, "DMY")  if strpos(s1, "FREIGABE")

              Comment


              • #8
                Thank you, Mike! The techniques you suggested guided me to structure the data as desired.

                Comment

                Working...
                X