Import text files: Delimiter and Transformation Issues

Julian Scholz

Join Date: May 2019

Posts: 22
#1

Import text files: Delimiter and Transformation Issues

14 Apr 2021, 07:42

Dear Statalisters,

I am trying to import a text file that seems to lack a consistent delimiter. It is about different financial transactions and their time of arrival and processing. The format (see shortened and anonymized sample below) seems to be tab-delimited, but the letters and numbers are actually separated by (multiple) spaces. Moreover, there is a lot of "noise" in the file that I am not interested in. My idea to prepare a dataset ready for a proper analysis was initially to import all the information in one column (as one single variable) and then manually split the column into variables and observations. Here, each new observation is marked by its reference ("REFERENZ-NUMMER"), and the other variables of interest are each preceded by "ERFASSUNG/BEARBEITUNG". Then I am facing the challenge of disentangling the observations/variables from one column.

I now wonder what would be the optimal approach to clean this messy text file. Is there a smarter approach than to use

Code:

import delimited "filename", delimiter("", collapse)

and to manually retrieve the information I need from one column?

I've also tried to prepare the file in excel, unfortunately, without any improvements. Any comments/guidance is appreciated!

Regards,
Julian
Attached Files

sample.txt (3.3 KB, 1 view)
Tags: CSV, data, delimited file, loop, Text file
Mike Lacy

Join Date: Apr 2014

Posts: 2426
#2

14 Apr 2021, 08:10

Like many people here, I don't like to open attachments, even if named as "txt." I believe that more people would be willing to take a look at your problem if you posted your text material within your message, between code delimiters.

Beyond that, some suggestions:
1) You say "seems to be tab-delimited [but actually spaces]." What makes you think that? charlist- (available at SSC) or -hexdump- might be relevant if there is whitespace that may or may not be tabs or spaces, and might be relevant to other issues as well.
2) -filefilter- might offer a way to regularize the delimiters.
3) -fileread- and -filewrite-, along with some search/replace work, might help with cleaning up the file.
4) Knowing something about the line structure of the file would matter.
Comment

Julian Scholz

Join Date: May 2019
Posts: 22

14 Apr 2021, 09:16

Thank you, Mike!

here is the text material:

Code:

REFERENZ-NUMMER: 191025022BB110025     VSS-VORGANGSTYP: MT100ES  
Freig. Datum:          29.10.2019  
Betrag Soll:                 0,00  
Betrag Haben:                0,00  
Sender BIC:           HTLxxxxF  
Empfänger BIC:  
Empfänger BLZ:           00000000  
Kto-Nr. Soll:          9400446005  
Kto-Nr. Haben:         9999999999  
Fremde Referenz: 00103C89491  
MT:                         MT103  


----> AUFTRAGSUEBERSICHT <----  
STATUSINFORMATIONEN  
AUFTRAGSDATEN  
MT103-EINGANG  
MT103-AUSGANG  
Kennz. KOMPC = 0  
EMBARGOPRUEFUNG  


----> STATUSINFO <----  
    
MT103 VON SWIFT ERHALTEN                           25.10.2019 09:53  
ERFASSUNG/BEARBEITUNG    00000000 MASCHINELL       25.10.2019 09:53  
ERFASSUNG/BEARBEITUNG    S022K480 VVVN             25.10.2019 10:11  
ERFASSUNG/BEARBEITUNG    S022K414 KKKKKKSSSSHHHHH  29.10.2019 15:43  
FREIGABE ERF./BEARB.     S022K558 HUGHA            29.10.2019 15:59  
MT103 WEITERZULEITEN  
KURSERMITTLUNG ERFORDERLICH  
AKTUALISIERUNG DER STATISTIK ERFORDERLICH  
UEBERGABE AN BUCHUNGSSYSTEM ERFORDERLICH  
********* DER VORGANG IST GELOESCHT *********  



REFERENZ-NUMMER: 1910255665500002     VSS-VORGANGSTYP: MT100ES  
Freig. Datum:          29.10.2019  
Betrag Soll:               644,49  
Betrag Haben:              583,35  
Sender BIC:           XXXXXLLLF  
Empfänger BIC:  
Empfänger BLZ:           00000000  
Kto-Nr. Soll:          9036416007  
Kto-Nr. Haben:            1174689  
Fremde Referenz: 00101BkkC990  
MT:                         MT103  


----> AUFTRAGSUEBERSICHT <----  
STATUSINFORMATIONEN  
AUFTRAGSDATEN  
MT103-EINGANG  
MT103-AUSGANG  
Kennz. KOMPC = 0  
EMBARGOPRUEFUNG  


----> STATUSINFORMATIONEN <----  
    
MT103 VON SWIFT ERHALTEN                           28.10.2019 05:56  
KEIN EMBARGO-VERDACHT    oder aus (im MT103-E)     28.10.2019 05:56  
ERFASSUNG/BEARBEITUNG    00000000 MASCHINELL       28.10.2019 05:56  
ERFASSUNG/BEARBEITUNG    S022K480 sSSSS            28.10.2019 08:55  
ERFASSUNG/BEARBEITUNG    S022K480 XXXX             29.10.2019 08:24  
FREIGABE ERF./BEARB.     S022XAKR KKKLLLLWWW       29.10.2019 10:58  
KURSERMITTLUNG DURCHGEFUEHRT                      29.10.2019 13:41  
  
********* DER VORGANG IST BEENDET *********  


REFERENZ-NUMMER: 1966580XXXX00008     VSS-VORGANGSTYP: MT100ES  
Freig. Datum:          28.10.2019  
Betrag Soll:            12.159,44  
Betrag Haben:           12.159,44  
Sender BIC:           TTTTTTAF  
Empfänger BIC:  
Empfänger BLZ:           00000000  
Kto-Nr. Soll:          9555555500  
Kto-Nr. Haben:             444445  
Fremde Referenz: 00101BS5140  
MT:                         MT103  


----> AUFTRAGSUEBERSICHT <----  
STATUSINFORMATIONEN  
AUFTRAGSDATEN  
MT103-EINGANG  
MT103-AUSGANG  
Kennz. KOMPC = 0  
EMBARGOPRUEFUNG  


----> STATUSINFORMATIONEN <----  
    
MT103 VON SWIFT ERHALTEN                           28.10.2019 07:40  
KEIN EMBARGO-VERDACHT    oder aus (im MT103-E)     28.10.2019 07:41  
ERFASSUNG/BEARBEITUNG    00000000 MASCHINELL       28.10.2019 07:41  
ERFASSUNG/BEARBEITUNG    S022K480 DDDD             28.10.2019 08:54  
********* DER VORGANG IST BEENDET *********

1) Most numbers or words are presented in a justified manner but neither excel nor STATA recognize tab as delimiter.
4) The file is an automatic excerpt and I think is in line sequential format (If I get things straight).

Last edited by Julian Scholz; 14 Apr 2021, 09:20.

Comment

Mike Lacy

Join Date: Apr 2014

Posts: 2426
#4

14 Apr 2021, 11:15

In general, Stata *can* recognize tabs as delimiters, as it's one of the options described in -help import_delimited-.

Setting that aside: I'd think first about separating your data into "transactions," as I understand that to be the unit of analysis. If I can assume that every line beginning with "REFERENZ-NUMMER:" starts a new transaction, I'd begin as follows:

Code:

clear // Assume "zzzzz" never occurs so that each line is read as one string. import delimited using "YourFile.txt", delimiter("zzzzz", asstring) rename v1 s // Put an ID and line number on each line that belongs to the same transaction. gen int ID = . quiet replace ID = cond(_n ==1, 1, ID[_n-1] + (strpos(s, "REFERENZ-NUMMER") > 0)) bysort ID: gen int line = _n order ID line // shows the structure desc

From this point, reshaping to a wide format, with one observation per transaction, stored in a single variable, *might* be useful, depending on how you want to break up the information within each transaction. If that's relevant, you could do:

Code:

reshape wide s, i(ID) j(line)

This may be slow if you have a large file.
It might also turn out that treating the whole transaction as one string would help, in order to process it with string functions:

Code:

egen wholething = concat(s*) drop s*

However, the usefulness of these last two ideas would depend on how you want to define, separate, and process different parts of the transaction, which to me seems like the hardest part of your problem. Doing that will require some substantive knowledge of the content of transactions, as well as what characters might appear as delimiters.
Comment

Mike Lacy

Join Date: Apr 2014
Posts: 2426

14 Apr 2021, 14:48

It occurred to me that -bysort ID- might not be guaranteed to leave the original order of lines within transaction untouched, so to be safe, I'd recommend capturing that original order and including it in the -bysort-:

Code:

clear
// Assume "zzzzz" never occurs so that each line is read as one string.
import delimited using "YourFile.txt", delimiter("zzzzz", asstring)
rename v1 s
// Put an ID and line number on each line that belongs to the same transaction.
gen int ID = .
quiet replace ID = cond(_n ==1, 1, ID[_n-1] + (strpos(s, "REFERENZ-NUMMER") > 0))
gen long origorder = _n // new
bysort ID (origorder) : gen int line = _n  //new
order ID line // shows the structure
desc

Comment

Julian Scholz

Join Date: May 2019

Posts: 22
#6

15 Apr 2021, 02:57

Mike, thanks for your help! In order to "clean" the set, I used the following codes

Code:

clear **Assume "zzzzz" never occurs so that each line is read as one string. import delimited using "C:\Users\Scholz.ECFS-SERV\Desktop\DSGF\sample.txt", delimiter("zzzzz", asstring) rename v1 s // Put an ID and line number on each line that belongs to the same transaction. gen int ID = . quiet replace ID = cond(_n ==1, 1, ID[_n-1] + (strpos(s, "REFERENZ-NUMMER") > 0)) gen long origorder = _n // new bysort ID (origorder) : gen int line = _n //new order ID line // shows the structure desc **drop noise keep if strpos(s,"REFERENZ")>0 | strpos(s, "ERFASSUNG") >0 | strpos(s, "FREIGABE")>0

Let's assume that I am only interested in observations in s that contain string positions used in "keep" above. I now like to structure the set that each reference ("REFERENZ-NUMMER") identifies the observations with the variables of interest being the dates and times in s following the prefix "ERFASSUNG/BEARBEITUNG" or "FREIGABE". In s, the left-hand side somehow contains the variable names (e.g., "REFERENZ-NUMMER", "ERFASSUNG" etc.) and the right-hand side the actual observations I am interested in, the date and times when the transactions were processed and approved and the employee (here: anonymized) who processed it (e.g. VVVN).

Code:

ERFASSUNG/BEARBEITUNG S022K480 VVVN 25.10.2019 10:11

The line above, therefore, contains three variables: the name of the employee who processed it ("VVVN"), the date (25.10.2019), and time (10:11).

Ideally, the dataset would look like this, with each reference as an identifier and the other variables containing processing date, time, and employee (if processing was not automated). For the analysis, I think the long format is the way to go.
Reference Automated_processing date Automated_processing_time Processing_1_employee Processing_1_date Processing_1_time

191025022BB110025 25.10.2019 09:53 VVVN 25.10.2019 10:11

I hope I could express the desired structure of the data set. Any suggestions on how to accomplish this?

Regards,
Julian

Last edited by Julian Scholz; 15 Apr 2021, 03:00.
Comment

Mike Lacy

Join Date: Apr 2014
Posts: 2426

15 Apr 2021, 13:50

I don't completely understand all of the details of what you want, and I left out dealing with the time of day, but the following (to be run after your #6) should illustrate some relevant techniques:

Code:

replace s = ustrregexra(s, "  *", " ")
split s, gen(s)
gen referenz = s2 if strpos(s1, "REFERENZ")
gen vorgangstyp = s4 if strpos(s3, "VORGANGSTYP")
bysort ID: replace referenz = referenz[1] if missing(referenz)
order referenz, before(s1)
gen double erfdate  = daily(s4, "DMY")  if strpos(s1, "ERFASSUNG")
gen double fredate = daily(s5, "DMY")  if strpos(s1, "FREIGABE")

Comment

Julian Scholz

Join Date: May 2019

Posts: 22
#8

16 Apr 2021, 03:34

Thank you, Mike! The techniques you suggested guided me to structure the data as desired.
Comment

Reference	Automated_processing date	Automated_processing_time	Processing_1_employee	Processing_1_date	Processing_1_time
191025022BB110025	25.10.2019	09:53	VVVN	25.10.2019	10:11

Announcement