Import data from a txt file, with an unwanted line break

Thong Nguyen

Join Date: Oct 2015
Posts: 236

Import data from a txt file, with an unwanted line break

17 Dec 2024, 21:00

Hi all,

I need your help with my issues when importing a data file, in txt, to Stata.

I copied and pasted this data set from a website, and it gave me an unwanted line break for different variables in a same observation.
Original txt file is like this.

Code:

64421180 by userA on 2024-08-13 11:20:31 UTC
Board A (#21)
64388130 by userA on 2024-08-03 13:40:09 UTC
Board B (#22)
64374963 by userA on 2024-07-30 16:03:47 UTC
Board C (#23)
64371979 by userA on 2024-07-29 15:25:23 UTC
Board D (#24)
64371679 by userA on 2024-07-29 13:45:30 UTC
Board E (#25)

I used following code for import, and result is as follows

Code:

import delimited test.txt, delimiter(" ", collapse) clear

Code:

. list

     +----------------------------------------------------------+
     |       v1   v2      v3   v4           v5         v6    v7 |
     |----------------------------------------------------------|
  1. | 64421180   by   userA   on   2024-08-13   11:20:31   UTC |
  2. |    Board    A   (#21)                                    |
  3. | 64388130   by   userA   on   2024-08-03   13:40:09   UTC |
  4. |    Board    B   (#22)                                    |
  5. | 64374963   by   userA   on   2024-07-30   16:03:47   UTC |
     |----------------------------------------------------------|
  6. |    Board    C   (#23)                                    |
  7. | 64371979   by   userA   on   2024-07-29   15:25:23   UTC |
  8. |    Board    D   (#24)                                    |
  9. | 64371679   by   userA   on   2024-07-29   13:45:30   UTC |
 10. |    Board    E   (#25)                                    |
     +----------------------------------------------------------+

I would like to have Board A, Board B, Board C, Board D on a same row of each observation 1, 3, 5, 7, 9 and so on.

Data

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str8 v1 str2 v2 str5 v3 str2 v4 str10 v5 str8 v6 str3 v7
"64421180" "by" "userA" "on" "2024-08-13" "11:20:31" "UTC"
"Board"    "A"  "(#21)" ""   ""           ""         ""  
"64388130" "by" "userA" "on" "2024-08-03" "13:40:09" "UTC"
"Board"    "B"  "(#22)" ""   ""           ""         ""  
"64374963" "by" "userA" "on" "2024-07-30" "16:03:47" "UTC"
"Board"    "C"  "(#23)" ""   ""           ""         ""  
"64371979" "by" "userA" "on" "2024-07-29" "15:25:23" "UTC"
"Board"    "D"  "(#24)" ""   ""           ""         ""  
"64371679" "by" "userA" "on" "2024-07-29" "13:45:30" "UTC"
"Board"    "E"  "(#25)" ""   ""           ""         ""  
end

Thank you for your help.

P.S:
Additionally, if v3 in this case, is for a name, that has space in between, not only "userA" but something like "Alex A B C", how do I import that it as a single variable rather than split it to different variables, that's not what I want.

Last edited by Thong Nguyen; 17 Dec 2024, 21:10.

Tags: None

Daniel Schaefer

Join Date: Mar 2020
Posts: 814

17 Dec 2024, 22:21

I would fix the issue after importing. For instance:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str8 v1 str2 v2 str5 v3 str2 v4 str10 v5 str8 v6 str3 v7
"64421180" "by" "userA" "on" "2024-08-13" "11:20:31" "UTC"
"Board"    "A"  "(#21)" ""   ""           ""         ""  
"64388130" "by" "userA" "on" "2024-08-03" "13:40:09" "UTC"
"Board"    "B"  "(#22)" ""   ""           ""         ""  
"64374963" "by" "userA" "on" "2024-07-30" "16:03:47" "UTC"
"Board"    "C"  "(#23)" ""   ""           ""         ""  
"64371979" "by" "userA" "on" "2024-07-29" "15:25:23" "UTC"
"Board"    "D"  "(#24)" ""   ""           ""         ""  
"64371679" "by" "userA" "on" "2024-07-29" "13:45:30" "UTC"
"Board"    "E"  "(#25)" ""   ""           ""         ""  
end

gen v8 = v1[_n + 1]
gen v9 = v2[_n + 1]
gen v10 = v3[_n + 1]
drop if mod(_n, 2) == 0
list, clean noobs

Code:

. list, clean noobs

          v1   v2      v3   v4           v5         v6    v7      v8   v9     v10  
    64421180   by   userA   on   2024-08-13   11:20:31   UTC   Board    A   (#21)  
    64388130   by   userA   on   2024-08-03   13:40:09   UTC   Board    B   (#22)  
    64374963   by   userA   on   2024-07-30   16:03:47   UTC   Board    C   (#23)  
    64371979   by   userA   on   2024-07-29   15:25:23   UTC   Board    D   (#24)  
    64371679   by   userA   on   2024-07-29   13:45:30   UTC   Board    E   (#25)

I'm not sure I understand what you are describing in the post script. Is v3 always space separated or only sometimes space separated? If its the former you can just concatenate the strings, if it's the latter some lines will span more variables than others, and that complicates things somewhat.

Comment

Thong Nguyen

Join Date: Oct 2015
Posts: 236

17 Dec 2024, 23:08

Thank you, Daniel. Your solution looks good.

I mean something like this.

Code:

64421180 by userA on 2024-08-13 11:20:31 UTC
Board A (#21)
64388130 by user A on 2024-08-03 13:40:09 UTC
Board B (#22)
64374963 by user Mr A on 2024-07-30 16:03:47 UTC
Board C (#23)
64371979 by user Mr Michael A on 2024-07-29 15:25:23 UTC
Board D (#24)

And I'd like to have "userA", "user A", "user Mr A", "user Mr Michael A", for example for the variable v3, or we can call it as a username variable.

Code:

. list

     +-------------------------------------------------------------------------------------------------+
     |       v1   v2      v3   v4           v5           v6           v7           v8         v9   v10 |
     |-------------------------------------------------------------------------------------------------|
  1. | 64421180   by   userA   on   2024-08-13     11:20:31          UTC                               |
  2. |    Board    A   (#21)                                                                           |
  3. | 64388130   by    user    A           on   2024-08-03     13:40:09          UTC                  |
  4. |    Board    B   (#22)                                                                           |
  5. | 64374963   by    user   Mr            A           on   2024-07-30     16:03:47        UTC       |
     |-------------------------------------------------------------------------------------------------|
  6. |    Board    C   (#23)                                                                           |
  7. | 64371979   by    user   Mr      Michael            A           on   2024-07-29   15:25:23   UTC |
  8. |    Board    D   (#24)                                                                           |
     +-------------------------------------------------------------------------------------------------+

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str8 v1 str2 v2 str5 v3 str2 v4 str10(v5 v6 v7 v8) str8 v9 str3 v10
"64421180" "by" "userA" "on" "2024-08-13" "11:20:31"   "UTC"        ""           ""         ""   
"Board"    "A"  "(#21)" ""   ""           ""           ""           ""           ""         ""   
"64388130" "by" "user"  "A"  "on"         "2024-08-03" "13:40:09"   "UTC"        ""         ""   
"Board"    "B"  "(#22)" ""   ""           ""           ""           ""           ""         ""   
"64374963" "by" "user"  "Mr" "A"          "on"         "2024-07-30" "16:03:47"   "UTC"      ""   
"Board"    "C"  "(#23)" ""   ""           ""           ""           ""           ""         ""   
"64371979" "by" "user"  "Mr" "Michael"    "A"          "on"         "2024-07-29" "15:25:23" "UTC"
"Board"    "D"  "(#24)" ""   ""           ""           ""           ""           ""         ""   
end

How to fix this issue? Thank you.

Comment

Andrew Musau

Join Date: Oct 2014
Posts: 10190

18 Dec 2024, 01:35

For standardized text, such as that found on webpages, regular expressions work quite well for extracting textual content. This is the approach I advocate in https://journals.sagepub.com/doi/10....6867X241233678.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str8 v1 str2 v2 str5 v3 str2 v4 str10(v5 v6 v7 v8) str8 v9 str3 v10
"64421180" "by" "userA" "on" "2024-08-13" "11:20:31"   "UTC"        ""           ""         ""  
"Board"    "A"  "(#21)" ""   ""           ""           ""           ""           ""         ""  
"64388130" "by" "user"  "A"  "on"         "2024-08-03" "13:40:09"   "UTC"        ""         ""  
"Board"    "B"  "(#22)" ""   ""           ""           ""           ""           ""         ""  
"64374963" "by" "user"  "Mr" "A"          "on"         "2024-07-30" "16:03:47"   "UTC"      ""  
"Board"    "C"  "(#23)" ""   ""           ""           ""           ""           ""         ""  
"64371979" "by" "user"  "Mr" "Michael"    "A"          "on"         "2024-07-29" "15:25:23" "UTC"
"Board"    "D"  "(#24)" ""   ""           ""           ""           ""           ""         ""  
end

egen text= concat(v*), punct(" ")
replace text = text + " " +text[_n+1] if regexm(text, "^\d")
drop if !regexm(text, "^\d")
gen userid=ustrregexra(text, "(\d+)\s.*", "$1")
gen username=ustrregexra(text, ".*by(.*)on.*", "$1")
gen date= ustrregexra(text,".*(\d{4}-\d{2}-\d{2}).*","$1")
gen time= ustrregexra(text,".*(\d{2}:\d{2}:\d{2}).*","$1")
gen timezone= ustrregexra(text,".*\d{2}:\d{2}:\d{2}\s+([A-Z]{3})\s.*","$1")
gen board= ustrregexra(text,".*(Board .*)$","$1")
drop v*

Res.:

Code:

. l text

     +------------------------------------------------------------------------+
     |                                                                   text |
     |------------------------------------------------------------------------|
  1. |             64421180 by userA on 2024-08-13 11:20:31 UTC Board A (#21) |
  2. |            64388130 by user A on 2024-08-03 13:40:09 UTC Board B (#22) |
  3. |         64374963 by user Mr A on 2024-07-30 16:03:47 UTC Board C (#23) |
  4. | 64371979 by user Mr Michael A on 2024-07-29 15:25:23 UTC Board D (#24) |
     +------------------------------------------------------------------------+

. l userid- board

     +-----------------------------------------------------------------------------------+
     |   userid              username         date       time   timezone           board |
     |-----------------------------------------------------------------------------------|
  1. | 64421180                userA    2024-08-13   11:20:31        UTC   Board A (#21) |
  2. | 64388130               user A    2024-08-03   13:40:09        UTC   Board B (#22) |
  3. | 64374963            user Mr A    2024-07-30   16:03:47        UTC   Board C (#23) |
  4. | 64371979    user Mr Michael A    2024-07-29   15:25:23        UTC   Board D (#24) |
     +-----------------------------------------------------------------------------------+

Last edited by Andrew Musau; 18 Dec 2024, 01:38.

Comment

Thong Nguyen

Join Date: Oct 2015

Posts: 236
#5

18 Dec 2024, 02:04

Hi Andrew,

Your code looks perfect but I can not use the function ustrregexra, as I used Stata 13. Any other way out, please.
Comment

Andrew Musau

Join Date: Oct 2014
Posts: 10190

18 Dec 2024, 02:56

There are other string functions as well as the regexm command that you can use in version 13. Here is how you'd extract the username, which was your question in #3.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str8 v1 str2 v2 str5 v3 str2 v4 str10(v5 v6 v7 v8) str8 v9 str3 v10
"64421180" "by" "userA" "on" "2024-08-13" "11:20:31"   "UTC"        ""           ""         ""  
"Board"    "A"  "(#21)" ""   ""           ""           ""           ""           ""         ""  
"64388130" "by" "user"  "A"  "on"         "2024-08-03" "13:40:09"   "UTC"        ""         ""  
"Board"    "B"  "(#22)" ""   ""           ""           ""           ""           ""         ""  
"64374963" "by" "user"  "Mr" "A"          "on"         "2024-07-30" "16:03:47"   "UTC"      ""  
"Board"    "C"  "(#23)" ""   ""           ""           ""           ""           ""         ""  
"64371979" "by" "user"  "Mr" "Michael"    "A"          "on"         "2024-07-29" "15:25:23" "UTC"
"Board"    "D"  "(#24)" ""   ""           ""           ""           ""           ""         ""  
end

egen text= concat(v*), punct(" ")
replace text = text + " " +text[_n+1] if regexm(text, "^\d")
drop if !regexm(text, "^\d")
drop v*
gen username =substr(text, strpos(text, "by")+2, strpos(text, "on")-strpos(text, "by")-2)

Res.:

Code:

. gen username =substr(text, strpos(text, "by")+2, strpos(text, "on")-strpos(text, "by")-2)

. l

     +----------------------------------------------------------------------------------------------+
     |                                                                   text              username |
     |----------------------------------------------------------------------------------------------|
  1. |             64421180 by userA on 2024-08-13 11:20:31 UTC Board A (#21)                userA  |
  2. |            64388130 by user A on 2024-08-03 13:40:09 UTC Board B (#22)               user A  |
  3. |         64374963 by user Mr A on 2024-07-30 16:03:47 UTC Board C (#23)            user Mr A  |
  4. | 64371979 by user Mr Michael A on 2024-07-29 15:25:23 UTC Board D (#24)    user Mr Michael A  |
     +----------------------------------------------------------------------------------------------+

.

Comment

Andrew Musau

Join Date: Oct 2014

Posts: 10190
#7

18 Dec 2024, 03:02

Alongside the documentation, this should be useful reading: https://journals.sagepub.com/doi/10....6867X221141068
Comment

Daniel Schaefer

Join Date: Mar 2020
Posts: 814

18 Dec 2024, 10:26

At the risk of robbing you of a valuable opportunity to learn more about regular expressions, it looks like you can use the words 'by', 'on', and 'utc' as separators. Building on Andrew's solution, something like this should work:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str8 v1 str2 v2 str5 v3 str2 v4 str10(v5 v6 v7 v8) str8 v9 str3 v10
"64421180" "by" "userA" "on" "2024-08-13" "11:20:31"   "UTC"        ""           ""         ""  
"Board"    "A"  "(#21)" ""   ""           ""           ""           ""           ""         ""  
"64388130" "by" "user"  "A"  "on"         "2024-08-03" "13:40:09"   "UTC"        ""         ""  
"Board"    "B"  "(#22)" ""   ""           ""           ""           ""           ""         ""  
"64374963" "by" "user"  "Mr" "A"          "on"         "2024-07-30" "16:03:47"   "UTC"      ""  
"Board"    "C"  "(#23)" ""   ""           ""           ""           ""           ""         ""  
"64371979" "by" "user"  "Mr" "Michael"    "A"          "on"         "2024-07-29" "15:25:23" "UTC"
"Board"    "D"  "(#24)" ""   ""           ""           ""           ""           ""         ""  
end

egen text= concat(v*), punct(" ")
replace text = text + " " +text[_n+1] if regexm(text, "^\d")
drop if !regexm(text, "^\d")
drop v*

split text, parse("by")
split text2, parse("on")
split text22, parse("UTC")
split text222, parse("(#")
replace text2222 = subinstr(text2222, ")", "", .)

rename text1 id
rename text21 user
rename text221 datetime
rename text2221 board
rename text2222 boardnum

drop text*

It's not pretty, but it's very easy to understand what's happening above. Alternatively, you could just read in each line without a separator (possibly by giving -import delimited- a separator character you know doesn't appear in the data) to yield something like this:

Code:

clear
input str64 v1
"64421180 by userA on 2024-08-13 11:20:31 UTC"
"Board A (#21)"
"64388130 by user A on 2024-08-03 13:40:09 UTC"
"Board B (#22)"
"64374963 by user Mr A on 2024-07-30 16:03:47 UTC"
"Board C (#23)"
"64371979 by user Mr Michael A on 2024-07-29 15:25:23 UTC"
"Board D (#24)"
end

Then you could proceed without any regular expressions at all, the advantage being code that is very easy to understand, even for those unfamiliar with regular expressions.

Code:

rename v1 text
replace text = text + " " +text[_n+1]
drop if mod(_n,2) == 0

split text, parse("by")
split text2, parse("on")
split text22, parse("UTC")
split text222, parse("(#")
replace text2222 = subinstr(text2222, ")", "", .)

rename text1 id
rename text21 user
rename text221 datetime
rename text2221 board
rename text2222 boardnum

drop text*

Of course, regular expressions are powerful tools for working with strings that are well worth learning. I will usually download data from a webpage as a long html string, then I parse the html directly with regular expressions to extract the data. It's also a very good idea to keep an eye out for the html <table> tag, since data organized in html tables are particularly easy to work with. Wikipedia tables generally work like that and it is very convenient.

Last edited by Daniel Schaefer; 18 Dec 2024, 10:33.

Comment

Daniel Schaefer

Join Date: Mar 2020

Posts: 814
#9

18 Dec 2024, 10:30

Andrew Musau Would you be willing to forward the article in #4 as a PDF either here or in a DM? I would like to read but don't appear to have institutional access.
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 10190
#10

18 Dec 2024, 10:57

Originally posted by Daniel Schaefer View Post

Andrew Musau Would you be willing to forward the article in #4 as a PDF either here or in a DM? I would like to read but don't appear to have institutional access.

Sure, I can email it to you.
1 like
Comment

Thong Nguyen

Join Date: Oct 2015
Posts: 236

#11

18 Dec 2024, 17:54

Hi Andrew and Daniel,

Could you check it again please.

I use your code and it doesn't work

Code:

. clear

. input str8 v1 str2 v2 str5 v3 str2 v4 str10(v5 v6 v7 v8) str8 v9 str3 v10

            v1         v2         v3         v4          v5          v6          v7          v8         v9        v10
  1. "64421180" "by" "userA" "on" "2024-08-13" "11:20:31"   "UTC"        ""           ""         ""  
  2. "Board"    "A"  "(#21)" ""   ""           ""           ""           ""           ""         ""  
  3. "64388130" "by" "user"  "A"  "on"         "2024-08-03" "13:40:09"   "UTC"        ""         ""  
  4. "Board"    "B"  "(#22)" ""   ""           ""           ""           ""           ""         ""  
  5. "64374963" "by" "user"  "Mr" "A"          "on"         "2024-07-30" "16:03:47"   "UTC"      ""  
  6. "Board"    "C"  "(#23)" ""   ""           ""           ""           ""           ""         ""  
  7. "64371979" "by" "user"  "Mr" "Michael"    "A"          "on"         "2024-07-29" "15:25:23" "UTC"
  8. "Board"    "D"  "(#24)" ""   ""           ""           ""           ""           ""         ""  
  9. end

. 
. egen text= concat(v*), punct(" ")

. replace text = text + " " +text[_n+1] if regexm(text, "^\d")
(0 real changes made)

. drop if !regexm(text, "^\d")
(8 observations deleted)

. drop v*

Issues can come from these lines

Code:

. replace text = text + " " +text[_n+1] if regexm(text, "^\d")
. drop if !regexm(text, "^\d")

By the way, what is it for

Code:

regexm(text, "^\d")

I am going to learn about regexm with this source https://stats.oarc.ucla.edu/stata/fa...r-expressions/ but firstly, if possible I would like to get your help again.
Thanks.

Comment

Thong Nguyen

Join Date: Oct 2015
Posts: 236

#12

18 Dec 2024, 18:16

I tried and it seems this one works.

Code:

egen text= concat(v*), punct(" ")
replace text = text + " " + text[_n+1]
drop if mod(_n,2) == 0

split text, parse("by")
split text2, parse("on")
split text22, parse("UTC")
split text222, parse("(#")
replace text2222 = subinstr(text2222, ")", "", .)

rename text1 id
rename text21 user
rename text221 datetime
rename text2221 board
rename text2222 boardnum

drop text*

Thank you.

Comment

Andrew Musau

Join Date: Oct 2014
Posts: 10190

#13

19 Dec 2024, 10:23

Originally posted by Thong Nguyen View Post

Hi Andrew and Daniel,

Could you check it again please.

Oh sorry, I'm still using updated syntax for regexm that was not available in version 13, specifically, "\d" to represent numerical characters. I need "[0-9]" instead. Try this:

Code:

version 13
* Example generated by -dataex-. To install: ssc install dataex
clear
input str8 v1 str2 v2 str5 v3 str2 v4 str10(v5 v6 v7 v8) str8 v9 str3 v10
"64421180" "by" "userA" "on" "2024-08-13" "11:20:31"   "UTC"        ""           ""         ""  
"Board"    "A"  "(#21)" ""   ""           ""           ""           ""           ""         ""  
"64388130" "by" "user"  "A"  "on"         "2024-08-03" "13:40:09"   "UTC"        ""         ""  
"Board"    "B"  "(#22)" ""   ""           ""           ""           ""           ""         ""  
"64374963" "by" "user"  "Mr" "A"          "on"         "2024-07-30" "16:03:47"   "UTC"      ""  
"Board"    "C"  "(#23)" ""   ""           ""           ""           ""           ""         ""  
"64371979" "by" "user"  "Mr" "Michael"    "A"          "on"         "2024-07-29" "15:25:23" "UTC"
"Board"    "D"  "(#24)" ""   ""           ""           ""           ""           ""         ""  
end

egen text= concat(v*), punct(" ")
replace text = text + " " +text[_n+1] if regexm(text, "^[0-9]")
drop if !regexm(text, "^[0-9]")
drop v*
gen username =substr(text, strpos(text, "by")+2, strpos(text, "on")-strpos(text, "by")-2)

Res.:

Code:

. l

     +----------------------------------------------------------------------------------------------+
     |                                                                   text              username |
     |----------------------------------------------------------------------------------------------|
  1. |             64421180 by userA on 2024-08-13 11:20:31 UTC Board A (#21)                userA  |
  2. |            64388130 by user A on 2024-08-03 13:40:09 UTC Board B (#22)               user A  |
  3. |         64374963 by user Mr A on 2024-07-30 16:03:47 UTC Board C (#23)            user Mr A  |
  4. | 64371979 by user Mr Michael A on 2024-07-29 15:25:23 UTC Board D (#24)    user Mr Michael A  |
     +----------------------------------------------------------------------------------------------+

.

Comment

Thong Nguyen

Join Date: Oct 2015

Posts: 236
#14

21 Dec 2024, 03:52

Originally posted by Andrew Musau View Post

Thank you Andrew, it works perfectly now.
Comment

Announcement

Import data from a txt file, with an unwanted line break

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment