Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Import data from a txt file, with an unwanted line break

    Hi all,

    I need your help with my issues when importing a data file, in txt, to Stata.

    I copied and pasted this data set from a website, and it gave me an unwanted line break for different variables in a same observation.
    Original txt file is like this.
    Code:
    64421180 by userA on 2024-08-13 11:20:31 UTC
    Board A (#21)
    64388130 by userA on 2024-08-03 13:40:09 UTC
    Board B (#22)
    64374963 by userA on 2024-07-30 16:03:47 UTC
    Board C (#23)
    64371979 by userA on 2024-07-29 15:25:23 UTC
    Board D (#24)
    64371679 by userA on 2024-07-29 13:45:30 UTC
    Board E (#25)
    I used following code for import, and result is as follows
    Code:
    import delimited test.txt, delimiter(" ", collapse) clear
    Code:
    . list
    
         +----------------------------------------------------------+
         |       v1   v2      v3   v4           v5         v6    v7 |
         |----------------------------------------------------------|
      1. | 64421180   by   userA   on   2024-08-13   11:20:31   UTC |
      2. |    Board    A   (#21)                                    |
      3. | 64388130   by   userA   on   2024-08-03   13:40:09   UTC |
      4. |    Board    B   (#22)                                    |
      5. | 64374963   by   userA   on   2024-07-30   16:03:47   UTC |
         |----------------------------------------------------------|
      6. |    Board    C   (#23)                                    |
      7. | 64371979   by   userA   on   2024-07-29   15:25:23   UTC |
      8. |    Board    D   (#24)                                    |
      9. | 64371679   by   userA   on   2024-07-29   13:45:30   UTC |
     10. |    Board    E   (#25)                                    |
         +----------------------------------------------------------+
    I would like to have Board A, Board B, Board C, Board D on a same row of each observation 1, 3, 5, 7, 9 and so on.

    Data
    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str8 v1 str2 v2 str5 v3 str2 v4 str10 v5 str8 v6 str3 v7
    "64421180" "by" "userA" "on" "2024-08-13" "11:20:31" "UTC"
    "Board"    "A"  "(#21)" ""   ""           ""         ""  
    "64388130" "by" "userA" "on" "2024-08-03" "13:40:09" "UTC"
    "Board"    "B"  "(#22)" ""   ""           ""         ""  
    "64374963" "by" "userA" "on" "2024-07-30" "16:03:47" "UTC"
    "Board"    "C"  "(#23)" ""   ""           ""         ""  
    "64371979" "by" "userA" "on" "2024-07-29" "15:25:23" "UTC"
    "Board"    "D"  "(#24)" ""   ""           ""         ""  
    "64371679" "by" "userA" "on" "2024-07-29" "13:45:30" "UTC"
    "Board"    "E"  "(#25)" ""   ""           ""         ""  
    end
    Thank you for your help.

    P.S:
    Additionally, if v3 in this case, is for a name, that has space in between, not only "userA" but something like "Alex A B C", how do I import that it as a single variable rather than split it to different variables, that's not what I want.
    Last edited by Thong Nguyen; 17 Dec 2024, 22:10.

  • #2
    I would fix the issue after importing. For instance:

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str8 v1 str2 v2 str5 v3 str2 v4 str10 v5 str8 v6 str3 v7
    "64421180" "by" "userA" "on" "2024-08-13" "11:20:31" "UTC"
    "Board"    "A"  "(#21)" ""   ""           ""         ""  
    "64388130" "by" "userA" "on" "2024-08-03" "13:40:09" "UTC"
    "Board"    "B"  "(#22)" ""   ""           ""         ""  
    "64374963" "by" "userA" "on" "2024-07-30" "16:03:47" "UTC"
    "Board"    "C"  "(#23)" ""   ""           ""         ""  
    "64371979" "by" "userA" "on" "2024-07-29" "15:25:23" "UTC"
    "Board"    "D"  "(#24)" ""   ""           ""         ""  
    "64371679" "by" "userA" "on" "2024-07-29" "13:45:30" "UTC"
    "Board"    "E"  "(#25)" ""   ""           ""         ""  
    end
    
    gen v8 = v1[_n + 1]
    gen v9 = v2[_n + 1]
    gen v10 = v3[_n + 1]
    drop if mod(_n, 2) == 0
    list, clean noobs
    Code:
    . list, clean noobs
    
              v1   v2      v3   v4           v5         v6    v7      v8   v9     v10  
        64421180   by   userA   on   2024-08-13   11:20:31   UTC   Board    A   (#21)  
        64388130   by   userA   on   2024-08-03   13:40:09   UTC   Board    B   (#22)  
        64374963   by   userA   on   2024-07-30   16:03:47   UTC   Board    C   (#23)  
        64371979   by   userA   on   2024-07-29   15:25:23   UTC   Board    D   (#24)  
        64371679   by   userA   on   2024-07-29   13:45:30   UTC   Board    E   (#25)
    I'm not sure I understand what you are describing in the post script. Is v3 always space separated or only sometimes space separated? If its the former you can just concatenate the strings, if it's the latter some lines will span more variables than others, and that complicates things somewhat.

    Comment


    • #3
      Thank you, Daniel. Your solution looks good.

      I'm not sure I understand what you are describing in the post script. Is v3 always space separated or only sometimes space separated? If its the former you can just concatenate the strings, if it's the latter some lines will span more variables than others, and that complicates things somewhat.
      I mean something like this.
      Code:
      64421180 by userA on 2024-08-13 11:20:31 UTC
      Board A (#21)
      64388130 by user A on 2024-08-03 13:40:09 UTC
      Board B (#22)
      64374963 by user Mr A on 2024-07-30 16:03:47 UTC
      Board C (#23)
      64371979 by user Mr Michael A on 2024-07-29 15:25:23 UTC
      Board D (#24)
      And I'd like to have "userA", "user A", "user Mr A", "user Mr Michael A", for example for the variable v3, or we can call it as a username variable.
      Code:
      . list
      
           +-------------------------------------------------------------------------------------------------+
           |       v1   v2      v3   v4           v5           v6           v7           v8         v9   v10 |
           |-------------------------------------------------------------------------------------------------|
        1. | 64421180   by   userA   on   2024-08-13     11:20:31          UTC                               |
        2. |    Board    A   (#21)                                                                           |
        3. | 64388130   by    user    A           on   2024-08-03     13:40:09          UTC                  |
        4. |    Board    B   (#22)                                                                           |
        5. | 64374963   by    user   Mr            A           on   2024-07-30     16:03:47        UTC       |
           |-------------------------------------------------------------------------------------------------|
        6. |    Board    C   (#23)                                                                           |
        7. | 64371979   by    user   Mr      Michael            A           on   2024-07-29   15:25:23   UTC |
        8. |    Board    D   (#24)                                                                           |
           +-------------------------------------------------------------------------------------------------+
      Code:
      * Example generated by -dataex-. To install: ssc install dataex
      clear
      input str8 v1 str2 v2 str5 v3 str2 v4 str10(v5 v6 v7 v8) str8 v9 str3 v10
      "64421180" "by" "userA" "on" "2024-08-13" "11:20:31"   "UTC"        ""           ""         ""   
      "Board"    "A"  "(#21)" ""   ""           ""           ""           ""           ""         ""   
      "64388130" "by" "user"  "A"  "on"         "2024-08-03" "13:40:09"   "UTC"        ""         ""   
      "Board"    "B"  "(#22)" ""   ""           ""           ""           ""           ""         ""   
      "64374963" "by" "user"  "Mr" "A"          "on"         "2024-07-30" "16:03:47"   "UTC"      ""   
      "Board"    "C"  "(#23)" ""   ""           ""           ""           ""           ""         ""   
      "64371979" "by" "user"  "Mr" "Michael"    "A"          "on"         "2024-07-29" "15:25:23" "UTC"
      "Board"    "D"  "(#24)" ""   ""           ""           ""           ""           ""         ""   
      end
      How to fix this issue? Thank you.

      Comment


      • #4
        For standardized text, such as that found on webpages, regular expressions work quite well for extracting textual content. This is the approach I advocate in https://journals.sagepub.com/doi/10....6867X241233678.

        Code:
        * Example generated by -dataex-. To install: ssc install dataex
        clear
        input str8 v1 str2 v2 str5 v3 str2 v4 str10(v5 v6 v7 v8) str8 v9 str3 v10
        "64421180" "by" "userA" "on" "2024-08-13" "11:20:31"   "UTC"        ""           ""         ""  
        "Board"    "A"  "(#21)" ""   ""           ""           ""           ""           ""         ""  
        "64388130" "by" "user"  "A"  "on"         "2024-08-03" "13:40:09"   "UTC"        ""         ""  
        "Board"    "B"  "(#22)" ""   ""           ""           ""           ""           ""         ""  
        "64374963" "by" "user"  "Mr" "A"          "on"         "2024-07-30" "16:03:47"   "UTC"      ""  
        "Board"    "C"  "(#23)" ""   ""           ""           ""           ""           ""         ""  
        "64371979" "by" "user"  "Mr" "Michael"    "A"          "on"         "2024-07-29" "15:25:23" "UTC"
        "Board"    "D"  "(#24)" ""   ""           ""           ""           ""           ""         ""  
        end
        
        egen text= concat(v*), punct(" ")
        replace text = text + " " +text[_n+1] if regexm(text, "^\d")
        drop if !regexm(text, "^\d")
        gen userid=ustrregexra(text, "(\d+)\s.*", "$1")
        gen username=ustrregexra(text, ".*by(.*)on.*", "$1")
        gen date= ustrregexra(text,".*(\d{4}-\d{2}-\d{2}).*","$1")
        gen time= ustrregexra(text,".*(\d{2}:\d{2}:\d{2}).*","$1")
        gen timezone= ustrregexra(text,".*\d{2}:\d{2}:\d{2}\s+([A-Z]{3})\s.*","$1")
        gen board= ustrregexra(text,".*(Board .*)$","$1")
        drop v*
        Res.:

        Code:
        . l text
        
             +------------------------------------------------------------------------+
             |                                                                   text |
             |------------------------------------------------------------------------|
          1. |             64421180 by userA on 2024-08-13 11:20:31 UTC Board A (#21) |
          2. |            64388130 by user A on 2024-08-03 13:40:09 UTC Board B (#22) |
          3. |         64374963 by user Mr A on 2024-07-30 16:03:47 UTC Board C (#23) |
          4. | 64371979 by user Mr Michael A on 2024-07-29 15:25:23 UTC Board D (#24) |
             +------------------------------------------------------------------------+
        
        . l userid- board
        
             +-----------------------------------------------------------------------------------+
             |   userid              username         date       time   timezone           board |
             |-----------------------------------------------------------------------------------|
          1. | 64421180                userA    2024-08-13   11:20:31        UTC   Board A (#21) |
          2. | 64388130               user A    2024-08-03   13:40:09        UTC   Board B (#22) |
          3. | 64374963            user Mr A    2024-07-30   16:03:47        UTC   Board C (#23) |
          4. | 64371979    user Mr Michael A    2024-07-29   15:25:23        UTC   Board D (#24) |
             +-----------------------------------------------------------------------------------+
        Last edited by Andrew Musau; 18 Dec 2024, 02:38.

        Comment


        • #5
          Hi Andrew,

          Your code looks perfect but I can not use the function ustrregexra, as I used Stata 13. Any other way out, please.

          Comment


          • #6
            There are other string functions as well as the regexm command that you can use in version 13. Here is how you'd extract the username, which was your question in #3.

            Code:
            * Example generated by -dataex-. To install: ssc install dataex
            clear
            input str8 v1 str2 v2 str5 v3 str2 v4 str10(v5 v6 v7 v8) str8 v9 str3 v10
            "64421180" "by" "userA" "on" "2024-08-13" "11:20:31"   "UTC"        ""           ""         ""  
            "Board"    "A"  "(#21)" ""   ""           ""           ""           ""           ""         ""  
            "64388130" "by" "user"  "A"  "on"         "2024-08-03" "13:40:09"   "UTC"        ""         ""  
            "Board"    "B"  "(#22)" ""   ""           ""           ""           ""           ""         ""  
            "64374963" "by" "user"  "Mr" "A"          "on"         "2024-07-30" "16:03:47"   "UTC"      ""  
            "Board"    "C"  "(#23)" ""   ""           ""           ""           ""           ""         ""  
            "64371979" "by" "user"  "Mr" "Michael"    "A"          "on"         "2024-07-29" "15:25:23" "UTC"
            "Board"    "D"  "(#24)" ""   ""           ""           ""           ""           ""         ""  
            end
            
            egen text= concat(v*), punct(" ")
            replace text = text + " " +text[_n+1] if regexm(text, "^\d")
            drop if !regexm(text, "^\d")
            drop v*
            gen username =substr(text, strpos(text, "by")+2, strpos(text, "on")-strpos(text, "by")-2)
            Res.:

            Code:
            . gen username =substr(text, strpos(text, "by")+2, strpos(text, "on")-strpos(text, "by")-2)
            
            . l
            
                 +----------------------------------------------------------------------------------------------+
                 |                                                                   text              username |
                 |----------------------------------------------------------------------------------------------|
              1. |             64421180 by userA on 2024-08-13 11:20:31 UTC Board A (#21)                userA  |
              2. |            64388130 by user A on 2024-08-03 13:40:09 UTC Board B (#22)               user A  |
              3. |         64374963 by user Mr A on 2024-07-30 16:03:47 UTC Board C (#23)            user Mr A  |
              4. | 64371979 by user Mr Michael A on 2024-07-29 15:25:23 UTC Board D (#24)    user Mr Michael A  |
                 +----------------------------------------------------------------------------------------------+
            
            .

            Comment


            • #7
              Alongside the documentation, this should be useful reading: https://journals.sagepub.com/doi/10....6867X221141068

              Comment


              • #8
                At the risk of robbing you of a valuable opportunity to learn more about regular expressions, it looks like you can use the words 'by', 'on', and 'utc' as separators. Building on Andrew's solution, something like this should work:

                Code:
                * Example generated by -dataex-. To install: ssc install dataex
                clear
                input str8 v1 str2 v2 str5 v3 str2 v4 str10(v5 v6 v7 v8) str8 v9 str3 v10
                "64421180" "by" "userA" "on" "2024-08-13" "11:20:31"   "UTC"        ""           ""         ""  
                "Board"    "A"  "(#21)" ""   ""           ""           ""           ""           ""         ""  
                "64388130" "by" "user"  "A"  "on"         "2024-08-03" "13:40:09"   "UTC"        ""         ""  
                "Board"    "B"  "(#22)" ""   ""           ""           ""           ""           ""         ""  
                "64374963" "by" "user"  "Mr" "A"          "on"         "2024-07-30" "16:03:47"   "UTC"      ""  
                "Board"    "C"  "(#23)" ""   ""           ""           ""           ""           ""         ""  
                "64371979" "by" "user"  "Mr" "Michael"    "A"          "on"         "2024-07-29" "15:25:23" "UTC"
                "Board"    "D"  "(#24)" ""   ""           ""           ""           ""           ""         ""  
                end
                
                egen text= concat(v*), punct(" ")
                replace text = text + " " +text[_n+1] if regexm(text, "^\d")
                drop if !regexm(text, "^\d")
                drop v*
                
                split text, parse("by")
                split text2, parse("on")
                split text22, parse("UTC")
                split text222, parse("(#")
                replace text2222 = subinstr(text2222, ")", "", .)
                
                rename text1 id
                rename text21 user
                rename text221 datetime
                rename text2221 board
                rename text2222 boardnum
                
                drop text*
                It's not pretty, but it's very easy to understand what's happening above. Alternatively, you could just read in each line without a separator (possibly by giving -import delimited- a separator character you know doesn't appear in the data) to yield something like this:

                Code:
                clear
                input str64 v1
                "64421180 by userA on 2024-08-13 11:20:31 UTC"
                "Board A (#21)"
                "64388130 by user A on 2024-08-03 13:40:09 UTC"
                "Board B (#22)"
                "64374963 by user Mr A on 2024-07-30 16:03:47 UTC"
                "Board C (#23)"
                "64371979 by user Mr Michael A on 2024-07-29 15:25:23 UTC"
                "Board D (#24)"
                end
                Then you could proceed without any regular expressions at all, the advantage being code that is very easy to understand, even for those unfamiliar with regular expressions.

                Code:
                rename v1 text
                replace text = text + " " +text[_n+1]
                drop if mod(_n,2) == 0
                
                split text, parse("by")
                split text2, parse("on")
                split text22, parse("UTC")
                split text222, parse("(#")
                replace text2222 = subinstr(text2222, ")", "", .)
                
                rename text1 id
                rename text21 user
                rename text221 datetime
                rename text2221 board
                rename text2222 boardnum
                
                drop text*
                Of course, regular expressions are powerful tools for working with strings that are well worth learning. I will usually download data from a webpage as a long html string, then I parse the html directly with regular expressions to extract the data. It's also a very good idea to keep an eye out for the html <table> tag, since data organized in html tables are particularly easy to work with. Wikipedia tables generally work like that and it is very convenient.
                Last edited by Daniel Schaefer; 18 Dec 2024, 11:33.

                Comment


                • #9
                  Andrew Musau Would you be willing to forward the article in #4 as a PDF either here or in a DM? I would like to read but don't appear to have institutional access.

                  Comment


                  • #10
                    Originally posted by Daniel Schaefer View Post
                    Andrew Musau Would you be willing to forward the article in #4 as a PDF either here or in a DM? I would like to read but don't appear to have institutional access.
                    Sure, I can email it to you.

                    Comment


                    • #11
                      Hi Andrew and Daniel,

                      Could you check it again please.

                      I use your code and it doesn't work
                      Code:
                      . clear
                      
                      . input str8 v1 str2 v2 str5 v3 str2 v4 str10(v5 v6 v7 v8) str8 v9 str3 v10
                      
                                  v1         v2         v3         v4          v5          v6          v7          v8         v9        v10
                        1. "64421180" "by" "userA" "on" "2024-08-13" "11:20:31"   "UTC"        ""           ""         ""  
                        2. "Board"    "A"  "(#21)" ""   ""           ""           ""           ""           ""         ""  
                        3. "64388130" "by" "user"  "A"  "on"         "2024-08-03" "13:40:09"   "UTC"        ""         ""  
                        4. "Board"    "B"  "(#22)" ""   ""           ""           ""           ""           ""         ""  
                        5. "64374963" "by" "user"  "Mr" "A"          "on"         "2024-07-30" "16:03:47"   "UTC"      ""  
                        6. "Board"    "C"  "(#23)" ""   ""           ""           ""           ""           ""         ""  
                        7. "64371979" "by" "user"  "Mr" "Michael"    "A"          "on"         "2024-07-29" "15:25:23" "UTC"
                        8. "Board"    "D"  "(#24)" ""   ""           ""           ""           ""           ""         ""  
                        9. end
                      
                      . 
                      . egen text= concat(v*), punct(" ")
                      
                      . replace text = text + " " +text[_n+1] if regexm(text, "^\d")
                      (0 real changes made)
                      
                      . drop if !regexm(text, "^\d")
                      (8 observations deleted)
                      
                      . drop v*
                      Issues can come from these lines
                      Code:
                      . replace text = text + " " +text[_n+1] if regexm(text, "^\d")
                      . drop if !regexm(text, "^\d")
                      By the way, what is it for
                      Code:
                      regexm(text, "^\d")
                      I am going to learn about regexm with this source https://stats.oarc.ucla.edu/stata/fa...r-expressions/ but firstly, if possible I would like to get your help again.
                      Thanks.

                      Comment


                      • #12
                        I tried and it seems this one works.
                        Code:
                        egen text= concat(v*), punct(" ")
                        replace text = text + " " + text[_n+1]
                        drop if mod(_n,2) == 0
                        
                        split text, parse("by")
                        split text2, parse("on")
                        split text22, parse("UTC")
                        split text222, parse("(#")
                        replace text2222 = subinstr(text2222, ")", "", .)
                        
                        rename text1 id
                        rename text21 user
                        rename text221 datetime
                        rename text2221 board
                        rename text2222 boardnum
                        
                        drop text*
                        Thank you.

                        Comment


                        • #13
                          Originally posted by Thong Nguyen View Post
                          Hi Andrew and Daniel,

                          Could you check it again please.
                          Oh sorry, I'm still using updated syntax for regexm that was not available in version 13, specifically, "\d" to represent numerical characters. I need "[0-9]" instead. Try this:

                          Code:
                          version 13
                          * Example generated by -dataex-. To install: ssc install dataex
                          clear
                          input str8 v1 str2 v2 str5 v3 str2 v4 str10(v5 v6 v7 v8) str8 v9 str3 v10
                          "64421180" "by" "userA" "on" "2024-08-13" "11:20:31"   "UTC"        ""           ""         ""  
                          "Board"    "A"  "(#21)" ""   ""           ""           ""           ""           ""         ""  
                          "64388130" "by" "user"  "A"  "on"         "2024-08-03" "13:40:09"   "UTC"        ""         ""  
                          "Board"    "B"  "(#22)" ""   ""           ""           ""           ""           ""         ""  
                          "64374963" "by" "user"  "Mr" "A"          "on"         "2024-07-30" "16:03:47"   "UTC"      ""  
                          "Board"    "C"  "(#23)" ""   ""           ""           ""           ""           ""         ""  
                          "64371979" "by" "user"  "Mr" "Michael"    "A"          "on"         "2024-07-29" "15:25:23" "UTC"
                          "Board"    "D"  "(#24)" ""   ""           ""           ""           ""           ""         ""  
                          end
                          
                          egen text= concat(v*), punct(" ")
                          replace text = text + " " +text[_n+1] if regexm(text, "^[0-9]")
                          drop if !regexm(text, "^[0-9]")
                          drop v*
                          gen username =substr(text, strpos(text, "by")+2, strpos(text, "on")-strpos(text, "by")-2)
                          Res.:

                          Code:
                          . l
                          
                               +----------------------------------------------------------------------------------------------+
                               |                                                                   text              username |
                               |----------------------------------------------------------------------------------------------|
                            1. |             64421180 by userA on 2024-08-13 11:20:31 UTC Board A (#21)                userA  |
                            2. |            64388130 by user A on 2024-08-03 13:40:09 UTC Board B (#22)               user A  |
                            3. |         64374963 by user Mr A on 2024-07-30 16:03:47 UTC Board C (#23)            user Mr A  |
                            4. | 64371979 by user Mr Michael A on 2024-07-29 15:25:23 UTC Board D (#24)    user Mr Michael A  |
                               +----------------------------------------------------------------------------------------------+
                          
                          .

                          Comment


                          • #14
                            Originally posted by Andrew Musau View Post
                            Thank you Andrew, it works perfectly now.

                            Comment

                            Working...
                            X