Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Importing multiple text files to Stata

    Dear all,
    I am trying to import ASCII files to Stata. The problem is that there are two ASCII files describing panel data; File1 containing information regarding more than 4000 companies and their activity. They are two files. Another file (File2) provides the time period for which the variables in the first file have been requested.
    I have different doubts because this is the first time I deal with such kind of data:
    In the first term, I transformed to .csv file and imported File1, following the procedure:
    Code:
    import delimited "file1", encoding(UTF-8)
    I have seen that one recommendation is to create a data dictionary or automate the creation of the variables. However, the result is a table with unnamed variables and not fitting with the variables to describe. Can it be the solution to this problem?
    In the second term, do I have to import first the two files to create a panel structure, or do you recommend creating separately and merging in a second step?
    Finally, if you can recommend me some lecture or reference about this procedure, I will be pleased.
    Thank you very much for your attention.

  • #2
    The solution will almost certainly require importing each file into Stata, saving it as a .dta file, and then using -merge- to combine the two files. Giving you advice further than that, however, is difficult without seeing any example data. If you would post a short example excerpt of each file (using StataList's code delimiter # feature for visual clarity), that would greatly increase the chances of you getting a quick and helpful answer. If you have successfully imported each file into Stata, post the example using the -dataex- command, as described in the StataList FAQ. If you are having problems importing, post some of the raw ASCII data for each file along with an explanation of what you know about the data layout.

    Comment


    • #3
      Dear Mike,
      Thank you very much for your answer, and of course, it is difficult to advise with no data visualization, my apologies.
      As a first step, what I’ve done is to convert the .dat files to .csv and then import files to Stata using:
      import delimited "file1", encoding(UTF-8)
      with this result for file 1:
      Code:
      * Example generated by -dataex-. For more info, type help dataex
      clear
      input int v1 byte(v2 v3 v4 v5) int v6 byte(v7 v8 v9 v10 v11 v12 v13 v14 v15 v16 v17 v18 v19 v20)
      100 . . . .    . . . . .  .  .  .  .  .  .  .  . . .
      101 2 2 2 2 1965 1 4 4 3 10 10 10 10 20 21 32 35 0 0
      102 . . . .    . . . . .  .  .  .  .  .  .  .  . . .
      103 . . . .    . . . . .  .  .  .  .  .  .  .  . . .
      104 . . . .    . . . . .  .  .  .  .  .  .  .  . . .
      105 . . . .    . . . . .  .  .  .  .  .  .  .  . . .
      106 . . . .    . . . . .  .  .  .  .  .  .  .  . . .
      107 . . . .    . . . . .  .  .  .  .  .  .  .  . . .
      108 2 2 2 2 1980 1 1 1 1 10 10 10 10 12 15 13 19 0 0
      109 . . . .    . . . . .  .  .  .  .  .  .  .  . . .
      110 . . . .    . . . . .  .  .  .  .  .  .  .  . . .
      111 . . . .    . . . . .  .  .  .  .  .  .  .  . . .
      112 . . . .    . . . . .  .  .  .  .  .  .  .  . . .
      113 . . . .    . . . . .  .  .  .  .  .  .  .  . . .
      114 2 2 2 2 1973 1 1 1 1 10 10 10 10 26 26 26 21 0 0
      115 . . . .    . . . . .  .  .  .  .  .  .  .  . . .
      116 . . . .    . . . . .  .  .  .  .  .  .  .  . . .
      117 . . . .    . . . . .  .  .  .  .  .  .  .  . . .
      118 . . . .    . . . . .  .  .  .  .  .  .  .  . . .
      119 . . . .    . . . . .  .  .  .  .  .  .  .  . . .
      120 . . . .    . . . . .  .  .  .  .  .  .  .  . . .
      121 . . . .    . . . . .  .  .  .  .  .  .  .  . . .
      122 . . . .    . . . . .  .  .  .  .  .  .  .  . . .
      123 . . . .    . . . . .  .  .  .  .  .  .  .  . . .
      124 2 2 2 2 1989 1 1 1 1 10 10 10 10 23 20 30 20 0 0
      125 . . . .    . . . . .  .  .  .  .  .  .  .  . . .
      126 . . . .    . . . . .  .  .  .  .  .  .  .  . . .
      127 . . . .    . . . . .  .  .  .  .  .  .  .  . . .
      128 . . . .    . . . . .  .  .  .  .  .  .  .  . . .
      129 . . . .    . . . . .  .  .  .  .  .  .  .  . . .
      130 . . . .    . . . . .  .  .  .  .  .  .  .  . . .
      end
      And this is the result for file 2:
      Code:
      * Example generated by -dataex-.           For         more     info,       type      help       dataex
      clear
      input int v1 str1(v2 v3 v4)
      100 "0" "0" "0"
      101 "1" "1" "1"
      102 "0" "0" "0"
      103 "0" "0" "0"
      104 "0" "0" "0"
      105 "0" "0" "0"
      106 "0" "0" "0"
      107 "2" "0" "0"
      108 "1" "1" "1"
      109 "0" "0" "0"
      110 "0" "0" "0"
      111 "0" "0" "0"
      112 "0" "0" "0"
      113 "0" "0" "0"
      114 "1" "1" "1"
      115 "0" "0" "0"
      116 "0" "0" "0"
      117 "0" "0" "0"
      118 "0" "0" "0"
      119 "0" "0" "0"
      120 "0" "0" "0"
      121 "0" "0" "0"
      122 "0" "0" "0"
      123 "0" "0" "0"
      124 "1" "1" "1"
      125 "0" "0" "0"
      126 "0" "0" "0"
      127 "0" "0" "0"
      128 "0" "0" "0"
      129 "0" "0" "0"
      130 "0" "0" "0"
      end
      I see that the result is not correct because this same database I have a previous version, and it should look like the following:
      Code:
      * Example generated by -dataex-. For more info, type help dataex
      clear
      input int(id year) float(growth size exp colc)
      5 2012           .         . . .
      6 1990           . 15.306912 . .
      6 1991   .03807773  15.34499 1 .
      6 1992  -.12022754 15.224762 0 .
      6 1993 -.071513176  15.15325 0 .
      6 1994  .020949407 15.174198 0 .
      6 1995   .18092208  15.35512 0 .
      6 1996  -.06973294 15.285388 0 .
      6 1997   .29177457 15.577163 1 .
      6 1998    .2185872  15.79575 1 0
      6 1999  -.05033931  15.74541 0 0
      6 2000   -.4005544 15.344856 0 0
      6 2001    .1177724 15.462628 0 0
      6 2002  .030672895   15.4933 0 0
      6 2003   .11669356 15.609995 0 0
      6 2004 -.034847923 15.575147 0 0
      6 2005  .020050725 15.595198 0 0
      6 2006    .3060077 15.901205 0 0
      6 2007    .2287926 16.129997 0 0
      6 2008   -.2309416 15.899056 0 0
      6 2009   -.7451451  15.15391 0 0
      6 2010  .036229804  15.19014 0 0
      6 2011 -.067294516 15.122847 1 0
      6 2012  -.26465213 14.858194 0 0
      7 1990   1.7560264 16.614222 . .
      7 1991  -.03786857 16.576351 0 .
      7 1992   -.2128487 16.363503 0 .
      7 1993    .1209208 16.484425 0 .
      7 1994   .12471449 16.609138 1 .
      7 1995           .         . . .
      7 1996           .         . . .
      7 1997           .         . . .
      7 1998           .         . . .
      7 1999           .         . . .
      7 2000           .         . . .
      7 2001           .         . . .
      7 2002           .         . . .
      7 2003           .         . . .
      7 2004           .         . . .
      7 2005           .         . . .
      7 2006           .         . . .
      7 2007           .         . . .
      7 2008           .         . . .
      7 2009           .         . . .
      7 2010           .         . . .
      7 2011           .         . . .
      7 2012           .         . . .
      8 1990           . 16.086248 . .
      8 1991  -.06271114 16.023537 0 .
      8 1992 -.024684256 15.998853 0 .
      8 1993  -.26220426 15.736649 0 .
      end
      I have to use the command infile because I saw that I have the data in .dat format and separately the documentation for the file(s).
      That is why I created a dictionary for file1; however, I am not sure of the code I am working with as I am dealing with panel data. I attach just a part of this dictionary code because the data is compound by more than 40 variables:
      Code:
      infix dictionary using file1.dat {
         NOMIN 1-10
         AAII2013 12-17
         AAII2014 19-24
         AAII2015 26-31
         AAII2016 33-38
         ARIP 40-45
         ADI2013 47-52
         ADI2014 54-59
         ADI2015 61-66
         ADI2016 68-73
         }
      In this sense, I have a doubt; I don’t know if I can reshape later data into panel data or if the specification must be introduced in the dictionary. In the way I have done, I separated in different variables per each year, meaning that I will have a high number of missing data per each variable? I don’t feel confident with this solution, but I have no alternative idea of how to solve it by now.
      Thank you in advance for your help. Cordially,

      Comment


      • #4
        I'd suggest to not worry about reshaping or other restructuring *now*. I think we need to first get good imports of your data files. Restructuring is not something I have ever had occasion to do in the context of importing a file.

        Some thoughts/questions in that direction:

        1. You say "what I’ve done is to convert the .dat files to .csv."
        To me, that might mean that you somehow substituted commas at all relevant positions in some text file that *happened* to have the sufix ".dat." I suspect you did something else. Showing a sample of that file 1 and explaining what you did would be helpful. What you did may well be perfectly fine, but I'm not sure.

        2. What you show for file 2 makes me think there was a problem with the import. Can you show a sample of that text file before you imported it and obtained all those strings?

        3. "I have to use the command infile because I saw that I have the data in .dat format."
        I'd have to say "no" to that. There is no definitive definition of a "dat" format. It's just a popular name for a text file (usually plain ASCII) that contains raw data. The reason for using the -infile- command would be if you have a raw text file that does not have a delimiter (comma, tab, space, etc.) between variables, but instead has a fixed format, which appears to be why you used it. I have not used -infile- for several years, but I looked at the Data Management manual to refresh my mind about using -infile-, and it appears (I might be wrong) and the example I see do *not* show Stata's infile using the ColumnStart-ColumnStop system as you have done. I'd look more closely at the examples there that involve % format descriptors. (Maybe what you are doing could work, but I didn't find it easily.)

        Comment


        • #5
          Dear Mike,

          Thank you for your kind answer. Going point by point:

          1. Yes, you are right; indeed, I replaced the spaces of the original text with commas, replacing the original format separated by spaces to commas.

          2. This is a sample of the raw text file:
          Code:
          0000000005      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .            .            .            .            .            .            .            .            .            .            .            .            .            .            .            .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .            .            .            .            .      .      .      .      .      .      .      .      .      .            .            .            .            .            .            .            .            .            .            .            .            .      .      .      .      .      .      .      .      . 
          0000000006      2      2      .      .   1973      1      1      .      .     16     16      .      .     31     23      .      .     10     10      .      .     92    100      .      .      .      .      .      .      2      2      .      .      2      2      .      .      2      2      .      .      2      2      .      .      3      3      .      .      0      .      .      1     .0     .0      .      .      1      1      .      .      2      2      .      .            0            0            .           .0           .0            .            .           .0           .0            .            .       152368       237951            .            .      0      0      .      .      1      1      .      .      1      1      .      .      1      1      .      .      2      2      .      .      2      2      .      .      2      2      .      .  -42.1  -24.9      .      .     13     13      .      .      0      0      .      .      1      1      .      .      1      1      .      .      0      0      .      .      0      0      .      .      0      0      .      .      1     38     30      .      .    6.7      .      .  100.0  100.0      .      .     .0     .0      .      .         23.6         33.2            .            .      3     38     32      .      .     .0     .0      .      .       966658      1061477            .            .      2289312      2156045            .            .            0          272            .            .    100    100      .      .    100      0      0      0 
          0000000007      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .            .            .            .            .            .            .            .            .            .            .            .            .            .            .            .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .            .            .            .            .      .      .      .      .      .      .      .      .      .            .            .            .            .            .            .            .            .            .            .            .            .      .      .      .      .      .      .      .      . 
          0000000008      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .            .            .            .            .            .            .            .            .            .            .            .            .            .            .            .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .            .            .            .            .      .      .      .      .      .      .      .      .      .            .            .            .            .            .            .            .            .            .            .            .            .      .      .      .      .      .      .      .      . 
          0000000009      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .            .            .            .            .            .            .            .            .            .            .            .            .            .            .            .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .      .
          3. Thank you for the explanation; by your text, what I deduced is that infile command is not the proper command to import, and I changed to the import delimited command, in this sense, the result is fitting better; the sample has 5840 observations and the Stata file resulting is also containing the 5840 observations. The problem is that following this command, the result in Stata shows just one variable in a single column, while the data is compound by 52 variables. I have tried to show an example; however, using dataex command Stata issued the following error:

          [CODE] dataex v1 in 1/10

          ----------------------- copy starting from the next line -----------------------
          Code:
          * Example generated by -dataex-. For more info, type help dataex
          clear
          input str1520 v1
          data width (1522 chars) exceeds max linesize. Try specifying fewer variables
          r(1000);
          Thank you in advance for your help. Cordially,

          Comment


          • #6
            Dear Mike,

            Problem solved, I used import delimited by space, and everything worked perfectly.

            Thank you for your advice.

            Comment

            Working...
            X