Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Append command changes the content of observations of the appended file

    I have an issue with the append command.

    I have a couple of stata files that contain three variables: File, Location and Document.
    I created these variables by working with Wordstat in Stata. I want to analyse different textfiles regarding key word frequencies (with egen noccur) at once. Wordstat can create stata files that contain the whole text of a text file as an observation within the variable Document.
    The variable File contains the name of the file, the varibale Location the name of the folder and the variable Document is different for each file and contains the particular text.

    I have stored the text files in 20 different stata files because the converting to a stata file with wordstat needed 1 hour for each stata file and thus it is not really possible to load all text files at once with wordstat. I knew the append command and that it should work.
    However, when I use one of these stata files, containing the three variables and I use append using one of the other stata files I receive

    (label FILE_lst already defined)
    (label LOCATION_lst already defined)

    Plus, observations are added that contain the new text files that have been appended with the new stata file, but only the Document variable is containing the correct information, the text from the different particular files. The location variable of the stata file used for apppending is different but it receives the same name as the first used stata file. The same is for the File variable. There a number is used as file name or another wrong file name. Thus, the information of File and Location for the new appended observations is incorrect.

    The label for the variables File and Location are in both stata files the same: Type long, Value label: FILE_Ist or LOCATION_Ist. DOCUMENT has Type: strL and Value Label: no value / is empty

    I use the file variable as a identifier to merge the statafile with a master statafile afterwards that contains panel data. So it is necessary to have File variable and Document variable correctly for each observation. The Location variable is not that important.


    How can I solve the issue?

  • #2
    It seems to me that, in the 20 different datasets you want to append, you - or Wordstat, or something else - used Stata's encode command to change your variables FILE and LOCATION from string variables to numeric variables with value labels.

    The problem is, having done that in 20 separate datasets, in each of the 20 datasets the encoding starts over at 1 for each of the two variables being encoded, and the value labels will be different in each dataset. FILE==1 in your first dataset is something entirely different than FILE==1 in your second dataset.

    To append your files, you should have left FILE and LOCATION as string variables until after the append, and then, if you truly needed these to be numeric variable, used encode on the full set of data.

    The best thing to do would be to go back t the point in your processing where you applied encode to those variables and remove those commands, and then continue forward from there. If that is not possible - perhaps it was the fault of Wordstat, for example - then you need to re-create string variables from the numeric variables using decode. The following untested code may point you in a useful direction.
    Code:
    use file1
    rename (LOCATION FILE) (Ln Fn)
    decode Ln, generate(LOCATION)
    decode Fn, generate(FILE)
    drop Ln Fn
    save appended, replace
    use file2
    rename (LOCATION FILE) (Ln Fn)
    decode Ln, generate(LOCATION)
    decode Fn, generate(FILE)
    drop Ln Fn
    append using appended
    save appended, replace
     ...  and so on

    Comment


    • #3
      Dear William,

      thank you a lot. I used the code and it seems to work!
      After using the append using command it says "(note: variable FILE was str55, now str79 to accommodate using data's values)".
      For me this appears to be a way to work. I understand that it states the longest text file name of an observation in the append stata file was 55 digits long. And now it changed to 79 because there is one observation in the appended stata file with a text file named with 79 digits.

      Best regards,
      Robert

      Comment

      Working...
      X