Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Question regarding the append command in Stata.

    Hi, I am facing a question regarding the force option in the append command in STATA. According to dappend.pdf, "force allows string variables to be appended to numeric variables and vice versa, resulting in missing values from the using dataset. If omitted, append issues an error message; if specified, append issues a warning message".
    In a recent command I ran, I first ran the command with the "force" option. The command did not issue me a warning message. Subsequently, I then ran the command without the force option, but STATA suggested me to use the "force" option. Why is this the case? I am of the assumption that if "force" is specified and STATA does not issue a warning message then "force" is not required?
    Any insights on this would be helpful. Thank you.

  • #2
    I have never observed the behavior you describe, and have not been able to replicate it on my system. I always get a warning message with -force- specified.
    The warning message looks like
    (note: variable varname was byte in the using data, but will be str# now)
    (or the appropriate data storage types as the case may be).

    Maybe you don't think of this as a warning message. Admittedly its tone doesn't feel like "caution here, something may be amiss."

    If you are really getting no message at all, that would be a bug. In that case, you should contact Stata Technical Support with an example data set and code that reproduces the problem so they can put it on their "to fix" list.

    Comment


    • #3
      Originally posted by Clyde Schechter View Post
      I have never observed the behavior you describe, and have not been able to replicate it on my system. I always get a warning message with -force- specified.
      The warning message looks like

      (or the appropriate data storage types as the case may be).

      Maybe you don't think of this as a warning message. Admittedly its tone doesn't feel like "caution here, something may be amiss."

      If you are really getting no message at all, that would be a bug. In that case, you should contact Stata Technical Support with an example data set and code that reproduces the problem so they can put it on their "to fix" list.
      Thank you for your reply on this. I had a follow up question on this. If I did not use the "force" option and STATA did not give me a message suggesting me to use the force option, in that case,I wanted to confirm that we should not be getting a message even when we used the forced option?

      Comment


      • #4
        If I did not use the "force" option and STATA did not give me a message suggesting me to use the force option, in that case,I wanted to confirm that we should not be getting a message even when we used the forced option?
        If, without -force- you got no message about a variable being string in one of the data sets and numeric in the other, then, no you should not get that message when you do use -force-.

        Now, -append- does issue other messages. For example, it will tell you about changes in the storage size of variables. So a string variable that is, say, str9 in the master data set but is str12 in the using data set will provoke a message saying that that variable is now changed to str12 in the combined data set. Similarly an int in master and a float in using will lead to a message informing you that in the combined data set the type has changed from int to float. These changes are very different from what happens when -force- allows you to ignore a string vs numeric difference. These changes do not lose any information. (To be honest, I'm not sure why Stata even bothers to tell you about them.) By contrast, when -force- permits you to append a string and numeric variable together, the values from the using data set are lost.

        All of that said, let me say what I usually say in discussions about -force- options and am surprised I didn't say in #2. The best way to handle -force- options is never use them. The commands that have -force- options are commands that can result in loss of data. Generally we want to avoid loss of data. Of course, sometimes you want to get rid of data that you don't need to trim down the data set. And sometimes we are working with data sets that contain extraneous variables that are unnecessary for our purposes, so we don't care if those variables get messed up in the course of data management. That is what -force- options allow you to do. But I think a better approach to these situations is to explicitly remove the unwanted variables using -drop- or -keep- commands that weed them out. This makes the code more transparent: a -drop- command make it obvious that certain variables are about to disappear. A -force- option in a command tells you that some data will be lost, but it doesn't tell you what data. This is a particularly serious problem with -append, force- because it does not actually remove a variable: it appears to retain the variable while dumping some of its values! The resulting data set contains an incorrect version of the original variable. That's just a mistake waiting to happen when somebody, unaware of the problem, subsequently tries to use that variable in calculations--and everything from that point on is just plain wrong.

        With respect to the -append- and -merge- commands, I recommend that unless you are combining two data sets that you are very familiar with, it is best to use Mark Chatfield's -precombine- command, available from SSC, that will alert you to problems that may arise from combining two data sets due to incompatibilities like the ones we are discussing here (and others, such as conflicted labeling). If -precombine- points up a string vs numeric incompatibility, then before going ahead, you can review the data sets and decide whether to remove that variable from both data sets, or use an appropriate Stata command to convert between string and numeric in one of the data sets before combining.

        Comment


        • #5
          Originally posted by Clyde Schechter View Post
          If, without -force- you got no message about a variable being string in one of the data sets and numeric in the other, then, no you should not get that message when you do use -force-.

          Now, -append- does issue other messages. For example, it will tell you about changes in the storage size of variables. So a string variable that is, say, str9 in the master data set but is str12 in the using data set will provoke a message saying that that variable is now changed to str12 in the combined data set. Similarly an int in master and a float in using will lead to a message informing you that in the combined data set the type has changed from int to float. These changes are very different from what happens when -force- allows you to ignore a string vs numeric difference. These changes do not lose any information. (To be honest, I'm not sure why Stata even bothers to tell you about them.) By contrast, when -force- permits you to append a string and numeric variable together, the values from the using data set are lost.

          All of that said, let me say what I usually say in discussions about -force- options and am surprised I didn't say in #2. The best way to handle -force- options is never use them. The commands that have -force- options are commands that can result in loss of data. Generally we want to avoid loss of data. Of course, sometimes you want to get rid of data that you don't need to trim down the data set. And sometimes we are working with data sets that contain extraneous variables that are unnecessary for our purposes, so we don't care if those variables get messed up in the course of data management. That is what -force- options allow you to do. But I think a better approach to these situations is to explicitly remove the unwanted variables using -drop- or -keep- commands that weed them out. This makes the code more transparent: a -drop- command make it obvious that certain variables are about to disappear. A -force- option in a command tells you that some data will be lost, but it doesn't tell you what data. This is a particularly serious problem with -append, force- because it does not actually remove a variable: it appears to retain the variable while dumping some of its values! The resulting data set contains an incorrect version of the original variable. That's just a mistake waiting to happen when somebody, unaware of the problem, subsequently tries to use that variable in calculations--and everything from that point on is just plain wrong.

          With respect to the -append- and -merge- commands, I recommend that unless you are combining two data sets that you are very familiar with, it is best to use Mark Chatfield's -precombine- command, available from SSC, that will alert you to problems that may arise from combining two data sets due to incompatibilities like the ones we are discussing here (and others, such as conflicted labeling). If -precombine- points up a string vs numeric incompatibility, then before going ahead, you can review the data sets and decide whether to remove that variable from both data sets, or use an appropriate Stata command to convert between string and numeric in one of the data sets before combining.
          Thank you for your detailed response! It makes sense.

          Comment

          Working...
          X