Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Converting string variable to double or vice-versa

    Hello!

    I'm trying to append two datasets with same variables but in different years. I have household level data and each household has a "code" that it is identified with. However, when I want to append both datasets, Stata tells me that "variable hhcode is str14 in master but double in using data. You could specify append's force option to ignore this string/numeric mismatch. The using variable would then be treated as if it contained ""."

    So, I tried using the force option but then Stata replaces part of the code by missing values "." and I need to keep the household code. How can I convert a string variable to a double or vice-versa?

    Thank you all for your help!


  • #2
    Candice:
    you should be better off with -destring-ing -hhcode- in master dataset before -append-ing.
    By default, -destring- use -double- type (see -help destring-).
    Kind regards,
    Carlo
    (Stata 19.0)

    Comment


    • #3
      Carlo:
      Thank you very much for your response. Destring wouldn't work with stata as it said that the variable contained non numeric values (which it didn't) but I managed to de-string the variable using encode.

      Comment


      • #4
        Candace:

        You said (slight editing)


        destring wouldn't work ... as it said that the variable contained non numeric values (which it didn't)

        If true, you've found a bug. If false, as I suspect, you are misunderstanding something. Very possibly with identifiers that are large integers, you are not specifying an appropriate format().

        Either way, evidence please.

        This is more serious than a small misunderstanding. The fact that encode works (meaning, your syntax runs without an error message) is absolutely no guarantee that it produces correct results. encode by itself will code "111" and "22" and "3" to 1 and 2 and 3 and they aren't even in the right order. What's more, the results will look right because you will see value labels based on the original strings that look fine, but underneath you have arbitrary garbage that is no use for matching with purely numeric cousins.
        Last edited by Nick Cox; 09 Mar 2017, 10:16.

        Comment


        • #5
          Nick:

          Thank you for your message. As a beginner with Stata, I am easily confused by this powerful tool. But, you are right, I need to be more rigorous and it is highly probable that I misunderstood something.

          The variable hhcode contains 14 integers per observation (each household is identified by a 14 number code and there are approximately 20,000 households per dataset). Would that be the reason why destring doesn't work?

          What I did was:
          "encode hhcode, gen(test)
          drop hhcode
          rename test hhcode"

          I was able to append my datasets like this without any error message from Stata. Did I make a mess of things?
          Last edited by Candice Yandam; 09 Mar 2017, 10:24.

          Comment


          • #6
            None of what you mention yet explains whydestring wouldn't oblige. But I didn't explain fully what would be evidence.

            First, you need to give the destring command you tried and the exact error message. Note that

            Code:
            destring mystrvar, format(%14.0f) 
            should suffice for 14 digit identifiers. As said, leaving out the format option might well cause a problem. I am betting on this one.

            Second, for any string variable at which destring stalls, you need to see which values could not be converted, say by

            Code:
            tab mystrvar if missing(real(mystrvar))


            Comment


            • #7
              Nick:

              What I coded and Stata's response was the following (hhcode being my string variable):

              Code:
              destring hhcode, replace
              hhcode: contains nonnumeric characters; no replace
              Now, I have tried your suggested code, namely:

              Code:
              destring hhcode, format(%14.0f)
              and Stata gives me the following error message:

              Code:
              option format() not allowed
               r(198)
              I hope that I am more thorough in my explanation, thanks for bearing with me!

              Comment


              • #8
                Sorry, this time it's me to blame, in essence in #6 confusing my own children (meaning commands I originally wrote) very badly. The format() option is entirely to do with tostring and that's irrelevant here.

                My second suggestion still stands.

                Comment


                • #9
                  Nick:

                  Those are pretty impressive children!

                  I tried your second suggestion:

                  Code:
                   tab hhcode if missing(real(hhcode))
                  and here is the output (with the help of another one of your children if I'm not mistaken, dataex):

                  Code:
                  * Example generated by -dataex-. To install: ssc install dataex
                  clear
                  input str14 hhcode
                  "1101013405401N"
                  "1101013405401N"
                  "1101013405401N"
                  "1402020203601N"
                  "1402020203601N"
                  "1402020203601N"
                  "1402020303701N"
                  "1402020303701N"
                  "1402020505101N"
                  "1402020505101N"
                  "1402020804101N"
                  "1402020804101N"
                  "1402020804101N"
                  "1402020806601N"
                  "1402020806601N"
                  "1402020806601N"
                  "1402021001601N"
                  "1402021001601N
                  end
                  I am not sure I fully understand the result of this line of code. There is a "N" at the end of each household code which isn't there if I check out the variable in the data browser for instance. As an example:

                  Code:
                  * Example generated by -dataex-. To install: ssc install dataex
                  clear
                  input str14 hhcode float person
                  "1101010206801" 1
                  "1101010206801" 2
                  "1101010206801" 3
                  "1101010208301" 1
                  "1101010209401" 1
                  "1101010209401" 2
                  "1101010209401" 3
                  "1101010209701" 1
                  "1101010209701" 2
                  "1101010209701" 3
                  "1101010209701" 4
                  "1101010209701" 5
                  "1101010209701" 6
                  "1101010210601" 1
                  end
                  I don't really know what I should do...

                  Thanks for the help!

                  Comment


                  • #10
                    thank you for the clear example; it appears that sometimes the hhcode which is a string ends in a letter (N in your examples); if the part of the hhcode in the string variable matches the numeric hhcode in the other data set, then you can use -destring- with the "ignore" option - see the help for -destring-

                    Comment


                    • #11
                      This type of problem occurs frequently when appending datasets. The best solution is to deal with the problem when the data is imported into Stata. The import excel command has an allstring option. The import delimited command has a stringcols(numlist|_all) option. Use these to force Stata to store all data to be appended using string variables. Once the data is appended, do a wholesale destring.

                      With respect to identifiers, they should remain string; there's no expectation of doing math with their value.

                      Comment


                      • #12
                        Thank you for your answers!

                        Rich:

                        The households don't match in both datasets. I am appending both datasets (2007 and 2013) with same informations about different households. The problem is that hhcode is string in one dataset and double in the other. Stata won't let me append them because of this difference. Is there a way to get around this?

                        Robert:

                        I got my data in .dta version directly, there is no excel file. Is there any way I could still use
                        Code:
                        destring
                        ?

                        Comment


                        • #13
                          Even if the households are not to be matched, you are still stuck with identifiers of different types in the datasets you need to append. Clearly, one dataset has some instances of characters in the identifier which means that hhid must remain string to avoid loosing information. The solution is to convert the one that's numeric to string. This is consistent with my comment that since you do not need to do arithmetics with identifiers, they should be stored as strings. You can use the tostring command to do this:

                          Code:
                          * Example generated by -dataex-. To install: ssc install dataex
                          clear
                          input double hhcode float person
                          1101010206801 1
                          1101010206801 2
                          1101010206801 3
                          1101010208301 1
                          1101010209401 1
                          1101010209401 2
                          end
                          
                          tostring hhcode, replace format("%14.0f")

                          Comment


                          • #14
                            Robert:

                            Now it seems that Stata won't let me string this variable. It says:
                            Code:
                             tostring hhcode, replace format("%14.0f")
                            hhcode has value label; no replace
                            Should I force it? Can I force it?

                            Thanks!!

                            Comment


                            • #15
                              It's a bit funny that an identifier variable would have value labels, actually. Could you paste a dataex example of the dataset that has this problem?

                              Comment

                              Working...
                              X