Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • the format of number was transformed to string wrongly when using import excel, how to solve this?

    The varaiable id is number in excel, after importing to stata, it becomes string variable, how to correct this?
    Here is the data sample of excel.
    https://xexx-my.sharepoint.com/:x:/g...xxGpw?e=WtnIVB

    Thanks so much!

  • #2
    We need to see a data example in Stata. Please follow 12.2 at https://www.statalist.org/forums/help#stata

    Comment


    • #3
      Originally posted by Nick Cox View Post
      We need to see a data example in Stata. Please follow 12.2 at https://www.statalist.org/forums/help#stata
      Thanks, Nick. I try to using dataex of Stata, however, the imported data is wrong shown as below. So I have to share the raw data.

      The link is shared uisng onedrive, you can see the data online after clicking it.
      Code:
      * Example generated by -dataex-. For more info, type help dataex
      clear
      input str18 id
      "1.10108195110e+17" 
      "1.00231199008e+17" 
      "1.30821196606e+17" 
      "4.22130197106e+17" 
      "15232419811015005X"
      "1.10108197310e+17" 
      "3.20211196510e+17" 
      "3.30124198903e+17" 
      "3.30622197612e+17" 
      "1.10223197701e+17" 
      "1.10102196704e+17" 
      "4.11322199207e+17" 
      "1.10111198708e+17" 
      "1.20104197607e+17" 
      "5.20102197301e+17" 
      "3.50582197406e+17" 
      "1.10101197202e+17" 
      "3.70203196610e+17" 
      "1.20106197709e+17" 
      "1.50102196406e+17" 
      "1.10112198808e+17" 
      "4.10105197201e+17" 
      "3.40322198309e+17" 
      "2.10106198111e+17" 
      "2.30103195009e+17" 
      "1.30825198902e+17" 
      "1.10108198606e+17" 
      "1.10108196208e+17" 
      "3.62232197301e+17" 
      "3.60724198202e+17" 
      "4.20624196203e+17" 
      "1.10105197608e+17" 
      "6.10112197907e+17" 
      "13243019450418144X"
      "3.71522198860e+17" 
      "6.32122196911e+17" 
      "6.10111197110e+17" 
      "11022319630831059X"
      "3.20103196802e+17" 
      "6.22825197207e+17" 
      "2.11382196805e+17" 
      "3.42426197609e+17" 
      "11010419630918165X"
      "2.30402198604e+17" 
      "1.10225195704e+17" 
      "4.20106196911e+17" 
      "1.10108196510e+17" 
      "4.20802196409e+17" 
      "3.40102196704e+17" 
      "4.10928198906e+17" 
      "5.13330196801e+17" 
      "3.40802197509e+17" 
      "1.10102198111e+17" 
      "1.10102196002e+17" 
      "3.10105196306e+17" 
      "1.52601198210e+17" 
      "1.30804198510e+17" 
      "1.10107196710e+17" 
      "6.12701198109e+17" 
      "1.20105198306e+17" 
      "4.12723197609e+17" 
      "4.10823198005e+17" 
      "4.30321198104e+17" 
      "3.70620197309e+17" 
      "1.32801197408e+17" 
      "3.30724196210e+17" 
      end

      Comment


      • #4
        destring is the tool of choice, but not here. First off, your identifier is 18 characters long. One kind of problem is clear: you have some identifiers like

        Code:
         15232419811015005X
        that can't be interpreted as string, but Stata got the others wrong and tried to think of them as numbers.

        I fear you've lost some final digits that can't be retrieved and need to import again, but this time put a prefix such as A in front of your identifiers. Then Stata will be less likely to mangle the identifiers.

        But, but, but: do you need the idenftifiers any way? Why not just create fresh identifiers 1 to whatever?

        Comment


        • #5
          Originally posted by Nick Cox View Post
          destring is the tool of choice, but not here. First off, your identifier is 18 characters long. One kind of problem is clear: you have some identifiers like

          Code:
          15232419811015005X
          that can't be interpreted as string, but Stata got the others wrong and tried to think of them as numbers.

          I fear you've lost some final digits that can't be retrieved and need to import again, but this time put a prefix such as A in front of your identifiers. Then Stata will be less likely to mangle the identifiers.

          But, but, but: do you need the idenftifiers any way? Why not just create fresh identifiers 1 to whatever?
          Thanks, Nick. I really need the identifiers because this is the ID card number of each individual , I need to use the identifier to match other information.

          I am wondering whether there existing some ways of not changing the original data information

          Comment


          • #6
            I don't advocate changing anything. A prefix is an attempt to fool the import. You can remove it if and after it works.

            Comment


            • #7
              Originally posted by Nick Cox View Post
              I don't advocate changing anything. A prefix is an attempt to fool the import. You can remove it if and after it works.
              Thanks, Nick! I am sorry I didn't understand how to use the prefix options when import.
              Code:
              . import excel "data.xlsx", sheet("Sheet1") firstrow prefix(A) clear
              option prefix() not allowed
              r(198);

              Comment


              • #8
                You have to do that in Excel before you read the data in Stata. I don't use Excel nearly enough to know how but i trust it's easy.

                Comment


                • #9
                  rather than playing with the data, use the "allstring" option in your Stata command; see
                  Code:
                  help import excel
                  although you will need to destring some variables after importing, I find this much the safest way to import data from excel

                  Comment


                  • #10
                    If you wish to edit in Excel beforehand, just select the column; right-click the mouse; select format cells; finally, choose the category you want to select.
                    Best regards,

                    Marcos

                    Comment


                    • #11
                      Originally posted by Rich Goldstein View Post
                      rather than playing with the data, use the "allstring" option in your Stata command; see
                      Code:
                      help import excel
                      although you will need to destring some variables after importing, I find this much the safest way to import data from excel
                      The allstring option cannot solve the problem.

                      Comment


                      • #12
                        Originally posted by Marcos Almeida View Post
                        If you wish to edit in Excel beforehand, just select the column; right-click the mouse; select format cells; finally, choose the category you want to select.
                        The format setting cannot solve the problem. Thanks!

                        Comment


                        • #13
                          I've often opted myself to read in data from spreadsheets as string and then use Stata's own commands and functions to sort out a mess, so on a pessimism Venn diagram I overlap mightily with Rich Goldstein.

                          The interesting detail here is that import is smart enough to note that terminal X makes string input advisable, but it still messes up those identifiers that are all numeric characters.

                          EDIT In a limited experiment I copied and pasted the data from Fred's source into MS Excel and most of the identifiers immediately displayed using E+17 although underneath they remained large integers, or appearing to be. So, where blame attaches I can't say.
                          Last edited by Nick Cox; 02 Jun 2021, 07:16.

                          Comment


                          • #14
                            Fred Lee

                            You have a serious problem in your Excel data.

                            Only numbers with 15 digits or fewer can be assured of being stored precisely as a 8-byte floating point number, which is the standard in Excel, and the best thatncan be done in Stata (with a storage type of double) as well.

                            Your id column contains a mixture of 18-character strings and 18-digit numeric values. Here are two of them, the first numeric and the second a string.
                            Code:
                            110108195110171000
                            15232419811015005X
                            In the example data, all of the numeric values are 18 digits long and end in "000". Note that the string however ends "05X" which suggests that perhaps some of the numeric values may also have had digits other than 0 in characters 16-18.

                            It appears that when your data was imported into Excel, Excel was not instructed to import the id as a string (which should always be used for "numbers" that are not meant for arithmetic operations) and instead, when Excel detected a number, did the best it could, which apparently was to import the leftmost 15 digits correctly and then (effectively) multiply it by 103 to get it to the right order of magnitude.

                            I expect if you review the original data that was imported into Excel, you will see that the ids that end in three digits do not all end in three zeros. If you do not correct this data, either your id will not match other information, or you will have problems with ids that are not distinct, or both.

                            Comment


                            • #15
                              Originally posted by William Lisowski View Post
                              Fred Lee

                              You have a serious problem in your Excel data.

                              Only numbers with 15 digits or fewer can be assured of being stored precisely as a 8-byte floating point number, which is the standard in Excel, and the best thatncan be done in Stata (with a storage type of double) as well.

                              Your id column contains a mixture of 18-character strings and 18-digit numeric values. Here are two of them, the first numeric and the second a string.
                              Code:
                              110108195110171000
                              15232419811015005X
                              In the example data, all of the numeric values are 18 digits long and end in "000". Note that the string however ends "05X" which suggests that perhaps some of the numeric values may also have had digits other than 0 in characters 16-18.

                              It appears that when your data was imported into Excel, Excel was not instructed to import the id as a string (which should always be used for "numbers" that are not meant for arithmetic operations) and instead, when Excel detected a number, did the best it could, which apparently was to import the leftmost 15 digits correctly and then (effectively) multiply it by 103 to get it to the right order of magnitude.

                              I expect if you review the original data that was imported into Excel, you will see that the ids that end in three digits do not all end in three zeros. If you do not correct this data, either your id will not match other information, or you will have problems with ids that are not distinct, or both.
                              Thanks William! Let me be more specific about the data. The id actually is the Chinese ID number, which is 18-digit long, ends with "X" or number. Usually, the last three digits shouldn't be zero. However, what I can get the most original is the data I have uploaded. It doesn't affect the matching information since the first-15 digit is always distinct. If the Excel can treat the ID as string, that will be perfect. However, It treat the 18-digit number as number, while 18-digit with "X" as string. Therefore, It cause such importing problem for Stata.
                              Last edited by Fred Lee; 02 Jun 2021, 11:02.

                              Comment

                              Working...
                              X