Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Fixing scientific-notation numeric variables

    Hello;

    I have a lengthy (10 to 12-digit) numeric identifier in my data.

    I've formatted it so that it no longer appears in scientific notation in the data editor, but when I do "tab variable", it truncates yet again to scientific notation and collapses together different values (i.e., anything that would be scientifically noted as 1.13+9e is now collapsed together as that--can't tell them apart).

    Any advice? I really need to see counts by specific ID number, and I really don't want to export to Excel and do a pivot table to figure it out!

    (By the way, the dataset is about 150,000 records, so "list" isn't a good option for me.)

    Thanks for your time!

  • #2
    Depending on what statistics you need reported, you could use tabstat, which has number formatting options, for more see http://www.stata.com/manuals13/rtabstat.pdf
    Code:
    . set obs 1
    obs was 0, now 1
    
    . 
    . generate var1 = 13900000000 in 1
    
    . 
    . tabstat var1
    
        variable |      mean
    -------------+----------
            var1 |  1.39e+10
    ------------------------
    
    . 
    . tabstat var1, format(%20.2f)
    
        variable |      mean
    -------------+----------
            var1 |      13900000256.00
    ------------------------
    
    .

    Comment


    • #3
      You can use the Ben Jann's fre command. You can get it by typing in Stata ssc install fre.
      ---------------------------------
      Maarten L. Buis
      University of Konstanz
      Department of history and sociology
      box 40
      78457 Konstanz
      Germany
      http://www.maartenbuis.nl
      ---------------------------------

      Comment


      • #4
        I guess Jorrit's solution, although helpful for the problem it solves, has it the wrong way round: your identifiers define rows of the table, they are not being summarized.

        You don't give any exact detail, in terms of concrete data examples, exact code and exact results, contrary to explicit advice at http://www.statalist.org/forums/help#stata -- which is no doubt why Jorrit is decoding your problem one way and I am decoding another way.

        My guess is that the problem is like this. And I give two solutions.

        Code:
        . clear
        
        . input double longid  
        
                 longid
          1. 123456789012
          2. 987654321987
          3. 987654321987
          4. end
        
        . format longid %12.0f
        
        . tab longid
        
             longid |      Freq.     Percent        Cum.
        ------------+-----------------------------------
           1.23e+11 |          1       33.33       33.33
           9.88e+11 |          2       66.67      100.00
        ------------+-----------------------------------
              Total |          3      100.00
        
        . groups longid
        
          +-----------------------------------------+
          |       longid   Freq.   Percent     Cum. |
          |-----------------------------------------|
          | 123456789012       1     33.33    33.33 |
          | 987654321987       2     66.67   100.00 |
          +-----------------------------------------+
        
        . tostring longid, gen(slongid) usedisplayformat
        slongid generated as str12
        
        . tab slongid
        
              longid |      Freq.     Percent        Cum.
        -------------+-----------------------------------
        123456789012 |          1       33.33       33.33
        987654321987 |          2       66.67      100.00
        -------------+-----------------------------------
               Total |          3      100.00
        Solution 1. Use groups from SSC. You must install first, with

        Code:
        ssc inst groups
        help groups
        Solution 2. Use a string identifier.
        Last edited by Nick Cox; 22 Jun 2016, 01:59.

        Comment


        • #5
          Nick's full advice is correct. I write only to add that I think that solution 2 is far, far better than solution 1. You have made it clear that this number is used only as an identifier. So it is not going to be used for computations. If you keep it numeric, it is likely only a matter of time before you copy the variable carelessly with some command and end up truncating it to a float. That float will not have enough bytes to store 12 digits, so the low order digits will be lost and your ids will now be incorrect, perhaps with different original ids having been truncated to the same value. There is no such danger with strings.

          Comment


          • #6
            Thank you very much, everyone! This was very helpful, and problem solved.

            I'm sorry for the no exact code/example--I'll do better next time.

            Comment


            • #7
              Hi I need help with scientific notation as well

              What is the code that I should use, I didn't not understand the explanation. How can I get all the numbers. Thank you for your help

              Here is my output

              . svy: tab riagendr sddsrvyr , missing count cellwidth(2) format(%8.0g)
              (running tabulate on estimation sample)

              Number of strata = 60 Number of obs = 40617
              Number of PSUs = 124 Population size = 1216874712
              Design df = 64

              -------------------------------------------------------
              | Data Release Number
              Gender | 5 6 7 8 Total
              ----------+--------------------------------------------
              1 | 1.5e+08 1.5e+08 1.5e+08 1.5e+08 6.0e+08
              2 | 1.5e+08 1.5e+08 1.6e+08 1.6e+08 6.2e+08
              |
              Total | 3.0e+08 3.0e+08 3.1e+08 3.1e+08 1.2e+09
              -------------------------------------------------------
              Key: weighted counts

              Pearson:
              Uncorrected chi2(3) = 0.1013
              Design-based F(2.87, 183.89) = 0.0252 P = 0.9934

              Comment


              • #8
                So I'm having the same problem. However, I can't convert the identifier to string format because the program I'm using "geonear" gives me an error when I do that. Are there any other ways to preserve the full text?

                What's super weird is that I have a field "abi" that's fine. Then I try and create a field: "gen abi2 = abi" and it truncates it. Even though it truncates it visually, it doesn't have them as duplicate values. However the output from geonear does. This is all to say, I'm not positive the problem isn't with geonear but it handles my original field fine, so I feel like if i can just successfully mirror it, I should be good to go.

                Comment


                • #9
                  look at
                  Code:
                  help precision
                  help format

                  Comment


                  • #10
                    Thanks for the tip. I read through the precision help but none of these is fixing the problem. Regardless of which format I use, "517996450" gets generated as "5.180e+08".

                    I've tried copying it as a string variable but the minute I destring it or generate a new variable based on the string (in any format) it kicks back to the scientific notation. I'm just confused because I had none of these problems when I imported my data in the first place. It seems like duplicate should just preserve whatever formatting lead to my primary variable being fine.
                    Last edited by Tim Jaquet; 20 Feb 2019, 13:03. Reason: typo

                    Comment


                    • #11
                      Tim,

                      When you write -gen abi2 = abi-, you are not creating an exact copy of abi. Your original abi is a double, but -gen-, by default, creates a float, and so you lose information. To get an exact copy you have to write -gen double abi2 = abi-, or, without the effort of even thinking about it, you could do -clonevar abi2 = abi-.

                      Comment


                      • #12
                        I should have tried clonevar. I did use the "gen double abi2" but it still gave me the scientific notation.

                        But I did find the answer was actually in Rich's post. The help format file is really confusing (to me) because it talks about justification and whatnot but "format abi2 %18.10g" did the trick.

                        Comment


                        • #13
                          Thank you so much, Nick, I had the same issue and this command solved the problem
                          . tostring longid, gen(slongid) usedisplayformat

                          Comment

                          Working...
                          X