Fixing scientific-notation numeric variables

Shannon Campbell

Join Date: Feb 2015

Posts: 26
#1

Fixing scientific-notation numeric variables

21 Jun 2016, 19:52

Hello;

I have a lengthy (10 to 12-digit) numeric identifier in my data.

I've formatted it so that it no longer appears in scientific notation in the data editor, but when I do "tab variable", it truncates yet again to scientific notation and collapses together different values (i.e., anything that would be scientifically noted as 1.13+9e is now collapsed together as that--can't tell them apart).

Any advice? I really need to see counts by specific ID number, and I really don't want to export to Excel and do a pivot table to figure it out!

(By the way, the dataset is about 150,000 records, so "list" isn't a good option for me.)

Thanks for your time!
Tags: None

Jorrit Gosens

Join Date: Jan 2015
Posts: 1019

22 Jun 2016, 00:55

Depending on what statistics you need reported, you could use tabstat, which has number formatting options, for more see http://www.stata.com/manuals13/rtabstat.pdf

Code:

. set obs 1
obs was 0, now 1

. 
. generate var1 = 13900000000 in 1

. 
. tabstat var1

    variable |      mean
-------------+----------
        var1 |  1.39e+10
------------------------

. 
. tabstat var1, format(%20.2f)

    variable |      mean
-------------+----------
        var1 |      13900000256.00
------------------------

.

Comment

Maarten Buis

Join Date: Mar 2014

Posts: 3426
#3

22 Jun 2016, 01:17

You can use the Ben Jann's fre command. You can get it by typing in Stata ssc install fre.

---------------------------------
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl
---------------------------------
Comment

Nick Cox

Join Date: Mar 2014
Posts: 35429

22 Jun 2016, 01:20

I guess Jorrit's solution, although helpful for the problem it solves, has it the wrong way round: your identifiers define rows of the table, they are not being summarized.

You don't give any exact detail, in terms of concrete data examples, exact code and exact results, contrary to explicit advice at http://www.statalist.org/forums/help#stata -- which is no doubt why Jorrit is decoding your problem one way and I am decoding another way.

My guess is that the problem is like this. And I give two solutions.

Code:

. clear

. input double longid  

         longid
  1. 123456789012
  2. 987654321987
  3. 987654321987
  4. end

. format longid %12.0f

. tab longid

     longid |      Freq.     Percent        Cum.
------------+-----------------------------------
   1.23e+11 |          1       33.33       33.33
   9.88e+11 |          2       66.67      100.00
------------+-----------------------------------
      Total |          3      100.00

. groups longid

  +-----------------------------------------+
  |       longid   Freq.   Percent     Cum. |
  |-----------------------------------------|
  | 123456789012       1     33.33    33.33 |
  | 987654321987       2     66.67   100.00 |
  +-----------------------------------------+

. tostring longid, gen(slongid) usedisplayformat
slongid generated as str12

. tab slongid

      longid |      Freq.     Percent        Cum.
-------------+-----------------------------------
123456789012 |          1       33.33       33.33
987654321987 |          2       66.67      100.00
-------------+-----------------------------------
       Total |          3      100.00

Solution 1. Use groups from SSC. You must install first, with

Code:

ssc inst groups
help groups

Solution 2. Use a string identifier.

Last edited by Nick Cox; 22 Jun 2016, 01:59.

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 29953
#5

22 Jun 2016, 05:35

Nick's full advice is correct. I write only to add that I think that solution 2 is far, far better than solution 1. You have made it clear that this number is used only as an identifier. So it is not going to be used for computations. If you keep it numeric, it is likely only a matter of time before you copy the variable carelessly with some command and end up truncating it to a float. That float will not have enough bytes to store 12 digits, so the low order digits will be lost and your ids will now be incorrect, perhaps with different original ids having been truncated to the same value. There is no such danger with strings.
2 likes
Comment
Shannon Campbell

Join Date: Feb 2015

Posts: 26
#6

22 Jun 2016, 16:05

Thank you very much, everyone! This was very helpful, and problem solved.

I'm sorry for the no exact code/example--I'll do better next time.
Comment
Jenny Dias

Join Date: May 2017

Posts: 8
#7

06 Jul 2017, 12:56

Hi I need help with scientific notation as well

What is the code that I should use, I didn't not understand the explanation. How can I get all the numbers. Thank you for your help

Here is my output

. svy: tab riagendr sddsrvyr , missing count cellwidth(2) format(%8.0g)
(running tabulate on estimation sample)

Number of strata = 60 Number of obs = 40617
Number of PSUs = 124 Population size = 1216874712
Design df = 64

-------------------------------------------------------
| Data Release Number
Gender | 5 6 7 8 Total
----------+--------------------------------------------
1 | 1.5e+08 1.5e+08 1.5e+08 1.5e+08 6.0e+08
2 | 1.5e+08 1.5e+08 1.6e+08 1.6e+08 6.2e+08
|
Total | 3.0e+08 3.0e+08 3.1e+08 3.1e+08 1.2e+09
-------------------------------------------------------
Key: weighted counts

Pearson:
Uncorrected chi2(3) = 0.1013
Design-based F(2.87, 183.89) = 0.0252 P = 0.9934
1 like
Comment
Tim Jaquet

Join Date: Feb 2019

Posts: 5
#8

20 Feb 2019, 12:33

So I'm having the same problem. However, I can't convert the identifier to string format because the program I'm using "geonear" gives me an error when I do that. Are there any other ways to preserve the full text?

What's super weird is that I have a field "abi" that's fine. Then I try and create a field: "gen abi2 = abi" and it truncates it. Even though it truncates it visually, it doesn't have them as duplicate values. However the output from geonear does. This is all to say, I'm not positive the problem isn't with geonear but it handles my original field fine, so I feel like if i can just successfully mirror it, I should be good to go.
Comment
Rich Goldstein

Join Date: Mar 2014

Posts: 4438
#9

20 Feb 2019, 12:38

look at

Code:

help precision help format
Comment
Tim Jaquet

Join Date: Feb 2019

Posts: 5
#10

20 Feb 2019, 13:00

Thanks for the tip. I read through the precision help but none of these is fixing the problem. Regardless of which format I use, "517996450" gets generated as "5.180e+08".

I've tried copying it as a string variable but the minute I destring it or generate a new variable based on the string (in any format) it kicks back to the scientific notation. I'm just confused because I had none of these problems when I imported my data in the first place. It seems like duplicate should just preserve whatever formatting lead to my primary variable being fine.

Last edited by Tim Jaquet; 20 Feb 2019, 13:03. Reason: typo
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29953
#11

20 Feb 2019, 13:58

Tim,

When you write -gen abi2 = abi-, you are not creating an exact copy of abi. Your original abi is a double, but -gen-, by default, creates a float, and so you lose information. To get an exact copy you have to write -gen double abi2 = abi-, or, without the effort of even thinking about it, you could do -clonevar abi2 = abi-.
1 like
Comment
Tim Jaquet

Join Date: Feb 2019

Posts: 5
#12

20 Feb 2019, 15:54

I should have tried clonevar. I did use the "gen double abi2" but it still gave me the scientific notation.

But I did find the answer was actually in Rich's post. The help format file is really confusing (to me) because it talks about justification and whatnot but "format abi2 %18.10g" did the trick.
Comment
Ramadhani Abdul

Join Date: May 2015

Posts: 6
#13

02 Sep 2019, 03:08

Thank you so much, Nick, I had the same issue and this command solved the problem
. tostring longid, gen(slongid) usedisplayformat
Comment

Announcement