How to do calcuations between different datatypes (double, long, float)?

Marco Kuehne

Join Date: Feb 2019
Posts: 32

How to do calcuations between different datatypes (double, long, float)?

16 Jul 2019, 09:25

When I add two numerics/numbers, the result is always rounded to some nonsense.

Actually my datatype for pid/hid is "long". With MWE it will give you float, but illustrates the same problem.

Code:

clear
input   pid hid
        513 60313    
        513 60313
        513 94    
        513 94    
        514 167    
        514 175
        515 175    
        516 175            
end

I would like to multiply the first identifier by the maximum of the second and add this in order to create a unique new identifier. But I cannot figure out, what dataformat I should use.

Code:

gen pid_float = float(pid)
gen pid2 = pid_float*1000000
format pid2 %15.0f

gen hid_float = float(hid)
gen pid_hid = pid2+hid_float
format pid_hid %15.0f
list pid hid pid_float pid2 hid_float pid_hid

     +-----------------------------------------------------------+
     | pid     hid   pid_fl~t        pid2   hid_fl~t     pid_hid |
     |-----------------------------------------------------------|
  1. | 513   60313        513   513000000      60313   513060320 |
  2. | 513   60313        513   513000000      60313   513060320 |
  3. | 513      94        513   513000000         94   513000096 |
  4. | 513      94        513   513000000         94   513000096 |
  5. | 514     167        514   514000000        167   514000160 |
     |-----------------------------------------------------------|
  6. | 514     175        514   514000000        175   514000160 |
  7. | 515     175        515   515000000        175   515000160 |
  8. | 516     175        516   516000000        175   516000160 |
     +-----------------------------------------------------------+

How can I handle those numbers as normal numbers? Why is hid 60313 rounded to 60320? Thank you very much.

Last edited by Marco Kuehne; 16 Jul 2019, 09:55.

Tags: data types

Clyde Schechter

Join Date: Apr 2014

Posts: 29956
#2

16 Jul 2019, 10:26

A float data type does not have sufficiently many bits to store the required number of digits.

Code:

gen long new_id = 1000000*pid + hid

will do it as long as nothing is longer than 9 digits. If in your real data some of these new id's would have more than 9 digits, then use -double- instead of -long-.

See -help data_types- for more information about the number of digits that can be represented in each storage type.
1 like
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35433
#3

16 Jul 2019, 10:31

You want exactly correct results from calculations with large integers. That's fine, but nothing in your code will ensure that.

* You nowhere specify that the variables created will be long or double. This is the key. If you have a big document, you need a big envelope.

* Note that the display format is a irrelevant to how numbers are stored. All changing it does is affect how numbers are displayed. Format is not,
in Stata, anything to do with variable or storage type. You don't say that you're imagining that it is the solution, but I mention it because I often encounter this idea.

* float() either leaves results unchanged or coarsens them. I think it's irrelevant to your example.

On your comment

I would like to multiply the first identifier by the maximum of the second and add this in order to create a unique new identifier. But I cannot figure out what data format I should use.

I note that the maximum of hid is 60313 and you certainly don't multiply by that. So I don't understand that.

But what you want is easy to get once you spell out the variable type you want:

Code:

clear input pid hid 513 60313 513 60313 513 94 513 94 514 167 514 175 515 175 516 175 end gen long pid2 = pid*1000000 gen long pid_hid = pid2 + hid format pid_hid %15.0f list , sepby(pid) +-------------------------------------+ | pid hid pid2 pid_hid | |-------------------------------------| 1. | 513 60313 513000000 513060313 | 2. | 513 60313 513000000 513060313 | 3. | 513 94 513000000 513000094 | 4. | 513 94 513000000 513000094 | |-------------------------------------| 5. | 514 167 514000000 514000167 | 6. | 514 175 514000000 514000175 | |-------------------------------------| 7. | 515 175 515000000 515000175 | |-------------------------------------| 8. | 516 175 516000000 516000175 | +-------------------------------------+
Comment
Marco Kuehne

Join Date: Feb 2019

Posts: 32
#4

16 Jul 2019, 10:36

Thank you very much Mr. Schechter. In help of float() I found Although you may store your numeric variables as byte, int, long, float, or double, Stata converts all numbers to double before performing any calculations. Thus I couldn't understand why it's not working by default. I should have set the data type before doing the calculation. For my data it's

Code:

gen double pid_hid = 1000000*pid + hid

*Edit: Thank you as well, Mr. Cox. Now format issues are more clear to me.

Last edited by Marco Kuehne; 16 Jul 2019, 10:40.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29956
#5

16 Jul 2019, 10:47

Yes, Stata converts all numbers to double before doing calculations, but when it stores the result in a new variable, it stores it as a float unless you specify otherwise. For most purposes this works well. But for creating many-digit ID variables, you need to specify a -long- or -double- storage type.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35433
#6

16 Jul 2019, 11:06

There are many alternatives to very long integers as composite identifiers, including

string identifiers: usually you won't save much if anything on storage but e.g. "513 60313" is really easy to work with

using two or more identifiers jointly

using egen group(), label
Comment
Marco Kuehne

Join Date: Feb 2019

Posts: 32
#7

16 Jul 2019, 11:26

Mr. Cox. I'm very interested into converting these big identifier into something easier. Since I for example need to include them as factor variables in my regression models. And I guess there're not recognized as such ("factor variables may not contain noninteger values"), due to their data type.

So how exactly can I relabel my identifier with egen, which roughly look like this:

Code:

clear input var 2030603131 2030603131 9010000941 9010000941 16020001671 17010001750 17040001750 23010002300 23020658890 end
Comment

Nick Cox

Join Date: Mar 2014
Posts: 35433

16 Jul 2019, 11:39

See my hint already in #6.

Code:

. egen wanted = group(hid pid), label

. l

     +-------------------------------------------------+
     | pid     hid        pid2     pid_hid      wanted |
     |-------------------------------------------------|
  1. | 513   60313   513000000   513060313   60313 513 |
  2. | 513   60313   513000000   513060313   60313 513 |
  3. | 513      94   513000000   513000094      94 513 |
  4. | 513      94   513000000   513000094      94 513 |
  5. | 514     167   514000000   514000167     167 514 |
     |-------------------------------------------------|
  6. | 514     175   514000000   514000175     175 514 |
  7. | 515     175   515000000   515000175     175 515 |
  8. | 516     175   516000000   516000175     175 516 |
     +-------------------------------------------------+

How many distinct factor values do you expect to be working with? What kind of models do you expect to be working with? Do they respect the structure of households and persons?

PS I'd just use the names people give themselves in replying. Formal titles aren't customary here. If you call males Mr what are you going to call females?

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 29956
#9

16 Jul 2019, 11:50

While your attempt to emulate -dataex- output to show data is well-meant, it is not an adequate substitute for the real thing. If you run the code you show in #7, -format var %11.0f- and then -list- or -browse- the data you will see that the numbers in the data set do not match what you are trying to -input-. It's the same problem all over again: -input- creates var as a float, which does not have the capacity to store 10 and 11 digit numbers. So you need to use the real -dataex- here.

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input double var 2030603131 2030603131 9010000941 9010000941 16020001671 17010001750 17040001750 23010002300 23020658890 end

works properly. The code in #7 does not.

If you are running version 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

Turning now to the question you asked, to use these as factor variables you can -encode- them, provided there are at most 65,536 distinct values. But, if this is what you ultimately want to do, then you do not even need to create this intermediate variable in the first place. Just go back to your original pid and hid, and create -egen new_id = group(pid hid), label- and you will have a variable new_id, whose values are consecutive numbers from 1 to how many different combinations of pid and hid there are, all ready to use as a factor variable. The variable will be labeled with the values of pid and hid.

Added: Crossed with #8.
Comment

Announcement

How to do calcuations between different datatypes (double, long, float)?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment