Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to do calcuations between different datatypes (double, long, float)?

    When I add two numerics/numbers, the result is always rounded to some nonsense.

    Actually my datatype for pid/hid is "long". With MWE it will give you float, but illustrates the same problem.

    Code:
    clear
    input   pid hid
            513 60313    
            513 60313
            513 94    
            513 94    
            514 167    
            514 175
            515 175    
            516 175            
    end
    I would like to multiply the first identifier by the maximum of the second and add this in order to create a unique new identifier. But I cannot figure out, what dataformat I should use.

    Code:
    gen pid_float = float(pid)
    gen pid2 = pid_float*1000000
    format pid2 %15.0f
    
    gen hid_float = float(hid)
    gen pid_hid = pid2+hid_float
    format pid_hid %15.0f
    list pid hid pid_float pid2 hid_float pid_hid
    
         +-----------------------------------------------------------+
         | pid     hid   pid_fl~t        pid2   hid_fl~t     pid_hid |
         |-----------------------------------------------------------|
      1. | 513   60313        513   513000000      60313   513060320 |
      2. | 513   60313        513   513000000      60313   513060320 |
      3. | 513      94        513   513000000         94   513000096 |
      4. | 513      94        513   513000000         94   513000096 |
      5. | 514     167        514   514000000        167   514000160 |
         |-----------------------------------------------------------|
      6. | 514     175        514   514000000        175   514000160 |
      7. | 515     175        515   515000000        175   515000160 |
      8. | 516     175        516   516000000        175   516000160 |
         +-----------------------------------------------------------+

    How can I handle those numbers as normal numbers? Why is hid 60313 rounded to 60320? Thank you very much.
    Last edited by Marco Kuehne; 16 Jul 2019, 09:55.

  • #2
    A float data type does not have sufficiently many bits to store the required number of digits.

    Code:
    gen long new_id = 1000000*pid + hid
    will do it as long as nothing is longer than 9 digits. If in your real data some of these new id's would have more than 9 digits, then use -double- instead of -long-.

    See -help data_types- for more information about the number of digits that can be represented in each storage type.

    Comment


    • #3
      You want exactly correct results from calculations with large integers. That's fine, but nothing in your code will ensure that.

      * You nowhere specify that the variables created will be long or double. This is the key. If you have a big document, you need a big envelope.

      * Note that the display format is a irrelevant to how numbers are stored. All changing it does is affect how numbers are displayed. Format is not,
      in Stata, anything to do with variable or storage type. You don't say that you're imagining that it is the solution, but I mention it because I often encounter this idea.

      * float() either leaves results unchanged or coarsens them. I think it's irrelevant to your example.

      On your comment

      I would like to multiply the first identifier by the maximum of the second and add this in order to create a unique new identifier. But I cannot figure out what data format I should use.
      I note that the maximum of hid is 60313 and you certainly don't multiply by that. So I don't understand that.

      But what you want is easy to get once you spell out the variable type you want:

      Code:
      clear
      input   pid hid
              513 60313    
              513 60313
              513 94    
              513 94    
              514 167    
              514 175
              515 175    
              516 175            
      end
      
      gen long pid2 = pid*1000000
      gen long pid_hid = pid2 + hid
      
      format pid_hid %15.0f
      
      list , sepby(pid)
      
           +-------------------------------------+
           | pid     hid        pid2     pid_hid |
           |-------------------------------------|
        1. | 513   60313   513000000   513060313 |
        2. | 513   60313   513000000   513060313 |
        3. | 513      94   513000000   513000094 |
        4. | 513      94   513000000   513000094 |
           |-------------------------------------|
        5. | 514     167   514000000   514000167 |
        6. | 514     175   514000000   514000175 |
           |-------------------------------------|
        7. | 515     175   515000000   515000175 |
           |-------------------------------------|
        8. | 516     175   516000000   516000175 |
           +-------------------------------------+

      Comment


      • #4
        Thank you very much Mr. Schechter. In help of float() I found Although you may store your numeric variables as byte, int, long, float, or double, Stata converts all numbers to double before performing any calculations. Thus I couldn't understand why it's not working by default. I should have set the data type before doing the calculation. For my data it's

        Code:
        gen double pid_hid = 1000000*pid + hid
        *Edit: Thank you as well, Mr. Cox. Now format issues are more clear to me.
        Last edited by Marco Kuehne; 16 Jul 2019, 10:40.

        Comment


        • #5
          Yes, Stata converts all numbers to double before doing calculations, but when it stores the result in a new variable, it stores it as a float unless you specify otherwise. For most purposes this works well. But for creating many-digit ID variables, you need to specify a -long- or -double- storage type.

          Comment


          • #6
            There are many alternatives to very long integers as composite identifiers, including

            string identifiers: usually you won't save much if anything on storage but e.g. "513 60313" is really easy to work with

            using two or more identifiers jointly

            using egen group(), label

            Comment


            • #7
              Mr. Cox. I'm very interested into converting these big identifier into something easier. Since I for example need to include them as factor variables in my regression models. And I guess there're not recognized as such ("factor variables may not contain noninteger values"), due to their data type.

              So how exactly can I relabel my identifier with egen, which roughly look like this:

              Code:
              clear
              input    var
                      2030603131
                      2030603131
                      9010000941
                      9010000941
                      16020001671
                      17010001750
                      17040001750
                      23010002300
                      23020658890
              end

              Comment


              • #8


                See my hint already in #6.

                Code:
                . egen wanted = group(hid pid), label
                
                . l
                
                     +-------------------------------------------------+
                     | pid     hid        pid2     pid_hid      wanted |
                     |-------------------------------------------------|
                  1. | 513   60313   513000000   513060313   60313 513 |
                  2. | 513   60313   513000000   513060313   60313 513 |
                  3. | 513      94   513000000   513000094      94 513 |
                  4. | 513      94   513000000   513000094      94 513 |
                  5. | 514     167   514000000   514000167     167 514 |
                     |-------------------------------------------------|
                  6. | 514     175   514000000   514000175     175 514 |
                  7. | 515     175   515000000   515000175     175 515 |
                  8. | 516     175   516000000   516000175     175 516 |
                     +-------------------------------------------------+
                How many distinct factor values do you expect to be working with? What kind of models do you expect to be working with? Do they respect the structure of households and persons?


                PS I'd just use the names people give themselves in replying. Formal titles aren't customary here. If you call males Mr what are you going to call females?

                Comment


                • #9
                  While your attempt to emulate -dataex- output to show data is well-meant, it is not an adequate substitute for the real thing. If you run the code you show in #7, -format var %11.0f- and then -list- or -browse- the data you will see that the numbers in the data set do not match what you are trying to -input-. It's the same problem all over again: -input- creates var as a float, which does not have the capacity to store 10 and 11 digit numbers. So you need to use the real -dataex- here.

                  Code:
                  * Example generated by -dataex-. To install: ssc install dataex
                  clear
                  input double var
                   2030603131
                   2030603131
                   9010000941
                   9010000941
                  16020001671
                  17010001750
                  17040001750
                  23010002300
                  23020658890
                  end
                  works properly. The code in #7 does not.

                  If you are running version 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

                  Turning now to the question you asked, to use these as factor variables you can -encode- them, provided there are at most 65,536 distinct values. But, if this is what you ultimately want to do, then you do not even need to create this intermediate variable in the first place. Just go back to your original pid and hid, and create -egen new_id = group(pid hid), label- and you will have a variable new_id, whose values are consecutive numbers from 1 to how many different combinations of pid and hid there are, all ready to use as a factor variable. The variable will be labeled with the values of pid and hid.

                  Added: Crossed with #8.

                  Comment

                  Working...
                  X