Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Why would all values for all variables for a subset observations be in scientific notations?

    Hi,

    I am working with a survey data and for a subset of observations values in terms of all variables are in scientific notation. These values should in fact be, at max, 3 digits numbers because they are codes for districts, and other administrative units. Responses which should correspond to option numbers such as 1, 2 are also in scientific notations. Is there an alternative explanation other than "something funky must have happened when the datafiles were generated from the tablets that were used to collect the data?" Or, perhaps something weird happened during the cleaning process?

    Thanks,
    Sampada

  • #2
    Welcome to Statalist, Sampada.

    You say that option numbers like 1 and 2 are displayed in scientific notation. The question is, what values are displayed? From what you've written, it's unclear whether the problem you describe is with correct values unexpectly formatted in scientific notation, or with incorrect values being displayed.

    Comment


    • #3
      Thanks William. First post, indeed!

      Thanks for phrasing my question much more clearly; I do want to know whether correct values are underneath those displayed in scientific notation, or incorrect values were entered to begin with. The storage format of these problem variables is currently %12.0g. I have tried suggestions posted here on statalist to get rid of scientific notation, but looks like those only work when all of the values of the var being formated are in scientific notation (is that correct?). In my case, only a subset (roughly 4000/996588) are.

      e.g. from the data below:

      dist vcode vdcmun ward EA howner_sn howner_num own_res_sme gender
      -4.3356e+307 1.9421e+212 -8.5299e-120 -1.26386e+75 1.8211e+199 1.4889e-232 1.47659e+62 1.2080e-197 -1.1068e+161
      -4.3375e+306 1.7251e-282 -6.42869e+41 1.1459e-276 1.2012e-291 -6.76261e+25 1.26115e+27 6.10353e+65 -1.7566e-107
      -2.9794e+306 1.6830e+130 -4.2295e+198 4.74271e+92 4.39579e+13 -5.51910e+26 1.8520e+126 2.7157e-250 -.0201969381
      -1.2680e+306 -7.55579e+66 7.9701e-136 -3.1848e-144 6.8994e-257 -1.11314e+88 -1.16378e-30 -4.61510e-99 -2.9358e-283
      -9.2791e+305 -1.9828e+262 -7.50810e+99 -4.6964e+207 -4.1908e+165 4.6949e-110 5.9342e+167 7.1971e+217 1.2025e-124
      -6.2335e+305 -2.0887e+245 8464139040 8.4725e+128 -5.4301e+130 3.8211e-140 5.3071e-168 9.9177e+287 -2.1799e-216
      -3.6119e+305 3.2762e+236 6.5516e-211 1.1306e+121 -2.3055e-229 -4.15166e+99 -3.6460e+168 -9.0282e+234 -1.1786e+271

      Comment


      • #4
        Sorry, about the format of the example above. Below, same thing using dataex:


        Code:
        * Example generated by -dataex-. To install: ssc install dataex
        clear
        input double(dist howner_sn relation gender)
         -4.335575041195846e+307  1.488933705462832e-232  -2.5065141646214288e-21 -1.1068471647448357e+161
         -4.337527757790623e+306  -6.762605509927541e+25  -2.2475060438067758e-39 -1.7565806675978806e-107
         -2.979406064316877e+306 -5.5191021726804144e+26  -3.525658199704998e+277     -.020196938075742816
        -1.2680293741202633e+306 -1.1131419406119997e+88  -4.795558608898875e-292 -2.9358002898858985e-283
         -9.279108398573142e+305 4.6948849559082683e-110 -1.4805937565807182e+156   1.202509865812938e-124
        end
        label values dist dist
        label values relation relation
        label values gender gender

        Comment


        • #5
          Something looks very wrong there. For example, gender values look nonsensical. There is unlikely to be a fix short of revisiting your data source(s).

          Comment


          • #6
            Thanks Nick. I am afraid that is the case here. Thanks.

            Comment


            • #7
              Thank you for presenting your sample data using dataex in post #4, it does help.

              I do want to know whether correct values are underneath those displayed in scientific notation, or incorrect values were entered to begin with.
              Following up on Nick's response in post #5, if you are unfamiliar with reading scientific notation, the value of gender from your first sample observation
              Code:
              -1.1068471647448357e+161
              is read as
              Code:
              -1.1068471647448357 times 10161
              Since gender is typically coded as a categorical variable taking a small number of non-negative values, this provides two indications (negative value, large number) that something is wrong with the data.

              If you were given a single dataset with 996588 observations, you have no recourse but to return to the source of the data. If instead you were given multiple datasets that you combined, you might first try to see if the problem is isolated to a single dataset.

              Comment


              • #8
                Thanks William. I have been lucky and my data source, which otherwise doesn't, has just replied to me with a datafile with sensible values. All sorted.

                Comment

                Working...
                X