Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to generate random values between 0 and 10 with mean 4.5

    Hi there,

    I need to generate a variable that has values between 0 and 10, with a mean of 4.5 and SD = 2.

    I tried using rnormal but I also generate negative values

    Code:
    g a = rnormal(4.5, 2)
    I then tried rpoisson but the range is from 0 to 12.

    Code:
    g a = rpoisson(4.5)
    Lastly I tried runiformint, it keeps the range from 0 to 12 but the mean is 5 with SD=3

    Code:
    g a = runiformint(0, 10)
    Can anyone help?

  • #2
    It's odd to specify that you want a random number with specific moments and not specify the distribution it should come from. Likely you will want the Normal, but you would need to accept truncation beyond [0, 10] or else specify what should happen in the tails (censored values perhaps?). Either way, you final distribution will not have the same SD and may not have exactly the same mean. What exactly are you trying to do?

    Comment


    • #3
      Leonardo Guizzetti makes excellent points. But it’s possible that a beta distribution might help. That said, if your distribution must be discrete that should be the first desideratum. It's hard to work out what the criteria are if rnormal(), rpoisson() and runiformint() are being tried, as they are qualitatively as well as quantitatively different.

      Comment


      • #4
        Thanks both. I am trying to simulate/create a variable that captures a Likert scale with values from 0 to 10 which is also normally distributed.

        I am trying to recreate values observed in the real dataset for simulation. Can you advise how this could be achieved?

        Comment


        • #5
          The criteria are contradictory.

          The query is like asking for a cat that is also a dog, in this sense: Any distribution for an integer scale such as you specify is inherently discrete and bounded. It can't also be a normal distribution, as a normal distribution is unbounded and continuous. What is possible, but much looser, is some idea of being as you state and also approximately symmetric, specifically approximately bell-shape.

          If the possible values are 0(1)10 then a mean of 4.5 already implies slight skewness. A binomial with range 0 to 10 and mean close to 5 would be close to normal in shape but not normal in any strict sense. It would have an SD of

          Code:
          . di sqrt(10 * 0.45 * 0.55)
          1.5732133
          which is distinctly less than 2.

          Here is a token simulation:

          Code:
          . clear 
          
          . set obs 10000 
          Number of observations (_N) was 0, now 10,000.
          
          . 
          . set seed 2803 
          
          . 
          . gen wanted = rbinomial(10, 0.45)
          
          . 
          . su wanted 
          
              Variable |        Obs        Mean    Std. dev.       Min        Max
          -------------+---------------------------------------------------------
                wanted |     10,000      4.4963    1.552942          0         10
          
          . 
          . tab wanted 
          
               wanted |      Freq.     Percent        Cum.
          ------------+-----------------------------------
                    0 |         24        0.24        0.24
                    1 |        202        2.02        2.26
                    2 |        740        7.40        9.66
                    3 |      1,660       16.60       26.26
                    4 |      2,420       24.20       50.46
                    5 |      2,396       23.96       74.42
                    6 |      1,553       15.53       89.95
                    7 |        756        7.56       97.51
                    8 |        214        2.14       99.65
                    9 |         31        0.31       99.96
                   10 |          4        0.04      100.00
          ------------+-----------------------------------
                Total |     10,000      100.00

          To get closer with a simulation, you need something more complicated, and contrariwise I don't have simple suggestions on how to get it, but there could well be smarter ideas from someone else.

          But what is the purpose here? It sounds as if you have data already, which is where your mean and SD come from, and want to get a handle on variability, in which case bootstrapping might be a much better answer.

          Comment


          • #6
            Hi Nick Cox , thanks for your reply and the worked examples. I plan to run simulations to estimate the sample size for a study where this Likert scale is a predictor in the model; I wanted its values to be close to the ones observed in the pilot.

            Comment


            • #7
              That helps; thanks. Mata has an rdiscrete() function for fully specified distributions. Here is a toy example.

              Code:
              . clear
              
              . set obs 1000
              Number of observations (_N) was 0, now 1,000.
              
              . gen wanted = .
              (1,000 missing values generated)
              
              . mata : st_store(., "wanted", rdiscrete(1000, 1, (0.1, 0.2, 0.3, 0.2, 0.1,
              >  0.1)))
              
              . tab wanted
              
                   wanted |      Freq.     Percent        Cum.
              ------------+-----------------------------------
                        1 |         93        9.30        9.30
                        2 |        187       18.70       28.00
                        3 |        303       30.30       58.30
                        4 |        198       19.80       78.10
                        5 |        116       11.60       89.70
                        6 |        103       10.30      100.00
              ------------+-----------------------------------
                    Total |      1,000      100.00
              
              . replace wanted = wanted - 1
              (1,000 real changes made)
              
              . tab wanted
              
                   wanted |      Freq.     Percent        Cum.
              ------------+-----------------------------------
                        0 |         93        9.30        9.30
                        1 |        187       18.70       28.00
                        2 |        303       30.30       58.30
                        3 |        198       19.80       78.10
                        4 |        116       11.60       89.70
                        5 |        103       10.30      100.00
              ------------+-----------------------------------
                    Total |      1,000      100.00
              In your case, you'd need to specify 11 probabilities, not 6, as I guess will be clear.

              Comment


              • #8
                That's great, thanks Nick Cox

                Would this approach also work for age? I am only interested in adults aged 18 to 65

                Comment


                • #9
                  You'd need to specify 48 probabilities, but it could be done. How well it would work I can't predict. Depends in part on how far you need exactly the same distribution. Real distributions are lumpy, and age data can suffer from heaping, so that's another area of difficulty.

                  Comment


                  • #10
                    Nick's given some excellent general advice. I would find it very unwieldy to have to specify (and justify) discrete probabilities for each age category. The point of simulating data is to capture the essential elements of of the populating you are trying to describe, and it's easy to get bogged down in the finer details. The activity you have described in #8 does not connect to your initial question. I'll assume it's the ages that you are most interested in.

                    It would be a sensible starting point that your age distribution is described as a (censored) normal distribution with whatever mean and SD you observe with your dataset. Any data outside of 18-65 years can be ignored and replaced with a valid age inside the range. This would be reasonable if you were sampling from a single population of those adults. Some finer points that may be worth considering follow.
                    • age may be rounded to the nearest integer. This might reflect how data are originally recorded. You can investigate if this matters for your purposes.
                    • "lumpiness" as Nick has described, can be an issue. This can be described as mixture population, where you choose to randomly sample from 1 of 2 or more different distributions. Again, you'll need to investigate if that's relevant for your needs.

                    Comment


                    • #11
                      Thinking about it more: I would recommend taking the empirical probabilities and then smoothing them, and if necessary rescaling to sum to 1.

                      Comment


                      • #12
                        The full context is emerging only slowly, but the implication seems to be that there are several predictors -- in which case I would underline that matching each marginal distribution for each predictor won't reproduce their joint distribution unless -- as seems unlikely -- the predictors are independent.

                        Comment


                        • #13
                          Thanks both for the considerations.

                          Leonardo Guizzetti - I like your suggestion that "any data outside of 18-65 years can be ignored and replaced with a valid age inside the range". How can this be achieved in Stata; would I need to identify generated values outside the range, set them to missing and sample again?

                          Comment


                          • #14
                            Originally posted by Jen Ward View Post
                            I plan to run simulations to estimate the sample size for a study where this Likert scale is a predictor in the model; I wanted its values to be close to the ones observed in the pilot.
                            Unless your pilot study is really tiny so that not all of the available scores appear in the dataset, then for this purpose I'd go with Nick's suggestion in #5 of randomly sampling the data in-hand, that is, use the empirical distribution of the questionnaire item's ordered-categorical response.

                            And if you're including other respondent characteristics as predictors, e.g, respondent's age, then I'd sample the predictors rowwise for the reason Nick implied in #12.

                            Comment


                            • #15
                              Originally posted by Jen Ward View Post
                              Thanks both for the considerations.

                              Leonardo Guizzetti - I like your suggestion that "any data outside of 18-65 years can be ignored and replaced with a valid age inside the range". How can this be achieved in Stata; would I need to identify generated values outside the range, set them to missing and sample again?
                              Yes, that's one way. Another way is to generate two variables with the same distribution, one that you keep, and the other that you take from in the event that values in the first are out of range. Then drop the second variable.

                              In light of this though, go with Nick's suggestion first.

                              Comment

                              Working...
                              X