Dealing with zeros when calculating quantiles

Anna Johnston

Join Date: Jun 2018

Posts: 6
#1

Dealing with zeros when calculating quantiles

11 Jul 2018, 03:27

Hello,
I am creating a new variable for continuous distance data. I want to categorise the data into 4 quantiles.

(variable name: dist_bt_fi, new variable name: dist_bt_fi_4)

My variable has 133 0's, and ranges from 0-9300, and are mainly in the 1000's.
when I write the following;

sort dist_bt_fi
xtile dist_bt_fi_4 = dist_bt_fi, nq(4)
tab dist_bt_fi_4
tabstat dist_bt_fi, stat(n mean min max sd p50) by(dist_bt_fi_4)

The minimum and maximum values for each quantile are incorrect (by a long shot! My maximum value is 54)

dist_bt_fi_4 | min max
-------------+----------------------------
1 | 1, 2
2 | 3, 3
3 | 4 ,23
4 | 24, 54
---------------+---------------------------

I cannot tell where these numbers are being generated from, but I've heard it might be to do with having lots of zeros in the dataset?

Any thoughts would be greatly appreciated,
Anna

Last edited by Anna Johnston; 11 Jul 2018, 03:45.
Tags: None
Nick Cox

Join Date: Mar 2014

Posts: 35405
#2

11 Jul 2018, 03:38

You seek quantile-based bins (why?). Calling those bins quantiles is common, but that is not the historical meaning of quantiles.

That said, values that are equal to each other must go in the same bin. That necessarily implies that all your zeros belong together.

I can't tell you why you see a maximum of 54, not 9300, as you don't give us anything reproducible here. Why not show us the results of

Code:

contract dist_bt_fi dataex

See also FAQ Advice #12.

That said, your variable doesn't look like a good candidate for this treatment. Perhaps the zeros and the positives mean something quite different, but even then the zeros shouldn't be mixed up with any other values. It may be that some transformation such as square root, cube root or neglog will help.

By an interesting coincidence I came across this reference earlier today: https://bmcmedresmethodol.biomedcent...471-2288-12-21
Comment
Anna Johnston

Join Date: Jun 2018

Posts: 6
#3

11 Jul 2018, 05:07

I apologise,
I am new to Stata and new to this forum!

Interestingly, I have just discovered why it was not working!

By converting the distance variable from string to numeric, using real() opposed to encode, it now is running correctly and is producing the correct output.

I am basing this categorisation on the methodology of the only other paper looking at this specific association, which may not be ideal but was the most appropriate method I could find to use!

Thank you for the link to the paper,
I will give it a read!
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#4

11 Jul 2018, 08:18

Anna Johnston Welcome to Statalist, and to Stata.

I'm glad you solved your problem using the real() function. However, you apparently didn't find the destring command, which is designed for what you attempted to do with the encode command. Perhaps that's because the output of help encode lists real() first and destring second as alternatives.

Because you're new to Stata, let me offer the following advice. When I began using Stata in a serious way, I started - as others here did - by reading my way through the Getting Started with Stata manual relevant to my setup. Chapter 18 then gives suggested further reading, much of which is in the Stata User's Guide, and I worked my way through much of that reading as well. All of these manuals are included as PDFs in the Stata installation (since version 11) and are accessible from within Stata - for example, through Stata's Help menu. The objective in doing this was not so much to master Stata as to be sure I'd become familiar with a wide variety of important basic techniques, so that when the time came that I needed them, I might recall their existence, if not the full syntax, and know how to find out more about them in the help files and manual.

Stata supplies exceptionally good documentation that amply repays the time spent studying it - there's just a lot of it. The path I followed surfaces the things you need to know to get started in a hurry and to work effectively.

And because you're new to Statalist, let me offer this advice. Please take a few moments to review the Statalist FAQ linked to from the top of the page, as well as from the Advice on Posting link on the page you used to create your post. Note especially sections 9-12 on how to best pose your question. You will have noticed that the output from tabstat is not particularly well formatted in your post. The FAQ will advise on formatting using CODE blocks, and on other matters as well that will ensure that future posts are as helpful as possible.

The more you help others understand your problem, the more likely others are to be able to help you solve your problem.

Good luck with your work!
Comment

Announcement

Dealing with zeros when calculating quantiles

Comment

Comment

Comment