Generating a ratio variable, when observations contains several counts of interest.

Erik Mikael Johnson

Join Date: Mar 2022

Posts: 22
#1

Generating a ratio variable, when observations contains several counts of interest.

18 Mar 2022, 07:49

Hi,

I'm currently building my data base for my master thesis.

My problem concerns the generation of the dependent variable, which I want to indicate number of reported crime by week.

Because of how we in Sweden report crime, the reported crime variable ranges from 1 to 135, meaning one observation can contain 135 counts of a specific crime type.
The "week" variable I simply generated by the "week()" command. As such, it contains frequencies of observations, e.g., week 1 = 5000.
As you can imagine I cannot simply divide the variables of interest by eachother to acquire the desired variable.

I've searched in this forum and in general but I don't seem to be abel to find a solution to my issue, i.e., how to generate the ratio-variable number of crimes by week.
Tags: None
Jared Greathouse

Join Date: Sep 2021

Posts: 2170
#2

18 Mar 2022, 07:52

Hey Erik. Can you show me what your data look like by using Stata's dataex command?
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35219
#3

18 Mar 2022, 07:57

week() is a function, not a command. More importantly, it sounds as if you have mapped a daily date to Stata weeks, which is not a good idea unless your weeks obey Stata rules, for example, that there can never be 53 weeks in a year. It's better to use daily dates and declare delta(7) to tsset or xtset.

That said, I can't easily follow your question, except for guessing that egen and its row*() functions may be what you need.

What you could helpfully do is give a hypothetical data example based on just a few variables to show the flavour of your problem.
Comment
Erik Mikael Johnson

Join Date: Mar 2022

Posts: 22
#4

18 Mar 2022, 11:28

Hi guys and thanx for responding so fast.

Here is an example using dataex.
It does not show in the example that "reported crime" can have a value of 135, you just gonna have to take my word for it as my data set has about 300 000 observations.

The "date of report" variabel is a string which is why the numbers corresponding do not provide a date. However, when I used the functions year, month and day on it, those variables corresponded perfectly to the levels of the "date of report" variable and it looked like the week function did the same (for instance it has 52 levels), ofcourse I might have missed something. Anyhow, it goes from 1jan2018 to 23feb2022. Is there anyway I can check it dosen't violate the rules of stata, Nick?

----------------------- copy starting from the next line -----------------------

clear
long "Reported Crime" float("week" "rc_week") int "date of report" float("year" "day")
1 13 .07692308 21635 2019 27
1 20 .05 21683 2019 14
1 4 .25 21574 2019 25
1 49 .020408163 21886 2019 3
1 6 .16666667 21222 2018 7
1 20 .05 21683 2019 14
1 51 .019607844 22636 2021 22
2 32 .0625 22503 2021 11
1 9 .11111111 21607 2019 27
1 21 .04761905 22424 2021 24
1 25 .04 21358 2018 23
1 35 .02857143 22154 2020 27
1 48 .020833334 21520 2018 2
1 23 .04347826 21707 2019 7
1 5 .2 21584 2019 4
1 32 .03125 21773 2019 12
1 20 .05 21319 2018 15
1 1 1 21550 2019 1
1 27 .037037037 21732 2019 2
1 32 .03125 21773 2019 12
end
format %td Inskrivningsdatum
------------------ copy up to and including the previous line ------------------

The "rc_week" is my first crude attempt at creating the ratio-varaible ( third colum), and as you can see it's a function of the values preceding, which when I think about it make sense....maybe I just had a bit of a brain-fart?

Basically, what I'm after is this, my independet variable is dichotomous and the first condition is the intervention period of security guards patroling an area and the second is the baseline period for the same area. With a correctly coded DV as outlined this independet t-test would compare the conditions by crime per week, in a sense treating the weeks as observations in every condition.
e.g., week 1 = 5.2 in condition 1 and 8.6 in condition 2

Am I going about this the right way ?
Comment
Erik Mikael Johnson

Join Date: Mar 2022

Posts: 22
#5

18 Mar 2022, 13:09

sorry about the formatting of the numbers I do not now how to get it "clean"
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#6

18 Mar 2022, 13:16

It is worse than the formatting of the numbers, you have edited the input command and you have removed the code delimiters [CODE] and [/CODE] that serve to make it clean. The result is something that is neither clean nor able to be copied-and-pasted into a do-file.

Your desire to provide meaningful variable names in English is good, but the way to go about that is to make a copy of your dataset and use the rename command (or Data > Variables Manager) to rename the variables before running the dataex command. And once you run it, follow the instructions and copy everything you are told to copy.

By default dataex will output the first 100 observations. The output of dataex will look something like the following.

Code:

----------------------- copy starting from the next line ----------------------- [CODE] * Example generated by -dataex-. To install: ssc install dataex clear input int(x1 x2 x3) float x4 int x5 byte x6 4195 24 1 2 10 0 10371 16 3 3.5 17 0 4647 28 3 2 11 0 ... 5079 24 4 2.5 8 1 8129 21 4 2.5 8 1 4296 21 3 2.5 16 1 end label values x6 yesno label def yesno 0 "No", modify label def yesno 1 "Yes", modify [/CODE] ------------------ copy up to and including the previous line ------------------

In your dataex output you will select the lines between, but not including, "copy starting from the next line" and "copy up to and including the previous line" and then paste that into your reply. The result presented in your post will look something like the following.

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input int(x1 x2 x3) float x4 int x5 byte x6 4195 24 1 2 10 0 10371 16 3 3.5 17 0 4647 28 3 2 11 0 ... 5079 24 4 2.5 8 1 8129 21 4 2.5 8 1 4296 21 3 2.5 16 1 end label values x6 yesno label def yesno 0 "No", modify label def yesno 1 "Yes", modify
2 likes
Comment

Erik Mikael Johnson

Join Date: Mar 2022
Posts: 22

19 Mar 2022, 04:18

Hi William,

Many thanks for the tips and trix with dataex.

Here is the same data example as outlined.

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input long reported_crime float(week rc_week) int date_of_report float(year day)
1  5         .2 21216 2018 1
1 48 .020833334 22615 2021 1
1 27 .037037037 22097 2020 1
1 13  .07692308 21640 2019 1
1 22  .04545455 21701 2019 1
1 27 .037037037 22097 2020 1
1  5         .2 21946 2020 1
1  9  .11111111 22340 2021 1
1 31 .032258064 21397 2018 1
1 22  .04545455 22432 2021 1
1  9  .11111111 22340 2021 1
1 48 .020833334 21519 2018 1
1 18  .05555556 22401 2021 1
1 35  .02857143 21428 2018 1
1 18  .05555556 21305 2018 1
1 18  .05555556 22036 2020 1
1  1          1 21185 2018 1
1 48 .020833334 22615 2021 1
1 18  .05555556 22036 2020 1
1 13  .07692308 22371 2021 1
end
format %td date_of_report

Comment

William Lisowski

Join Date: Dec 2014

Posts: 10150
#8

19 Mar 2022, 08:20

Thank you for the effective data example.

In thinking about what I want to write, I have a question.

Code:

list date_of_report week in 1/10, clean date_of~t week 1. 01feb2018 5 2. 01dec2021 48 3. 01jul2020 27 4. 01apr2019 13 5. 01jun2019 22 6. 01jul2020 27 7. 01feb2020 5 8. 01mar2021 9 9. 01aug2018 31 10. 01jun2021 22

Each of the "date_of_report" values is a Stata daily date for the first day of a given month. Your "week" values are, as you said, created by using

Code:

generate week = week(date_of_report)

This leads me to think that your data consists of monthly reports of crimes, not data that is intrinsically weekly. In particular if it were weekly data I'd expect the weeks to all start on the same day of the week - e.g. Sunday - which these clearly do not.

Code:

. generate weekday = dow(date_of_report) . list date_of_report weekday in 1/10, clean date_of~t weekday 1. 01feb2018 4 2. 01dec2021 3 3. 01jul2020 3 4. 01apr2019 1 5. 01jun2019 6 6. 01jul2020 3 7. 01feb2020 6 8. 01mar2021 1 9. 01aug2018 3 10. 01jun2021 2

So to address your question effectively, we need to better understand what your data represents. If it is indeed monthly data, then you can analyze monthly crime rates, perhaps including indicator variables for the months of the year to capture seasonality and as a side effect to capture differences in the lengths of each month. Or you can re-express monthly crime rates as weekly crime rates by dividing by the number of days in each month to get average daily crime rates, and then multiply by 7 to get average weekly crime rates.

With that said, let me add the following advice. Stata's "date and time" variables are complicated and there is a lot to learn. If you have not already read the very detailed Chapter 24 (Working with dates and times) of the Stata User's Guide PDF, do so now. If you have, it's time for a refresher. After that, the help datetime documentation will usually be enough to point the way. You can't remember everything; even the most experienced users end up referring to the help datetime documentation or back to the manual for details. But at least you will get a good understanding of the basics and the underlying principles. An investment of time that will be amply repaid.

All Stata manuals are included as PDFs in the Stata installation and are accessible from within Stata - for example, through the PDF Documentation section of Stata's Help menu.

Reading that, you will learn that it is important to distinguish between how a value is stored and how it is displayed. You will understand that date_of_report is not a string variable, it is a numeric variable containing a "Stata daily date" value, with the %td format causing Stata to display that value as something meaningful to the reader. The values of date_of_report do indeed provide a date, in a format usable by Stata for analytic purposes.

Code:

* Example generated by -dataex-. For more info, type help dataex clear input float(day1 day2) 22704 22707 end format %td day1 format %td day2 generate diff = day2 - day1 list, clean

Code:

. list, clean day1 day2 diff 1. 28feb2022 03mar2022 3

Note how simple it is to compute the number of days between two dates using Stata daily dates. That is the point to Stata's internal representation of dates - they work well with simple arithmetic. .

So again, we need to better understand your data to address your analytical needs. Or perhaps with a better understanding of Stata dates, you'll see the way forward yourself.

I will say that while weekly crime rates are a useful concept, the value returned by week() is not generally useful for analysis, because years do not have 1 or two days more than 52 7-day weeks - hence the 8- or 9-day 52nd week of the year.

Last edited by William Lisowski; 19 Mar 2022, 08:48.
1 like
Comment
Erik Mikael Johnson

Join Date: Mar 2022

Posts: 22
#9

20 Mar 2022, 07:58

Thank you very much William for the very pedagogical and informative answer. I will take your advice and do as you suggest. If I do not solve it my self I will post here agian.
Comment
Erik Mikael Johnson

Join Date: Mar 2022

Posts: 22
#10

21 Mar 2022, 06:50

Okay, so the data I have is daily crime data. I provided a bad example above as I sorted the data by date_of_report beforehand, sorry about that.

I like your suggestion " then you can analyze monthly crime rates, perhaps including indicator variables for the months of the year to capture seasonality and as a side effect to capture differences in the lengths of each month. Or you can re-express monthly crime rates as weekly crime rates by dividing by the number of days in each month to get average daily crime rates, and then multiply by 7 to get average weekly crime rates."

How would you express this in Stata syntax?
Comment

William Lisowski

Join Date: Dec 2014
Posts: 10150

#11

21 Mar 2022, 14:41

Here is some example code on example data that demonstrates using the collapse and reshape commands with data like yours. My understanding is that each observation in your original data represents a single crime of the type indicated by reported_crime on the reported_date.

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input float(reported_crime date_of_report)
2 21963
2 21969
3 21974
3 21935
3 21925
1 21953
1 21930
3 21958
3 21924
3 21933
3 21967
1 21935
3 21941
3 21916
2 21935
2 21974
1 21918
3 21954
2 21954
1 21972
3 21939
2 21937
3 21959
2 21965
3 21968
3 21969
1 21920
2 21955
1 21919
2 21937
3 21921
1 21949
1 21966
2 21922
1 21932
3 21942
1 21966
1 21952
3 21922
2 21962
3 21931
1 21966
2 21957
1 21930
1 21955
2 21949
3 21961
3 21955
3 21934
2 21917
2 21964
3 21965
3 21940
2 21928
3 21956
2 21954
1 21956
2 21970
3 21935
2 21965
3 21971
2 21952
1 21925
1 21928
2 21963
3 21946
1 21965
2 21937
1 21932
3 21938
1 21941
2 21962
2 21956
3 21916
3 21972
3 21934
1 21973
1 21927
2 21916
3 21956
3 21943
1 21958
3 21926
2 21919
2 21931
2 21921
3 21948
1 21930
3 21963
3 21945
2 21945
1 21945
1 21960
2 21924
3 21974
1 21939
2 21934
1 21970
1 21963
3 21962
end
format %td date_of_report

// display first 5 of 100 observations
list in 1/5, abbreviate(20)

// collapse daily observations to monthly totals
generate crimes = 1 
generate month_of_report = mofd(date_of_report)
format %tm month_of_report
collapse (sum) crimes, by(reported_crime month_of_report)
list, sepby(reported_crime) abbreviate(20)

// reshape wide if what you want is 152 variables
tostring reported_crime, replace format(%03.0f)
rename crimes crime_
reshape wide crime_, i(month_of_report) j(reported_crime) string
list, abbreviate(20)

Code:

. // display first 5 of 100 observations
. list in 1/5, abbreviate(20)

     +---------------------------------+
     | reported_crime   date_of_report |
     |---------------------------------|
  1. |              2        18feb2020 |
  2. |              2        24feb2020 |
  3. |              3        29feb2020 |
  4. |              3        21jan2020 |
  5. |              3        11jan2020 |
     +---------------------------------+

. 
. // collapse daily observations to monthly totals
. generate crimes = 1 

. generate month_of_report = mofd(date_of_report)

. format %tm month_of_report

. collapse (sum) crimes, by(reported_crime month_of_report)

. list, sepby(reported_crime) abbreviate(20)

     +-------------------------------------------+
     | reported_crime   month_of_report   crimes |
     |-------------------------------------------|
  1. |              1            2020m1       15 |
  2. |              1            2020m2       15 |
     |-------------------------------------------|
  3. |              2            2020m1       14 |
  4. |              2            2020m2       17 |
     |-------------------------------------------|
  5. |              3            2020m1       20 |
  6. |              3            2020m2       19 |
     +-------------------------------------------+

. 
. // reshape wide if what you want is 152 variables
. tostring reported_crime, replace format(%03.0f)
reported_crime was float now str3

. rename crimes crime_

. reshape wide crime_, i(month_of_report) j(reported_crime) string
(j = 001 002 003)

Data                               Long   ->   Wide
-----------------------------------------------------------------------------
Number of observations                6   ->   2           
Number of variables                   3   ->   4           
j variable (3 values)    reported_crime   ->   (dropped)
xij variables:
                                 crime_   ->   crime_001 crime_002 crime_003
-----------------------------------------------------------------------------

. list, abbreviate(20)

     +-----------------------------------------------------+
     | month_of_report   crime_001   crime_002   crime_003 |
     |-----------------------------------------------------|
  1. |          2020m1          15          14          20 |
  2. |          2020m2          15          17          19 |
     +-----------------------------------------------------+

.

Comment

Erik Mikael Johnson

Join Date: Mar 2022

Posts: 22
#12

22 Mar 2022, 08:54

Thanks for the detailed example and code, much appreciated.

What I do not understand, and sorry if I'm being a bit slow here, is how I can code a continuous ratio-variable from this which I can use as a dependent variable? Say for instance, in a t-test comparing two areas.
Comment
Jared Greathouse

Join Date: Sep 2021

Posts: 2170
#13

22 Mar 2022, 09:01

If your outcome variable is already a number (let's say 40.34 per 100,000), you've got your continuous variable right there.
1 like
Comment
Erik Mikael Johnson

Join Date: Mar 2022

Posts: 22
#14

22 Mar 2022, 09:10

Okay, perfect! Thank you guys for everything.
Comment

Announcement

Generating a ratio variable, when observations contains several counts of interest.

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment