Need Help with a Calculation

Ali Ahmed Awan

Join Date: Aug 2024
Posts: 14

Need Help with a Calculation

15 Nov 2024, 00:04

Hi, I would greatly appreciate it if someone could help me with this calculation.

I want to count the number of dbids based on the report_time column. When report_time is available, I want to count all the dbids, which are lower in value as compared to the current dbid, where resolution_timestamp is greater than the report_time of the current dbid.

Please see the attached picture. I want to calculate the wanted column as shown in the attached picture. The explanation column shows the reasoning behind the values in the wanted column.

Here is the sample data.

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input long dbid double(report_time resolution_timestamp)
412735             . 1856208378000
412754 1853220847616 1886187028480
412756             . 1855512505000
412759             . 1854635297000
412772 1853223206912 1886259642368
412781             . 1856191395000
412870             . 1861464751000
412892             . 1853431024000
412896             . 1856198193000
412917             . 1886262361000
412928             . 1862069843000
413075             . 1856209389000
413077 1853310238720 1856180715520
413115 1853324394496 1861570789376
413172             . 1858451505000
413176             . 1858451707000
413184             . 1978148563000
413185             . 1858452263000
413187             . 1976419524000
418248 1854181998592 1886187028480
418262             . 1855797664000
418267 1854186192896 1855405162496
418271 1854186979328 1855835996160
418295             . 1858452544000
418308 1854190780416 1857043431424
418325             . 1855241649000
459739             . 1865615570000
460316             . 1886263053000
478251             . 1874878119000
478315             . 1867087645000
478901             . 1899747510000
479573             . 1911748991000
482340             . 1891113781000
482517             . 1886263515000
482535             . 1867088573000
482645             . 1874874223000
482708             . 1891113724000
487347             . 1886263708000
503988 1867073585152 1909333426176
504122 1867099013120 1867321442304
508726             . 1868228231000
512306             . 1.8702798e+12
519519             . 1870120862000
529233             . 1904244701000
529238             . 1919015047000
530874             . 1874259850000
543775 1871351382016 1874863194112
548212             . 1885756576000
593655             . 1876769490000
633598             . 1878327631000
633976             . 1923158508000
633983             . 1916345555000
634037             . 1911749010000
636643             . 1878311041000
636669             . 1881257594000
667057             . 1880679878000
691804             . 1883923852000
692108             . 1886261391000
692116 1883769929728 1886109827072
707034             . 1916345738000
707052             . 1916346058000
707760             . 1909770916000
710368 1886223073280 1886246797312
710574             . 1887295151000
710878             . 1886990655000
715486             . 1886879398000
715489             . 1886879519000
715581             . 1886879479000
716216             . 1891016872000
716218             . 1887557182000
716219             . 1887557225000
721189             . 1945379232000
721202             . 1916346990000
722145 1887540871168 1889787314176
722281 1887541657600 1890298626048
722292             . 1891186350000
722301 1887544410112 1906138415104
722372             . 1903288101000
723370             . 1906152358000
724878             . 1909277630000
725716             . 1908223670000
725722             . 1908223636000
726586             . 1897157195000
726876             . 1888175564000
729645             . 1906152379000
729648             . 1906152455000
729649             . 1906152487000
729650             . 1906152571000
729652             . 1906152331000
729656             . 1906152034000
729657             . 1906152044000
729669             . 1906151946000
730557             . 1899759508000
751816             . 1899759344000
761671             . 1896720469000
762844             . 1893698642000
770513 1894123044864 1897081995264
770548 1894129467392 1894711033856
770626             . 1896718924000
770746             . 1913046325000
end
format %tc report_time
format %tc resolution_timestamp

Attached Files

Last edited by Ali Ahmed Awan; 15 Nov 2024, 00:17.

Tags: None

Clyde Schechter

Join Date: Apr 2014

Posts: 29453
#2

15 Nov 2024, 10:17

The following code assumes, and verifies, that the dbid uniquely identifies observations in the data set and that the data are sorted in increasing order of dbid. (The code can be modified if this is not true.)

Code:

// VERIFY DBID VALUES ARE ALL DISTINCT AND GIVEN IN ASCENDING ORDER assert dbid[_n+1] > dbid tempfile copy save `copy' rangejoin resolution_timestamp report_time . using `copy' by dbid, sort: egen wanted = total(resolution_timestamp_U > report_time /// & dbid_U < dbid) by dbid: keep if _n == 1 drop *_U

-rangejoin- is written by Robert Picard, and is available from SSC. To use it, you must also install -rangestat-, by Robert Picard, Nick Cox, and Roberto Ferrer, also available from SSC.

Note: The results given above disagree for dbid 418248, where you show 17 but my code finds 18. I have rechecked that result by hand, and I believe you miscounted.
1 like
Comment
Ali Ahmed Awan

Join Date: Aug 2024

Posts: 14
#3

15 Nov 2024, 10:41

Thank you! Let me try.
Comment

Ali Ahmed Awan

Join Date: Aug 2024
Posts: 14

15 Nov 2024, 13:13

Hi Clyde,
The code serves the intended purpose.

Now, I am trying to extend this to multiple groups of dbid but unable to do it. I have a variable comp_id. Each comp_id contains multiple dbids but each dbid belongs to one dbid.

I have almost 177k observations, and I think the rangejoin may not work.

Here is my sample data:

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str42 comp_id long dbid double(report_time resolution_timestamp)
"yahoo"    438             . 1700601254000
"yahoo"    439             . 1700601254000
"yahoo"    440             . 1700601254000
"yahoo"    441             . 1700601254000
"yahoo"    442             . 1700601254000
"yahoo"    443             . 1700601254000
"yahoo"    444             . 1700601254000
"yahoo"    445             . 1700601254000
"yahoo"    446             . 1700601254000
"yahoo"    447             . 1700601254000
"yahoo"    448             . 1700601254000
"yahoo"    449             . 1700601254000
"yahoo"    450             . 1700601254000
"security" 454 1700842831872 1717703147520
"security" 477 1701412995072 1701552586752
"security" 487 1701460705280 1701699125248
"ruby"     499 1700679647232 1700679647232
"ibb"      500 1699383607296 1699383607296
"ibb"      501 1701767938048 1731243409408
"ibb"      523 1701453627392 1702490406912
"security" 546 1704190935040 1704756248576
"security" 547 1704349401088 1704781152256
"security" 575 1704955478016 1705453420544
"yahoo"    579             . 1705263441000
"yahoo"    581             . 1705263442000
"yahoo"    582             . 1705263442000
"yahoo"    583             . 1705263442000
"yahoo"    584             . 1705263442000
"yahoo"    585             . 1705263442000
"yahoo"    586             . 1705263442000
"yahoo"    587             . 1705263442000
"yahoo"    588             . 1705263443000
"yahoo"    589             . 1705263744000
"yahoo"    590             . 1705263745000
"yahoo"    591             . 1705263745000
"yahoo"    592             . 1705263745000
"yahoo"    593             . 1705263745000
"yahoo"    594             . 1705263745000
"yahoo"    595             . 1708267815000
"yahoo"    596             . 1705263745000
"yahoo"    597             . 1705263745000
"yahoo"    598             . 1705263745000
"yahoo"    599             . 1705263745000
"yahoo"    600             . 1705263745000
"yahoo"    601             . 1705263745000
"yahoo"    602             . 1705263745000
"yahoo"    603             . 1705263745000
"yahoo"    604             . 1705263745000
"yahoo"    605             . 1705263745000
"yahoo"    606             . 1705263745000
"yahoo"    607             . 1705263746000
end
format %tc report_time
format %tc resolution_timestamp

Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 29453

15 Nov 2024, 13:32

FWIW, here's another approach. This one uses less memory--which might be an issue if the full original data set is very large.

Code:

// VERIFY DBID VALUES ARE ALL DISTINCT AND GIVEN IN ASCENDING ORDER
assert dbid[_n+1] > dbid
tempfile copy
save `copy'

capture program drop one_report
program define one_report
    list
    count if resolution_timestamp > report_time[_N] & _n < _N
    gen wanted = r(N)
    keep in L
    exit
end

rangerun one_report, interval(dbid . 0) verbose
keep if !missing(report_time)
merge 1:1 dbid using `copy', nogenerate
sort dbid

-rangerun- is also by Robert Picard and, wait for it.... is available from SSC!

Last edited by Clyde Schechter; 15 Nov 2024, 13:44.

Comment

Ali Ahmed Awan

Join Date: Aug 2024
Posts: 14

16 Nov 2024, 00:06

The first code is not working on the full data due to the memory limitation.

The code is working well. It takes almost 6 minutes to finish the job. Which I think is still very good.

However, I am trying to calculate the wanted variable within each group.

Each dbid belongs to one company, which is identified using comp_id in the data. I tried to create a group identifier using egen group command. But the output is incorrect.

Here is the code I used:

Code:

egen new_id = group(comp_id dbid)

assert new_id[_n+1] > new_id
tempfile copy
save `copy'

capture program drop one_report
program define one_report
    //list
    count if resolution_timestamp > report_time[_N] & _n < _N
    gen wanted = r(N)
    keep in L
    exit
end

rangerun one_report, interval(new_id . 0) verbose
keep if !missing(report_time)
merge 1:1 new_id using `copy', nogenerate
sort new_id

Sample data:

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str42 comp_id long dbid double(report_time resolution_timestamp) float new_id
"slack"    2815             . 1709516627000  96646
"relateiq" 2831             . 1709834661000  90441
"relateiq" 2833             . 1711059222000  90442
"slack"    2839             . 1709491449000  96647
"yahoo"    2916             . 1712084118000 118858
"slack"    2926 1709473923072 1709474316288  96648
"relateiq" 2929             . 1709580624000  90443
"relateiq" 2931             . 1711058889000  90444
"yahoo"    2952             . 1715208857000 118859
"relateiq" 2958             . 1711059296000  90445
"slack"    2969             . 1709691664000  96649
"slack"    2970             . 1710197327000  96650
"slack"    2971             . 1709505003000  96651
"slack"    2983             . 1709510712000  96652
"slack"    2990             . 1729206862000  96653
"slack"    2992             . 1709510633000  96654
"slack"    2995             . 1736369823000  96655
"slack"    3023             . 1709954110000  96656
"slack"    3028             . 1709576164000  96657
"yahoo"    3032             . 1713470578000 118860
"yahoo"    3039 1709509181440 1710446739456 118861
"relateiq" 3063             . 1711474334000  90446
"relateiq" 3066             . 1711504107000  90447
"yahoo"    3083             . 1728404747000 118862
"relateiq" 3094             . 1711660234000  90448
"slack"    3108             . 1712255457000  96658
"slack"    3162             . 1709575470000  96659
"slack"    3167             . 1709575529000  96660
"relateiq" 3172             . 1710138096000  90449
"slack"    3173             . 1709598952000  96661
"relateiq" 3186             . 1711473872000  90450
"relateiq" 3215             . 1711059888000  90451
"relateiq" 3216             . 1709853990000  90452
"security" 3227 1709570785280 1710119583744  92496
"relateiq" 3232             . 1710184374000  90453
"slack"    3233             . 1709598918000  96662
"relateiq" 3235             . 1710125575000  90454
"yahoo"    3237             . 1720558162000 118863
"relateiq" 3241             . 1709704184000  90455
"slack"    3243             . 1710027187000  96663
"relateiq" 3247             . 1709702974000  90456
"relateiq" 3278             . 1714429811000  90457
"slack"    3280             . 1709673807000  96664
"slack"    3282             . 1710287528000  96665
"yahoo"    3313             . 1713477883000 118864
"yahoo"    3319             . 1711572011000 118865
"yahoo"    3320             . 1714595493000 118866
"yahoo"    3322             . 1728574525000 118867
"yahoo"    3323             . 1712330224000 118868
"yahoo"    3325             . 1715194212000 118869
"slack"    3326             . 1710287256000  96666
end
format %tc report_time
format %tc resolution_timestamp

Comment

Nick Cox

Join Date: Mar 2014

Posts: 34748
#7

16 Nov 2024, 05:12

See the help for rangerun and consider whether adding a by() option to the original code s enough to do what you want.
1 like
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29453
#8

16 Nov 2024, 10:43

Yes, I'm very confident that Nick has nailed it. You just need to add -by(comp_id)- to the end of the -runby- command. By the way, you can and should also remove the -verbose- option from that command. I put that in there while I was testing the program to see what it was doing, but now that we know it does the intended job, that is no longer necessary and just generates a lot of unhelpful output.

In addition, the verification of the proper order condition should be:

Code:

sort comp_id, stable by comp_id: assert dbid[_n+1] > dbid

Added:

Actually, given the large size of your data set, some additional tweaking of the code to speed-up execution is in order:

Code:

sort comp_id, stable by comp_id: assert dbid[_n+1] > dbid tempfile copy save `copy' capture program drop one_report program define one_report count if resolution_timestamp > report_time[_N] & _n < _N gen wanted = r(N) keep in L exit end gen low = cond(!missing(report_time), ., 1) gen high = cond(!missing(report_time), dbid, 0) rangerun one_report, interval(dbid low high) by(comp_id) merge 1:1 dbid using `copy', nogenerate sort comp_id dbid

Note also the removal of the -list- command from the program (which you already seized upon by commenting out), the -verbose- option from -rangerun-, and the -keep if !missing(report_time)- command after -rangerun-.

The little dance with the high and low variables makes -rangerun- skip over the observations where report_time has a missing value, thereby saving considerable execution time.

Last edited by Clyde Schechter; 16 Nov 2024, 11:20.
Comment

Announcement

Need Help with a Calculation

Comment

Comment

Comment

Comment

Comment

Comment

Comment