Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Need Help with a Calculation


    Hi, I would greatly appreciate it if someone could help me with this calculation.

    I want to count the number of dbids based on the report_time column. When report_time is available, I want to count all the dbids, which are lower in value as compared to the current dbid, where resolution_timestamp is greater than the report_time of the current dbid.

    Please see the attached picture. I want to calculate the wanted column as shown in the attached picture. The explanation column shows the reasoning behind the values in the wanted column.

    Here is the sample data.


    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input long dbid double(report_time resolution_timestamp)
    412735             . 1856208378000
    412754 1853220847616 1886187028480
    412756             . 1855512505000
    412759             . 1854635297000
    412772 1853223206912 1886259642368
    412781             . 1856191395000
    412870             . 1861464751000
    412892             . 1853431024000
    412896             . 1856198193000
    412917             . 1886262361000
    412928             . 1862069843000
    413075             . 1856209389000
    413077 1853310238720 1856180715520
    413115 1853324394496 1861570789376
    413172             . 1858451505000
    413176             . 1858451707000
    413184             . 1978148563000
    413185             . 1858452263000
    413187             . 1976419524000
    418248 1854181998592 1886187028480
    418262             . 1855797664000
    418267 1854186192896 1855405162496
    418271 1854186979328 1855835996160
    418295             . 1858452544000
    418308 1854190780416 1857043431424
    418325             . 1855241649000
    459739             . 1865615570000
    460316             . 1886263053000
    478251             . 1874878119000
    478315             . 1867087645000
    478901             . 1899747510000
    479573             . 1911748991000
    482340             . 1891113781000
    482517             . 1886263515000
    482535             . 1867088573000
    482645             . 1874874223000
    482708             . 1891113724000
    487347             . 1886263708000
    503988 1867073585152 1909333426176
    504122 1867099013120 1867321442304
    508726             . 1868228231000
    512306             . 1.8702798e+12
    519519             . 1870120862000
    529233             . 1904244701000
    529238             . 1919015047000
    530874             . 1874259850000
    543775 1871351382016 1874863194112
    548212             . 1885756576000
    593655             . 1876769490000
    633598             . 1878327631000
    633976             . 1923158508000
    633983             . 1916345555000
    634037             . 1911749010000
    636643             . 1878311041000
    636669             . 1881257594000
    667057             . 1880679878000
    691804             . 1883923852000
    692108             . 1886261391000
    692116 1883769929728 1886109827072
    707034             . 1916345738000
    707052             . 1916346058000
    707760             . 1909770916000
    710368 1886223073280 1886246797312
    710574             . 1887295151000
    710878             . 1886990655000
    715486             . 1886879398000
    715489             . 1886879519000
    715581             . 1886879479000
    716216             . 1891016872000
    716218             . 1887557182000
    716219             . 1887557225000
    721189             . 1945379232000
    721202             . 1916346990000
    722145 1887540871168 1889787314176
    722281 1887541657600 1890298626048
    722292             . 1891186350000
    722301 1887544410112 1906138415104
    722372             . 1903288101000
    723370             . 1906152358000
    724878             . 1909277630000
    725716             . 1908223670000
    725722             . 1908223636000
    726586             . 1897157195000
    726876             . 1888175564000
    729645             . 1906152379000
    729648             . 1906152455000
    729649             . 1906152487000
    729650             . 1906152571000
    729652             . 1906152331000
    729656             . 1906152034000
    729657             . 1906152044000
    729669             . 1906151946000
    730557             . 1899759508000
    751816             . 1899759344000
    761671             . 1896720469000
    762844             . 1893698642000
    770513 1894123044864 1897081995264
    770548 1894129467392 1894711033856
    770626             . 1896718924000
    770746             . 1913046325000
    end
    format %tc report_time
    format %tc resolution_timestamp
    Attached Files
    Last edited by Ali Ahmed Awan; 15 Nov 2024, 00:17.

  • #2
    The following code assumes, and verifies, that the dbid uniquely identifies observations in the data set and that the data are sorted in increasing order of dbid. (The code can be modified if this is not true.)

    Code:
    //    VERIFY DBID VALUES ARE ALL DISTINCT AND GIVEN IN ASCENDING ORDER
    assert dbid[_n+1] > dbid
    
    tempfile copy
    save `copy'
    
    rangejoin resolution_timestamp report_time . using `copy'
    by dbid, sort: egen wanted = total(resolution_timestamp_U > report_time ///
        & dbid_U < dbid)
    by dbid: keep if _n == 1
    drop *_U
    -rangejoin- is written by Robert Picard, and is available from SSC. To use it, you must also install -rangestat-, by Robert Picard, Nick Cox, and Roberto Ferrer, also available from SSC.

    Note: The results given above disagree for dbid 418248, where you show 17 but my code finds 18. I have rechecked that result by hand, and I believe you miscounted.

    Comment


    • #3
      Thank you! Let me try.

      Comment


      • #4
        Hi Clyde,
        The code serves the intended purpose.

        Now, I am trying to extend this to multiple groups of dbid but unable to do it. I have a variable comp_id. Each comp_id contains multiple dbids but each dbid belongs to one dbid.

        I have almost 177k observations, and I think the rangejoin may not work.

        Here is my sample data:


        Code:
        * Example generated by -dataex-. For more info, type help dataex
        clear
        input str42 comp_id long dbid double(report_time resolution_timestamp)
        "yahoo"    438             . 1700601254000
        "yahoo"    439             . 1700601254000
        "yahoo"    440             . 1700601254000
        "yahoo"    441             . 1700601254000
        "yahoo"    442             . 1700601254000
        "yahoo"    443             . 1700601254000
        "yahoo"    444             . 1700601254000
        "yahoo"    445             . 1700601254000
        "yahoo"    446             . 1700601254000
        "yahoo"    447             . 1700601254000
        "yahoo"    448             . 1700601254000
        "yahoo"    449             . 1700601254000
        "yahoo"    450             . 1700601254000
        "security" 454 1700842831872 1717703147520
        "security" 477 1701412995072 1701552586752
        "security" 487 1701460705280 1701699125248
        "ruby"     499 1700679647232 1700679647232
        "ibb"      500 1699383607296 1699383607296
        "ibb"      501 1701767938048 1731243409408
        "ibb"      523 1701453627392 1702490406912
        "security" 546 1704190935040 1704756248576
        "security" 547 1704349401088 1704781152256
        "security" 575 1704955478016 1705453420544
        "yahoo"    579             . 1705263441000
        "yahoo"    581             . 1705263442000
        "yahoo"    582             . 1705263442000
        "yahoo"    583             . 1705263442000
        "yahoo"    584             . 1705263442000
        "yahoo"    585             . 1705263442000
        "yahoo"    586             . 1705263442000
        "yahoo"    587             . 1705263442000
        "yahoo"    588             . 1705263443000
        "yahoo"    589             . 1705263744000
        "yahoo"    590             . 1705263745000
        "yahoo"    591             . 1705263745000
        "yahoo"    592             . 1705263745000
        "yahoo"    593             . 1705263745000
        "yahoo"    594             . 1705263745000
        "yahoo"    595             . 1708267815000
        "yahoo"    596             . 1705263745000
        "yahoo"    597             . 1705263745000
        "yahoo"    598             . 1705263745000
        "yahoo"    599             . 1705263745000
        "yahoo"    600             . 1705263745000
        "yahoo"    601             . 1705263745000
        "yahoo"    602             . 1705263745000
        "yahoo"    603             . 1705263745000
        "yahoo"    604             . 1705263745000
        "yahoo"    605             . 1705263745000
        "yahoo"    606             . 1705263745000
        "yahoo"    607             . 1705263746000
        end
        format %tc report_time
        format %tc resolution_timestamp






















        Comment


        • #5
          FWIW, here's another approach. This one uses less memory--which might be an issue if the full original data set is very large.

          Code:
          // VERIFY DBID VALUES ARE ALL DISTINCT AND GIVEN IN ASCENDING ORDER
          assert dbid[_n+1] > dbid
          tempfile copy
          save `copy'
          
          capture program drop one_report
          program define one_report
              list
              count if resolution_timestamp > report_time[_N] & _n < _N
              gen wanted = r(N)
              keep in L
              exit
          end
          
          rangerun one_report, interval(dbid . 0) verbose
          keep if !missing(report_time)
          merge 1:1 dbid using `copy', nogenerate
          sort dbid
          -rangerun- is also by Robert Picard and, wait for it.... is available from SSC!
          Last edited by Clyde Schechter; 15 Nov 2024, 13:44.

          Comment


          • #6
            The first code is not working on the full data due to the memory limitation.

            The code is working well. It takes almost 6 minutes to finish the job. Which I think is still very good.

            However, I am trying to calculate the wanted variable within each group.

            Each dbid belongs to one company, which is identified using comp_id in the data. I tried to create a group identifier using egen group command. But the output is incorrect.

            Here is the code I used:

            Code:
            egen new_id = group(comp_id dbid)
            
            assert new_id[_n+1] > new_id
            tempfile copy
            save `copy'
            
            capture program drop one_report
            program define one_report
                //list
                count if resolution_timestamp > report_time[_N] & _n < _N
                gen wanted = r(N)
                keep in L
                exit
            end
            
            rangerun one_report, interval(new_id . 0) verbose
            keep if !missing(report_time)
            merge 1:1 new_id using `copy', nogenerate
            sort new_id

            Sample data:

            Code:
            * Example generated by -dataex-. For more info, type help dataex
            clear
            input str42 comp_id long dbid double(report_time resolution_timestamp) float new_id
            "slack"    2815             . 1709516627000  96646
            "relateiq" 2831             . 1709834661000  90441
            "relateiq" 2833             . 1711059222000  90442
            "slack"    2839             . 1709491449000  96647
            "yahoo"    2916             . 1712084118000 118858
            "slack"    2926 1709473923072 1709474316288  96648
            "relateiq" 2929             . 1709580624000  90443
            "relateiq" 2931             . 1711058889000  90444
            "yahoo"    2952             . 1715208857000 118859
            "relateiq" 2958             . 1711059296000  90445
            "slack"    2969             . 1709691664000  96649
            "slack"    2970             . 1710197327000  96650
            "slack"    2971             . 1709505003000  96651
            "slack"    2983             . 1709510712000  96652
            "slack"    2990             . 1729206862000  96653
            "slack"    2992             . 1709510633000  96654
            "slack"    2995             . 1736369823000  96655
            "slack"    3023             . 1709954110000  96656
            "slack"    3028             . 1709576164000  96657
            "yahoo"    3032             . 1713470578000 118860
            "yahoo"    3039 1709509181440 1710446739456 118861
            "relateiq" 3063             . 1711474334000  90446
            "relateiq" 3066             . 1711504107000  90447
            "yahoo"    3083             . 1728404747000 118862
            "relateiq" 3094             . 1711660234000  90448
            "slack"    3108             . 1712255457000  96658
            "slack"    3162             . 1709575470000  96659
            "slack"    3167             . 1709575529000  96660
            "relateiq" 3172             . 1710138096000  90449
            "slack"    3173             . 1709598952000  96661
            "relateiq" 3186             . 1711473872000  90450
            "relateiq" 3215             . 1711059888000  90451
            "relateiq" 3216             . 1709853990000  90452
            "security" 3227 1709570785280 1710119583744  92496
            "relateiq" 3232             . 1710184374000  90453
            "slack"    3233             . 1709598918000  96662
            "relateiq" 3235             . 1710125575000  90454
            "yahoo"    3237             . 1720558162000 118863
            "relateiq" 3241             . 1709704184000  90455
            "slack"    3243             . 1710027187000  96663
            "relateiq" 3247             . 1709702974000  90456
            "relateiq" 3278             . 1714429811000  90457
            "slack"    3280             . 1709673807000  96664
            "slack"    3282             . 1710287528000  96665
            "yahoo"    3313             . 1713477883000 118864
            "yahoo"    3319             . 1711572011000 118865
            "yahoo"    3320             . 1714595493000 118866
            "yahoo"    3322             . 1728574525000 118867
            "yahoo"    3323             . 1712330224000 118868
            "yahoo"    3325             . 1715194212000 118869
            "slack"    3326             . 1710287256000  96666
            end
            format %tc report_time
            format %tc resolution_timestamp

            Comment


            • #7
              See the help for rangerun and consider whether adding a by() option to the original code s enough to do what you want.

              Comment


              • #8
                Yes, I'm very confident that Nick has nailed it. You just need to add -by(comp_id)- to the end of the -runby- command. By the way, you can and should also remove the -verbose- option from that command. I put that in there while I was testing the program to see what it was doing, but now that we know it does the intended job, that is no longer necessary and just generates a lot of unhelpful output.

                In addition, the verification of the proper order condition should be:
                Code:
                sort comp_id, stable
                by comp_id: assert dbid[_n+1] > dbid
                Added:

                Actually, given the large size of your data set, some additional tweaking of the code to speed-up execution is in order:
                Code:
                sort comp_id, stable
                by comp_id: assert dbid[_n+1] > dbid
                tempfile copy
                save `copy'
                
                capture program drop one_report
                program define one_report
                    count if resolution_timestamp > report_time[_N] & _n < _N
                    gen wanted = r(N)
                    keep in L
                    exit
                end
                
                gen low = cond(!missing(report_time), ., 1)
                gen high = cond(!missing(report_time), dbid, 0)
                rangerun one_report, interval(dbid low high) by(comp_id)
                merge 1:1 dbid using `copy', nogenerate
                sort comp_id dbid
                Note also the removal of the -list- command from the program (which you already seized upon by commenting out), the -verbose- option from -rangerun-, and the -keep if !missing(report_time)- command after -rangerun-.

                The little dance with the high and low variables makes -rangerun- skip over the observations where report_time has a missing value, thereby saving considerable execution time.
                Last edited by Clyde Schechter; 16 Nov 2024, 11:20.

                Comment

                Working...
                X