Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Calculating total numbers before a specif date for a specif person

    Dear members,

    I'm new to the forum and posted an inquiry yesterday. Apologies for any confusion; it seems I didn't follow the forum rules, and there was a mistake in posting my data.
    To simplify things, I've attached the correct data below.
    I am looking for someone to help me with the following coding problem:

    At the beginning of each month in a given team:

    1st, I would like to calculate for each player the total number of goals they scored within 1 year preceding the 1st of every month.

    2nd, I would like to determine the maximum goal against a single goalkeeper in the same time window i.e., within 1 year preceding the 1st of every month.
    For example, in a team AR010, and for player BKX9M before 1st of Mar 2020, this would be 5 - the scores he had against a single goalkeeper - Jesen. Notice his score on 1st of Mar 2020 is not counted as I want to consider only within one year before every 1st month.

    3rd, I would like to compute, the share of max goal against a single player counts in a total scores for each player as (2nd/1st).

    Note that for a given month, if a player has less than 2 GameDates within 1 year before every month, I would leave it missing. For example, player GHBEO has just two games preceding 12th of May 2020 in a team AR01. And I will leave it missing.


    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str5(team player) str7 ScoredAgainst int GameDate
    "AR010" "BKX9M" "Jesen"   21808
    "AR010" "BKX9M" "Mark"    21865
    "AR010" "BKX9M" "Jesen"   21897
    "AR010" "BKX9M" "Jesen"   21915
    "AR010" "BKX9M" "Jesen"   21926
    "AR010" "BKX9M" "Jesen"   21947
    "AR010" "BKX9M" "Jesen"   21975
    "AR010" "BKX9M" "Mark"    21986
    "AR010" "BKX9M" "Parker"  22047
    "AR010" "KCH6W" "Jesen"   21808
    "AR010" "KCH6W" "Parker"  21992
    "AR010" "KCH6W" "Mark"    22035
    "AR010" "KCH6W" "Parker"  22047
    "AR010" "KCH6W" "Parker"  22067
    "AR010" "KCH6W" "Parker"  22081
    "AR010" "KCH6W" "Parker"  22108
    "AR010" "GHBEO" "Parker"  22002
    "AR010" "GHBEO" "Parker"  22017
    "AR010" "GHBEO" "Parker"  22047
    "MAN78" "NMD8B" "Mark"    21806
    "MAN78" "NMD8B" "Mark"    21870
    "MAN78" "NMD8B" "Mark"    21884
    "MAN78" "NMD8B" "Parker"  21897
    "MAN78" "NMD8B" "William" 21926
    "MAN78" "NMD8B" "Mark"    21942
    "MAN78" "NMD8B" "Mark"    22048
    "MAN78" "NMD8B" "Mark"    22150
    "MAN78" "NMD8B" "Mark"    22159
    "MAN78" "NMD8B" "Mark"    22173
    "MAN78" "MYD8B" "Mark"    21942
    "MAN78" "MYD8B" "Mark"    22048
    "MAN78" "MYD8B" "Mark"    22150
    "MAN78" "MYD8B" "Mark"    22159
    "MAN78" "MYD8B" "Mark"    22173
    end
    format %tddd-Mon-YY GameDate
    Any guidance or support would be greatly appreciated.

  • #2
    Well, there are still a few things I'm unsure about, but I think the code below will at least point you in the right direction if it isn't quite what you had in mind.

    You refer to the first of every month as a reference point. So I'm creating these three variables in series, with a new set of variables corresponding to every month from September 2019 through September 2020. Those particular months are not "hard wired" into the code. Rather they are the first and last months observed in the GameDate variable, and in your full data set will be modified accordingly without you having to do anything to the code.

    There is also nothing in the data set that indicates which observations correspond to a goal being scored. So I assume that each observation is a goal, not a game.

    You don't explain the coding system for players. In your data set, any given player code is always associated with the same team. I don't know if that holds throughout your data. But if the same player code can occur with different team codes, I don't know if that's the same player having switched teams, or if it could be a different player who is assigned the same code, with the team code serving to distinguish them. My code assumes the latter.

    Code:
    summ GameDate, meanonly
    local begin = (mofd(`r(min)'))
    local end = (mofd(`r(max)'))
    
    forvalues m = `begin'/`end' {
        local mm: display %tmmonYY `m'
        by team player (ScoredAgainst GameDate), sort: egen total_goals_`mm' = ///
            total(ceil(datediff_frac(GameDate, dofm(`m'), "y")) == 1) // WANTED 1
        replace total_goals_`mm' = . if total_goals_`mm' < 2
        by team player ScoredAgainst (GameDate): egen against_this_keeper_`mm' = ///
            total(ceil(datediff_frac(GameDate, dofm(`m'), "y")) == 1)
        by team player: egen max_against_single_keeper_`mm' = ///
            max(against_this_keeper_`mm') // WANTED 2
        gen share_`mm' = max_against_single_keeper_`mm'/total_goals_`mm' // WANTED 3
    }

    Comment


    • #3
      Thank you very much Clyde. The code seems to work with a minor concern. For example, a player with code GHBEO in a team AR010, played three games on 28Mar2020, 12Apr2020, and 12May2020. I can see that the code calculate the share even on June and beyond for this player even though he didn't play.

      As I want to calculate the average of all this shares by team and month (I assume using: bysort team month: egen team_average_share=rowmean(Share_sep19, share_oct19, etc), instances like that may invalidate the average share value for a team.

      By the way, as I want to compute the share based on within 1 year time frame before a particular month, would it possible to implement rangestat? more in general, is there a way to improve the code to account for the mentioned problem?

      To clarify a bit on the variables: a goal being scored is captured by the variable "ScoredAgainst", containing the goalkeeper name. So if for example a player scored two times on the same GameDay, there would be one more row and "ScoredAgainst" will have the same goalkeeper name in two rows. You're right, there is a potential of players switching to another team specially overtime, he is going to take on a different ID in his new team. All his scores and information will count towards his new team.

      I cannot thank you enough Clyde.

      Best Wishes,

      Comment


      • #4
        OK. In addition to the problem of months that occur after the player's last game, there is a similar problem on the other end: months that occur before the player's first game, right? So both of those are easily fixed--see the code below.

        I am a big fan of -rangestat-, and it could be used in this situation. But it would not simplify the code, and I have a general principle of sticking to official Stata commands when they can do the job. This is because there may come a day (I hope not, but it could happen) when a new version of Stata no longer supports -rangestat-, and that would break a -rangestat- based solution to this problem. But that cannot happen with the actual code I am showing here, because Stata has a commitment to maintaining the function of old code with respect to official Stata commands. So I nearly always program with official Stata commands where possible--I only make exceptions where using a user-written command will greatly simplify the code. -rangestat- often greatly simplifies problems like this--but in this particular case, it does not.

        Code:
        //    IDENTIFY FIRST AND LAST MONTHS OF PLAY FOR EACH PLAYER
        by team player, sort: egen first_month = min(GameDate)
        replace first_month = mofd(first_month)
        format first_month %tm
        by team player: egen last_month = max(GameDate)
        replace last_month = mofd(last_month)
        format last_month %tm
        
        summ GameDate, meanonly
        local begin = (mofd(`r(min)'))
        local end = (mofd(`r(max)'))
        
        forvalues m = `begin'/`end' {
            local mm: display %tmmonYY `m'
            by team player (ScoredAgainst GameDate), sort: egen total_goals_`mm' = ///
                total(ceil(datediff_frac(GameDate, dofm(`m'), "y")) == 1) // WANTED 1
            replace total_goals_`mm' = . if total_goals_`mm' < 2
            replace total_goals_`mm' = . if !inrange(`m', first_month, last_month)
            by team player ScoredAgainst (GameDate): egen against_this_keeper_`mm' = ///
                total(ceil(datediff_frac(GameDate, dofm(`m'), "y")) == 1)
            replace against_this_keeper_`mm' = . if missing(total_goals_`mm')
            by team player: egen max_against_single_keeper_`mm' = ///
                max(against_this_keeper_`mm')
            replace max_against_single_keeper_`mm' = . if missing(total_goals_`mm') // WANTED 2
            gen share_`mm' = max_against_single_keeper_`mm'/total_goals_`mm' // WANTED 3
        }
        Changes to original code shown in bold face.

        Comment


        • #5
          Thank you Clyde. It works very well.

          If I change the time window requirement from one to two years: total(ceil(datediff_frac(GameDate, dofm(`m'), "y")) == 2). By the way, I didn't fully understand the syntax logic, but it is fine I will read.

          One more question: when averaging the shares for a team in a given month does this code right? bysort team month= egen average_share=rowmean(share_*) ??

          I would appreciate if you could suggest me a book or any resource to learn Stata. I am new and think that I will use Stata most of the time in my career.

          Best,
          Ami

          Comment


          • #6
            Actually
            total(ceil(datediff_frac(GameDate, dofm(`m'), "y")) == 2)
            won't get you what you want. It will get you a window that starts two years back but ends at one year and one day back. It's not a two year window. For a two-year window you need:
            Code:
            total(ceil(datediff_frac(GameDate, dofm(`m'), "y")) <= 2)
            when averaging the shares for a team in a given month does this code right? bysort team month= egen average_share=rowmean(share_*) ??
            No. That will get you a syntax error message. The first syntax error is the = after month, which should be a :. The second syntax error is that -rowmean()- cannot be used with -by-.

            If, for each month, you want the average share for all teams in that month you should add the following command at the end of the loop in the code you already have.
            Code:
            by team: egen team_avg_share_`mm' = mean(share_`mm')
            For each player's average share across all months:
            Code:
            egen player_avg_share == rowmean(share_*) // N.B.  AFTER, NOT INSIDE, THE LOOP
            For each team's average share, well, this is undefined because different teams have different numbers of players, and different ways of putting together a team average will yield different results. If you average all the players in a month and then average those monthly averages across months you will get one thing. If you average all the months for each player and then average those player averages across players, you will get something different. And if you just average all of the player-month numbers over all combinations of player and month you will get yet another result. All of these are possibly useful statistics, but you need to decide which one is the useful one for your purposes.ientat

            Added: As for materials to learn Stata, the very first step is to read the User Guide [U] and Getting Started [GS] sections of the PDF documentation that is part of your Stata installation. That will cover the stuff that everyone needs in order to do anything useful at all. It is a long read and you won't remember all the details, but you will at least have some familiarity with the "bread and butter" parts of Stata. After that, when I was a beginner I found the StataCorp net courses very helpful. I took the first three. There are also a number of books in the StataCorp bookstore whose purpose is to teach beginners to become proficient in Stata use. There are many such books, with different emphases, and targeting people in different disciplines. You can browse those and see if one of them looks right for you. Those resources can help you get off the ground. Once you have the basics down, o really become proficient, I believe there is no substitute for actually doing real-world analysis on real-world data sets that you will deliver to real-world clients who will rely on them for real-world application. That's what applied statistics is all about! And the more of it you do, the better you will get at it.



            Last edited by Clyde Schechter; 19 Jan 2024, 16:49.

            Comment


            • #7
              I just realized that there is an error in my response in #6. -total(ceil(datediff_frac(GameDate, dofm(`m'), "y")) <= 2) does not do what is needed, because it will also include the date of the game in the current observation, which #1 clearly states is to be excluded. To get that right, the expression should be:
              Code:
              total(inlist(ceil(datediff_frac(GameDate, dofm(`m'), "y")), 1, 2))
              Explanation: if a game occurs on the first of the month date, GameDate will equal dofm(`m'), so datediff_frac(...) will be 0. ceil(0) = 0, and that is <= 2. By instead requiring that the ceil(...) expression be either 1 or 2, we exclude the GameDate and achieve a two year window preceding the first of month date.

              Comment


              • #8
                Thanks, Clyde!

                Everything is working well, and I've gained a good understanding. By the way, can we stack all team values of team_avg_share_`mm'' for each month into a single variable and column named 'Average_share'? Ultimately, I am interested in panel data by team and month.

                Appreciate your assistance!

                Best,

                Comment


                • #9
                  By the way, can we stack all team values of team_avg_share_`mm'' for each month into a single variable and column named 'Average_share'?
                  I don't understand what you want here. Can you show an example of what this would look like?

                  Comment


                  • #10
                    For example: If I want the my data to look like below, running the previous code. I think my problem is reshaping the data so that for each "team" I have one variable "Month" ruining from 1-12 and another variable "average share", containing values of share in a particular month for each team.
                    then merging this to the original data by team ID possibly.

                    * Example generated by -dataex-. For more info, type help dataex
                    clear
                    input str5 team byte month double average_share
                    "AR010" 1 .6
                    "AR010" 2 .5
                    "AR010" 3 .4
                    "AR010" 4 .6
                    "AR010" 5 .6
                    "AR010" 6 .7
                    "AR010" 7 .8
                    "AR010" 8 .01
                    "AR010" 9 .6
                    "AR010" 10 .6
                    "AR010" 11 .7
                    "AR010" 12 .9
                    "MAN78" 1 .6
                    "MAN78" 2 .5
                    "MAN78" 3 .4
                    "MAN78" 4 .6
                    "MAN78" 5 .6
                    "MAN78" 6 .7
                    "MAN78" 7 .8
                    "MAN78" 8 .01
                    "MAN78" 9 .6
                    "MAN78" 10 .6
                    "MAN78" 11 .7
                    "MAN78" 12 .9
                    "" . .
                    "" . .
                    "" . .
                    "" . .
                    end
                    [/CODE]

                    Comment


                    • #11
                      So, starting from the very beginning, the whole sequence is:
                      Code:
                      //    IDENTIFY FIRST AND LAST MONTHS OF PLAY FOR EACH PLAYER
                      by team player, sort: egen first_month = min(GameDate)
                      replace first_month = mofd(first_month)
                      format first_month %tm
                      by team player: egen last_month = max(GameDate)
                      replace last_month = mofd(last_month)
                      format last_month %tm
                      
                      summ GameDate, meanonly
                      local begin = (mofd(`r(min)'))
                      local end = (mofd(`r(max)'))
                      
                      forvalues m = `begin'/`end' {
                          local mm: display %tmmonYY `m'
                          by team player (ScoredAgainst GameDate), sort: egen total_goals_`mm' = ///
                              total(ceil(datediff_frac(GameDate, dofm(`m'), "y")) == 1) // WANTED 1
                          replace total_goals_`mm' = . if total_goals_`mm' < 2
                          replace total_goals_`mm' = . if !inrange(`m', first_month, last_month)
                          by team player ScoredAgainst (GameDate): egen against_this_keeper_`mm' = ///
                              total(ceil(datediff_frac(GameDate, dofm(`m'), "y")) == 1)
                          replace against_this_keeper_`mm' = . if missing(total_goals_`mm')
                          by team player: egen max_against_single_keeper_`mm' = ///
                              max(against_this_keeper_`mm')
                          replace max_against_single_keeper_`mm' = . if missing(total_goals_`mm') // WANTED 2
                          gen share_`mm' = max_against_single_keeper_`mm'/total_goals_`mm' // WANTED 3
                      }
                      
                      collapse (mean) share_*, by(team)
                      reshape long share_, i(team) j(_month) string
                      gen month = monthly(_month, "M20Y"), after(team)
                      format month %tm
                      drop _month
                      rename share_ average_share

                      Comment


                      • #12
                        Many many thanks Clyde. That is exactly what I want.
                        I started by duplicates drop team and then reshape long, that didn't yield what I wanted.

                        Thanks very much,
                        Ami

                        Comment

                        Working...
                        X