This tutorial demonstrates how to create an animated visualization of the NBA’s play-by-play data. It heavily builds upon the tutorials written by Rajiv Shah available here:. I will use several functions available at Rajiv’s GitHub page to get the data and process it. I will then get the data into the format required to animate using gganimate.

Depending upon your machine, please allow a little time for the example animation to load at the bottom of the page. If you continue to have difficulties please refresh or try here for the RPubs version.

Load these packages

library(RCurl)
library(jsonlite)
library(dplyr)
library(sp)
library(ggplot2)
library(data.table)
library(gganimate)

Source Some Functions

First source several functions that we need to extract and plot data. These written by Rajiv and adapted by me. The R file is available here.

Get the Data

Although the NBA’s play by play data is no longer made publicly available, a selection of these data have been stored. We will use the data in this repository. I downloaded this game-log if you want to follow along. I stored it as “0021500431.json” following Rajiv’s example. It is a game between the San Antonio Spurs and Minnesota Timberwolves from December 2015.

Use the ‘sportvu_convert_json’ function to convert the raw json data to a R dataframe.

all.movements <- sportvu_convert_json("0021500431.json")

The loaded data should look like this:

str(all.movements)
## 'data.frame':    2646562 obs. of  13 variables:
##  $ player_id : chr  "2225" "2225" "-1" "-1" ...
##  $ lastname  : chr  "Parker" "Parker" "ball" "ball" ...
##  $ firstname : chr  "Tony" "Tony" NA NA ...
##  $ jersey    : chr  "9" "9" NA NA ...
##  $ position  : chr  "G" "G" NA NA ...
##  $ team_id   : num  1.61e+09 1.61e+09 NA NA 1.61e+09 ...
##  $ x_loc     : num  51.7 51.7 52.9 52.9 60.4 ...
##  $ y_loc     : num  40.3 40.3 39.9 39.9 31.8 ...
##  $ radius    : num  0 0 2.5 2.5 0 ...
##  $ game_clock: num  716 716 716 716 716 ...
##  $ shot_clock: num  13.3 13.3 13.3 13.3 13.3 ...
##  $ quarter   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ event.id  : num  2 1 1 2 2 1 2 1 2 1 ...
head(all.movements)
##   player_id lastname firstname jersey position    team_id    x_loc
## 1      2225   Parker      Tony      9        G 1610612759 51.73003
## 2      2225   Parker      Tony      9        G 1610612759 51.73003
## 3        -1     ball      <NA>   <NA>     <NA>         NA 52.88766
## 4        -1     ball      <NA>   <NA>     <NA>         NA 52.88766
## 5    201937    Rubio     Ricky      9        G 1610612750 60.40236
## 6    201937    Rubio     Ricky      9        G 1610612750 60.40236
##      y_loc  radius game_clock shot_clock quarter event.id
## 1 40.31428 0.00000     715.78      13.27       1        2
## 2 40.31428 0.00000     715.78      13.27       1        1
## 3 39.89349 2.49642     715.78      13.27       1        1
## 4 39.89349 2.49642     715.78      13.27       1        2
## 5 31.81802 0.00000     715.78      13.27       1        2
## 6 31.81802 0.00000     715.78      13.27       1        1

You can see that the raw data contains infomation about players and the ball. Every row represents the location of a ball or player at a particular time. Each player is given an ID, a jersey number and position and team-id. They also have an x and y coordinate for each time. There is additionally radius information related to the ball’s movement.

Next, we need to get some metadata for this particular game. The get_pbp function grabs this information from the stats.nba.com website.

gameid = "0021500431"
pbp <- get_pbp(gameid)
str(pbp)
## 'data.frame':    414 obs. of  33 variables:
##  $ GAME_ID                  : Factor w/ 1 level "0021500431": 1 1 1 1 1 1 1 1 1 1 ...
##  $ EVENTNUM                 : Factor w/ 414 levels "0","1","10","100",..: 1 2 99 192 288 368 378 387 404 394 ...
##  $ EVENTMSGTYPE             : Factor w/ 12 levels "1","10","12",..: 3 2 8 5 7 8 5 7 10 1 ...
##  $ EVENTMSGACTIONTYPE       : Factor w/ 36 levels "0","1","10","108",..: 1 1 15 18 1 15 2 1 7 28 ...
##  $ PERIOD                   : Factor w/ 4 levels "1","2","3","4": 1 1 1 1 1 1 1 1 1 1 ...
##  $ WCTIMESTRING             : Factor w/ 94 levels "10:01 PM","10:02 PM",..: 14 14 14 15 15 15 15 15 16 16 ...
##  $ PCTIMESTRING             : Factor w/ 220 levels "0:00","0:01",..: 69 69 66 63 62 61 55 54 47 47 ...
##  $ HOMEDESCRIPTION          : Factor w/ 182 levels "Bjelica BLOCK (1 BLK)",..: NA 39 NA 91 NA NA 78 NA 171 NA ...
##  $ NEUTRALDESCRIPTION       : Factor w/ 0 levels: NA NA NA NA NA NA NA NA NA NA ...
##  $ VISITORDESCRIPTION       : Factor w/ 209 levels "Aldridge 1' Alley Oop Layup (4 PTS) (Duncan 4 AST)",..: NA NA 162 90 94 92 NA 164 NA 160 ...
##  $ SCORE                    : Factor w/ 104 levels "100 - 77","101 - 77",..: NA NA NA NA NA NA NA NA NA 16 ...
##  $ SCOREMARGIN              : Factor w/ 24 levels "-10","-11","-12",..: NA NA NA NA NA NA NA NA NA 11 ...
##  $ PERSON1TYPE              : Factor w/ 6 levels "0","1","2","3",..: 1 5 6 5 6 6 5 6 5 6 ...
##  $ PLAYER1_ID               : Factor w/ 26 levels "0","1495","1610612750",..: 1 6 22 21 14 14 23 22 6 22 ...
##  $ PLAYER1_NAME             : Factor w/ 23 levels "Adreian Payne",..: NA 9 21 2 10 10 19 21 9 21 ...
##  $ PLAYER1_TEAM_ID          : Factor w/ 2 levels "1610612750","1610612759": NA 1 2 1 2 2 1 2 1 2 ...
##  $ PLAYER1_TEAM_CITY        : Factor w/ 2 levels "Minnesota","San Antonio": NA 1 2 1 2 2 1 2 1 2 ...
##  $ PLAYER1_TEAM_NICKNAME    : Factor w/ 2 levels "Spurs","Timberwolves": NA 2 1 2 1 1 2 1 2 1 ...
##  $ PLAYER1_TEAM_ABBREVIATION: Factor w/ 2 levels "MIN","SAS": NA 1 2 1 2 2 1 2 1 2 ...
##  $ PERSON2TYPE              : Factor w/ 3 levels "0","4","5": 1 3 1 1 1 1 1 1 1 1 ...
##  $ PLAYER2_ID               : Factor w/ 25 levels "0","1495","1626145",..: 1 2 1 1 1 1 1 1 1 1 ...
##  $ PLAYER2_NAME             : Factor w/ 24 levels "Adreian Payne",..: NA 21 NA NA NA NA NA NA NA NA ...
##  $ PLAYER2_TEAM_ID          : Factor w/ 2 levels "1610612750","1610612759": NA 2 NA NA NA NA NA NA NA NA ...
##  $ PLAYER2_TEAM_CITY        : Factor w/ 2 levels "Minnesota","San Antonio": NA 2 NA NA NA NA NA NA NA NA ...
##  $ PLAYER2_TEAM_NICKNAME    : Factor w/ 2 levels "Spurs","Timberwolves": NA 1 NA NA NA NA NA NA NA NA ...
##  $ PLAYER2_TEAM_ABBREVIATION: Factor w/ 2 levels "MIN","SAS": NA 2 NA NA NA NA NA NA NA NA ...
##  $ PERSON3TYPE              : Factor w/ 4 levels "0","1","4","5": 1 4 1 4 1 1 1 1 2 1 ...
##  $ PLAYER3_ID               : Factor w/ 9 levels "0","1495","200746",..: 1 4 1 6 1 1 1 1 1 1 ...
##  $ PLAYER3_NAME             : Factor w/ 8 levels "Danny Green",..: NA 1 NA 4 NA NA NA NA NA NA ...
##  $ PLAYER3_TEAM_ID          : Factor w/ 2 levels "1610612750","1610612759": NA 2 NA 2 NA NA NA NA NA NA ...
##  $ PLAYER3_TEAM_CITY        : Factor w/ 2 levels "Minnesota","San Antonio": NA 2 NA 2 NA NA NA NA NA NA ...
##  $ PLAYER3_TEAM_NICKNAME    : Factor w/ 2 levels "Spurs","Timberwolves": NA 1 NA 1 NA NA NA NA NA NA ...
##  $ PLAYER3_TEAM_ABBREVIATION: Factor w/ 2 levels "MIN","SAS": NA 2 NA 2 NA NA NA NA NA NA ...

As you can see, this contains loads of stuff that we don’t need. Let’s just keep what we want:

pbp <- pbp[-1,] #first row is NAs
colnames(pbp)[2] <- c('event.id') #will use this to merge on all.movements df
pbp0 <- pbp %>% select (event.id,EVENTMSGTYPE,EVENTMSGACTIONTYPE,SCORE)
pbp0$event.id <- as.numeric(levels(pbp0$event.id))[pbp0$event.id]
head(pbp0)
##   event.id EVENTMSGTYPE EVENTMSGACTIONTYPE SCORE
## 2        1           10                  0  <NA>
## 3        2            5                 45  <NA>
## 4        3            2                  5  <NA>
## 5        4            4                  0  <NA>
## 6        5            5                 45  <NA>
## 7        6            2                  1  <NA>
tail(pbp0)
##     event.id EVENTMSGTYPE EVENTMSGACTIONTYPE    SCORE
## 409      477            6                  1     <NA>
## 410      478            3                 11 107 - 83
## 411      480            3                 12 108 - 83
## 412      482            2                  1     <NA>
## 413      483            4                  0     <NA>
## 414      484           13                  0 108 - 83

Each event.id refers to a different play in the game. Now we can merge the two files:

all.movements <- merge(x = all.movements, y = pbp0, by = "event.id", all.x = TRUE)
dim(all.movements)
## [1] 2646562      16

This is a lot of information! Over 2.6 millions rows. Let’s pick one event id to visualize. Here I will use the same one as Rajiv’s example - eventid=303.

id303 <- all.movements[which(all.movements$event.id == 303),]
dim(id303)
## [1] 10890    16
head(id303)
##         event.id player_id lastname firstname jersey position    team_id
## 1644741      303        -1     ball      <NA>   <NA>     <NA>         NA
## 1644742      303    203937 Anderson      Kyle      1        F 1610612759
## 1644743      303    201937    Rubio     Ricky      9        G 1610612750
## 1644744      303    201988    Mills     Patty      8        G 1610612759
## 1644745      303    203952  Wiggins    Andrew     22      G-F 1610612750
## 1644746      303    203937 Anderson      Kyle      1        F 1610612759
##            x_loc    y_loc   radius game_clock shot_clock quarter
## 1644741  5.43835 24.73073 10.63683     359.75       5.49       3
## 1644742 65.31054 22.12468  0.00000     346.42      19.03       3
## 1644743 46.60167 20.00475  0.00000     376.60      22.70       3
## 1644744 38.77574 21.41917  0.00000     359.40      23.71       3
## 1644745 11.18441 34.04307  0.00000     359.40      23.69       3
## 1644746  8.62043  2.05544  0.00000     364.39       7.11       3
##         EVENTMSGTYPE EVENTMSGACTIONTYPE   SCORE
## 1644741            1                 98 67 - 48
## 1644742            1                 98 67 - 48
## 1644743            1                 98 67 - 48
## 1644744            1                 98 67 - 48
## 1644745            1                 98 67 - 48
## 1644746            1                 98 67 - 48

Wow! So much information - this event along as over 10 thousand rows !

Making a static plot

The first thing to notice is that there are 850 different timepoints in this particular

length(table(id303$game_clock))
## [1] 850

We will extract data to make several dataframes. We will use time 361.15. A different function is used to get the data for the ball, players and the convex hull.

Players

This shows the x and y locations of the 10 players on the court at this time.

playerdf <- player_position1(df=id303, eventid=303,gameclock=361.15) 
playerdf
##            ID        X        Y jersey
## 1  1610612750  6.56514 24.39048     32
## 2  1610612750 13.67987 34.34244     22
## 3  1610612759 16.08250 25.92709     30
## 4  1610612759 20.31601 46.23162     33
## 5  1610612750 15.89886 20.33154      9
## 6  1610612759 24.47131 10.80858      8
## 7  1610612750 11.03182  7.93721      8
## 8  1610612759  7.59973 16.73085     20
## 9  1610612750 20.11542 29.46991      5
## 10 1610612759 14.81859  2.42889      1

Convex Hull Plot

This function grabs the points required for plotting the convex hull of the player’s location. See Rajiv’s website for more info.

chulldf <- chull_plot(df=id303, eventid=303, gameclock=361.15)
chulldf
##    ID        X        Y
## 1   1 11.03182  7.93721
## 2   1  6.56514 24.39048
## 3   1 13.67987 34.34244
## 4   1 20.11542 29.46991
## 5   1 11.03182  7.93721
## 6   2 24.47131 10.80858
## 7   2 14.81859  2.42889
## 8   2  7.59973 16.73085
## 9   2 20.31601 46.23162
## 10  2 24.47131 10.80858

Ball Info

This simply gets the ball information from the id303 dataframe. Notice this dataframe also includes ID and jersey variables (NA). This is so we can bind together with the player location dataframe later.

ballposdf <- ball_position1(df=id303, eventid=303, gameclock=361.15)
ballposdf
##   ID       X        Y jersey
## 1 NA 7.80244 17.77459   <NA>

Plotting

Next these pieces of information are plotted using ggplot by overlaying it on a geom made by Ed Kupfer.

fullcourt() + 
  geom_point(data=playerdf,aes(x=X,y=Y,group=ID,color=factor(ID)),size=6) +       #players
  geom_text(data=playerdf,aes(x=X,y=Y,group=ID,label=jersey),color='black') +     #jersey number
  geom_polygon(data=chulldf,aes(x=X,y=Y,group=ID,fill=factor(ID)),alpha = 0.2) +  #convex hull
  geom_point(data=ballposdf,aes(x=X,y=Y),color='darkorange',size=3) +             #ball
  scale_color_manual(values=c("lightsteelblue2","orangered2")) +
  scale_fill_manual(values=c("lightsteelblue2","orangered2")) +
  theme(legend.position="none")

Making a dynamic plot

Because the above plot was made using ggplot, it is relatively trivial to make an animated plot by creating the same plot for every time unit and compiling together. This could be done in several ways - I will use gganimate for ease.

First, I will get the 3 dataframes (player position, convex hull info, ball position) for every time unit of play 303. I will then put everything into one big dataframe with an extra variable indicating which time unit (timebin) those observations belong to. To do this, we need to make sure that all 3 dataframes (player position, convex hull info, ball position) need to have the same variables. All that is required is to add a jersey number = NA to the convex hull dataframe. Lastly, we will add a variable that indicates what dataframe the observations originally come from - this is so we can only plot those data for each step of the ggplot.

Step 1. Store all the clocktimes for this event in ascending order. We will grab the player and ball positions for every unique clock time:

clocktimes= rev(sort(unique(id303$game_clock)))

Step 2. Now use a for loop to grab player positions, convex hull info and ball position for each time unit.

 fulldf=list()

 for(i in seq_along(clocktimes)){
  
  dplayer <- player_position1(df=id303, 303,clocktimes[i]) #Gets positions of players
  dchull <- chull_plot(df=id303, 303,clocktimes[i])       #Gets area of convex hull
  ballpos <- ball_position1(df=id303, 303,clocktimes[i])  #Gets position of ball
  dchull$jersey = "NA"
  dplayer$valx = 'player'
  dchull$valx = 'hull'
  ballpos$valx  = 'ball'
  fulldf[[i]] = rbind(dplayer,dchull,ballpos)
}

 length(fulldf)  #850 elements
## [1] 850
 fulldf = Map(cbind,fulldf,timebin=1:length(fulldf))  #add time unit 

A few things to notice… The number of rows of this dataframe are not all the same.

table(lapply(fulldf,nrow) %>% unlist)
## 
##   20   21   22   23  216 1397 
##   45  320  334  149    1    1

This is primarily because the convex plot sometimes has a different number of rows. However, two timebins have ridiculously large number of observations…

which(lapply(fulldf,nrow) %>% unlist   > 23)
## [1]   1 464

The timeunit = 1 is when the game clock is stopped as the ball is entering play. The players are running all over the place. I will not plot this here. 464 actually refers to the end of this particular play and similarly the ball and tracking data picks up on a lot of different info. Therefore, I will actually just plot the events 2-463.

playdf = data.table::rbindlist(fulldf)
playdf2 = playdf %>% filter(timebin!=1) %>% filter(timebin<464)

Now we can use the gganimate syntax to create a ggplot object. We add a “frame=” argument, telling it that we want to compile separate images over the timebin variable. We store this as “p”.

p = fullcourt() + 
  geom_point(data=playdf2 %>% filter(valx=="player"),aes(x=X,y=Y,group=ID,color=factor(ID),frame=timebin),size=6) +
  geom_text(data=playdf2 %>% filter(valx=="player"),aes(x=X,y=Y,group=ID,frame=timebin,label=jersey),color='black') +
  geom_polygon(data=playdf2 %>% filter(valx=="hull"),aes(x=X,y=Y,group=ID,fill=factor(ID),frame=timebin),alpha = 0.2) + 
  geom_point(data=playdf2 %>% filter(valx=="ball"),aes(x=X,y=Y,frame=timebin),color='darkorange',size=3) +
  scale_color_manual(values=c("lightsteelblue2","orangered2")) +
  scale_fill_manual(values=c("lightsteelblue2","orangered2")) +
  theme(legend.position="none")

Next, we use gganimate to save the file as a gif to our working directory. You could tweak the interval parameter to improve the flow. This might will take some time depending on your machine’s umph:

gg_animate(p, "nbaplot.gif", title_frame =F, ani.width = 600, ani.height = 450, interval=0.1)