sports analytics

Data Science MD August Recap: Sports Analytics Meetup

pitchfx For August's meetup, Data Science MD hosted a discussion on one of the most popular fields of analytics, the wide world of sports. From Moneyball to the MIT Sloan Sports Analytics conference, there has been much interest by researchers, team owners, and athletes in the area of sports analytics. Enthusiastic fans eagerly pour over recent statistics and crunch the numbers to see just how well their favorite sports team will do this season.

One issue that sports teams must deal with is the reselling of tickets to different events. Joshua Brickman, the Director of Ticket Analytics for , led off the night by discussing how the Washington Wizards are addressing secondary markets such as StubHub. One of the major initiatives taking place is a joint venture between Ticketmaster and the NBA to create a unified ticket exchange for all teams. Tickets, like most items, operate on a free market, where customers are free to purchase from whomever they choose. brickman-slide-1

Joshua went on to explain that teams could either try to beat the secondary markets by limiting printing, changing fee structures, and offering guarantees, or they could instead take advantage of the transaction data received each week from the league across secondary markets.


Josh outlined that the problem with the data was that it was only for the Wizards, it was only received weekly, and it doesn't take into consideration dynamic pricing changes. So instead they built their own models and queries to create heat maps. The first heat map shows the inventory sold. For this particular example, the Wizards had a sold out game.


Possibly of more importance was the heat map showing at what premium were tickets sold on the secondary market. In certain cases, the prices were actually lower than face value.


As with most data science products, the visualization of the results is extremely important. Joshua explained that the graphical heat maps make the data easily digestible for sales and directors, and supplements their numerical tracking. Their current process involves combining SQL queries with hand drawn shape files. Joshua also explained how they can track secondary markets and calculate current dynamic prices to see discrepancies.


Joshua ended with describing how future work could involve incorporating historical data and current secondary market prices to modify pricing to more closely reflect current conditions.


Our next speaker for the night was , the Player Information Analyst for our very own Baltimore Orioles. Tom began by describing how PITCHf/x is installed in every major league stadium and provides teams with information the location, velocity, and movement of every pitch. Using heat maps, Tom was able to show how the strike zone has changed between 2009 and 2013.


Tom then described the R code necessary to generate the heat maps.


Since different batters have different strike zones, locations needed to be rescaled to define new boundaries. For instance, Jose Altuve, who is 5'5", has a relative Z location that is shifted slightly higher.


Tom then went on to describe the impact that home plate umpires have on the game. On average, 155 pitches are called per game, with 15 being within one inch of the strike zone, and 31 being within two inches. With a game sometimes being determined by a single pitch, the decisions that an home plate umpire make are very important. A given pitch is worth approximately 0.13 runs.


Next Tom showed various heat map comparisons that highlighted differences between umpires, batters, and counts. One of the most surprisingly results was the difference when the batter faced an 0-2 count versus 3-0 count. I suggest readers look at all the slides to see the other interesting results.


While the heat maps provide a lot of useful information, it is sometimes interesting to look at certain pitches of interest. By linking to video clips, Tom demonstrated how an interactive strike scatter plot could be created. Please view the video to see this demonstration.


Tom concluded by saying that PITCHf/x is very powerful, and yes, umpires have a very difficult job!

Slides can be found here for Joshua and here for Tom.

The video for the event is below:




Weekly Round-Up: Sports Analytics, LinkedIn, Kaggle Connect, and Drug Side Effects

Welcome back to the round-up, an overview of the most interesting data science, statistics, and analytics articles of the past week. This week, we have 4 fascinating articles ranging in topics from sports analytics to drug side effects. In this week's round-up:

  • How Numbers Can Reveal Hidden Truths About Sports
  • How and Why LinkedIn is Becoming an Engineering Powerhouse
  • Introducing Kaggle Connect: Data Science Consulting via Kaggle
  • Unreported Drug Side Effects Found Using Internet Search Data

How Numbers Can Reveal Hidden Truths About Sports

This is an article about a sports analytics study done at MIT examining the importance of factors in the success of field goals. The study analyzes 11,896 NFL field goal attempts from 2000 through 2011 and debunks some common misconceptions, like that calling a time-out before the kick to put extra pressure on the kicker will increase the likelihood of a missed field goal. The article also gives a brief history of sports analytics and mentions another interesting study about the value of flexibility in baseball roster construction.

How and Why LinkedIn is Becoming an Engineering Powerhouse

This GigaOM article follows the changes in LinkedIn's data infrastructure over the last five years. This includes setting up the company's Hadoop infrastructure, their Voldemort distributed database, a scheduler for batch processes called Azkaban, a message broker system named Kafka, and the company's new Espresso database. The architecture combines online, offline, and nearline systems that each perform the necessary functions as efficiently as possible and allow the company to continue to scale effectively.

Introducing Kaggle Connect: Data Science Consulting via Kaggle

This is an interesting post from Kaggle blog introducing the company's new offering, called Kaggle Connect. Connect is a consulting platform that helps match top competitors in Kaggle competitions with companies that need machine learning and predictive analytics projects completed. The post mentions the intent behind the platform is to create a McKinsey in the Cloud, not a Mechanical Turk for Data Science. The post goes on to describe the platform in more detail and includes a map of where on the planet the Connect participants reside.

Unreported Drug Side Effects Found Using Internet Search Data

This is an interesting NY Times article about how a group of scientists from Microsoft, Stanford, and Columbia were able to detect evidence of unreported prescription drug side-effects before the warning system used by the Food and Drug Administration. They were able to do this by mining data from the search engines of Google, Microsoft, and Yahoo. The article goes on the mention the drugs that were analyzed in the study and provide more details about some of the group's findings.

That's it for this week. Make sure to come back next week when we’ll have some more interesting articles! If there's something we missed, feel free to let us know in the comments below.

Read Our Other Round-Ups