video

Data Science MD August Recap: Sports Analytics Meetup

pitchfx For August's meetup, Data Science MD hosted a discussion on one of the most popular fields of analytics, the wide world of sports. From Moneyball to the MIT Sloan Sports Analytics conference, there has been much interest by researchers, team owners, and athletes in the area of sports analytics. Enthusiastic fans eagerly pour over recent statistics and crunch the numbers to see just how well their favorite sports team will do this season.

One issue that sports teams must deal with is the reselling of tickets to different events. Joshua Brickman, the Director of Ticket Analytics for , led off the night by discussing how the Washington Wizards are addressing secondary markets such as StubHub. One of the major initiatives taking place is a joint venture between Ticketmaster and the NBA to create a unified ticket exchange for all teams. Tickets, like most items, operate on a free market, where customers are free to purchase from whomever they choose. brickman-slide-1

Joshua went on to explain that teams could either try to beat the secondary markets by limiting printing, changing fee structures, and offering guarantees, or they could instead take advantage of the transaction data received each week from the league across secondary markets.

brickman-slide-2

Josh outlined that the problem with the data was that it was only for the Wizards, it was only received weekly, and it doesn't take into consideration dynamic pricing changes. So instead they built their own models and queries to create heat maps. The first heat map shows the inventory sold. For this particular example, the Wizards had a sold out game.

brickman-slide-3

Possibly of more importance was the heat map showing at what premium were tickets sold on the secondary market. In certain cases, the prices were actually lower than face value.

brickman-slide-4

As with most data science products, the visualization of the results is extremely important. Joshua explained that the graphical heat maps make the data easily digestible for sales and directors, and supplements their numerical tracking. Their current process involves combining SQL queries with hand drawn shape files. Joshua also explained how they can track secondary markets and calculate current dynamic prices to see discrepancies.

brickman-slide-5

Joshua ended with describing how future work could involve incorporating historical data and current secondary market prices to modify pricing to more closely reflect current conditions.

brickman-slide-6

Our next speaker for the night was , the Player Information Analyst for our very own Baltimore Orioles. Tom began by describing how PITCHf/x is installed in every major league stadium and provides teams with information the location, velocity, and movement of every pitch. Using heat maps, Tom was able to show how the strike zone has changed between 2009 and 2013.

duncan-slide-1

Tom then described the R code necessary to generate the heat maps.

duncan-slide-2

Since different batters have different strike zones, locations needed to be rescaled to define new boundaries. For instance, Jose Altuve, who is 5'5", has a relative Z location that is shifted slightly higher.

duncan-slide-3

Tom then went on to describe the impact that home plate umpires have on the game. On average, 155 pitches are called per game, with 15 being within one inch of the strike zone, and 31 being within two inches. With a game sometimes being determined by a single pitch, the decisions that an home plate umpire make are very important. A given pitch is worth approximately 0.13 runs.

duncan-slide-4

Next Tom showed various heat map comparisons that highlighted differences between umpires, batters, and counts. One of the most surprisingly results was the difference when the batter faced an 0-2 count versus 3-0 count. I suggest readers look at all the slides to see the other interesting results.

duncan-slide-5

While the heat maps provide a lot of useful information, it is sometimes interesting to look at certain pitches of interest. By linking to video clips, Tom demonstrated how an interactive strike scatter plot could be created. Please view the video to see this demonstration.

duncan-slide-6

Tom concluded by saying that PITCHf/x is very powerful, and yes, umpires have a very difficult job!

Slides can be found here for Joshua and here for Tom.

The video for the event is below:

[youtube=http://www.youtube.com/playlist?list=PLgqwinaq-u-NLZhSml9VHXVgHEh6l7wQH]

 

 

Data Science MD July Recap: Python and R Meetup

highres_261064962 For July's meetup, Data Science MD was honored to have Jonathan Street of NIH and Brian Godsey of RedOwl Analytics come discuss using Python and R for data analysis.

Jonathan started off by describing the growing ecosystem of Python data analysis tools including Numpy, Matplotlib, and Pandas.

He next walked through a brief example demonstrating Numpy, Pandas, and Matplotlib that he made available with the IPython notebook viewer.

The second half of Jonathan's talk focused on the problem of using clustering to identify scientific articles of interest. He needed to a) convert PDF to text b) extract sections of the document c) cluster and d) retrieve new material.

Jonathan used the PyPDF library for PDF conversion and then used the NLTK library for text processing. For a thorough discussion of NLTK, please see Data Community DC's multi-part series written by Ben Bengfort.

Clustering was done using scikit-learn, which identified seven groups of articles. From these, Jonathan was then able to retrieve other similar articles to read.

Overall, by combining several Python packages to handle text conversion, text processing, and clustering, Jonathan was able to create an automated personalized scientific recommendation system. Please see the Data Community DC posts on Python data analysis tutorials and Python data tools for more information.

Next to speak was Brian Godsey of RedOwl Analytics who was presenting on their social network analysis. He first described the problem of identifying misbehavior in a financial firm. Their goals are to detect patterns in employee behavior, measure differences between various types of people, and ultimately find anomalies in behavior.

In order to find these anomalies, they model behavior based on patterns in communications and estimate model parameters from a data set and set of effects.

Brian then revealed that while implementing their solution they have developed a R package called rRevelation that allows a user to import data sets, create covariates, specify a model's behavioral parameters, and estimate the parameter values.

To conclude his presentation, Brian demonstrated using the package against the well-known Enron data set and discussed how larger data sets requires using other technologies such as MapReduce.

  http://www.youtube.com/watch?v=1rFlqD3prKE&list=PLgqwinaq-u-Piu9tTz58e7-5IwYX_e6gh

Slides can be found here for Jonathan and here for Brian.

Data Science MD Discusses Health IT at Shady Grove

For its June meetup, Data Science MD explored a new venue, The Universities at Shady Grove. And what better way to venture into Montgomery County than to spend an evening discussing one of its leading sectors. That's right, an event all about healthcare. And we tackled it from two different sides.

http://www.youtube.com/playlist?v=PLgqwinaq-u-Ob2qS9Rt8uCXeKtT830A7e&w=640&h=385

The night started with a presentation from Gavin O'Brien from NIST's National Cybersecurity Center of Excellence. He spoke about creating a secure mobile Health IT platform that would allow doctors and nurses to share relevant pieces of information in a manner that is secure and follows all guidelines and policies set forth documenting how health data must be handled. Gavin's presentation focused on securing 802.11 links as opposed to cellular links or other types of wireless links as this is a good first step and is immediately practical when deployed within one building like a hospital. Gavin discussed all of the technological challenges, from encrypting data during transmission rather than in the clear where it can be intercepted as well as creating Access Control Lists so that only the correct people saw a patient's data. As his talk progressed, one thought was constantly in the back of my mind: how can this architecture be put in place to provide the protection for the data that the policies stipulate while still allowing the data to be distributed so that analytics can be run on the data? For instance, a hospital should be interested in trends among patients in their care like if patients had complications all after receiving the same family of drugs or specific drug (perhaps from the same batch), when patients have the most problems and therefore require the most attention and when a bacteria or virus may be loose in the hospital, further complicating patients ailments. The architecture may allow these types of analytics but they were not specifically discussed during Gavin's presentation. If you have any ideas how a compliant architecture can support these analytics or potential problems to running analytics, please provide a comment to this post.

The final speaker of the night was Uma Ahluwalia, the Director of Health and Human Services for Montgomery County. Uma spoke about the various different avenues that county residents have to report problems and that often times, their needs cross many different segments of health and human services, usually requires their stories to be retold each time. According to her vision, a resident/patient could report their problem to any one of six segments and then all of the segments could see the information without the patient having to reiterate their story over and over again. One big problem with this solution is that data would be shared across many groups, giving county workers access to more information than they should according to health regulations. However, Montgomery County sees each segment as a part of one organization, and therefore the data can be shared internally among all employees within that organization. While this should help with reducing the amount of time patients need to retell their story, it still does not provide an open platform for data scientists. However, Uma also had a potential solution to that problem: volunteers. Volunteers can sign non-disclosure agreements allowing them access to see patient data to help create useful analytics, thereby opening the problem space to many more minds in the hopes of creating truly revolutionary analytics. Perhaps you will be the next great mind that unlocks the meaning behind a current social issue.

Finally, Data Science MD needs to acknowledge a few key people and groups that contributed to this meetup. Mary Lowe and Melissa Marquez from the Universities at Shady Grove were instrumental in making this happen, helping to secure the room and providing the food and A/V setup. Dan Hoffman, the Chief Innovation Officer for Montgomery County also provided a great deal of support to make this happen. Finally, John Rumble, a DSMD member, took the lead in getting DSMD beyond the Baltimore/Columbia corridor. Thanks so much to all of these key people.

If you want to catch up on previous meetups, please check out our YouTube channel.

Please check our July meetup where we discuss analysis techniques in Python and R at Betamore.

Data Science MD Unveils YouTube Channel

[youtube http://www.youtube.com/watch?v=videoseries?list=PLgqwinaq-u-OjuL89qqV4lto2PG3QwjbM&w=600&h=360] Data Science MD, in an effort to provide additional value to its members, has started a YouTube channel, DataScienceMD, to host videos of talks presented at Meetup events. Now, when a member can't attend an event due to a scheduling conflict or being out of town, they can view the videos after the fact to stay in the loop. However, we know the more likely scenario: seeing the talks in person will not be enough and you will want to see it again and again. (It's OK, we won't tell the presenters how often you are watching them.)

The presentations are available in two formats: individual video entries that cover one specific presentation and playlists which group all presentations from an event together in one package making it easy to relive it in its entirety. The default view when first visiting the channel is to see the most recent activity. By clicking on the Videos link just below the channel title, you will see individual presentations. To see the playlists, simply change the Uploads box to Playlists.

The playlist above is from our May Meetup which featured Cloudera consultants Joey Echeverria and Sean Busbey discussing an infrastructure option that can make analyzing Twitter data quick and simple as well an introduction to one of the many features of Apache Mahout.  These were not just static presentations; they also included live demonstrations/queries against data stored within the infrastructure, and it was all captured in the videos. Check them out!

Data Community DC Video Series Kicks Off: Dr. Jesse English Talks NLP and Text Processing

We are excited to announce the first in a new series of posts and a brand new initiative: Data Community DC Videos! We are going to film and publish online videos (and separate audio, resources permitting) as many talks from Data Community DC meetups as possible. Yes, we want you to experience the events in person, but realize that not everyone who wants to be a part of our community can attend every single event. To kick this off, we have a fantastic video of Dr. Jesse English passionately discussing a brand new, open source framework, WIMs (Weakly Inferred Meanings), a novel approach to creating structured meaning representations for semantic analyses. Whereas a TMR (text meaning representation) requires a large, domain-specific knowledge base and significant computation times, WIMs cover a limited scope of possible relationships. The limitation is intentional, and allows for better performance-- but still carries enough relationships for most applications. Additionally, the creation of a bespoke knowledge base and microtheory is not required, the novel pattern matching technique means that available ontologies like WordNet provide enough coverage. WIMs are Open Source and available now, and are truly a break through in semantic processing.

Dr. Jesse English holds a PhD in computer science from UMBC and has specialized in natural language processing, machine learning and machine reading. As the Chief Science Officer at Unbound Concepts, Jesse's focused on automatic extraction of semantically rich meaning from literature, and application of that knowledge to the company's big-data driven machine learning algorithm. Before his work at Unbound Concepts, Jesse worked as a research associate at UMBC, focusing on automatically bridging the knowledge acquisition bottleneck through machine reading, as well as developing agent-based conversation systems.

[vimeo 61955058]