Data Science MD

2013 September DSMD Event: Teaching Data Science to the Masses

Stats-vs-data-science For Data Science MD's Septmeber meetup, we were very fortunate to have the very talented and very passionate Dr. Jeff Leek speak about his experiences teaching Data Science through the online learning platform Coursera. This was also a unique event for DSMD itself because it was the first meetup that only featured one speaker. Having one speaker speak for a whole hour can be a disaster if the speaker is unable to keep the attention of those in the audience. However, Dr. Leek is a very dynamic and engaging speaker and had no problem keeping the attention of everyone in the room, including a couple of middle school students.

For those of you who are not familiar with Dr. Leek, he is a biostatistician at Johns Hopkins University as well as a instructor in the JHU biostatistics program. His biostatistics work typically entails analyzing human genome sequenced data to provide insights to doctors and patients in the form of raw data and advanced visualizations. However, when he is not revolutionizing the medical world or teaching the great biostatisticians of tomorrow at JHU, you may look for him teaching his course on Coursera, or providing new content to his blog, Simply Statistics.

Now, on to the talk. Johns Hopkins and specifically Dr. Leek got involved in teaching a Coursera course because they have constantly been looking at ways to improve learning for their students. They had been "flipping the classroom" by taking lectures and posting them to YouTube so that students could review the lecture material before class and then use the classroom time to dig deeper into specific topics. Because online videos are such a vital component of Massive Open Online Classes (MOOCs), it is no surprise that they took the next leap.

But just in case you think that JHU and Dr. Leek are new to this whole data science "thing," check out their Kaggle team's results for the Heritage Health Prize.


Even though their team fell a few places when run on the private data, they still had a very impressive showing considering there were 1358 teams that entered and over 20,000 entries. But what exactly does data science mean to Dr. Leek? Check out his expanded components of data science chart, that differs from similar charts of other data scientists by showing the root disciplines of each component too.


But what does the course look like?


He covers topics such as type of analyses, how to organize a data analysis, data munging as well as others like:



One of the interesting things to note though is that he also shows examples of poor data analysis attempts. There is a core problem with the statistics example from above (pointed out by high school students). Below is an example of another:


And this course, in addition to two other courses, Computing for Data Analysis and Mathematical Biostatistics Bootcamp taught by other JHU faculty, have had a very positive response.


But how do you teach that many people effectively? That is where the power of Coursera comes in; JHU could have chosen other providers like edX or Udacity but decided to go with Coursera. The videos make it easy to convey knowledge and message boards provide a mechanism to ask questions. Dr. Leek even had students answering questions for other students so that all he had to do was validate the response. But he also pointed out that his class' message board was just like all other message boards and followed 1/98/1 rule where 1% of people respond in a mean way and are unhelpful, 1% of people are very nice and very helpful and the other 98% don't care and don't really respond.

One of the most unique aspects of Coursera is that it helps to scale to tens of thousands of students by using peer/student grading. Each person grades 4 different assignments so that everyone is submitting one answer and grading 4 others. The final score for each student is the median of the four scores from the other students. The rubric used in Dr. Leek's class is below.


The result of this grading policy, based on Dr. Leek's analysis is that good students received good grades, poor students received poor grades and middle students' grades fluctuated a fair amount. So it seems like the policy works mostly, but there is still room for improvement.

But why does Johns Hopkins and Dr. Leek even support this model of learning? They do have full time jobs that involve teaching after all. Well, besides being huge supporters of open source technology and open learning, they also see many other reasons for supporting this paradigm.


Check out the video for the many other reasons why JHU further supports this paradigm. And, while you are at it, see if you can figure out if the x and y axes are related in some way. This was our data science/statistics problem for the evening. The answer can also be found in the video.


We also got a sneak peek at a new tool/component that integrates directly into R - swirl. Look for a meetup or blog post about this tool in the future.


Our next meetup is on October 9th at in Baltimore beginning at 6:30PM. We will have Don Miner speak about using Hadoop for Data Science. If you can make it, come out and join us.

Data Science MD August Recap: Sports Analytics Meetup

pitchfx For August's meetup, Data Science MD hosted a discussion on one of the most popular fields of analytics, the wide world of sports. From Moneyball to the MIT Sloan Sports Analytics conference, there has been much interest by researchers, team owners, and athletes in the area of sports analytics. Enthusiastic fans eagerly pour over recent statistics and crunch the numbers to see just how well their favorite sports team will do this season.

One issue that sports teams must deal with is the reselling of tickets to different events. Joshua Brickman, the Director of Ticket Analytics for , led off the night by discussing how the Washington Wizards are addressing secondary markets such as StubHub. One of the major initiatives taking place is a joint venture between Ticketmaster and the NBA to create a unified ticket exchange for all teams. Tickets, like most items, operate on a free market, where customers are free to purchase from whomever they choose. brickman-slide-1

Joshua went on to explain that teams could either try to beat the secondary markets by limiting printing, changing fee structures, and offering guarantees, or they could instead take advantage of the transaction data received each week from the league across secondary markets.


Josh outlined that the problem with the data was that it was only for the Wizards, it was only received weekly, and it doesn't take into consideration dynamic pricing changes. So instead they built their own models and queries to create heat maps. The first heat map shows the inventory sold. For this particular example, the Wizards had a sold out game.


Possibly of more importance was the heat map showing at what premium were tickets sold on the secondary market. In certain cases, the prices were actually lower than face value.


As with most data science products, the visualization of the results is extremely important. Joshua explained that the graphical heat maps make the data easily digestible for sales and directors, and supplements their numerical tracking. Their current process involves combining SQL queries with hand drawn shape files. Joshua also explained how they can track secondary markets and calculate current dynamic prices to see discrepancies.


Joshua ended with describing how future work could involve incorporating historical data and current secondary market prices to modify pricing to more closely reflect current conditions.


Our next speaker for the night was , the Player Information Analyst for our very own Baltimore Orioles. Tom began by describing how PITCHf/x is installed in every major league stadium and provides teams with information the location, velocity, and movement of every pitch. Using heat maps, Tom was able to show how the strike zone has changed between 2009 and 2013.


Tom then described the R code necessary to generate the heat maps.


Since different batters have different strike zones, locations needed to be rescaled to define new boundaries. For instance, Jose Altuve, who is 5'5", has a relative Z location that is shifted slightly higher.


Tom then went on to describe the impact that home plate umpires have on the game. On average, 155 pitches are called per game, with 15 being within one inch of the strike zone, and 31 being within two inches. With a game sometimes being determined by a single pitch, the decisions that an home plate umpire make are very important. A given pitch is worth approximately 0.13 runs.


Next Tom showed various heat map comparisons that highlighted differences between umpires, batters, and counts. One of the most surprisingly results was the difference when the batter faced an 0-2 count versus 3-0 count. I suggest readers look at all the slides to see the other interesting results.


While the heat maps provide a lot of useful information, it is sometimes interesting to look at certain pitches of interest. By linking to video clips, Tom demonstrated how an interactive strike scatter plot could be created. Please view the video to see this demonstration.


Tom concluded by saying that PITCHf/x is very powerful, and yes, umpires have a very difficult job!

Slides can be found here for Joshua and here for Tom.

The video for the event is below:




Data Science MD July Recap: Python and R Meetup

highres_261064962 For July's meetup, Data Science MD was honored to have Jonathan Street of NIH and Brian Godsey of RedOwl Analytics come discuss using Python and R for data analysis.

Jonathan started off by describing the growing ecosystem of Python data analysis tools including Numpy, Matplotlib, and Pandas.

He next walked through a brief example demonstrating Numpy, Pandas, and Matplotlib that he made available with the IPython notebook viewer.

The second half of Jonathan's talk focused on the problem of using clustering to identify scientific articles of interest. He needed to a) convert PDF to text b) extract sections of the document c) cluster and d) retrieve new material.

Jonathan used the PyPDF library for PDF conversion and then used the NLTK library for text processing. For a thorough discussion of NLTK, please see Data Community DC's multi-part series written by Ben Bengfort.

Clustering was done using scikit-learn, which identified seven groups of articles. From these, Jonathan was then able to retrieve other similar articles to read.

Overall, by combining several Python packages to handle text conversion, text processing, and clustering, Jonathan was able to create an automated personalized scientific recommendation system. Please see the Data Community DC posts on Python data analysis tutorials and Python data tools for more information.

Next to speak was Brian Godsey of RedOwl Analytics who was presenting on their social network analysis. He first described the problem of identifying misbehavior in a financial firm. Their goals are to detect patterns in employee behavior, measure differences between various types of people, and ultimately find anomalies in behavior.

In order to find these anomalies, they model behavior based on patterns in communications and estimate model parameters from a data set and set of effects.

Brian then revealed that while implementing their solution they have developed a R package called rRevelation that allows a user to import data sets, create covariates, specify a model's behavioral parameters, and estimate the parameter values.

To conclude his presentation, Brian demonstrated using the package against the well-known Enron data set and discussed how larger data sets requires using other technologies such as MapReduce.

Slides can be found here for Jonathan and here for Brian.

Data Science MD Unveils YouTube Channel

[youtube] Data Science MD, in an effort to provide additional value to its members, has started a YouTube channel, DataScienceMD, to host videos of talks presented at Meetup events. Now, when a member can't attend an event due to a scheduling conflict or being out of town, they can view the videos after the fact to stay in the loop. However, we know the more likely scenario: seeing the talks in person will not be enough and you will want to see it again and again. (It's OK, we won't tell the presenters how often you are watching them.)

The presentations are available in two formats: individual video entries that cover one specific presentation and playlists which group all presentations from an event together in one package making it easy to relive it in its entirety. The default view when first visiting the channel is to see the most recent activity. By clicking on the Videos link just below the channel title, you will see individual presentations. To see the playlists, simply change the Uploads box to Playlists.

The playlist above is from our May Meetup which featured Cloudera consultants Joey Echeverria and Sean Busbey discussing an infrastructure option that can make analyzing Twitter data quick and simple as well an introduction to one of the many features of Apache Mahout.  These were not just static presentations; they also included live demonstrations/queries against data stored within the infrastructure, and it was all captured in the videos. Check them out!