Data Science MD

Hadoop for Data Science: A Data Science MD Recap

Hadoop logo On October 9th, Data Science MD welcomed Dr. Donald Miner as its speaker to talk about doing data science work and how the hadoop framework can help. To start the presentation, Don was very clear about one thing: hadoop is bad at a lot of things. It is not meant to be a panacea for every problem a data scientist will face.

With that in mind, Don spoke about the benefits that hadoop offers data scientists. Hadoop is a great tool for data exploration. It can easily handle filtering, sampling and anti-filtering (summarization) tasks. When speaking about these concepts, Don expressed the benefits of each and included some anecdotes that helped to show real world value. He also spoke about data cleanliness in a very Baz Luhrmann Wear Sunscreen sort of way, offering that as his biggest piece of advice.

Don then transitioned to the more traditional data science problems of classification (including NLP) and recommender systems.

The talk was very well received by DSMD members. If you missed it, check out the video:

http://www.youtube.com/playlist?list=PLgqwinaq-u-Mj5keXlUOrH-GKTR-LDMv4

Our next event will be November 20th, 2013 at Loyola University Maryland Graduate Center starting at 6:30PM. We will be digging deeper into the daily lives of 3 data scientists. We hope you will join us!

2013 September DSMD Event: Teaching Data Science to the Masses

Stats-vs-data-science For Data Science MD's Septmeber meetup, we were very fortunate to have the very talented and very passionate Dr. Jeff Leek speak about his experiences teaching Data Science through the online learning platform Coursera. This was also a unique event for DSMD itself because it was the first meetup that only featured one speaker. Having one speaker speak for a whole hour can be a disaster if the speaker is unable to keep the attention of those in the audience. However, Dr. Leek is a very dynamic and engaging speaker and had no problem keeping the attention of everyone in the room, including a couple of middle school students.

For those of you who are not familiar with Dr. Leek, he is a biostatistician at Johns Hopkins University as well as a instructor in the JHU biostatistics program. His biostatistics work typically entails analyzing human genome sequenced data to provide insights to doctors and patients in the form of raw data and advanced visualizations. However, when he is not revolutionizing the medical world or teaching the great biostatisticians of tomorrow at JHU, you may look for him teaching his course on Coursera, or providing new content to his blog, Simply Statistics.

Now, on to the talk. Johns Hopkins and specifically Dr. Leek got involved in teaching a Coursera course because they have constantly been looking at ways to improve learning for their students. They had been "flipping the classroom" by taking lectures and posting them to YouTube so that students could review the lecture material before class and then use the classroom time to dig deeper into specific topics. Because online videos are such a vital component of Massive Open Online Classes (MOOCs), it is no surprise that they took the next leap.

But just in case you think that JHU and Dr. Leek are new to this whole data science "thing," check out their Kaggle team's results for the Heritage Health Prize.

jhu-kaggle-team

Even though their team fell a few places when run on the private data, they still had a very impressive showing considering there were 1358 teams that entered and over 20,000 entries. But what exactly does data science mean to Dr. Leek? Check out his expanded components of data science chart, that differs from similar charts of other data scientists by showing the root disciplines of each component too.

expanded-fusion-of-data-science

But what does the course look like?

course-setup

He covers topics such as type of analyses, how to organize a data analysis, data munging as well as others like:

concepts-machine-learning

statistics-concept

One of the interesting things to note though is that he also shows examples of poor data analysis attempts. There is a core problem with the statistics example from above (pointed out by high school students). Below is an example of another:

concepts-confounding

And this course, in addition to two other courses, Computing for Data Analysis and Mathematical Biostatistics Bootcamp taught by other JHU faculty, have had a very positive response.

jhu-course-enrollment

But how do you teach that many people effectively? That is where the power of Coursera comes in; JHU could have chosen other providers like edX or Udacity but decided to go with Coursera. The videos make it easy to convey knowledge and message boards provide a mechanism to ask questions. Dr. Leek even had students answering questions for other students so that all he had to do was validate the response. But he also pointed out that his class' message board was just like all other message boards and followed 1/98/1 rule where 1% of people respond in a mean way and are unhelpful, 1% of people are very nice and very helpful and the other 98% don't care and don't really respond.

One of the most unique aspects of Coursera is that it helps to scale to tens of thousands of students by using peer/student grading. Each person grades 4 different assignments so that everyone is submitting one answer and grading 4 others. The final score for each student is the median of the four scores from the other students. The rubric used in Dr. Leek's class is below.

grading-rubric

The result of this grading policy, based on Dr. Leek's analysis is that good students received good grades, poor students received poor grades and middle students' grades fluctuated a fair amount. So it seems like the policy works mostly, but there is still room for improvement.

But why does Johns Hopkins and Dr. Leek even support this model of learning? They do have full time jobs that involve teaching after all. Well, besides being huge supporters of open source technology and open learning, they also see many other reasons for supporting this paradigm.

partial-motivation

Check out the video for the many other reasons why JHU further supports this paradigm. And, while you are at it, see if you can figure out if the x and y axes are related in some way. This was our data science/statistics problem for the evening. The answer can also be found in the video.

stat-challenge

http://www.youtube.com/playlist?list=PLgqwinaq-u-NNCAp3nGLf93dh44TVJ0vZ

We also got a sneak peek at a new tool/component that integrates directly into R - swirl. Look for a meetup or blog post about this tool in the future.

swirl

Our next meetup is on October 9th at Advertising.com in Baltimore beginning at 6:30PM. We will have Don Miner speak about using Hadoop for Data Science. If you can make it, come out and join us.

Data Science MD August Recap: Sports Analytics Meetup

pitchfx For August's meetup, Data Science MD hosted a discussion on one of the most popular fields of analytics, the wide world of sports. From Moneyball to the MIT Sloan Sports Analytics conference, there has been much interest by researchers, team owners, and athletes in the area of sports analytics. Enthusiastic fans eagerly pour over recent statistics and crunch the numbers to see just how well their favorite sports team will do this season.

One issue that sports teams must deal with is the reselling of tickets to different events. Joshua Brickman, the Director of Ticket Analytics for , led off the night by discussing how the Washington Wizards are addressing secondary markets such as StubHub. One of the major initiatives taking place is a joint venture between Ticketmaster and the NBA to create a unified ticket exchange for all teams. Tickets, like most items, operate on a free market, where customers are free to purchase from whomever they choose. brickman-slide-1

Joshua went on to explain that teams could either try to beat the secondary markets by limiting printing, changing fee structures, and offering guarantees, or they could instead take advantage of the transaction data received each week from the league across secondary markets.

brickman-slide-2

Josh outlined that the problem with the data was that it was only for the Wizards, it was only received weekly, and it doesn't take into consideration dynamic pricing changes. So instead they built their own models and queries to create heat maps. The first heat map shows the inventory sold. For this particular example, the Wizards had a sold out game.

brickman-slide-3

Possibly of more importance was the heat map showing at what premium were tickets sold on the secondary market. In certain cases, the prices were actually lower than face value.

brickman-slide-4

As with most data science products, the visualization of the results is extremely important. Joshua explained that the graphical heat maps make the data easily digestible for sales and directors, and supplements their numerical tracking. Their current process involves combining SQL queries with hand drawn shape files. Joshua also explained how they can track secondary markets and calculate current dynamic prices to see discrepancies.

brickman-slide-5

Joshua ended with describing how future work could involve incorporating historical data and current secondary market prices to modify pricing to more closely reflect current conditions.

brickman-slide-6

Our next speaker for the night was , the Player Information Analyst for our very own Baltimore Orioles. Tom began by describing how PITCHf/x is installed in every major league stadium and provides teams with information the location, velocity, and movement of every pitch. Using heat maps, Tom was able to show how the strike zone has changed between 2009 and 2013.

duncan-slide-1

Tom then described the R code necessary to generate the heat maps.

duncan-slide-2

Since different batters have different strike zones, locations needed to be rescaled to define new boundaries. For instance, Jose Altuve, who is 5'5", has a relative Z location that is shifted slightly higher.

duncan-slide-3

Tom then went on to describe the impact that home plate umpires have on the game. On average, 155 pitches are called per game, with 15 being within one inch of the strike zone, and 31 being within two inches. With a game sometimes being determined by a single pitch, the decisions that an home plate umpire make are very important. A given pitch is worth approximately 0.13 runs.

duncan-slide-4

Next Tom showed various heat map comparisons that highlighted differences between umpires, batters, and counts. One of the most surprisingly results was the difference when the batter faced an 0-2 count versus 3-0 count. I suggest readers look at all the slides to see the other interesting results.

duncan-slide-5

While the heat maps provide a lot of useful information, it is sometimes interesting to look at certain pitches of interest. By linking to video clips, Tom demonstrated how an interactive strike scatter plot could be created. Please view the video to see this demonstration.

duncan-slide-6

Tom concluded by saying that PITCHf/x is very powerful, and yes, umpires have a very difficult job!

Slides can be found here for Joshua and here for Tom.

The video for the event is below:

[youtube=http://www.youtube.com/playlist?list=PLgqwinaq-u-NLZhSml9VHXVgHEh6l7wQH]

 

 

Data Science MD July Recap: Python and R Meetup

highres_261064962 For July's meetup, Data Science MD was honored to have Jonathan Street of NIH and Brian Godsey of RedOwl Analytics come discuss using Python and R for data analysis.

Jonathan started off by describing the growing ecosystem of Python data analysis tools including Numpy, Matplotlib, and Pandas.

He next walked through a brief example demonstrating Numpy, Pandas, and Matplotlib that he made available with the IPython notebook viewer.

The second half of Jonathan's talk focused on the problem of using clustering to identify scientific articles of interest. He needed to a) convert PDF to text b) extract sections of the document c) cluster and d) retrieve new material.

Jonathan used the PyPDF library for PDF conversion and then used the NLTK library for text processing. For a thorough discussion of NLTK, please see Data Community DC's multi-part series written by Ben Bengfort.

Clustering was done using scikit-learn, which identified seven groups of articles. From these, Jonathan was then able to retrieve other similar articles to read.

Overall, by combining several Python packages to handle text conversion, text processing, and clustering, Jonathan was able to create an automated personalized scientific recommendation system. Please see the Data Community DC posts on Python data analysis tutorials and Python data tools for more information.

Next to speak was Brian Godsey of RedOwl Analytics who was presenting on their social network analysis. He first described the problem of identifying misbehavior in a financial firm. Their goals are to detect patterns in employee behavior, measure differences between various types of people, and ultimately find anomalies in behavior.

In order to find these anomalies, they model behavior based on patterns in communications and estimate model parameters from a data set and set of effects.

Brian then revealed that while implementing their solution they have developed a R package called rRevelation that allows a user to import data sets, create covariates, specify a model's behavioral parameters, and estimate the parameter values.

To conclude his presentation, Brian demonstrated using the package against the well-known Enron data set and discussed how larger data sets requires using other technologies such as MapReduce.

  http://www.youtube.com/watch?v=1rFlqD3prKE&list=PLgqwinaq-u-Piu9tTz58e7-5IwYX_e6gh

Slides can be found here for Jonathan and here for Brian.

What is a Hackathon?

National Day of Civic Hacking Most people have probably heard of the term hackathon in today's technology landscape. But just because you have heard of it doesn't necessarily mean you know what it is. Wikipedia defines a hackathon as an event in which computer programmers and others involved in software development collaborate intensively on software projects. They can last for just a few hours or go as long as a week; and they come in all shapes and sizes. There are national hackathons, state-level hackathons, internal-corporate hackathons, open data-focused hackathons, and hackathons meant to flesh out APIs.

Hackathons can also operate on local levels. A couple of examples of local-level hackathons are the Baltimore Neighborhood Indicators Alliance-Jacob France Institute (BNIA-JFI) hackathon, which just completed its event in mid July, and the Jersey Shore Comeback-a-thon hackathon this past April hosted by Marathon Data Systems.

The BNIA-JFI hackathon was paired with a Big Data Day event and lasted for 8 hours one Friday. The goal of this hackathon was to come up with solutions that help out two local Baltimore neighborhoods: Old Goucher and Pigtown. The two each had a very clear goal established at the beginning of the hackathon but the overall goal was simple: help improve the local community. The data was provided to participants from the hackathon sponsors.

The Jersey Shore Comeback-a-thon hackathon lasted 24 hours straight, taking some participants back to their college days of pulling all-nighters. This hackathon also differed from the Baltimore hackathon in that it did not seem to provide data but instead focused on the pure technology of an application and how it can be used to achieve the goal of alerting locals and tourists that the Jersey shore is open for business.

Both of these events are great examples of local hackathons that look to give back to the community. However, if you want to learn more about federal level hackathons, open government data, or state level hackathons, please join Data Science MD at their next event titled Open Data Breakfast this Friday, August 9th, 8AM at Unallocated, a local Maryland hackerspace.

Data Science MD Discusses Health IT at Shady Grove

For its June meetup, Data Science MD explored a new venue, The Universities at Shady Grove. And what better way to venture into Montgomery County than to spend an evening discussing one of its leading sectors. That's right, an event all about healthcare. And we tackled it from two different sides.

http://www.youtube.com/playlist?v=PLgqwinaq-u-Ob2qS9Rt8uCXeKtT830A7e&w=640&h=385

The night started with a presentation from Gavin O'Brien from NIST's National Cybersecurity Center of Excellence. He spoke about creating a secure mobile Health IT platform that would allow doctors and nurses to share relevant pieces of information in a manner that is secure and follows all guidelines and policies set forth documenting how health data must be handled. Gavin's presentation focused on securing 802.11 links as opposed to cellular links or other types of wireless links as this is a good first step and is immediately practical when deployed within one building like a hospital. Gavin discussed all of the technological challenges, from encrypting data during transmission rather than in the clear where it can be intercepted as well as creating Access Control Lists so that only the correct people saw a patient's data. As his talk progressed, one thought was constantly in the back of my mind: how can this architecture be put in place to provide the protection for the data that the policies stipulate while still allowing the data to be distributed so that analytics can be run on the data? For instance, a hospital should be interested in trends among patients in their care like if patients had complications all after receiving the same family of drugs or specific drug (perhaps from the same batch), when patients have the most problems and therefore require the most attention and when a bacteria or virus may be loose in the hospital, further complicating patients ailments. The architecture may allow these types of analytics but they were not specifically discussed during Gavin's presentation. If you have any ideas how a compliant architecture can support these analytics or potential problems to running analytics, please provide a comment to this post.

The final speaker of the night was Uma Ahluwalia, the Director of Health and Human Services for Montgomery County. Uma spoke about the various different avenues that county residents have to report problems and that often times, their needs cross many different segments of health and human services, usually requires their stories to be retold each time. According to her vision, a resident/patient could report their problem to any one of six segments and then all of the segments could see the information without the patient having to reiterate their story over and over again. One big problem with this solution is that data would be shared across many groups, giving county workers access to more information than they should according to health regulations. However, Montgomery County sees each segment as a part of one organization, and therefore the data can be shared internally among all employees within that organization. While this should help with reducing the amount of time patients need to retell their story, it still does not provide an open platform for data scientists. However, Uma also had a potential solution to that problem: volunteers. Volunteers can sign non-disclosure agreements allowing them access to see patient data to help create useful analytics, thereby opening the problem space to many more minds in the hopes of creating truly revolutionary analytics. Perhaps you will be the next great mind that unlocks the meaning behind a current social issue.

Finally, Data Science MD needs to acknowledge a few key people and groups that contributed to this meetup. Mary Lowe and Melissa Marquez from the Universities at Shady Grove were instrumental in making this happen, helping to secure the room and providing the food and A/V setup. Dan Hoffman, the Chief Innovation Officer for Montgomery County also provided a great deal of support to make this happen. Finally, John Rumble, a DSMD member, took the lead in getting DSMD beyond the Baltimore/Columbia corridor. Thanks so much to all of these key people.

If you want to catch up on previous meetups, please check out our YouTube channel.

Please check our July meetup where we discuss analysis techniques in Python and R at Betamore.

Data Science MD Unveils YouTube Channel

[youtube http://www.youtube.com/watch?v=videoseries?list=PLgqwinaq-u-OjuL89qqV4lto2PG3QwjbM&w=600&h=360] Data Science MD, in an effort to provide additional value to its members, has started a YouTube channel, DataScienceMD, to host videos of talks presented at Meetup events. Now, when a member can't attend an event due to a scheduling conflict or being out of town, they can view the videos after the fact to stay in the loop. However, we know the more likely scenario: seeing the talks in person will not be enough and you will want to see it again and again. (It's OK, we won't tell the presenters how often you are watching them.)

The presentations are available in two formats: individual video entries that cover one specific presentation and playlists which group all presentations from an event together in one package making it easy to relive it in its entirety. The default view when first visiting the channel is to see the most recent activity. By clicking on the Videos link just below the channel title, you will see individual presentations. To see the playlists, simply change the Uploads box to Playlists.

The playlist above is from our May Meetup which featured Cloudera consultants Joey Echeverria and Sean Busbey discussing an infrastructure option that can make analyzing Twitter data quick and simple as well an introduction to one of the many features of Apache Mahout.  These were not just static presentations; they also included live demonstrations/queries against data stored within the infrastructure, and it was all captured in the videos. Check them out!

Data Community DC Video Series Kicks Off: Dr. Jesse English Talks NLP and Text Processing

We are excited to announce the first in a new series of posts and a brand new initiative: Data Community DC Videos! We are going to film and publish online videos (and separate audio, resources permitting) as many talks from Data Community DC meetups as possible. Yes, we want you to experience the events in person, but realize that not everyone who wants to be a part of our community can attend every single event. To kick this off, we have a fantastic video of Dr. Jesse English passionately discussing a brand new, open source framework, WIMs (Weakly Inferred Meanings), a novel approach to creating structured meaning representations for semantic analyses. Whereas a TMR (text meaning representation) requires a large, domain-specific knowledge base and significant computation times, WIMs cover a limited scope of possible relationships. The limitation is intentional, and allows for better performance-- but still carries enough relationships for most applications. Additionally, the creation of a bespoke knowledge base and microtheory is not required, the novel pattern matching technique means that available ontologies like WordNet provide enough coverage. WIMs are Open Source and available now, and are truly a break through in semantic processing.

Dr. Jesse English holds a PhD in computer science from UMBC and has specialized in natural language processing, machine learning and machine reading. As the Chief Science Officer at Unbound Concepts, Jesse's focused on automatic extraction of semantically rich meaning from literature, and application of that knowledge to the company's big-data driven machine learning algorithm. Before his work at Unbound Concepts, Jesse worked as a research associate at UMBC, focusing on automatically bridging the knowledge acquisition bottleneck through machine reading, as well as developing agent-based conversation systems.

[vimeo 61955058]

Event Review: Analyzing Twitter: An End-to-End Data Pipeline Recap

Data Science MD once again returned to the wonderful Ad.com in Baltimore to discuss Analyzing Twitter: An End-to-End Data Pipeline with two experts from Cloudera.

Starting off the night, Joey Echeverria, a Principal Solutions Architect, first discussed a big data architecture and how a key components of relational data management system can be replaced with current big data technologies. With Twitter being increasingly popular with marketing teams, analyzing Twitter data becomes a perfect use case to demonstrate a complete big data pipeline.

Walking through each component, Joey described what functionality each technology provided to the solution. Flume is able to pull data from a source and store it into a sink. With a custom Flume source interfacing with the Twitter API, this allows the automatic retrieval of tweets and storage into HDFS using the JSON format.

For query and reporting functionality, Hive (an open source project under the Apache Software Foundation) provides a SQL like interface to create MapReduce jobs to access the data. Hive has a schema on read, supports scalar and complex types, and allows custom serializers and deserializers.  However, as Joey warned, it is not the same as accessing a relational database. There is no transaction support, and the queries can take several minutes to hours.

Complex queries can be written to select, group, and perform other calculations on the Twitter data. With a JSON deserializer, the JSON Twitter data stored by Flume can be queried by Hive.

While Hive is powerful, Joey explained it can be slow.  A tool like Impala can perform queries up to 100x times faster than Hive. Impala circumvents MapReduce to directly access the data through a specialized distributed query engine that is very similar to those found in commercial parallel RDBMSs.

Lastly, in order to have everything run automatically and repeatedly, Joey introduced Oozie, which allows workflows to be created and managed. By combining these open source tools of the Hadoop ecosystem, a complete Twitter analysis pipleline can be created to provide efficient retrieval, storage, and querying of tweets.

In conclusion, Joey highlighting the three, part, and series that further covers the Twitter pipeline in detail.

Following Joey, Sean Busbey, a Solutions Architect at Cloudera, discussed working with Mahout, a scalable machine learning library for Hadoop. Sean first introduced the three C's of machine learning: classification, clustering, and collaborative filtering. With classification, learning from a training set supervised, and new examples can be categorized. Clustering allows examples to be grouped together with common features, while collaborative filtering allows new candidates to be suggested.

For tonight's presentation, Sean used clustering for the demonstration.

For clustering to work, a vector of features is needed for the algorithm to be able to cluster.

With two features in the vector, it is possible to visualize how the the clustering of items could occur.

Sean then discussed how tweets could be weighted for learning using basic count, inverse document frequencies, and n-grams. To get the data into a format for Mahout to use, a combination of Hive, Hadoop streaming, and the Mahout seq2sparse tool can be used.

For clustering to work, a notion of measurable similarity is needed. Sean discussed how Euclidean distance, cosine similarity, and Jaccard distance can be used.

K-Means clustering is a method of cluster analysis supported by Mahout. Several parameters including number of clusters, distance metric, max iterations, and stopping threshold must be specified when running k-means.

The results of the clustering were not ideal, so Sean next described using canopy clustering.

Even then though, the clusters were not the best, for which Sean concluded that short term casual speech is hard to analyze and that a custom analyzer that does further smoothing and removes certain stop words should be used.

If you would like to read the complete slides, Joey's can be found here, and Sean's are here.