cloudera

Data Science MD Unveils YouTube Channel

[youtube http://www.youtube.com/watch?v=videoseries?list=PLgqwinaq-u-OjuL89qqV4lto2PG3QwjbM&w=600&h=360] Data Science MD, in an effort to provide additional value to its members, has started a YouTube channel, DataScienceMD, to host videos of talks presented at Meetup events. Now, when a member can't attend an event due to a scheduling conflict or being out of town, they can view the videos after the fact to stay in the loop. However, we know the more likely scenario: seeing the talks in person will not be enough and you will want to see it again and again. (It's OK, we won't tell the presenters how often you are watching them.)

The presentations are available in two formats: individual video entries that cover one specific presentation and playlists which group all presentations from an event together in one package making it easy to relive it in its entirety. The default view when first visiting the channel is to see the most recent activity. By clicking on the Videos link just below the channel title, you will see individual presentations. To see the playlists, simply change the Uploads box to Playlists.

The playlist above is from our May Meetup which featured Cloudera consultants Joey Echeverria and Sean Busbey discussing an infrastructure option that can make analyzing Twitter data quick and simple as well an introduction to one of the many features of Apache Mahout.  These were not just static presentations; they also included live demonstrations/queries against data stored within the infrastructure, and it was all captured in the videos. Check them out!

Event Review: Analyzing Twitter: An End-to-End Data Pipeline Recap

Data Science MD once again returned to the wonderful Ad.com in Baltimore to discuss Analyzing Twitter: An End-to-End Data Pipeline with two experts from Cloudera.

Starting off the night, Joey Echeverria, a Principal Solutions Architect, first discussed a big data architecture and how a key components of relational data management system can be replaced with current big data technologies. With Twitter being increasingly popular with marketing teams, analyzing Twitter data becomes a perfect use case to demonstrate a complete big data pipeline.

Walking through each component, Joey described what functionality each technology provided to the solution. Flume is able to pull data from a source and store it into a sink. With a custom Flume source interfacing with the Twitter API, this allows the automatic retrieval of tweets and storage into HDFS using the JSON format.

For query and reporting functionality, Hive (an open source project under the Apache Software Foundation) provides a SQL like interface to create MapReduce jobs to access the data. Hive has a schema on read, supports scalar and complex types, and allows custom serializers and deserializers.  However, as Joey warned, it is not the same as accessing a relational database. There is no transaction support, and the queries can take several minutes to hours.

Complex queries can be written to select, group, and perform other calculations on the Twitter data. With a JSON deserializer, the JSON Twitter data stored by Flume can be queried by Hive.

While Hive is powerful, Joey explained it can be slow.  A tool like Impala can perform queries up to 100x times faster than Hive. Impala circumvents MapReduce to directly access the data through a specialized distributed query engine that is very similar to those found in commercial parallel RDBMSs.

Lastly, in order to have everything run automatically and repeatedly, Joey introduced Oozie, which allows workflows to be created and managed. By combining these open source tools of the Hadoop ecosystem, a complete Twitter analysis pipleline can be created to provide efficient retrieval, storage, and querying of tweets.

In conclusion, Joey highlighting the three, part, and series that further covers the Twitter pipeline in detail.

Following Joey, Sean Busbey, a Solutions Architect at Cloudera, discussed working with Mahout, a scalable machine learning library for Hadoop. Sean first introduced the three C's of machine learning: classification, clustering, and collaborative filtering. With classification, learning from a training set supervised, and new examples can be categorized. Clustering allows examples to be grouped together with common features, while collaborative filtering allows new candidates to be suggested.

For tonight's presentation, Sean used clustering for the demonstration.

For clustering to work, a vector of features is needed for the algorithm to be able to cluster.

With two features in the vector, it is possible to visualize how the the clustering of items could occur.

Sean then discussed how tweets could be weighted for learning using basic count, inverse document frequencies, and n-grams. To get the data into a format for Mahout to use, a combination of Hive, Hadoop streaming, and the Mahout seq2sparse tool can be used.

For clustering to work, a notion of measurable similarity is needed. Sean discussed how Euclidean distance, cosine similarity, and Jaccard distance can be used.

K-Means clustering is a method of cluster analysis supported by Mahout. Several parameters including number of clusters, distance metric, max iterations, and stopping threshold must be specified when running k-means.

The results of the clustering were not ideal, so Sean next described using canopy clustering.

Even then though, the clusters were not the best, for which Sean concluded that short term casual speech is hard to analyze and that a custom analyzer that does further smoothing and removes certain stop words should be used.

If you would like to read the complete slides, Joey's can be found here, and Sean's are here.

Mid Maryland Data Science Kickoff Event Review

On Tuesday, January 29th, nearly 90 academics, professionals, and data science enthusiasts gathered at JHU APL for the kick-off meetup of the new Mid-Maryland Data Science group. With samosas on their plates and sodas in hand, members filled the air with conversations about their work and interests. After their meal, members were ushered into the main auditorium and the presenters took their place at the front. PANO_20130129_183408

 

Greetings and Mission

by Jason Barbour & Matt Motyka

Jason and Matt kicked off the talks with an introduction of the group. Motivated by both growth of data science and the vast opportunities being made available by powerful free tools and open access to data, they described their interest in creating a local group that help grow  Maryland data science community. Being software developers with analytic experience, Jason and Matt next described their seven keys to a success analytic: infrastructure, people, data, model, and presentation. Lastly, metrics about the interests and experience of the members was presented.

The Rise of Data Products

by Sean Murphy

With excitement and passion, Sean took the stage to show how now is the Gold Rush for data products. Laying out the definition of a data product, and cycling through several well known examples, Sean explained how these products are able to bring social, financial, or environmental value through the combination of data and algorithms. Consumers want data, and the tools and infrastructure needed to supply this demand are available either freely or extremely low cost. Data scientists are now able to harness this stack of tools to provide the data products that consumers crave. As Sean succinctly stated, it is a great time time to work with data.

The article version of the talk can be found here.

The Variety of Data Scientists

by Harlan Harris

Being a full-fledged data science, Harlan followed up Sean by presenting his research into what the name “data scientists” really means. Using the results of a data scientist survey, Harlan listed several skill groupings that provide a shorthand for the variety of skills that data scientists possess: programming, stats, math, business, and machine learning/big data. Next Harlan, discussed that the diverse backgrounds of data scientists can be more accurately categorized into four types: data businessperson, data creative, data researcher, and data engineer. With this breakdown, Harlan demonstrated that the data scientists community is actually composed of individuals with a variety of interests and skills.

Cloudera Impala - Closing the near real time gap in BIGDATA

by Wayne Wheeles

A true cyber security evangelist, Wayne Wheeles presented how Cloudera’s Impala, was able to make near real time security analysis a reality. With his years of experience in the field of cyber security, and his prior work utilizing big data technologies, Wayne was given unique access to Cloudera’s latest tool. Through his testing and analysis, he concluded that the Impala tool offered a significant improvement in performance and could become a vital tool in cyber security.

After the last presentation, more than a dozen members joined joined us at nearby Looney’s Pub to end the night with a few beers and snacks. To everyone's surprise, Donald Miner of EMC Greenplum offered to pick-up the tab! You can follow him on Twitter or LinkedIn from this page.

If you missed this first event, don't worry as the next one is coming up on March 14th in Baltimore. Check it out here.