hadoop

The Evolution of Big Data Platforms and People

This is a guest post by Paco Nathan. Paco is an O’Reilly authorApache Spark open source evangelist with Databricks, and an advisor for ZettacapAmplify Partners, and The Data Guild. Google lives in his family’s backyard. Paco spoke at Data Science DC in 2012.  Data Workflows for Machine LearningA kind of “middleware” for Big Data has been evolving since the mid–2000s. Abstraction layers help make it simpler to write apps in frameworks such as Hadoop. Beyond the relatively simple issue of programming convenience, there are much more complex factors in play. Several open source frameworks have emerged that build on the notion of workflow, exemplifying highly sophisticated features. My recent talk Data Workflows for Machine Learning considers several OSS frameworks in that context, developing a kind of “scorecard” to help assess best-of-breed features. Hopefully it can help your decisions about which frameworks suit your use case needs.

By definition, a workflow encompasses both the automation that we’re leveraging (e.g., machine learning apps running on clusters) as well as people and process. In terms of automation, some larger players have departed from “conventional wisdom” for their clusters and ML apps. For example, while the rest of the industry embraced virtualization, Google avoided that path by using cgroups for isolation. Twitter sponsored a similar open source approach, Apache Mesos, which was attributed to helping resolve their “Fail Whale” issues prior to their IPO. As other large firms adopt this strategy, the implication is that VMs may have run out of steam. Certainly, single-digit utilization rates at data centers (current industry norm) will not scale to handle IoT data rates: energy companies could not handle that surge, let along the enormous cap-ex implied. I'll be presenting on Datacenter Computing with Apache Mesos next Tuesday at the Big Data DC Meetup, held at AddThis. We’ll discuss the Mesos approach of mixed workloads for better elasticity, higher utilization rates, and lower latency.

On the people side, a very different set of issues looms ahead. Industry is retooling on a massive scale. It’s not about buying a whole new set of expensive tools for Big Data. Rather it’s about retooling how people in general think about computable problems. One vital component may well not be having enough advanced math in the hands of business leaders. Seriously, we still frame requirements for college math in Cold War terms: years of calculus were intended to filter out the best Mechanical Engineering candidates, who could then help build the best ICBMs. However, in business today the leadership needs to understand how to contend with enormous data rates and meanwhile deploy high-ROI apps at scale: how and when to leverage graph queries, sparse matrices, convex optimization, Bayesian statistics – topics that are generally obscured beyond the “killing fields” of calculus.

A new book by Allen Day and me in development at O’Reilly called “Just Enough Math” introduces advanced math for business people, especially to learn how to leverage open source frameworks for Big Data – much of which comes from organizations that leverage sophisticated math, e.g., Twitter. Each morsel of math is considered in the context of concrete business use cases, lots of illustrations, and historical background – along with brief code examples in Python that one can cut & paste.

This next week in the DC area I will be teaching a full-day workshop that includes material from all of the above:

Machine Learning for Managers Tue, Apr 15, 8:30am–4:30pm (Eastern) MicroTek, 1101 Vermont Ave NW #700, Washington, DC 20005

That workshop provides an introduction to ML – something quite different than popular MOOCs or vendor training – with emphasis placed as much on the “soft skills” as on the math and coding. We’ll also have a drinkup in the area, to gather informally and discuss related topics in more detail:

Drinks and Data Science Wed, Apr 16, 6:30–9:00pm (Eastern) location TBD

Looking forward to meeting you there!

Hadoop for Data Science: A Data Science MD Recap

Hadoop logo On October 9th, Data Science MD welcomed Dr. Donald Miner as its speaker to talk about doing data science work and how the hadoop framework can help. To start the presentation, Don was very clear about one thing: hadoop is bad at a lot of things. It is not meant to be a panacea for every problem a data scientist will face.

With that in mind, Don spoke about the benefits that hadoop offers data scientists. Hadoop is a great tool for data exploration. It can easily handle filtering, sampling and anti-filtering (summarization) tasks. When speaking about these concepts, Don expressed the benefits of each and included some anecdotes that helped to show real world value. He also spoke about data cleanliness in a very Baz Luhrmann Wear Sunscreen sort of way, offering that as his biggest piece of advice.

Don then transitioned to the more traditional data science problems of classification (including NLP) and recommender systems.

The talk was very well received by DSMD members. If you missed it, check out the video:

http://www.youtube.com/playlist?list=PLgqwinaq-u-Mj5keXlUOrH-GKTR-LDMv4

Our next event will be November 20th, 2013 at Loyola University Maryland Graduate Center starting at 6:30PM. We will be digging deeper into the daily lives of 3 data scientists. We hope you will join us!

Getting Ready to Teach the Elephant to Read: A Strata + Hadoop World 2013 Tutorial

We (Ben Bengfort and Sean Murphy) are very excited to be holding the Teaching the Elephant to Read tutorial at the sold out Strata + Hadoop World 2013 on Monday, the 28th of October. We will be discussing and using numerous software packages that can be time consuming to install on various operating systems and laptops. If you are taking our tutorial, we strongly encourage you to set aside an hour or two this weekend to follow the instructions below to install and configure the virtual machine needed for the class. The instructions have been tested and debugged so you shouldn't have too many problems (famous last words ;).

Important Notes

Please note that

  1. you will need a 64-bit machine and operating system for this tutorial. The virtual machine/image that we will be building and using has been tested on Mac OS X (up through Mavericks) and 64-bit Windows.
  2. this process could take an hour or longer depending on the bandwidth of your connection as you will need to download approximately 1 GB of software.

1) Install and Configure your Virtual Machine

First, you will need to install Virtual Box, free software from Oracle. Go here to download the 64-bit version appropriate for your machine.

Download Virtual Box

Once Virtual Box is installed, you will need to grab a Ubuntu x64 Server 12.04 LTS image and you can do that directly from Ubuntu here.

Ubuntu Image

There are numerous tutorials online for creating a virtual machine from this image with Virtual Box. We recommend that you configure your virtual machine with at least 1GB of RAM and a 12 GB hard drive.

2) Setup Linux

First, let's create a user account with admin privileges with username "hadoop" and the very creative password "password."

username: hadoop
password: password

Honestly, you don't have to do this. If you have a user account that can already sudo, you are good to go and can skip to the install some software. But if not, use the following commands.

~$ sudo adduser hadoop
~$ sudo usermod -a -G sudo hadoop
~$ sudo adduser hadoop sudo

Log out and log back in as "hadoop."

Now you need to install some software.

~$ sudo apt-get update && sudo apt-get upgrade
~$ sudo apt-get install build-essential ssh avahi-daemon
~$ sudo apt-get install vim lzop git
~$ sudo apt-get install python-dev python-setuptools libyaml-dev
~$ sudo easy_install pip

The above installs may take some time.

At this point you should probably generate some ssh keys (for hadoop and so you can ssh in and get out of the VM terminal.)

~$ ssh-keygen
Generating public/private rsa key pair.
Enter file in which to save the key (/home/hadoop/.ssh/id_rsa):
Created directory '/home/hadoop/.ssh'.
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /home/hadoop/.ssh/id_rsa.
Your public key has been saved in /home/hadoop/.ssh/id_rsa.pub.
[… snip …]

Make sure that you leave the password as blank, hadoop will need the keys if you're setting up a cluster for more than one user. Also note that it is good practice to keep the administrator seperate from the hadoop user- but since this is a development cluster, we're just taking a shortcut and leaving them the same.

One final step, copy allow that key to be authorized for ssh.

~$ cp .ssh/id_rsa.pub .ssh/authorized_keys

You can download this key and use it to ssh into your virtual environment if needed.

3) Install and Configure Hadoop

Hadoop requires Java - and since we're using Ubuntu, we're going to use OpenJDK rather than Sun because Ubuntu doesn't provide a .deb package for Oracle Java. Hadoop supports OpenJDK with a few minor caveats: java versions on hadoop. If you'd like to install a different version, see installing java on hadoop.

~$ sudo apt-get install openjdk-7-jdk

Do a quick check to make sure you have the right version of Java installed:

~$ java -version
java version "1.7.0_25"
OpenJDK Runtime Environment (IcedTea 2.3.10) (7u25-2.3.10-1ubuntu0.12.04.2)
OpenJDK 64-Bit Server VM (build 23.7-b01, mixed mode)

Now we need to disable IPv6 on Ubuntu- there is one issue when hadoop binds on 0.0.0.0 that it also binds to the IPv6 address. This isn't too hard: simply edit (with the editor of your choice, I prefer vim) the /etc/sysctl.conf file using the following command

sudo vim /etc/sysctl.conf

and add the following lines to the end of the file:

# disable ipv6
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1

Unfortunately you'll have to reboot your machine for this change to take affect. You can then check the status with the following command (0 is enabled, 1 is disabled):

~$ cat /proc/sys/net/ipv6/conf/all/disable_ipv6

And now we're ready to download Hadoop from the Apache Download Mirrors. Hadoop versions are a bit goofy: an update on Apache Hadoop 1.0 however, as of October 15, 2013 release 2.2.0 is available. However, the stable version is still listed as version 1.2.1.

Go ahead and unpack in a location of your choice. We've debated internally what directory to place Hadoop and other distributed services like Cassandra or Titan in- but we've landed on /srv thanks to this post. Unpack the file, change the permissions to the hadoop user and then create a symlink from the version to a local hadoop link. This will allow you to set any version to the latest hadoop without worrying about losing versioning.

/srv$ sudo tar -xzf hadoop-1.2.1.tar.gz
/srv$ sudo chown -R hadoop:hadoop hadoop-1.2.1
/srv$ sudo ln -s $(pwd)/hadoop-1.2.1 $(pwd)/hadoop

Now we have to configure some environment variables so that everything executes correctly, while we're at it will create a couple aliases in our bash profile to make our lives a bit easier. Edit the ~/.profile file in your home directory and add the following to the end:

# Set the Hadoop Related Environment variables
export HADOOP_PREFIX=/srv/hadoop

# Set the JAVA_HOME
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64

# Add Hadoop bin/ directory to PATH
export PATH=$PATH:$HADOOP_PREFIX/bin

# Some helpful aliases

unalias fs &> /dev/null
alias fs="hadoop fs"
unalias hls &> /dev/null
alias hls="fs -ls"
alias ..="cd .."
alias ...="cd ../.."

lzohead() {
    hadoop fs -cat $1 | lzop -dc | head -1000 | less
}

We'll continue configuring the Hadoop environment. Edit the following files in /srv/hadoop/conf/:

hadoop-env.sh

# The java implementation to use. Required.
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64

core-site.xml



        fs.default.name
        hdfs://localhost:9000

        hadoop.tmp.dir
        /app/hadoop/tmp

hdfs-site.xml



        dfs.replication
        1

mapred-site.xml



        mapred.job.tracker
        localhost:9001

That's it configuration over! But before we get going we have to format the distributed filesystem in order to use it. We'll store our file system in the /app/hadoop/tmp directory as per Michael Noll and as we set in the core-site.xml configuration. We'll have to set up this directory and then format the name node.

/srv$ sudo mkdir -p /app/hadoop/tmp
/srv$ sudo chown -R hadoop:hadoop /app/hadoop
/srv$ sudo chmod -R 750 /app/hadoop
/srv$ hadoop namenode -format
[… snip …]

You should now be able to run Hadoop's start-all.sh command to start all the relevant daemons:

/srv$ hadoop-1.2.1/bin/start-all.sh
starting namenode, logging to /srv/hadoop-1.2.1/libexec/../logs/hadoop-hadoop-namenode-ubuntu.out
localhost: starting datanode, logging to /srv/hadoop-1.2.1/libexec/../logs/hadoop-hadoop-datanode-ubuntu.out
localhost: starting secondarynamenode, logging to /srv/hadoop-1.2.1/libexec/../logs/hadoop-hadoop-secondarynamenode-ubuntu.out
starting jobtracker, logging to /srv/hadoop-1.2.1/libexec/../logs/hadoop-hadoop-jobtracker-ubuntu.out
localhost: starting tasktracker, logging to /srv/hadoop-1.2.1/libexec/../logs/hadoop-hadoop-tasktracker-ubuntu.out

And you can use the jps command to see what's running:

/srv$ jps
1321 NameNode
1443 DataNode
1898 Jps
1660 JobTracker
1784 TaskTracker
1573 SecondaryNameNode

Furthermore, you can access the various hadoop web interfaces as follows:

To stop Hadoop simply run the stop-all.sh command.

4) Install Python Packages and the Code for the Class

To run the code in this section, you'll need to install some Python packages as dependencies, and in particular the NLTK library. The simplest way to install these packages is with the requirements.txt file that comes with the code library in our repository. We'll clone it into a repository called tutorial.

~$ git clone https://github.com/bbengfort/strata-teaching-the-elephant-to-read.git tutorial
~$ cd tutorial/code
~$ sudo pip install -U -r requirements.txt
[… snip …]

However, if you simply want to install the dependencies yourself, here are the contents of the requirements.txt file:

# requirements.txt
PyYAML==3.10
dumbo==0.21.36
language-selector==0.1
nltk==2.0.4
numpy==1.7.1
typedbytes==0.3.8
ufw==0.31.1-1

You'll also have to download the NLTK data packages which will install to /usr/share/nltk_data unless you set an environment variable called NLTK_DATA. The best way to install all this data is as follows:

~$ sudo python -m nltk.downloader -d /usr/share/nltk_data all

At this point the steps that are left are loading data into Hadoop.

References

  1. Hadoop/Java Versions
  2. Installing Java on Ubuntu
  3. An Update on Apache Hadoop 1.0
  4. Running Hadoop on Ubuntu Linux Single Node Cluster
  5. Apache: Single Node Setup

BinaryPig: Scalable Static Binary Analysis Over Hadoop from Endgame

Screen Shot 2013-09-05 at 9.49.10 AM Below is a white paper previously published by Endgame, reproduced with permission. "Endgame uses data science to bring clarity to the cyber domain, allowing you to sense, discover, and act in real time."

Authors: Zachary Hanif, Telvis Calhoun, and Jason Trost {zach, tcalhoun, jtrost}@endgame.com

I. Abstract

Malware collection and analysis are critical to the modern Internet security industry. This increasing tide of ‘unique’ malware samples and the difficulty in storing and analyzing these samples in a scalable fashion has been frequently decried in academic forums and industry conferences. Due to the increasing volume of samples that are being discovered in the wild, the need for an easily deployed, scalable architecture is becoming increasingly pronounced. Over the past 2.5 years Endgame received 20M samples of malware equating to roughly 9.5 TB of binary data. In this, we’re not alone. McAfee reports that it currently receives roughly 100,000 malware samples per day and received roughly 10M samples in the last quarter of 2012 [1]. Its total corpus is estimated to be about 100M samples. VirusTotal receives between 300k and 600k unique files per day, and of those roughly one-third to half are positively identified as malware [2]. Effectively combating and triaging malware is now a Big Data problem. Most malware analysis scripts are unprepared for processing collections of malware once they surpass the 100GB scale, let alone terabytes. This paper describes the architecture of a system the authors built to overcome these challenges that uses Hadoop [13] and Apache Pig [14] to enable analysts to leverage pre-existing tools to analyze terabytes of malware, at scale. We demonstrate this system to analyze ~20M malware samples weighing in at 9.5 TB of binary data.

II. Introduction

Malware authors utilize multiple methods of obfuscating their executables to evade detection by automated anti-virus systems. Many malware authors utilize packing, compression, and encryption tools in an effort to defeat automated static analysis; authors continually pack and repack their samples and check them against automated scanning engines. These samples are shared between interested parties, thereby inflating the number of samples that require analysis. These samples are relatively low value: many of them are never released into the wild, and moreover, as their core functionality has not been altered, analysis of these samples represents a needless redundancy. For samples that are in the wild, autonomously spreading malware often have algorithms which alter key elements of their executable payloads before transmission and execution on a newly compromised system in an effort to evade detection from network security sensors and other simpler protections. Collection of autonomously spreading samples by honeypot systems such as Amun [2] and Dionaea [3] therefore collect huge numbers of samples that appear unique when compared using cryptographic file checksums, yet represent functionally identical software. Further complicating the issue are the samples that are collected by both systems that represent false positives: samples that have mistakenly triggered a heuristic, or benign samples that were submitted to an online scanning engine.

BinaryPig hopes to provide a solution to the scalability problem represented by this influx of malware. It promotes rapid iteration and exploration through processing non-trivial amounts of malware at scale using easily developed scripts. It was also built to take advantage of pre-existing malware analysis scripts.

III. Prior Work

While there is no lack of published descriptions for malware analytic models and feature sets, to our knowledge, there are no salient papers that discuss the architecture behind the feature and data extraction from a static perspective [7], [8], [9]. As distributed data processing has engineering concerns that are generally unseen in centralized processing infrastructure, and the majority of papers that we explored while researching during the construction of this system focus on the results of derived analytics, it is assumed that the systems which coordinated the extraction, results and sample storage, and other distributed processing concerns were either considered to be too simplistic, too specialized, or too ungainly for publication. As these papers have been published, however, the authors believe that while systems similar in function to BinaryPig exist, our offering represents something novel in the form of an extensible, free, open-source framework [10].

The most prominent papers surrounding data extraction infrastructure are represented in a tangential area of research: malware dynamic analysis. There have been a number of excellent discussions surrounding various methods of dynamic analysis at relatively large scales [8], [9] as well as work that lends itself to simply adding new analytical methods to an existing framework [18], none of this work takes advantage of architecture that is likely to exist within an organization that deals with large sums of data - essentially, again the extraction and storage infrastructure that drives many papers that have been published recently appears to be unique to each of the research institutions.

IV. Design

We needed a system to address scalability, reliability, and maintainability concerns of large-scale malware processing. We determined that Apache Hadoop met many of these requirements, but it had some shortcomings that we had to address.

Developing native MapReduce code can be time consuming even for highly experienced users. An early version of this framework was developed using native MapReduce code, and while functional, was difficult to use and difficult to extend. As a result, we opted to use the Apache Pig DSL to provide a more accessible user and developer experience.

During the development of BinaryPig, Hadoop’s issues with storing and processing large numbers of small files was made apparent: attempting to store and iterate on malware samples directly resulted in memory issues on the NameNode and unacceptable performance decreases due task startup overhead dominating job execution times. To resolve these issues, we created ingest tool for compressing, packaging, and uploading malware binaries into Hadoop. The ingest tool combines thousands of small binary files into one or more large Hadoop SequenceFiles (each file typically ranging from 10GB to 100 GB in size), an internal data storage format that enables Hadoop to take advantage of sequential disk reads speeds as well as split these files for parallel processing [15]. The SequenceFiles created by BinaryPig are a collection of key/value pairs where the key represents the ID of the binary file (typically MD5 checksum) and the value is the contents of the file represented as raw bytes. To ensure that the malware data was properly distributed across computing nodes in the cluster and resilient to failures, the Hadoop Distributed File System (HDFS) was utilized [19]; HDFS provides any nice features for managing files and data including automatic data replication, failover, and optimizations for batch processing.

Within Pig, analysis scripts and their supporting libraries are uploaded to Hadoop’s DistributedCache [16] before job execution and destroyed from the processing nodes afterwards. This functionality allows users to quickly publish analytical scripts and libraries to the processing nodes without the general concerns of ensuring deployment operated as expected or the time associated with a manual deployment.

We developed a series of Pig loader functions that read SequenceFiles, copy binaries to the local filesystem, and enable scripts to process them. These load functions were designed to allow these analysis scripts to read the files dropped onto the local filesystem. This allows BinaryPig to leverage preexisting malware analysis scripts and tools. In essence, we are using HDFS to distribute our malware binaries across our cluster and using MapReduce to push the processing to the node where the data is stored. This prevents needless data copying as seen in many legacy cluster computing environments. The framework expects the analysis scripts to write their output to stdout and to write any error messages to stderr (as most legacy malware analysis scripts already do). Ideally the data written is structured (JSON, XML, etc), but the system gladly handles unstructured data too. Data emitted from the system is stored into HDFS for additional batch processing as needed. We also typically load this data into ElasticSearch indices [17]. This distributed index is used for manual exploration by individual users, as well as automated extraction by analytical processes. Through use of the Wonderdog library [12], we have been able to load data from the MapReduce jobs into ElasticSearch in a real-time fashion while continuing to serve queries.

end_game_fig1

We implemented optimizations to the backend processing system including using the /dev/shm/virtual device for writing malware binaries. This device functions as a filesystem and allows for reading/writing files into memory managed by the OS. This allowed us to avoid a great deal of system and Disk I/O when writing files and creating processes that did not lend themselves to accepting serialized files through stdin. Another optimization we implemented for some processing was creating long running analytics daemons in lieu of scripts. This drastically increased the performance for some important processing tasks since scripts are great for rapid iteration, but we found that the script interpreter startup times were dominating task execution times. The long running daemons listen on a socket, and read local file paths through this connection. They then analyze these respective files and return the results through the socket connection.

end_game_fig2

V. Preliminary Results

The initial testing and development of BinaryPig was performed on a 16-node Hadoop cluster. We loaded 20 million malware samples (~9.5 TB of data) onto this cluster. Both development time and execution time of new analytics are demonstrably faster, and the ability to execute them across historical sets, something that was not feasible before, is now possible. The ability to develop new analytics and execute them against historical sample sets for comparison and learning is exceptionally valuable. Full execution over the 20 million historical set was clocked at 5 hours when calculating the MD5, SHA1, SHA256, and SHA512 hashes for each sample. Operation speed is primarily based on the number of computational cores that can be dedicated to the process. Additionally, the ability to store, operate and analyze samples on a multi-use cluster is a marked improvement over requiring a dedicated cluster and storage environment.

VI. Ongoing Work

BinaryPig continues to be developed. Currently there is an effort to migrate currently existing analytics over to the new architecture so as to complete a malware census for use in developing predictive models. These results are continually developed and are intended for publication after validation. Additional input and output functionality between the Pig and analytics layers is being explored, as there is a need to be able to retrieve emitted binary data, without an intermediate encoding and serialization step, for icon, image, and other resource extraction. Additionally, we feel that the error mission over the stderr interface is not optimal for production job management and subtle error catching, and are looking to expand the published examples with scripts that demonstrate such practices.

VII. References

[1] Higgins. “McAfee: Close To 100K New Malware Samples Per Day In Q2.” http://www.darkreading.com/attacks-breaches/mcafee-close-to-100k-new-malware-samples/240006702 [2] “Virus Total File Statistics.” https://www.virustotal.com/en/statistics/ as of 4/9/2013 [3] Amun, http://amunhoney.sourceforge.net/ [4] Dionaea, http://dionaea.carnivore.it/ [5] Chau, Duen Horng, et al. "Polonium: Tera-scale graph mining for malware detection." Proceedings of the second workshop on Large-scale Data Mining: Theory and Applications (LDMTA 2010), Washington, DC. Vol. 25. 2010. [6] Perdisci, Roberto, Andrea Lanzi, and Wenke Lee. "Classification of packed executables for accurate computer virus detection." Pattern Recognition Letters 29.14 (2008): 1941-1946. [7] Neugschwandtner, Matthias, et al. "Forecast: skimming off the malware cream." Proceedings of the 27th Annual Computer Security Applications Conference. ACM, 2011. [8] Rieck, Konrad, et al. "Automatic analysis of malware behavior using machine learning." Journal of Computer Security 19.4 (2011): 639-668. [9] MTRACE, http://media.blackhat.com/bh-eu-12/Royal/bh-eu-12-Royal-Entrapment-WP.pdf [10] Firdausi, Ivan, et al. "Analysis of machine learning techniques used in behavior-based malware detection." Advances in Computing, Control and Telecommunication Technologies (ACT), 2010 Second International Conference on. IEEE, 2010. [11] Jang, Jiyong, David Brumley, and Shobha Venkataraman. "Bitshred: feature hashing malware for scalable triage and semantic analysis." Proceedings of the 18th ACM conference on Computer and communications security. ACM, 2011. [12] Wonderdog. InfoChimps. https://github.com/infochimps-labs/wonderdog [13] Apache Hadoop. Apache Foundation. http://hadoop.apache.org/ [14] Apache Pig. Apache Foundation. http://pig.apache.org/ [15] Sequence Files Format. Hadoop Wiki. http://wiki.apache.org/hadoop/SequenceFile [16] DistributedCache. Apache Foundation. http://hadoop.apache.org/docs/stable/mapred_tutorial.html#DistributedCache [17] ElasticSearch. http://www.elasticsearch.org/ [18] MASTIFF. http://sourceforge.net/projects/mastiff/ [19] Borthakur. HDFS Architecture Guide. http://hadoop.apache.org/docs/stable/hdfs_design.pdf

Weekly Round-Up: Hadoop, Big Data vs. Analytics, Process Management, and Palantir

Welcome back to the round-up, an overview of the most interesting data science, statistics, and analytics articles of the past week. This week, we have 4 fascinating articles ranging in topics from Hadoop to business process management. In this week's round-up:

  • To Hadoop or Not to Hadoop?
  • What’s the Difference Between Big Data and Business Analytics?
  • What Big Data Means to BPM
  • How A Deviant Philosopher Built Palantir

To Hadoop or Not to Hadoop?

Our first piece this week is an interesting blog post about what sorts of data operations Hadoop is and isn't good for. The post can serve as a useful guide when trying to figure out whether or not you should use Hadoop to do what you're thinking of doing with your data. It is organized into 5 categories of things you should consider and contains a series of questions you can ask yourself for each of the categories to help with your decision-making.

What’s the Difference Between Big Data and Business Analytics?

This is an excellent post on Cathy O'Neil's Mathbabe blog about how she distinguishes big data from business analytics. Cathy argues that what most people consider big data is really business analytics (on arguably large data sets) and that big data, in her opinion, consists of automated intelligent systems that algorithmically know what to do and need very little human interference. She goes into more detail about the differences between, including some examples to drive home her point.

What Big Data Means to BPM

Continuing on the subject of intelligent systems performing business processes, our third piece this week is a Data Informed article about big data's effect on business process management. The article is an interview with Nathaniel Palmer, a BPM veteran practitioner and author. In the interview, Palmer answers questions about what kinds of trends are emerging in business process management, how big data is affecting its practices, and what changes are being brought about because of it.

How A Deviant Philosopher Built Palantir

Our last piece this week is a Forbes article about Palantir, an analytics software company that works with federal intelligence agencies and is funded by In-Q-Tel - the CIA's investment fund. The article describes the company's CEO, what the company does, who it does for, and delves into some of Palantir's history. Overall, the article provides an interesting look at a very interesting company.

That's it for this week. Make sure to come back next week when we’ll have some more interesting articles! If there's something we missed, feel free to let us know in the comments below.

Read Our Other Round-Ups

Hadoop as a Data Warehouse Archive

Recently, companies have seen a huge growth in data volume both from existing structured data and from new, multi-structured data. Transaction data in particular from online shopping and mobile devices along with clickstream and social data is creating more data in one year than was ever created before. How is a company supposed to keep track of and store all of this data effectively? Traditional data warehouses would have to be constantly expanding to keep up with this constant stream of data, making storage increasingly too expensive and time consuming. Businesses have found some relief using Hadoop to extract and load the data into the data warehouse, but as the warehouse becomes full, businesses have had to expand the data warehouse’s storage capabilities.

Instead, businesses should consider moving the data back into Hadoop, turning Hadoop into a data warehouse archive. There are several advantages to using Hadoop as an archive in conjunction with a traditional data warehouse. Here’s a look at a few.

Improved Resilience and Performance

Many of the platforms designed around Hadoop have focused on making Hadoop more user friendly and have adjusted or added features to help protect data. MapR, for example removes single points of failure in Hadoop that made it easy for data to be destroyed or lost. Platforms will often offer data mirroring across clusters to help support failover and disaster recovery as well.

With a good level of data protection and recovery abilities, Hadoop platforms become a viable option for the long-term storage of Big Data and other data that has been archived in a data warehouse.

Hadoop also keeps historical data online and accessible which makes it easier to revisit data when new questions come and is dramatically faster and easier than going through traditional magnetic tapes.

Handle More Data for Less Cost

Hadoop’s file system can capture 10s of terabytes of data in a day, and this is accomplished at the lowest possible cost due to open source economics and commodity hardware. Hadoop can also easily handle more data by adding more Hadoop nodes to the cluster to continue to process data at speed thanks to Hadoop’s greater compute power. This is much less expensive than the continuous upgrades that would be required to maintain a traditional warehouse and to keep up with the extreme amount of data. On top of that, data tape archives found in traditional data stores can become costly because the data is difficult to retrieve. Not only is the data stored offline, requiring tons of time to restore, but the cartridges are prone to degrade over time resulting in costly losses of data.

High Availability

Traditional data warehouses often made it difficult for global businesses to maintain all of their data in one place with employees working and logging in from various locations around the world. Hadoop platforms will generally allow direct access to remote clients that want to mount the cluster to read or write data flows. This means that clients and employees will be working directly on the Hadoop cluster rather than first uploading data to a local or network storage system. In a global business where ETL processing may need to happen several times within the day, high availability is very important.

Reduce Tools Needed

Coupled with increased availability, the ability to access the cluster directly dramatically reduces the number of tools needed to capture data. For example, this reduces the need for log collection tools that may require agents on every application server. It also eliminates the need to keep up with changing tape formats every couple years or risk being unable to restore data because it is stored on obsolete tapes.

Author Bio

Rick Delgado, Freelance Tech Journalist

I've been blessed to have a successful career and have recently taken a step back to pursue my passion of writing. I've started doing freelance writing and I love to write about new technologies and how it can help us and our planet.

Weekly Round-Up: WibiData, Big Data Trends, Analytics Processes, and Human Trafficking

Welcome back to the round-up, an overview of the most interesting data science, statistics, and analytics articles of the past week. This week, we have 4 fascinating articles ranging in topics from Big Data trends to using data to fight human trafficking. In this week's round-up:

  • WibiData Gets $15M to Help It Become the Hadoop Application Company
  • 7 Big Data Trends That Will Impact Your Business
  • Want Better Analytics? Fix Your Processes
  • How Big Data is Being Used to Target Human Trafficking

WibiData Gets $15M to Help It Become the Hadoop Application Company

It was announced this week that Cloudera co-founder Christophe Bisciglia's new company, WibiData, has raised $15 million in a Series B round of financing. WibiData is looking to become a dominant player in the market by selling software that lets companies build consumer-facing applications on Hadoop. This article has additional details about the company and what they are trying to do.

7 Big Data Trends That Will Impact Your Business

We're all interested in seeing what the future of data science and Big Data have in store, and this article identifies 7 trends that the author thinks will continue to develop in the years ahead. Some general themes of the trends listed include predictions about platforms, structure, and programming languages.

Want Better Analytics? Fix Your Processes

In order to succeed in running a data-driven organization, you must have the proper analytical business processes in place so that any insights derived from your efforts can be applied to improving operations. In this article, the author proposes 5 principles to ensure analytics are used correctly and deliver the results the organization wants.

How Big Data is Being Used to Target Human Trafficking

Our last article this week is a piece about how Google announced recently that it will be partnering with other organizations in an effort to leverage data analytics in helping to fight human trafficking. Part of the effort will include aggregation of previously dispersed data and another part will consist of developing algorithms to identify patterns and better predict trafficking trends. This article lists additional details about the project.

That's it for this week. Make sure to come back next week when we’ll have some more interesting articles! If there's something we missed, feel free to let us know in the comments below.

Read Our Other Round-Ups

Beyond Preprocessing - Weakly Inferred Meanings - Part 5

Congrats! This is the final post in our 6 part series! Just in case you have missed any parts, click through to the introductionpart 1part 2, part 3, and part 4.

NLP of Big Data using NLTK and Hadoop31

After you have treebanks, then what? The answer is that syntactic guessing is not the final frontier of NLP, we must go beyond to something more semantic. The idea is to determine the meaning of text in a machine tractable way by creating a TMR, a text-meaning representation (or thematic meaning representation). This, however, is not a trivial task, and now you’re at the frontier of the science.

 

NLP of Big Data using NLTK and Hadoop32

Text Meaning Representations are language-independent representations of a language unit, and can be thought of as a series of connected frames that represent knowledge. TMRs allow us to do extremely deep querying of natural language, including the creation of knowledge bases, question and answer systems, and even allow for conversational agents. Unfortunately they require extremely deep ontologies and knowledge to be constructed and can be extremely process intensive, particularly in resolving ambiguity.

NLP of Big Data using NLTK and Hadoop33

We have created a system called WIMs -- Weakly Inferred Meanings that attempts to stand in the middle ground of no semantic computation at all, and the extremely labor intensive TMR. WIMs reduces the search space by using a limited, but extremely important set of relations. These relations can be created using available knowledge -- Wordnet has proved to be a valuable ontology for creating WIMs, and they are extremely lightweight.

Even better, they’re open source!

NLP of Big Data using NLTK and Hadoop34

Both TMRs and WIMs are graph representations of content, and therefore any semantic computation involving these techniques will involve graph traversal. Although there are graph databases created on top of HDFS (particularly Titan on HBase), graph computation is not the strong point of MapReduce. Hadoop, unfortunately, can only get us so far.

NLP of Big Data using NLTK and Hadoop35

Hadoop for Preprocessing Language - Part 4

We are glad that you have stuck around for this long and, just in case you have missed any parts, click through to the introductionpart 1part 2, and part 3.

NLP of Big Data using NLTK and Hadoop21

You might ask me, doesn’t Hadoop do text processing extremely well? After all, the first Hadoop jobs we learn are word count and inverted index!

NLP of Big Data using NLTK and Hadoop22

The answer is that NLP preprocessing techniques are more complicated than splitting on whitespace and punctuation, and different tasks require different kinds of tokenization (also called segmentation or chunking). Consider the following sentence:

“You're not going to the U.S.A. in that super-zeppelin, Dr. Stoddard?”

How do you split this as a stand alone sentence? If you simply used punctuation, this would segment (sentence tokenization) to six sentences (“You’re not going to the U.”, “S.”, “A.”, “in that super-zeppelin, Dr.”, “Stoddard?”). Also, is the “You’re” two tokens or a single token? What about Punctuation? Is “Dr. Stoddard” one token or more? How about “super-zeppelin”. N-Gram analysis and other syntactic tokenization will also probably require different token lengths that go beyond white space.

NLP of Big Data using NLTK and Hadoop23

So we require some more formal NLP mechanisms even for simple tokenization. However, I propose that Hadoop might be perfect for language preprocessing. A Hadoop job creates output in the file system, so each job can be considered an NLP preprocessing task. Moreover, in many other Big Data analytics, Hadoop is used this way; last mile computations usually occur within 100GB of memory, Map Reduce jobs are used to perform calculations designed to transform data into something that is computable in that memory space. We will do the same thing with NLP, and transform our raw text as follows: 

Raw Text → Tokenized Text → Tagged Text → Parsed Text → Treebanks

Namely, after we have tokenized our text depending on our requirements, splitting it into sentences, chunks and tokens as required, we then want to understand the syntactic class of the tokens, and tag it as such. Tagged text can then be structured into parses - a structured representation of the sentence. The final output, used for training our stochastic mechanisms and going beyond to more semantic analyses are treebanks. Each of these tasks can be one or more MapReduce jobs.

NLP of Big Data using NLTK and Hadoop27

NLTK comes with a few notable built-ins making your preprocessing with Hadoop integration easier (you’ll note all these methods are stochastic):

  • Punct Word and Sentence tokenizer uses an unsupervised training set to capture the beginning of sentences and other non-sentence termination marks. It doesn’t require a single sentence to perform tokenization.

  • Brill Tagger - a transformational rule based tagger that does a first pass tagging then applies rules that were trained from a tagged training data set.

  • Viterbi Parser- a dynamic programming algorithm that uses a weighted grammar to fill in a most-likely-constituent table and very quickly come up with the most likely parse.

NLP of Big Data using NLTK and Hadoop28 The end result after a series of MapReduce jobs (we had six) was a Treebank -- a machine tractable syntactic representation of language; it’s very important.

Python's Natural Language Took Kit (NLTK) and Hadoop - Part 3

Welcome back to part 3 of Ben's talk about Big Data and Natural Language Processing. (Click through to see the intro, part 1, and part 2).

NLP of Big Data using NLTK and Hadoop12

We chose NLTK (Natural Language Toolkit) particularly because it’s not Stanford. Stanford is kind of a magic black box, and it costs money to get a commercial license. NLTK is open source and it’s Python. But more importantly, NLTK is completely built around stochastic analysis techniques and comes with data sets and training mechanisms built in. Particularly because the magic and foo of Big Data with NLP requires using your own domain knowledge and data set, NLTK is extremely valuable from a leadership perspective! And anyway, it does come with out of the box NLP - use the Viterbi parser with a trained PCFG (Probabilistic Context Free Grammer, also called a Weighted Grammar)  from the Penn Treebank, and you’ll get excellent parses immediately.

NLP of Big Data using NLTK and Hadoop13

Choosing Hadoop might seem obvious given that we’re talking about Data Science and particularly Big Data. But I do want to point out that the NLP tasks that we’re going to talk about right off the bat are embarrassingly parallel - meaning that they are extremely well suited for the Map Reduce paradigm. If you consider the unit of natural language the sentence, then each sentence (at least to begin with) can be analyzed on its own with little required knowledge about the surrounding processing of sentences.

NLP of Big Data using NLTK and Hadoop14

Combine that with the many flavors of Hadoop and the fact that you can get a cluster going in your closet for cheap-- it’s the right price for a startup!

NLP of Big Data using NLTK and Hadoop15

The glue to making NLTK (Python) and Hadoop (Java) play nice is Hadoop Streaming. Hadoop Streaming will allow you to create a mapper and a reducer with any executable, and expects that the executable will receive key-value pairs via stdin and output them via stdout. Just keep in mind that all other Hadoopy-ness exists, e.g. the FileInputFormat, HDFS, and Job Scheduling, all you get to replace is the mapper and reducer, but this is enough to include NLTK, so you’re off and running!

Here’s an example of a Mapper and Reducer to get you started doing token counts with NLTK (note that these aren’t word counts -- to computational linguists, words are language elements that have senses and therefore convey meaning. Instead, you’ll be counting tokens, the syntactic base for words in this example, and you might be surprised to find out what tokens are-- trust me, it isn’t as simple as splitting on whitespace and punctuation!).

mapper.py


    #!/usr/bin/env python

    import sys
    from nltk.tokenize import wordpunct_tokenize

    def read_input(file):
        for line in file:
            # split the line into tokens
            yield wordpunct_tokenize(line)

    def main(separator='\t'):
        # input comes from STDIN (standard input)
        data = read_input(sys.stdin)
        for tokens in data:
            # write the results to STDOUT (standard output);
            # what we output here will be the input for the
            # Reduce step, i.e. the input for reducer.py
            #
            # tab-delimited; the trivial token count is 1
            for token in tokens:
                print '%s%s%d' % (word, separator, 1)

    if __name__ == "__main__":
        main()

reducer.py


    #!/usr/bin/env python

    from itertools import groupby
    from operator import itemgetter
    import sys

    def read_mapper_output(file, separator='\t'):
        for line in file:
            yield line.rstrip().split(separator, 1)

    def main(separator='\t'):
        # input comes from STDIN (standard input)
        data = read_mapper_output(sys.stdin, separator=separator)
        # groupby groups multiple word-count pairs by word,
        # and creates an iterator that returns consecutive keys and their group:
        #   current_word - string containing a word (the key)
        #   group - iterator yielding all ["<current_word>", "<count>"] items
        for current_word, group in groupby(data, itemgetter(0)):
            try:
                total_count = sum(int(count) for current_word, count in group)
                print "%s%s%d" % (current_word, separator, total_count)
            except ValueError:
                # count was not a number, so silently discard this item
                pass

    if __name__ == "__main__":
        main()

Running the Job


    hduser@ubuntu:/usr/local/hadoop$ bin/hadoop jar contrib/streaming/hadoop-*streaming*.jar \
        -file /home/hduser/mapper.py    -mapper /home/hduser/mapper.py \
        -file /home/hduser/reducer.py   -reducer /home/hduser/reducer.py \
        -input /user/hduser/gutenberg/* -output /user/hduser/gutenberg-output

Some interesting notes about using Hadoop Streaming relate to memory usage. NLP can sometimes be a memory intensive task as you have to load up training data to compute various aspects of your processing- loading these things up can take minutes at the beginning of your processing. However, with Hadoop Streaming, only one interpreter per job is loaded, thus saving you repeating that loading process. Similarly, you can use generators and other python iteration techniques to carve through mountains of data very easily. There are some Python libraries, including dumbo, mrjob, and hadoopy that can make all of this a bit easier.