big data

Fast Data Applications with Spark & Python Workshop on November 8th

Data Community DC and District Data Labs are excited to be hosting a Fast Data Applications with Spark & Python workshop on November 8th  For more info and to sign up, go to http://bit.ly/Zhj0y1.  There’s even an early bird discount if you register before October 17th!

Hadoop has made the world of Big Data possible by providing a framework for distributed computing on economical, commercial off-the-shelf hardware. Hadoop 2.0 implements a distributed file system, HDFS, and a computing framework, YARN, that allows distributed applications to easily harness the power of clustered computing on extremely large data sets. Over the past decade, the primary application framework has been MapReduce - a functional programming paradigm that lends itself extremely well to designing distributed applications, but carries with it a lot of computational overhead.

Simulation and Predictive Analytics

This is a guest post by Lawrence Leemis, a professor in the Department of Mathematics at The College of William & Mary.  A front-page article over the weekend in the Wall Street Journal indicated that the number one profession of interest to tech firms is a data scientist, someone whose analytic skills, computing skills, and domain skills are able to detect signals from data and use them to advantage. Although the terms are squishy, the push today is for "big data" skills and "predictive analytics" skills which allow firms to leverage the deluge of data that is now accessible.

I attended the Joint Statistical Meetings last week in Boston and I was impressed by the number of talks that referred to big data sets and also the number that used the R language. Over half of the technical talks that I attended included a simulation study of one type or another.

The two traditional aspects of the scientific method, namely theory and experimentation, have been enhanced with computation being added as a third leg. Sitting at the center of computation is simulation, which is the topic of this post. Simulation is a useful tool when analytic methods fail because of mathematical intractability.

The questions that I will address here are how Monte Carlo simulation and discrete-event simulation differ and how they fit into the general framework of predictive analytics.

First, how do how Monte Carlo and discrete-event simulation differ? Monte Carlo simulation is appropriate when the passage of time does not play a significant role. Probability calculations involving problems associated with playing cards, dice, and coins, for example, can be solved by Monte Carlo.

Discrete-event simulation, on the other hand, has the passage of time as an integral part of the model. The classic application areas in which discrete-event simulation has been applied are queuing, inventory, and reliability. As an illustration, a mathematical model for a queue with a single server might consist of (a) a probability distribution for the time between arrivals to the queue, (b) a probability distribution for the service time at the queue, and (c) an algorithm for placing entities in the queue (first-come-first served is the usual default). Discrete-event simulation can be coded into any algorithmic language, although the coding is tedious. Because of the complexities of coding a discrete-event simulation, dozens of languages have been developed to ease implementation of a model. 

The field of predictive analytics leans heavily on the tools from data mining in order to identify patterns and trends in a data set. Once an appropriate question has been posed, these patterns and trends in explanatory variables (often called covariates) are used to predict future behavior of variables of interest. There is both an art and a science in predictive analytics. The science side includes the standard tools of associated with mathematics computation, probability, and statistics. The art side consists mainly of making appropriate assumptions about the mathematical model constructed for predicting future outcomes. Simulation is used primarily for verification and validation of the mathematical models associated with a predictive analytics model. It can be used to determine whether the probabilistic models are reasonable and appropriate for a particular problem.

Two sources for further training in simulation are a workshop in Catonsville, Maryland on September 12-13 by Barry Lawson (University of Richmond) and me or the Winter Simulation Conference (December 7-10, 2014) in Savannah.

Event Recap: DC Energy and Data Summit

This is a guest post by Majid al-Dosari, a master’s student in Computational Science at George Mason University. I recently attended the first DC Energy and Data Summit organized by Potential Energy DC and co-hosted by the American Association for the Advancement of Science’s Fellowship Big Data Affinity Group. I was excited to be at a conference where two important issues of modern society meet: energy and (big) data!

There was a keynote and plenary panel. In addition, there were three breakout sessions where participants brainstormed improvements to building energy efficiency, the grid, and transportation. Many of the issues raised at the conference could be either big data or energy issues (separately). However, I’m only going to highlight points raised that deal with both energy and data.

In the keynote, Joel Gurin (NYU Governance Lab, Director of OpenData500) emphasized the benefits of open government data (which can include unexpected use cases). In the energy field, this includes data about electric power consumption, solar irradiance, and public transport. He mentioned that the private sector also has a role in publishing and adding value to existing data.

Then, in the plenary panel, Lucy Nowel (Department of Energy) brought up the costs associated with the management, transport, and analysis of big data. These costs can be measured in terms of time and energy. You can ask this question: At what point does it “cost” less to transport some amount of data physically (via a SneakerNet) than it does through some computer network?

After the panel, I attended the breakout session dealing with energy efficiency of homes and businesses. The former is the domain of Opower represented by Asher Burns-Burg, while the latter is the domain of Aquicore represented by Logan Soya. It is of interest to compare the general strategy of both companies here. Opower uses psychological methods to encourage households to reduce consumption. On the other hand, Aquicore uses business metrics to show how building managers can save money. But both are data-enabled.

Asher claims that Opower is just scratching the surface with what is possible with the use of data. He also talked about how personalization can be used to deliver more effective messages to consumers. Meanwhile, Aquicore has challenges associated with working with existing (old) metering technology in order to obtain more fine-grained data on building energy use.

In the concluding remarks, I became aware of discussions at the other breakout sessions. The most notable to me was a concern raised by the transportation session: The rebound effect can offset any gain in efficiency by an increase in consumption. Also, the grid breakout session suggested that there should be a centralized “data mart” and a way to be able to easily navigate the regulations of the energy industry.

While DC is not Houston, the unique environment of policy, entrepreneurship, and analytical talent give DC the potential to innovate in this area. Credit goes to Potential Energy DC for creating a supportive environment.

A Rush of Ideas: Kalev Leetaru at Data Science DC

gdeltThis review of the April Data Science DC Meetup was written by Ross Mohan. Ross is a solutions architect for Five 9 Group.

Perhaps you’ve heard the phrase lately “software is eating the world”. Well, to be successful at that, it’s going to have to do as least as good a job of eating the world’s data as do the systems of Kalev Leetaru, Georgetown/Yahoo! fellow.

Kalev Leetaru, lead investigator on GDELT and other tools, defines “world class” work — certainly in the sense of size and scope of data. The goal of GDELT and related systems is to stream global news and social media in as near realtime as possible through multiple steps. The overall goal is to arrive at reliable tone (sentiment) mining and differential conflict detection and to do so …. globally. It is a grand goal.

Kalev Leetaru’s talk covered several broad areas. History of data and communication, data quality and “gotcha” issues in data sourcing and curation, geography of Twitter, processing architecture, toolkits and considerations, and data formatting observations. In each he had a fresh perspective or a novel idea, born of the requirement to handle enormous quantities of ‘noisy’ or ‘dirty’ data.

Perspectives

Keetaru observed that “the map is not the territory” in the sense that actual voting, resource or policy boundaries as measured by various data sources may not match assigned national boundaries. He flagged this as a question of “spatial error bars” for maps.

Distinguishing Global data science from hard established HPC-like pursuits (such as computational chemistry) Kalev Leetaru observed that we make our own bespoke toolkits, and that there is no single ‘magic toolkit” for Big Data, so we should be prepared and willing to spend time putting our toolchain together.

After talking a bit about the historical evolution and growth of data, Kalev Leetaru asked a few perspective-changing questions (some clearly relevant to intelligence agency needs) How to find all protests? How to locate all law books? Some of the more interesting data curation tools and resources Kalev Leetaru mentioned — and a lot more — might be found by the interested reader in The Oxford Guide to Library Research by Thomas Mann.

GDELT (covered further below), labels parse trees with error rates, and reaches beyond the “WHAT” of simple news media to tell us WHY, and ‘how reliable’. One GDELT output product among many is the Daily Global Conflict Report, which covers world leader emotional state and differential change in conflict, not absolute markers.

One recurring theme was to find ways to define and support “truth.” Kalev Leetaru decried one current trend in Big Data, the so-called “Apple Effect”: making luscious pictures from data; with more focus on appearance than actual ground truth. One example he cited was a conclusion from a recent report on Syria, which -- blithely based on geotagged English-language tweets and Facebook postings -- cast a skewed light on Syria’s rebels (Bzzzzzt!)

Twitter

Leetaru provided one answer on “how to ‘ground truth’ data” by asking “how accurate are geotagged tweets?” Such tweets are after all only 3% of the total. But he reliably used those tweets. How?  By correlating location to electric power availability. (r = .89) He talked also about how to handle emoticons, irony, sarcasm, and other affective language, cautioning analysts to think beyond blindly plugging data into pictures.

Kalev Leetaru talked engagingly about Geography of Twitter, encouraging us to to more RTFD (D=data) than RTFM. Cut your own way through the forest. The valid maps have not been made yet, so be prepared to make your own. Some of the challenges he cited were how to break up typical #hashtagswithnowhitespace and put them back into sentences, how to build — and maintain — sentiment/tone dictionaries and to expect, therefore, to spend the vast majority of time in innovative projects in human tuning the algorithms and understanding the data, and then iterating the machine. Refreshingly “hands on.”

Scale and Tech Architecture

Kalev Leetaru turned to discuss the scale of data, which is now generating easily  in the petabytes per day range. There is no longer any question that automation must be used and that serious machinery will be involved. Our job is to get that automation machinery doing the right thing, and if we do so, we can measure the ‘heartbeat of society.’

For a book images project (60 Million images across hundreds of years) he mentioned a number of tools and file systems (but neither Gluster nor CEPH, disappointingly to this reviewer!) and delved deeply and masterfully into the question of how to clean and manage the very dirty data of “closed captioning” found in news reports. To full-text geocode and analyze half a million hours of news (from the Internet Archives), we need fast language detection and captioning error assessment. What makes this task horrifically difficult is that POS tagging “fails catastrophically on closed captioning” and that CC is worse, far worse in terms of quality than is Optical Character Recognition. The standard Stanford NL Understanding toolkit is very “fragile” in this domain: one reason being that news media has an extremely high density of location references, forcing the analyst into using context to disambiguate.

He covered his GDELT (Global Database of Event, Language and Tone), covering human/societal behavior and beliefs at scale around the world. A system of half a billion plus georeferenced rows, 58 columns wide, comprising 100,000 sources such as  broadcast, print, online media back to 1979, it relies on both human translation and Google translate, and will soon be extended across languages and back to the 1800s. Further, he’s incorporating 21 billion words of academic literature into this model (a first!) and expects availability in Summer 2014, (Sources include JSTOR, DTIC, CIA, CVORE CiteSeerX, IA.)

GDELT’s architecture, which relies heavily on the Google Cloud and BigQuery, can stream at 100,000 input observations/second. This reviewer wanted to ask him about update and delete needs and speeds, but the stream is designed to optimize ingest and query. GDELT tools were myriad, but Perl was frequently mentioned (for text processing).

Kalev Leetaru shared some post GDELT construction takeaways — “it’s not all English” and “watch out for full Unicode compliance” in your toolset, lest your lovely data processing stack SEGFAULT halfway through a load. Store data in whatever is easy to maintain and fast. Modularity is good but performance can be an issue; watch out for XML which bogs down processing on highly nested data. Use for interchange more than anything; sharing seems “nice” but “you can’t shared a graph” and “RAM disk is your friend” more so even than SSD, FusionIO, or fast SANs.

The talk, like this blog post, ran over allotted space and time, but the talk was well worth the effort spent understanding it.

The Evolution of Big Data Platforms and People

This is a guest post by Paco Nathan. Paco is an O’Reilly authorApache Spark open source evangelist with Databricks, and an advisor for ZettacapAmplify Partners, and The Data Guild. Google lives in his family’s backyard. Paco spoke at Data Science DC in 2012.  Data Workflows for Machine LearningA kind of “middleware” for Big Data has been evolving since the mid–2000s. Abstraction layers help make it simpler to write apps in frameworks such as Hadoop. Beyond the relatively simple issue of programming convenience, there are much more complex factors in play. Several open source frameworks have emerged that build on the notion of workflow, exemplifying highly sophisticated features. My recent talk Data Workflows for Machine Learning considers several OSS frameworks in that context, developing a kind of “scorecard” to help assess best-of-breed features. Hopefully it can help your decisions about which frameworks suit your use case needs.

By definition, a workflow encompasses both the automation that we’re leveraging (e.g., machine learning apps running on clusters) as well as people and process. In terms of automation, some larger players have departed from “conventional wisdom” for their clusters and ML apps. For example, while the rest of the industry embraced virtualization, Google avoided that path by using cgroups for isolation. Twitter sponsored a similar open source approach, Apache Mesos, which was attributed to helping resolve their “Fail Whale” issues prior to their IPO. As other large firms adopt this strategy, the implication is that VMs may have run out of steam. Certainly, single-digit utilization rates at data centers (current industry norm) will not scale to handle IoT data rates: energy companies could not handle that surge, let along the enormous cap-ex implied. I'll be presenting on Datacenter Computing with Apache Mesos next Tuesday at the Big Data DC Meetup, held at AddThis. We’ll discuss the Mesos approach of mixed workloads for better elasticity, higher utilization rates, and lower latency.

On the people side, a very different set of issues looms ahead. Industry is retooling on a massive scale. It’s not about buying a whole new set of expensive tools for Big Data. Rather it’s about retooling how people in general think about computable problems. One vital component may well not be having enough advanced math in the hands of business leaders. Seriously, we still frame requirements for college math in Cold War terms: years of calculus were intended to filter out the best Mechanical Engineering candidates, who could then help build the best ICBMs. However, in business today the leadership needs to understand how to contend with enormous data rates and meanwhile deploy high-ROI apps at scale: how and when to leverage graph queries, sparse matrices, convex optimization, Bayesian statistics – topics that are generally obscured beyond the “killing fields” of calculus.

A new book by Allen Day and me in development at O’Reilly called “Just Enough Math” introduces advanced math for business people, especially to learn how to leverage open source frameworks for Big Data – much of which comes from organizations that leverage sophisticated math, e.g., Twitter. Each morsel of math is considered in the context of concrete business use cases, lots of illustrations, and historical background – along with brief code examples in Python that one can cut & paste.

This next week in the DC area I will be teaching a full-day workshop that includes material from all of the above:

Machine Learning for Managers Tue, Apr 15, 8:30am–4:30pm (Eastern) MicroTek, 1101 Vermont Ave NW #700, Washington, DC 20005

That workshop provides an introduction to ML – something quite different than popular MOOCs or vendor training – with emphasis placed as much on the “soft skills” as on the math and coding. We’ll also have a drinkup in the area, to gather informally and discuss related topics in more detail:

Drinks and Data Science Wed, Apr 16, 6:30–9:00pm (Eastern) location TBD

Looking forward to meeting you there!

Hadoop for Data Science: A Data Science MD Recap

Hadoop logo On October 9th, Data Science MD welcomed Dr. Donald Miner as its speaker to talk about doing data science work and how the hadoop framework can help. To start the presentation, Don was very clear about one thing: hadoop is bad at a lot of things. It is not meant to be a panacea for every problem a data scientist will face.

With that in mind, Don spoke about the benefits that hadoop offers data scientists. Hadoop is a great tool for data exploration. It can easily handle filtering, sampling and anti-filtering (summarization) tasks. When speaking about these concepts, Don expressed the benefits of each and included some anecdotes that helped to show real world value. He also spoke about data cleanliness in a very Baz Luhrmann Wear Sunscreen sort of way, offering that as his biggest piece of advice.

Don then transitioned to the more traditional data science problems of classification (including NLP) and recommender systems.

The talk was very well received by DSMD members. If you missed it, check out the video:

http://www.youtube.com/playlist?list=PLgqwinaq-u-Mj5keXlUOrH-GKTR-LDMv4

Our next event will be November 20th, 2013 at Loyola University Maryland Graduate Center starting at 6:30PM. We will be digging deeper into the daily lives of 3 data scientists. We hope you will join us!

Getting Ready to Teach the Elephant to Read: A Strata + Hadoop World 2013 Tutorial

We (Ben Bengfort and Sean Murphy) are very excited to be holding the Teaching the Elephant to Read tutorial at the sold out Strata + Hadoop World 2013 on Monday, the 28th of October. We will be discussing and using numerous software packages that can be time consuming to install on various operating systems and laptops. If you are taking our tutorial, we strongly encourage you to set aside an hour or two this weekend to follow the instructions below to install and configure the virtual machine needed for the class. The instructions have been tested and debugged so you shouldn't have too many problems (famous last words ;).

Important Notes

Please note that

  1. you will need a 64-bit machine and operating system for this tutorial. The virtual machine/image that we will be building and using has been tested on Mac OS X (up through Mavericks) and 64-bit Windows.
  2. this process could take an hour or longer depending on the bandwidth of your connection as you will need to download approximately 1 GB of software.

1) Install and Configure your Virtual Machine

First, you will need to install Virtual Box, free software from Oracle. Go here to download the 64-bit version appropriate for your machine.

Download Virtual Box

Once Virtual Box is installed, you will need to grab a Ubuntu x64 Server 12.04 LTS image and you can do that directly from Ubuntu here.

Ubuntu Image

There are numerous tutorials online for creating a virtual machine from this image with Virtual Box. We recommend that you configure your virtual machine with at least 1GB of RAM and a 12 GB hard drive.

2) Setup Linux

First, let's create a user account with admin privileges with username "hadoop" and the very creative password "password."

username: hadoop
password: password

Honestly, you don't have to do this. If you have a user account that can already sudo, you are good to go and can skip to the install some software. But if not, use the following commands.

~$ sudo adduser hadoop
~$ sudo usermod -a -G sudo hadoop
~$ sudo adduser hadoop sudo

Log out and log back in as "hadoop."

Now you need to install some software.

~$ sudo apt-get update && sudo apt-get upgrade
~$ sudo apt-get install build-essential ssh avahi-daemon
~$ sudo apt-get install vim lzop git
~$ sudo apt-get install python-dev python-setuptools libyaml-dev
~$ sudo easy_install pip

The above installs may take some time.

At this point you should probably generate some ssh keys (for hadoop and so you can ssh in and get out of the VM terminal.)

~$ ssh-keygen
Generating public/private rsa key pair.
Enter file in which to save the key (/home/hadoop/.ssh/id_rsa):
Created directory '/home/hadoop/.ssh'.
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /home/hadoop/.ssh/id_rsa.
Your public key has been saved in /home/hadoop/.ssh/id_rsa.pub.
[… snip …]

Make sure that you leave the password as blank, hadoop will need the keys if you're setting up a cluster for more than one user. Also note that it is good practice to keep the administrator seperate from the hadoop user- but since this is a development cluster, we're just taking a shortcut and leaving them the same.

One final step, copy allow that key to be authorized for ssh.

~$ cp .ssh/id_rsa.pub .ssh/authorized_keys

You can download this key and use it to ssh into your virtual environment if needed.

3) Install and Configure Hadoop

Hadoop requires Java - and since we're using Ubuntu, we're going to use OpenJDK rather than Sun because Ubuntu doesn't provide a .deb package for Oracle Java. Hadoop supports OpenJDK with a few minor caveats: java versions on hadoop. If you'd like to install a different version, see installing java on hadoop.

~$ sudo apt-get install openjdk-7-jdk

Do a quick check to make sure you have the right version of Java installed:

~$ java -version
java version "1.7.0_25"
OpenJDK Runtime Environment (IcedTea 2.3.10) (7u25-2.3.10-1ubuntu0.12.04.2)
OpenJDK 64-Bit Server VM (build 23.7-b01, mixed mode)

Now we need to disable IPv6 on Ubuntu- there is one issue when hadoop binds on 0.0.0.0 that it also binds to the IPv6 address. This isn't too hard: simply edit (with the editor of your choice, I prefer vim) the /etc/sysctl.conf file using the following command

sudo vim /etc/sysctl.conf

and add the following lines to the end of the file:

# disable ipv6
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1

Unfortunately you'll have to reboot your machine for this change to take affect. You can then check the status with the following command (0 is enabled, 1 is disabled):

~$ cat /proc/sys/net/ipv6/conf/all/disable_ipv6

And now we're ready to download Hadoop from the Apache Download Mirrors. Hadoop versions are a bit goofy: an update on Apache Hadoop 1.0 however, as of October 15, 2013 release 2.2.0 is available. However, the stable version is still listed as version 1.2.1.

Go ahead and unpack in a location of your choice. We've debated internally what directory to place Hadoop and other distributed services like Cassandra or Titan in- but we've landed on /srv thanks to this post. Unpack the file, change the permissions to the hadoop user and then create a symlink from the version to a local hadoop link. This will allow you to set any version to the latest hadoop without worrying about losing versioning.

/srv$ sudo tar -xzf hadoop-1.2.1.tar.gz
/srv$ sudo chown -R hadoop:hadoop hadoop-1.2.1
/srv$ sudo ln -s $(pwd)/hadoop-1.2.1 $(pwd)/hadoop

Now we have to configure some environment variables so that everything executes correctly, while we're at it will create a couple aliases in our bash profile to make our lives a bit easier. Edit the ~/.profile file in your home directory and add the following to the end:

# Set the Hadoop Related Environment variables
export HADOOP_PREFIX=/srv/hadoop

# Set the JAVA_HOME
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64

# Add Hadoop bin/ directory to PATH
export PATH=$PATH:$HADOOP_PREFIX/bin

# Some helpful aliases

unalias fs &> /dev/null
alias fs="hadoop fs"
unalias hls &> /dev/null
alias hls="fs -ls"
alias ..="cd .."
alias ...="cd ../.."

lzohead() {
    hadoop fs -cat $1 | lzop -dc | head -1000 | less
}

We'll continue configuring the Hadoop environment. Edit the following files in /srv/hadoop/conf/:

hadoop-env.sh

# The java implementation to use. Required.
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64

core-site.xml



        fs.default.name
        hdfs://localhost:9000

        hadoop.tmp.dir
        /app/hadoop/tmp

hdfs-site.xml



        dfs.replication
        1

mapred-site.xml



        mapred.job.tracker
        localhost:9001

That's it configuration over! But before we get going we have to format the distributed filesystem in order to use it. We'll store our file system in the /app/hadoop/tmp directory as per Michael Noll and as we set in the core-site.xml configuration. We'll have to set up this directory and then format the name node.

/srv$ sudo mkdir -p /app/hadoop/tmp
/srv$ sudo chown -R hadoop:hadoop /app/hadoop
/srv$ sudo chmod -R 750 /app/hadoop
/srv$ hadoop namenode -format
[… snip …]

You should now be able to run Hadoop's start-all.sh command to start all the relevant daemons:

/srv$ hadoop-1.2.1/bin/start-all.sh
starting namenode, logging to /srv/hadoop-1.2.1/libexec/../logs/hadoop-hadoop-namenode-ubuntu.out
localhost: starting datanode, logging to /srv/hadoop-1.2.1/libexec/../logs/hadoop-hadoop-datanode-ubuntu.out
localhost: starting secondarynamenode, logging to /srv/hadoop-1.2.1/libexec/../logs/hadoop-hadoop-secondarynamenode-ubuntu.out
starting jobtracker, logging to /srv/hadoop-1.2.1/libexec/../logs/hadoop-hadoop-jobtracker-ubuntu.out
localhost: starting tasktracker, logging to /srv/hadoop-1.2.1/libexec/../logs/hadoop-hadoop-tasktracker-ubuntu.out

And you can use the jps command to see what's running:

/srv$ jps
1321 NameNode
1443 DataNode
1898 Jps
1660 JobTracker
1784 TaskTracker
1573 SecondaryNameNode

Furthermore, you can access the various hadoop web interfaces as follows:

To stop Hadoop simply run the stop-all.sh command.

4) Install Python Packages and the Code for the Class

To run the code in this section, you'll need to install some Python packages as dependencies, and in particular the NLTK library. The simplest way to install these packages is with the requirements.txt file that comes with the code library in our repository. We'll clone it into a repository called tutorial.

~$ git clone https://github.com/bbengfort/strata-teaching-the-elephant-to-read.git tutorial
~$ cd tutorial/code
~$ sudo pip install -U -r requirements.txt
[… snip …]

However, if you simply want to install the dependencies yourself, here are the contents of the requirements.txt file:

# requirements.txt
PyYAML==3.10
dumbo==0.21.36
language-selector==0.1
nltk==2.0.4
numpy==1.7.1
typedbytes==0.3.8
ufw==0.31.1-1

You'll also have to download the NLTK data packages which will install to /usr/share/nltk_data unless you set an environment variable called NLTK_DATA. The best way to install all this data is as follows:

~$ sudo python -m nltk.downloader -d /usr/share/nltk_data all

At this point the steps that are left are loading data into Hadoop.

References

  1. Hadoop/Java Versions
  2. Installing Java on Ubuntu
  3. An Update on Apache Hadoop 1.0
  4. Running Hadoop on Ubuntu Linux Single Node Cluster
  5. Apache: Single Node Setup

September 2013 Data Science DC Event Review: Data Mining for Patterns That Aren’t There

This is a guest post by Eunice Choi, a Health IT Consultant who is very interested in Data Science.

When a fortuitous event takes place, it is a very human inclination to be intrigued—and when such an event happens again in seemingly quick succession, we start to look for patterns. It is widely known that, in Data Mining, this ability to notice patterns is of great consequence—and in Big Data, this ability plays  itself out in the data analysis process, both in intuitive and counterintuitive ways.

In addition to addressing what Jules Berman, in his book, Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information, calls statistical method bias and ambiguity bias, the speakers at the Data Science DC Meetup illuminated an important issue for Data Miners that goes beyond ‘correlation is not causation’—the issue of whether the correlation itself is real or the result of repeated, massive exploration and modeling of the data.  In addition, by citing examples from the fields of Statistics, Predictive Analytics, Epidemiology, Biomedical Research, Marketing, and the Business and Government worlds, the speakers addressed a commonly seen issue stemming from overconfidence: viewing data validation as proof of causality. The talks provided a broad sense of context around the phenomenon of how repetitive computer intensive modeling can lead to overfitting and model underperformance. The speakers provided a deeper understanding of the process by which to determine which models and methods would offer the highest level of confidence, with a focus on best practices and remedies.

Attendee ratings, showing mostly positive experiences, and a desire for more technical content.

A compelling use case mentioned was: Do we see random events for their randomness, or do we see winning a $1 million lottery twice in a day as something beyond chance?

Jules Berman noted that, in order “To get the greatest value from Big Data resources, it is important to understand when a problem in one field has equivalence to a problem in another field.” In a set of talks that spoke to such equivalence, Peter Bruce, President of The Institute for Statistics Education at Statistics.com, and Gerhard Pilcher, VP and Senior Scientist at Elder Research, Inc. (ERI), who leads the Washington, DC office and all its federal civil work, presented on the topic of ‘Data Mining for Patterns That Aren’t There’.

During the first portion of the Meetup, networking took place outdoors over empanadas, and the  atmosphere was collegial and friendly. Once the audience filtered in, the audience appreciated Jonathan Street’s data visualization of when new members RSVP’d to the Meetup event--you can see the momentum built up in the 5 days directly preceding the Meetup:

Graph

Peter Bruce spoke on the topic by drawing the audience into probability examples and discussing the ‘lack of replication’ problem in scientific research. He then observed that humans are unwilling to think that chance is responsible for patterns in datasets and expanded upon this further with an example of the human capacity to be fooled by randomness in which commodity traders were shown charts and were asked to comment on them. Charts such as the one below were produced by random chance, yet the commodity traders viewed the charts as being representative of specific, observed phenomena, and continued to do so even after being told that the actual series were random:

"Commodity Price"

He then spoke on the how numerators and denominators figure into the question: Did you see the interesting event and then conclude it was interesting? In that case the numerator could be huge, which would mean that “interestingness” would decrease drastically.

To guide the audience to re-examine the significance of “statistically significant” correlations, Bruce cited epidemiology studies and other examples from health and science. For instance, in epidemiological studies on Bisphenol A (BPA), 1,000 people were involved and the models looked at 275 chemicals, 32 possible health outcomes, and 10 demographic variables. The high dimensionality and high volume of the data objects created computational challenges--there were 9,000,000 possible models when accounting for all possible covariate inclusion/exclusion options. This example demonstrated the idea: ‘Try enough models with enough covariates and you’ll get a correlation’--but also demonstrated that this idea does not necessarily embody the optimal approach to data mining.  In data mining, Bruce asserted that the proper use of a validation sample protects to some extent—but that information about the validation data may leak into the modeling process through repeated model tuning using the validation data or via information gained during the exploration/preparation phase.

Gerhard Pilcher discussed the ‘Vast Search Effect’ (i.e., what statisticians call the ‘multiple testing problem’ or ‘data snooping’), which he defined as ‘trying to find something interesting, whether that finding is real or the effect of random chance.’ He focused his talk on points of inquiry around the main question: Are orange cars really least likely to be bad buys?

From an initial bar graph on the proportion of bad buys by car color, it would appear that orange cars were least likely to be bad buys:

What Color Car Would You Buy?

However, Pilcher pointed out that the hypothesis was developed after seeing the data and data was not partitioned (the hypothesis also tested the same data), which had led to an instantiation of the ‘Vast Search Effect.’ To set the stage for why the Vast Search Effect is important, one compelling fact that Pilcher mentioned was that Bayer Laboratories confirmed that they could not replicate 67% of positive findings claimed in medical journals.

Pilcher also showed a great example of a financial model that was built and gave great results on two variables in terms of the numbers. However, when the model was plotted, the response surface showed the return (in red)—the stability of the model was shown to be very low, so the model was not able to continue to be used.

A Financial Example

To avoid the Vast Search Effect, Pilcher offered the following solutions:

  • Partitioning – breaking out the dataset into training, validation, and/or test data sets (making sure to avoid using the testing set to revise the training set)

  • Statistical Inference – deduce and test a new hypothesis

  • Simulation – sampling without replacement (e.g., target-shuffling and checking for proportions)

Key takeaways from Pilcher’s talk included the following: Hypothesis tests work when:

  1. The hypothesis comes first, the analysis second;
  2. The data is partitioned into training and testing datasets; and
  3. The logic incorporates practical significance in addition to statistical significance.

Pilcher emphasized the importance of the human element to determine what makes sense in a computer’s output in data mining. In addition, he compared the modern machine learning algorithm of learning by induction to linear regression and made the point that when learning by induction, one is inducing what the data is trying to tell us, thereby creating nonlinear surfaces—and that in that situation, it is likely one will overfit one’s model. Therefore, he used random shuffling to test different algorithms. He emphasized that one ought to test the algorithm against the data and should ask oneself: How much is the algorithm trying to overfit the random data?

By having us consider these questions, the speakers balanced their cautionary word on overfitting models with their assertion that data validation also depends on meaningful results—and the best  ways to arrive at the hypotheses and processes that lead to such results. If the modeling process could be likened to an expansive cube, personally, the effect of pondering these considerations was like walking around such an expansive cube and examining it for all of its contours--in addition to peering inside of it to understand its properties.

For more on the presentations, see the following resources: