Round-Ups

Weekly Round-Up: Data Analysis Tools, M2M, Machine Learning, and Naming Babies

Welcome back to the round-up, an overview of the most interesting data science, statistics, and analytics articles of the past week. This week, we have 4 fascinating articles ranging in topics from data analysis tools to naming babies. In this week's round-up:

  • Data Analysis Tools Target Non-experts
  • How M2M Data Will Dominate the Big Data Era
  • What Hackers Should Know About Machine Learning
  • Knowledge Engineering Applied to Baby Names

Data Analysis Tools Target Non-experts

Our first piece this week is an O'Reilly Strata article about some of the data analysis tools that are coming to market and are aimed at providing business users with the analytics they need to make decisions. The article highlights several tools from a variety of companies and categorizes them into three different categories according to what they help you do. The article also includes links to all the companies' websites so that, if you're anything like me, you can check out every single one of them.

How M2M Data Will Dominate the Big Data Era

The Internet of Things is getting a lot of attention these days, partly due to the amount of data that gets produced when one connected device communicates with another connected device. This is known as Machine-to-Machine data (M2M), and this Smart Data Collective article describes where a lot of this data may come from and how much data can potentially be generated.

What Hackers Should Know About Machine Learning

Our third piece is a Fast Company interview with Drew Conway, the author of the must-own book Machine Learning for Hackers. In the interview Drew answers questions about why developers should learn machine learning, the biggest knowledge gaps they need to overcome, and the differences between a machine learning project and a development project. (Editor's Note, the image to the left links to Amazon where if you buy the book we get a small cut of the proceeds. Buy enough books through this link, and we retire to an island.)

Knowledge Engineering Applied to Baby Names

Our final piece this week is a blog post about a company called Nameling is in the midst of holding a contest to improve the algorithms behind their baby name recommendation engine. Coming up with a good name for your baby is very important to parents, as the consequences of choosing a bad one almost certainly result in ridicule and tears. It should be interesting to see the results of the contest as well as what kinds of names the recommendation engine spits out.

That's it for this week. Make sure to come back next week when we’ll have some more interesting articles! If there's something we missed, feel free to let us know in the comments below.

Read Our Other Round-Ups

Weekly Round-Up: Hadoop, Big Data vs. Analytics, Process Management, and Palantir

Welcome back to the round-up, an overview of the most interesting data science, statistics, and analytics articles of the past week. This week, we have 4 fascinating articles ranging in topics from Hadoop to business process management. In this week's round-up:

  • To Hadoop or Not to Hadoop?
  • What’s the Difference Between Big Data and Business Analytics?
  • What Big Data Means to BPM
  • How A Deviant Philosopher Built Palantir

To Hadoop or Not to Hadoop?

Our first piece this week is an interesting blog post about what sorts of data operations Hadoop is and isn't good for. The post can serve as a useful guide when trying to figure out whether or not you should use Hadoop to do what you're thinking of doing with your data. It is organized into 5 categories of things you should consider and contains a series of questions you can ask yourself for each of the categories to help with your decision-making.

What’s the Difference Between Big Data and Business Analytics?

This is an excellent post on Cathy O'Neil's Mathbabe blog about how she distinguishes big data from business analytics. Cathy argues that what most people consider big data is really business analytics (on arguably large data sets) and that big data, in her opinion, consists of automated intelligent systems that algorithmically know what to do and need very little human interference. She goes into more detail about the differences between, including some examples to drive home her point.

What Big Data Means to BPM

Continuing on the subject of intelligent systems performing business processes, our third piece this week is a Data Informed article about big data's effect on business process management. The article is an interview with Nathaniel Palmer, a BPM veteran practitioner and author. In the interview, Palmer answers questions about what kinds of trends are emerging in business process management, how big data is affecting its practices, and what changes are being brought about because of it.

How A Deviant Philosopher Built Palantir

Our last piece this week is a Forbes article about Palantir, an analytics software company that works with federal intelligence agencies and is funded by In-Q-Tel - the CIA's investment fund. The article describes the company's CEO, what the company does, who it does for, and delves into some of Palantir's history. Overall, the article provides an interesting look at a very interesting company.

That's it for this week. Make sure to come back next week when we’ll have some more interesting articles! If there's something we missed, feel free to let us know in the comments below.

Read Our Other Round-Ups

Weekly Round-Up: Machine Learning, DIY Data Scientists, Games, and Helping Couples Conceive

Welcome back to the round-up, an overview of the most interesting data science, statistics, and analytics articles of the past week. This week, we have 4 fascinating articles ranging in topics from machine learning to helping couple's conceive. In this week's round-up:

  • Jeff Hawkins: Where Open Source and Machine Learning Meet Big Data
  • The Rise Of The DIY Data Scientist
  • Why Games Matter to Artificial Intelligence
  • Three Questions for Max Levchin About His New Startup

Jeff Hawkins: Where Open Source and Machine Learning meet Big Data

Our first piece this week is an InfoWorld article about Jeff Hawkins, the machine learning work that him and his company have been doing, and the open source project they've recently released on Github. The project's name is the Numenta Platform for Intelligent Computing (NuPIC) and it's goal is to allow others to be able to embed machine intelligence into their own systems. The article has a short interview with Jeff and a link to the Github page where the project resides.

The Rise Of The DIY Data Scientist

This is an interesting Fast Company article about how Kaggle competition winners tend to be self-taught. The author of the article interview's Kaggle's chief scientist Jeremy Howard about this phenomenon and other interesting findings derived from Kaggle's competitions about data scientists. Some of the questions inquire about where the winners are from, how they learned data science, and what machine learning algorithms they use.

Why Games Matter to Artificial Intelligence

This blog post on the IBM Research blog is an interview with Dr. Gerald Tesauro about the significance of games in the Artificial Intelligence field. Dr. Tesauro was the IBM research scientists who taught Watson how to play Jeopardy. In the interview, he explains how games tend to be an ideal training ground for machines because they tend to simplify real life. He goes on to answer questions about how that prepares the machines for transitioning to other real-world problems, what he's currently working on, what Watson is doing these days, and where else machine learning can be used.

Three Questions for Max Levchin About His New Startup

Our final piece this week is an MIT Technology Review article about PayPal co-founder Max Levchin's new startup called Glow. A lot of people are having children later in life these days and one downside of this is that many couples have trouble trying to conceive. Levchin has developed an iPhone app that uses data to help couples identify the optimal time for conception. In this brief interview, Levchin talks about what they are doing, why, and the degree of accuracy they hope to achieve.

That's it for this week. Make sure to come back next week when we’ll have some more interesting articles! If there's something we missed, feel free to let us know in the comments below.

Read Our Other Round-Ups

Weekly Round-Up: Statisticians, Build Smart DC, Kirk Borne, and Treating Parkinson's

Welcome back to the round-up, an overview of the most interesting data science, statistics, and analytics articles of the past week. This week, we have 4 fascinating articles ranging in topics from collecting building data to treating Parkinson's. In this week's round-up:

  • Statisticians: An Endangered Species?
  • Washington DC Launches Real-time Building Energy Data Project
  • Time Spent with Kirk Borne
  • Michael J. Fox Foundation Points Big Data At Parkinson's

Statisticians: An Endangered Species?

Our first piece this week is an interesting blog post on the Revolution Analytics blog about how statisticians are perceived and how that relates to data science. The post was inspired by an American Statistical Association Magazine article that portrayed statisticians as being left in the dust of the big data movement. The author goes on to talk about how he was surprised at how little mention there was of R in the article and how contributing to the statistical programming language may be a good way for statisticians to continue to play an important role in data science.

Washington DC Launches Real-time Building Energy Data Project

Our next piece is a GigaOM article about a project that launched last week called Build Smart DC. The project monitors energy data from city-owned buildings at 15 minute intervals to provide management with a much more granular view of energy use in the properties than ever before. This will allow them to monitor trends and make data-driven decisions that will lead to more efficient energy consumption. The article also goes on to talk about the startup that is driving this program and some other cities that have similar projects in place.

Time Spent with Kirk Borne

Our third piece is an interesting short interview with Kirk Borne. Kirk is a Professor of Astrophysics and Computational Science at George Mason University and has been one of the most influential Big Data advocates on Twitter in recent years. He talks to the interviewer about astrophysics, big data, and data science education.

Michael J. Fox Foundation Points Big Data At Parkinson's

Our final article this week is an InformationWeek piece about how the Michael J. Fox Foundation put on a Kaggle competition to see if data scientists could help identify patients that had Parkinson's and track increases and decreases in symptoms among patients that had the disease. The article highlights the winning team in the competition, some of the methods they used to generate their predictive models, and how they were about to acquire the domain knowledge that ultimately helped them win the competition.

That's it for this week. Make sure to come back next week when we’ll have some more interesting articles! If there's something we missed, feel free to let us know in the comments below.

Read Our Other Round-Ups

Weekly Round-Up: Data Science Roles, Technology Stacks, Predictive Analytics, and Michael Jordan

Welcome back to the round-up, an overview of the most interesting data science, statistics, and analytics articles of the past week. This week, we have 4 fascinating articles ranging in topics from data science technology stacks to Michael Jordan. In this week's round-up:

  • Five Roles You Need on Your Big Data Team
  • Choosing a Data Science Technology Stack
  • 12 Predictive Analytics Screw-ups
  • What Michael Jordan Can Teach Us About Big Data, Strategy And Innovation

Five Roles You Need on Your Big Data Team

Our first piece this week is an HBR article about the different roles you need when building a data science team. Data science is a very broad field and because of this, it's difficult to find someone who has all the skills that fall under its umbrella. This article attempts to break down the skill sets into more specific roles that can work together to really create value for an organization. The article lists the different roles, describes them, and also talks about the kind of culture you need to develop in order to get everyone in the organization on board and on the same page.

Choosing a Data Science Technology Stack

This is an interesting blog post about different data science technology stacks and how we as data scientists go about choosing one that works best for us. The author points out that there are several layers to a data science stack - sourcing the data, storing it, exploring it, modeling it, etc. - and there are several technological options available for performing each layer. The post examines these different options and even has a survey you can enter the technologies you use for each layer. When the survey is complete, those who participated will be emailed the results.

12 Predictive Analytics Screw-ups

This is a ComputerWorld article about some of the pitfalls you would do well to avoid when performing predictive analytics. The author interviewed experts at 3 data science consulting firms - Elder Research, Abbott Analytics, and Prediction Impact - about about the different mistakes they encounter to come up with this list. Take a look through them and see how many you've encountered yourself!

What Michael Jordan Can Teach Us About Big Data, Strategy And Innovation

Our final piece this week is a Forbes article that uses Michael Jordan and other sports examples to drive home points about big data and how we use it in business. The author starts out by drawing a parallel between the types of decisions managers need to make these days about new technologies, opportunities, and employees to looking at Michael in his early days when his athletic potential wasn't as obvious. He continues through the rest of the article writing about the processes we go through, the data we look at in our attempts to evaluate a situation and make appropriate decisions, and how big data and advances in technology improve our abilities to do all these things over time.

That's it for this week. Make sure to come back next week when we’ll have some more interesting articles! If there's something we missed, feel free to let us know in the comments below.

Read Our Other Round-Ups

Weekly Round-Up: Big Data Projects, OpenGeo, Coca-Cola, and Crime-Fighting

Welcome back to the round-up, an overview of the most interesting data science, statistics, and analytics articles of the past week. This week, we have 4 fascinating articles ranging in topics from big data projects to Coca-Cola. In this week's round-up:

  • 5 Big Data Projects That Could Impact Your Life
  • CIA Invests in Geodata Expert OpenGeo
  • How Coca-Cola Takes a Refreshing Approach to Big Data
  • Fighting Crime with Big Data

5 Big Data Projects That Could Impact Your Life

Our first piece this week is a Mashable article listing 5 interesting data projects. The projects range from one that projects transit times in NYC to one that tracks homicides in DC to one that illustrates the prevalence of HIV in the United States. All are great examples of people doing interesting things with data that is becoming increasingly available.

CIA Invests in Geodata Expert OpenGeo

A while back, the CIA spun off a strategic investment arm called In-Q-Tel to make investments in data and technologies that could benefit the intelligence community. This week, it was announced that they have invested in geo-data startup OpenGeo. This GigaOM article provides a little detail about the company and what they do and also lists some of the other companies In-Q-Tel has invested in thus far.

How Coca-Cola Takes a Refreshing Approach to Big Data

This is an interesting Smart Data Collective article about Coca-Cola and how they use data to drive their decisions and maintain a competitive advantage. The article describes multiple ways the company uses big data and analytics, from interacting with their Facebook followers to the formulas for their soft drinks.

Fighting Crime with Big Data

Our final piece this week is an article about how analytics platform provider, Palantir, helps investigators find patterns to uncover white collar crime, which is usually hidden using data. The article contains multiple quotes from Palantir's legal counsel Ryan Taylor about how they work with crime-fighting agencies and what methods they employ to bring these criminals to justice.

That's it for this week. Make sure to come back next week when we’ll have some more interesting articles! If there's something we missed, feel free to let us know in the comments below.

Read Our Other Round-Ups

Weekly Round-Up: Data Science Metro Map, Big Data Workers, Prescriptive Analytics, and Knewton

Welcome back to the round-up, an overview of the most interesting data science, statistics, and analytics articles of the past week. This week, we have 4 fascinating articles ranging in topics from big data workers to educational recommendation algorithms. In this week's round-up:

  • Becoming a Data Scientist – Curriculum via Metromap
  • The Growing Need for Big Data Workers: Meeting the Challenge With Training
  • How Prescriptive Analytics Could Harness Big Data to See the Future
  • Q&A With Knewton’s David Kuntz, Maker of Algorithms

Becoming a Data Scientist – Curriculum via Metromap

For those of you looking to get started learning data science but don't know where to begin, this blog post literally maps it out for you. The author has taken the broad subject of data science and created a train map similar to those found in all major cities with public transportation. The different tracks of data science are depicted as different color train lines in the map and the subjects within those tracks are depicted as stops along those lines. Very interesting and definitely worth a look!

The Growing Need for Big Data Workers: Meeting the Challenge With Training

This is a Wired article about how the need for big data workers is growing as there is more and more data that needs to be collected, organized, analyzed, and acted upon. The article talks about the challenges of educating people and highlights the efforts of a few companies such as IBM, Big Data University, and DeveloperWorks.

Speaking of data science education, Data Community DC is hosting a Natural Language Processing Basics workshop on July 27th and there are still a few seats left. You can view details and sign up here.

How Prescriptive Analytics Could Harness Big Data to See the Future

Our third piece this week is about prescriptive analytics and how organizations can use it to help them make data-driven improvements in their operations. The article defines prescriptive analytics, contrasts it with the more commonly used descriptive and predictive analytics, and provides some examples as to how it can be useful.

Q&A With Knewton’s David Kuntz, Maker of Algorithms

Our final piece this week is an article about a company call Knewton and the interesting work they do. Knewton designs recommendation systems for educational products, which help customize the learning experience and tailor it to the individual student. In this article the author interviews David Kuntz, who is Knewton's Vice President of Research, about how their technology works, what kinds of things it can do, and what this means for education in the future.

That's it for this week. Make sure to come back next week when we’ll have some more interesting articles! If there's something we missed, feel free to let us know in the comments below.

Read Our Other Round-Ups

Weekly Round-Up: Data Scientist Types, Data Protection, Travel, and Jay-Z

Welcome back to the round-up, an overview of the most interesting data science, statistics, and analytics articles of the past week. This week, we have 4 fascinating articles ranging in topics from data scientists types to data collecting music apps. In this week's round-up:

  • What Kind Of Data Scientist Are You?
  • Evernote’s Three Laws of Data Protection
  • Big Data Analysis Drives Revolution In Travel
  • Samsung and Jay-Z Accused of Using New Album to Mine Customer Data

What Kind Of Data Scientist Are You?

Our first article this week is a Fast Company piece about the new ebook our very own Harlan Harris, Marck Vaisman, and Sean Murphy authored. The ebook is about how there are actually multiple types of data scientists and the different combinations of skills and experience each type tends to have. The article provides some overview, some excerpts and graphics, and a link to the ebook as well.

Evernote’s Three Laws of Data Protection

This is a Smart Data Collective article about Evernote's stance on data protection and how it differs from other companies. Evernote is one of the most popular note-taking apps on the market, essentially letting you keep a copy of your brain out in the cloud where you can access it from anywhere and remember things your real brain may have forgotten. That being the case, the privacy of their users' data is of great importance to them.

Big Data Analysis Drives Revolution In Travel

Our third piece this week is an InformationWeek article about how data is revolutionizing the travel industry. We've all had to endure the frustrations that often come along with getting from point A to point B. This article highlights several companies and explains how they are using data to operate more efficiently and improve customer experiences.

Samsung and Jay-Z Accused of Using New Album to Mine Customer Data

Our final piece this week is a Time article about how Samsung and rapper Jay-Z offered early access to Jay's new album Magna Carta Holy Grail through an app on select Samsung mobile devices. The intent seemed to be for them to be able to collect some data about the types of customers that would want access to the album before the official release date. This article describes some of the data the app requested and talks about how this has raised some eyebrows about why they would need to collect the type of data they are collecting.

That's it for this week. Make sure to come back next week when we’ll have some more interesting articles! If there's something we missed, feel free to let us know in the comments below.

Read Our Other Round-Ups

The State of Recommender Technology

Reblogged with permission from Cobrain. socialnetwork_graph

So let’s start with the big idea that is the reason that we are all here: recommendation engines. If you are reading this, you have probably already overcome the mental hurdle of the massive design and implementation challenge that recommendation engines represent, otherwise I can’t imagine why you would have signed up! Or perhaps you don’t know what a massive design and implementation challenge recommendation engines represent. Either way, you’re in the right place- this post is an introduction to the state of the technology of recommendation systems.

Well sort of– here is a working state of the technology: Academia has created a series of novel machine learning and predictive algorithms that would allow scarily accurate trend analysis, recommendations, and predictions given the right, unbiased supervised training sets of sufficient magnitude. Commercial applications in very specific domains have leveraged these insights and extremely large data sets to create interesting results in the release phase of applications but have found that over time the quality of these predictions decreases rapidly. Companies with even larger data sets that have tackled other algorithmic challenges involving supervised training sets (Google) have avoided current recommender systems because of their domain specificity, and have yet to find a generic enough application.

To sum up:

Recommendation Engines are really really hard, and you need a whole heckuva lot of data to make them work.

Now go build one.

Don’t despair though! If it wasn’t hard, everyone would be doing it! We’re here precisely because we want to leverage existing techniques on interesting and novel data sets, but also to continue to push forward the state of the technology. In the process we will probably learn a lot and hopefully also provide a meaningful experience for our users. But before we get into that, let’s talk more generically about the current generation of recommender systems.

Who Does it Well?

The current big boys in the recommendation space are AmazonNetflixHunch (now owned by eBay), Pandora, and Goodreads. I strongly encourage you to understand how these guys operate and what they do to create domain specific recommendations. For example, the domain of Goodreads, Netflix, and Pandora is books, movies, and music respectively. Recommending inside a particular domain allows you to leverage external knowledge resources that either solve scarcity issues or allow ontological reasoning that can add a more accurate layer on top of the pure graph analyses that typically happen with recommenders.

Amazon and Hunch seem to be more generic, but in fact they also have domain qualification. Amazon has the data set of all SKU-level transactions through it’s massive eCommerce site. Even so, Amazon has spent 10 years and a lot of money perfecting how to rank various member behaviors. Because it is Amazon-specific, Amazon can leverage Amazon-only trends and purchasing behaviors, and they are still working on perfecting it. Hunch doesn’t have an item-specific domain, but rather a system-specific domain, using social and taste-making graphs to propose recommendations inside the context of social networks.

Speaking of Amazon’s decade long effort to create a decent recommender with tons of data, I hope you’ve heard of the Netflix Prize. Netflix was so desperate for a better algorithm for recommendations that they instituted an X-Prize like contest for a unique algorithm for recommending movies in particular. In fact, the test methodology for the Netflix Prize has become a standard for movie recommendations, and since 2009 (when the prize was awarded) other algorithm sets have actually achieved better results, most notably, Filmaster.

Given what these companies have tried to do, we can more generically speak of the state of the technology as follows: An “adequate” recommender system comprises of the following items:

  1. An unbiased, non-scarce data set of sufficient size
  2. A suite of machine learning and predictive algorithms that traverse that data set
  3. Knowledge resources to apply transformations on the results of those algorithms

Pandora is a great example of this. They have created an intensive project at detailing a “music genome” or an ontological breakdown of a sample of music. The genome itself is the knowledge resource. The analysis of the genomics of a piece of music aggregated across a large number of pieces is the unbiased non-scarce data set of sufficient size. Finally the suite recommendation algorithms that Pandora applies to these two sets then generates ranked recommendations that are interesting.

Types of Recommenders

Without getting into a formal description of recommenders, I do want to list a few of the common types of recommendation systems that exist within domain specific contexts. To do this, I need to describe the two basic classes of algorithms that power these systems:

  1. Collaborative Filtering: recommendations based on shared behavior with other people or things. E.g. if you and I bought a widget, and I also bought a sprocket, it is likely that you would also like a sprocket.
  2. Expert Adaptive or Generative Systems: recommendations based on shared traits of people or things or rules about how things interact with each other in a non-behavior way. E.g. if you play football and live in Michigan, this particular pair of cleats is great in the snow.

In the world of recommenders, we are trying to create a semantic relationship between people and things, therefore we can discuss person-centric and item-centric approaches in each of these classes of algorithms; and that gives us four main types of recommenders!

  1. Personalized Recommendations- A person-centric, expert adaptive model based on the person’s previous behavior or traits.
  2. Social/Collaborative Recommendations- A person-centric collaborative filtering model based on the past behavior of people similar to you, either because of shared traits or shared behavior. Note that the clustering of similar people can fall into either algorithm set, but the recommendations come from collaborative filtering.
  3. Ontological Reasoned Recommendations- An item-centric expert adaptive system that uses rules and knowledge mined with machine learning approaches to determine an inter-item relational model.
  4. Basket Recommendations- An item-centric collaborative filtering algorithm that uses inter-item relationships like “purchased together” to create recommendations.

Keep in mind, however, that these types of recommenders and classes are very loose and there is a lot of overlap!

Conclusion

Now that large scale search has been dramatically improved and artificial intelligence knowledge bases are being constructed with a reasonable degree of accuracy, it is generally considered that the next step in true AI will be effective trend and prediction analysis. Methodologies to deal with Big Data have evolved to make this possible, and many large companies are rushing towards predictive systems with a wide range of success. Recent approaches have revealed that near-time, large data, domain-specific efforts yield interesting results, if not truly predictive.

The overwhelming challenge is not just in engineering architectures that traverse graphs extremely well (see the picture at the top of this post), but also in finding a unique combination of data, algorithms, and knowledge that will give our applications a chance to provide truly scary, inspiring results to our users. Even though this might be a challenge, there are four very promising approaches that we can leverage within our own categories.

Stay tuned for more on this topic soon!

Weekly Round-Up: Data Scientists, Startups, Big Data Leaders, and Einstein

Welcome back to the round-up, an overview of the most interesting data science, statistics, and analytics articles of the past week. This week, we have 4 fascinating articles ranging in topics from data scientists job descriptions to contrasting big data and genius. In this week's round-up:

  • It's a Bird! It's a Plane! No, It's Just a Data Scientist.
  • Meet the Startups Making Machine Learning an Elementary Affair
  • What the Companies Winning at Big Data Do Differently
  • What Would Big Data Think of Einstein?

It's a Bird! It's a Plane! No, It's Just a Data Scientist.

This week, we start off with a Smart Data Collective article about how typical data scientist job descriptions tend to be composed of an unrealistic wishlist of things the hiring organization thinks a data scientist is. The article mentions how the term data scientist is very unclear in nature and how it is made up of at least two roles - data management and data analytics - both of which take up a substantial amount of a person's time.

Meet the Startups Making Machine Learning an Elementary Affair

Next up, we have a GigaOM article about startups that are trying to make machine learning tools that business users can use. The article lists 5 startups and talks a little about what each one does and what they're trying to produce.

What the Companies Winning at Big Data Do Differently

This Bloomberg article examines a survey done by Tata Consulting Services on large companies with substantial investments in big data technologies and explains what the differences are between companies that are getting a high return on these investments and companies that are not.

What Would Big Data Think of Einstein?

Our final article this week is a BBC piece that asks the question what happens to genius and big ideas in a world where big data gets so much attention. The author says that coming up with answers becomes relatively easy once you have the data and you know what you want to measure. The problem with this is that it focuses on looking backward and not the creativity and imagination it takes to look toward the future.

That's it for this week. Make sure to come back next week when we’ll have some more interesting articles! If there's something we missed, feel free to let us know in the comments below.

Read Our Other Round-Ups