data scientist

Simulation and Predictive Analytics

This is a guest post by Lawrence Leemis, a professor in the Department of Mathematics at The College of William & Mary.  A front-page article over the weekend in the Wall Street Journal indicated that the number one profession of interest to tech firms is a data scientist, someone whose analytic skills, computing skills, and domain skills are able to detect signals from data and use them to advantage. Although the terms are squishy, the push today is for "big data" skills and "predictive analytics" skills which allow firms to leverage the deluge of data that is now accessible.

I attended the Joint Statistical Meetings last week in Boston and I was impressed by the number of talks that referred to big data sets and also the number that used the R language. Over half of the technical talks that I attended included a simulation study of one type or another.

The two traditional aspects of the scientific method, namely theory and experimentation, have been enhanced with computation being added as a third leg. Sitting at the center of computation is simulation, which is the topic of this post. Simulation is a useful tool when analytic methods fail because of mathematical intractability.

The questions that I will address here are how Monte Carlo simulation and discrete-event simulation differ and how they fit into the general framework of predictive analytics.

First, how do how Monte Carlo and discrete-event simulation differ? Monte Carlo simulation is appropriate when the passage of time does not play a significant role. Probability calculations involving problems associated with playing cards, dice, and coins, for example, can be solved by Monte Carlo.

Discrete-event simulation, on the other hand, has the passage of time as an integral part of the model. The classic application areas in which discrete-event simulation has been applied are queuing, inventory, and reliability. As an illustration, a mathematical model for a queue with a single server might consist of (a) a probability distribution for the time between arrivals to the queue, (b) a probability distribution for the service time at the queue, and (c) an algorithm for placing entities in the queue (first-come-first served is the usual default). Discrete-event simulation can be coded into any algorithmic language, although the coding is tedious. Because of the complexities of coding a discrete-event simulation, dozens of languages have been developed to ease implementation of a model. 

The field of predictive analytics leans heavily on the tools from data mining in order to identify patterns and trends in a data set. Once an appropriate question has been posed, these patterns and trends in explanatory variables (often called covariates) are used to predict future behavior of variables of interest. There is both an art and a science in predictive analytics. The science side includes the standard tools of associated with mathematics computation, probability, and statistics. The art side consists mainly of making appropriate assumptions about the mathematical model constructed for predicting future outcomes. Simulation is used primarily for verification and validation of the mathematical models associated with a predictive analytics model. It can be used to determine whether the probabilistic models are reasonable and appropriate for a particular problem.

Two sources for further training in simulation are a workshop in Catonsville, Maryland on September 12-13 by Barry Lawson (University of Richmond) and me or the Winter Simulation Conference (December 7-10, 2014) in Savannah.

Weekly Round-Up: Machine Learning, DIY Data Scientists, Games, and Helping Couples Conceive

Welcome back to the round-up, an overview of the most interesting data science, statistics, and analytics articles of the past week. This week, we have 4 fascinating articles ranging in topics from machine learning to helping couple's conceive. In this week's round-up:

  • Jeff Hawkins: Where Open Source and Machine Learning Meet Big Data
  • The Rise Of The DIY Data Scientist
  • Why Games Matter to Artificial Intelligence
  • Three Questions for Max Levchin About His New Startup

Jeff Hawkins: Where Open Source and Machine Learning meet Big Data

Our first piece this week is an InfoWorld article about Jeff Hawkins, the machine learning work that him and his company have been doing, and the open source project they've recently released on Github. The project's name is the Numenta Platform for Intelligent Computing (NuPIC) and it's goal is to allow others to be able to embed machine intelligence into their own systems. The article has a short interview with Jeff and a link to the Github page where the project resides.

The Rise Of The DIY Data Scientist

This is an interesting Fast Company article about how Kaggle competition winners tend to be self-taught. The author of the article interview's Kaggle's chief scientist Jeremy Howard about this phenomenon and other interesting findings derived from Kaggle's competitions about data scientists. Some of the questions inquire about where the winners are from, how they learned data science, and what machine learning algorithms they use.

Why Games Matter to Artificial Intelligence

This blog post on the IBM Research blog is an interview with Dr. Gerald Tesauro about the significance of games in the Artificial Intelligence field. Dr. Tesauro was the IBM research scientists who taught Watson how to play Jeopardy. In the interview, he explains how games tend to be an ideal training ground for machines because they tend to simplify real life. He goes on to answer questions about how that prepares the machines for transitioning to other real-world problems, what he's currently working on, what Watson is doing these days, and where else machine learning can be used.

Three Questions for Max Levchin About His New Startup

Our final piece this week is an MIT Technology Review article about PayPal co-founder Max Levchin's new startup called Glow. A lot of people are having children later in life these days and one downside of this is that many couples have trouble trying to conceive. Levchin has developed an iPhone app that uses data to help couples identify the optimal time for conception. In this brief interview, Levchin talks about what they are doing, why, and the degree of accuracy they hope to achieve.

That's it for this week. Make sure to come back next week when we’ll have some more interesting articles! If there's something we missed, feel free to let us know in the comments below.

Read Our Other Round-Ups

Weekly Round-Up: Data Science Metro Map, Big Data Workers, Prescriptive Analytics, and Knewton

Welcome back to the round-up, an overview of the most interesting data science, statistics, and analytics articles of the past week. This week, we have 4 fascinating articles ranging in topics from big data workers to educational recommendation algorithms. In this week's round-up:

  • Becoming a Data Scientist – Curriculum via Metromap
  • The Growing Need for Big Data Workers: Meeting the Challenge With Training
  • How Prescriptive Analytics Could Harness Big Data to See the Future
  • Q&A With Knewton’s David Kuntz, Maker of Algorithms

Becoming a Data Scientist – Curriculum via Metromap

For those of you looking to get started learning data science but don't know where to begin, this blog post literally maps it out for you. The author has taken the broad subject of data science and created a train map similar to those found in all major cities with public transportation. The different tracks of data science are depicted as different color train lines in the map and the subjects within those tracks are depicted as stops along those lines. Very interesting and definitely worth a look!

The Growing Need for Big Data Workers: Meeting the Challenge With Training

This is a Wired article about how the need for big data workers is growing as there is more and more data that needs to be collected, organized, analyzed, and acted upon. The article talks about the challenges of educating people and highlights the efforts of a few companies such as IBM, Big Data University, and DeveloperWorks.

Speaking of data science education, Data Community DC is hosting a Natural Language Processing Basics workshop on July 27th and there are still a few seats left. You can view details and sign up here.

How Prescriptive Analytics Could Harness Big Data to See the Future

Our third piece this week is about prescriptive analytics and how organizations can use it to help them make data-driven improvements in their operations. The article defines prescriptive analytics, contrasts it with the more commonly used descriptive and predictive analytics, and provides some examples as to how it can be useful.

Q&A With Knewton’s David Kuntz, Maker of Algorithms

Our final piece this week is an article about a company call Knewton and the interesting work they do. Knewton designs recommendation systems for educational products, which help customize the learning experience and tailor it to the individual student. In this article the author interviews David Kuntz, who is Knewton's Vice President of Research, about how their technology works, what kinds of things it can do, and what this means for education in the future.

That's it for this week. Make sure to come back next week when we’ll have some more interesting articles! If there's something we missed, feel free to let us know in the comments below.

Read Our Other Round-Ups

Weekly Round-Up: Data Scientist Types, Data Protection, Travel, and Jay-Z

Welcome back to the round-up, an overview of the most interesting data science, statistics, and analytics articles of the past week. This week, we have 4 fascinating articles ranging in topics from data scientists types to data collecting music apps. In this week's round-up:

  • What Kind Of Data Scientist Are You?
  • Evernote’s Three Laws of Data Protection
  • Big Data Analysis Drives Revolution In Travel
  • Samsung and Jay-Z Accused of Using New Album to Mine Customer Data

What Kind Of Data Scientist Are You?

Our first article this week is a Fast Company piece about the new ebook our very own Harlan Harris, Marck Vaisman, and Sean Murphy authored. The ebook is about how there are actually multiple types of data scientists and the different combinations of skills and experience each type tends to have. The article provides some overview, some excerpts and graphics, and a link to the ebook as well.

Evernote’s Three Laws of Data Protection

This is a Smart Data Collective article about Evernote's stance on data protection and how it differs from other companies. Evernote is one of the most popular note-taking apps on the market, essentially letting you keep a copy of your brain out in the cloud where you can access it from anywhere and remember things your real brain may have forgotten. That being the case, the privacy of their users' data is of great importance to them.

Big Data Analysis Drives Revolution In Travel

Our third piece this week is an InformationWeek article about how data is revolutionizing the travel industry. We've all had to endure the frustrations that often come along with getting from point A to point B. This article highlights several companies and explains how they are using data to operate more efficiently and improve customer experiences.

Samsung and Jay-Z Accused of Using New Album to Mine Customer Data

Our final piece this week is a Time article about how Samsung and rapper Jay-Z offered early access to Jay's new album Magna Carta Holy Grail through an app on select Samsung mobile devices. The intent seemed to be for them to be able to collect some data about the types of customers that would want access to the album before the official release date. This article describes some of the data the app requested and talks about how this has raised some eyebrows about why they would need to collect the type of data they are collecting.

That's it for this week. Make sure to come back next week when we’ll have some more interesting articles! If there's something we missed, feel free to let us know in the comments below.

Read Our Other Round-Ups