# Thoughts on the INFORMS Business Analytics Conference

This post, from DC2 President Harlan Harris, was originally published on his blog. Harlan was on the board of WINFORMS, the local chapter of the Operations Research professional society, from 2012 until this summer. Earlier this year, I attended the INFORMS Conference on Business Analytics & Operations Research, in Boston. I was asked beforehand if I wanted to be a conference blogger, and for some reason I said I would. This meant I was able to publish posts on the conference's WordPress web site, and was also obliged to do so!

Here are the five posts that I wrote, along with an excerpt from each. Please click through to read the full pieces:

## Operations Research, from the point of view of Data Science

• more insight, less action — deliverables tend towards predictions and storytelling, versus formal optimization
• more openness, less big iron — open source software leads to a low-cost, highly flexible approach
• more scruffy, less neat — data science technologies often come from black-box statistical models, vs. domain-based theory
• more velocity, smaller projects — a hundred $10K projects beats one$1M project
• more science, less engineering — both practitioners and methods have different backgrounds
• more hipsters, less suits — stronger connections to the tech industry than to the boardroom
• more rockstars, less teams — one person can now (roughly) do everything, in simple cases, for better or worse

## What is a “Data Product”?

DJ Patil says “a data product is a product that facilitates an end goal through the use of data.” So, it’s not just an analysis, or a recommendation to executives, or an insight that leads to an improvement to a business process. It’s a visible component of a system. LinkedIn’s People You May Know is viewed by many millions of customers, and it’s based on the complex interactions of the customers themselves.

## Healthcare (and not Education) at INFORMS Analytics

[A]s a DC resident, we often hear of “Healthcare and Education” as a linked pair of industries. Both are systems focused on social good, with intertwined government, nonprofit, and for-profit entities, highly distributed management, and (reportedly) huge opportunities for improvement. Aside from MIT Leaders for Global Operations winning the Smith Prize (and a number of shoutouts to academic partners and mentors), there was not a peep from the education sector at tonight’s awards ceremony. Is education, and particularly K-12 and postsecondary education, not amenable to OR techniques or solutions?

## What’s Changed at the Practice/Analytics Conference?

In 2011, almost every talk seemed to me to be from a Fortune 500 company, or a large nonprofit, or a consulting firm advising a Fortune 500 company or a large nonprofit. Entrepeneurship around analytics was barely to be seen. This year, there are at least a few talks about Hadoop and iPhone apps and more. Has the cost of deploying advanced analytics substantially dropped?

## Why OR/Analytics People Need to Know About Database Technology

It’s worthwhile learning a bit about databases, even if you have no decision-making authority in your organization, and don’t feel like becoming a database administrator (good call). But by getting involved early in the data-collection process, when IT folks are sitting around a table arguing about platform questions, you can get a word in occasionally about the things that matter for analytics — collecting all the data, storing it in a way friendly to later analytics, and so forth.

All in all, I enjoyed blogging the conference, and recommend the practice to others! It's a great way to organize your thoughts and to summarize and synthesize your experiences.

# Elements of an Analytics "Education"

This a guest post by Wen Phan, who will be completing a Master of Science in Business at George Washington University (GWU) School of Business.  Wen is the recipient of the GWU Business Analytics Award for Excellence and Chair of the Business Analytics Symposium, a full-day symposium on business analytics on Friday, May 30th -- all are invited to attend. Follow Wen on Twitter @wenphan.

We have read the infamous McKinsey report. There is the estimated 140,000- to 190,000-person shortage of deep analytic talent by 2018, and an even bigger need - 1.5 million professionals - for those who can manage and consume analytical content. Justin Timberlake brought sexy back in 2006, but it’ll be the data scientist that will bring sexy to the 21st century. While data scientists are arguably the poster child of this most recent data hype, savvy data professionals are really required across many levels and functions of an organization. Consequently, a number of new and specialized advanced degree programs in data and analytics have emerged over the past several years – many of which are not housed in the traditional analytical departments, such as statistics, computer science or math. These programs are becoming increasingly competitive and graduates of these programs are skilled and in demand. For many just completing their undergraduate degrees or with just a few years of experience, these data degrees have become a viable option in developing skills and connections for a burgeoning industry. For others with several years of experience in adjacent ﬁelds, such as myself, such educational opportunities provide a way to help with career transitions and advancement.

I came back to school after having worked for a little over a decade. My undergraduate degree is in electrical engineering and at one point in my career, I worked on some of the most advanced microchips in the world. But I also have experience in operations, software engineering, product management, and marketing. Through it all, I have learned about the art and science of designing and delivering technology and products from ground zero - both from technical and business perspectives. My decision to leave a comfortable, well-paid job to return to school was made in order to leverage my technical and business experience in new ways and gain new skills and experiences to increase my ability to make an impact in organizations.

There are many opinions regarding what is important in an analytics education and just as many options to pursuing them, each with their own merits. Given that, I do believe there are a few competencies that should be developed no matter what educational path one takes, whether it is graduate school, MOOCs, or self-learning. What I oﬀer here are some personal thoughts on these considerations based on my own background, previous professional experiences, and recent educational endeavor with analytics and, more broadly, using technology and problem solving to advance organizational goals.

# Not just stats.

For many, analytics is about statistics and a data degree is just slightly diﬀerent from a statistics one. There is no doubt that statistics plays a major role in analytics, but it is still just one of the technical skills. If you are a serious direct handler of data of any kind, it will be obvious that programming chops are almost a must. For more customized and sophisticated processing, even substantial computer science knowledge – data structures, algorithms, and design patterns – will be required. Of course, even this idea has been pretty mainstream and is nicely captured by Drew Conway’s Data Science Venn Diagram. Other areas not as obvious to data competency are that of data storage theory and implementation (e.g. relational databases and data warehouses), operations research, and decision analysis. The computer science and statistics portions really focus on the sexy predictive modeling aspects of data. That said, knowing how to eﬀectively collect and store data upstream is tremendously valuable. After all, it is often the case that data extends beyond just one analysis or model. Data begets more data (e.g. data gravity). Many of the underlying statistical methods, such as maximum likelihood estimation (MLE), neural networks and support vector machines, all rely on principles and techniques of operations research. Further, operations research, also called optimization, oﬀers a prescriptive perspective on analytics. Last, it is obvious that analytics can help identify trends, understand customers, and forecast the future. However, in and of themselves those activities do not add any value; it is the decisions and resulting actions taken on those activities that deliver value. But, often, these decisions must be made in the face of substantial uncertainty and risk - hence the importance of critical decision analysis. The level of expertise required in various technical domains must align with your professional goals, but a basic knowledge of the above should allow you adequate ﬂuency across analytics activities.

# Applied.

I consider analytics an applied degree similar to how engineering is an applied degree. Engineering applies math and science to solve problems. Analytics is similar this way. One importance of applied ﬁelds is that they are where the rubber of theory needs to meet the road of reality. Data is not always normally distributed. In fact data is not always valid or even consistent. Formal education oﬀers rigor in developing strong foundational knowledge and skills. However, just as important are the skills to deal with reality. It is no myth that 80% of analytics is just about pre-processing the data; I call it dealing with reality. It is important to understand the theory behind the models, and frankly, it’s pretty fun to indulge in the intricacies of machine learning and convex optimization. In the end though, those things have been made relatively straightforward to implement with computers. What hasn’t (yet) been nicely encapsulated in computer software is the judgment and skill required to handle the ugliness of real-world data. You know what else is reality? Teammates, communication, and project management constraints. All this is to say that so much of an analytics education includes other areas that are not the theory, and I would argue that the success of many analytics endeavors are limited not by the theoretical knowledge, but rather by the practicalities of implementation whether with data, machines, or people. My personal recommendation to aspiring or budding data geeks is to cut your teeth as much as possible in dealing with reality. Do projects. As many of them as possible. With real data. And real stakeholders. And, for those of you manager types, give it a try; it’ll give you the empathy and perspective to eﬀectively work with the hardcore data scientists and manage the analytics process.

# Working with complexity and ambiguity.

The funny thing about data is that you have problems both when you have too little and too much of it. With too little data, you are often making inferences and assessing the conﬁdence of those inferences. With too much data, you are trying not to get confused. In the best case scenarios, your objectives in mining the data are straightforward and crystal clear. However, that is often not the case and exploration is required. Navigating this process of exploration and value discovery can be complex and ambiguous. There are the questions of “where do I start?” and “how far do I go?” This really speaks to the art of working with data. You pick up best practices along the way and develop some of your own. Initial exploration tactics may be as simple as proﬁling all attributes and computing correlations among a few of thing, seeing if anything looks promising or sticks. This process is further exacerbated with “big data”, where computational time is non-negligible and limits feedback delays during any kind of exploratory data analysis.

You can search the web for all kinds of advice on skills to develop for a data career. The few tidbits I include above are just my perspectives on some of the higher order bits in developing solid data skills. Advanced degree programs oﬀer compelling environments to build these skills and gain exposure in an eﬃcient way, including a professional network, resources, and opportunities. However, it is not the only way. As with all professional endeavors, one needs to assess his or her goals, background, and situation to ultimately determine the educational path that makes sense.

References:

# Weekly Round-Up: Hadoop, Big Data vs. Analytics, Process Management, and Palantir

Welcome back to the round-up, an overview of the most interesting data science, statistics, and analytics articles of the past week. This week, we have 4 fascinating articles ranging in topics from Hadoop to business process management. In this week's round-up:

• What’s the Difference Between Big Data and Business Analytics?
• What Big Data Means to BPM
• How A Deviant Philosopher Built Palantir

Our first piece this week is an interesting blog post about what sorts of data operations Hadoop is and isn't good for. The post can serve as a useful guide when trying to figure out whether or not you should use Hadoop to do what you're thinking of doing with your data. It is organized into 5 categories of things you should consider and contains a series of questions you can ask yourself for each of the categories to help with your decision-making.

# What’s the Difference Between Big Data and Business Analytics?

This is an excellent post on Cathy O'Neil's Mathbabe blog about how she distinguishes big data from business analytics. Cathy argues that what most people consider big data is really business analytics (on arguably large data sets) and that big data, in her opinion, consists of automated intelligent systems that algorithmically know what to do and need very little human interference. She goes into more detail about the differences between, including some examples to drive home her point.

# What Big Data Means to BPM

Continuing on the subject of intelligent systems performing business processes, our third piece this week is a Data Informed article about big data's effect on business process management. The article is an interview with Nathaniel Palmer, a BPM veteran practitioner and author. In the interview, Palmer answers questions about what kinds of trends are emerging in business process management, how big data is affecting its practices, and what changes are being brought about because of it.

# How A Deviant Philosopher Built Palantir

Our last piece this week is a Forbes article about Palantir, an analytics software company that works with federal intelligence agencies and is funded by In-Q-Tel - the CIA's investment fund. The article describes the company's CEO, what the company does, who it does for, and delves into some of Palantir's history. Overall, the article provides an interesting look at a very interesting company.

That's it for this week. Make sure to come back next week when we’ll have some more interesting articles! If there's something we missed, feel free to let us know in the comments below.

# Weekly Round-Up: WibiData, Big Data Trends, Analytics Processes, and Human Trafficking

Welcome back to the round-up, an overview of the most interesting data science, statistics, and analytics articles of the past week. This week, we have 4 fascinating articles ranging in topics from Big Data trends to using data to fight human trafficking. In this week's round-up:

• WibiData Gets $15M to Help It Become the Hadoop Application Company • 7 Big Data Trends That Will Impact Your Business • Want Better Analytics? Fix Your Processes • How Big Data is Being Used to Target Human Trafficking # WibiData Gets$15M to Help It Become the Hadoop Application Company

It was announced this week that Cloudera co-founder Christophe Bisciglia's new company, WibiData, has raised $15 million in a Series B round of financing. WibiData is looking to become a dominant player in the market by selling software that lets companies build consumer-facing applications on Hadoop. This article has additional details about the company and what they are trying to do. # 7 Big Data Trends That Will Impact Your Business We're all interested in seeing what the future of data science and Big Data have in store, and this article identifies 7 trends that the author thinks will continue to develop in the years ahead. Some general themes of the trends listed include predictions about platforms, structure, and programming languages. # Want Better Analytics? Fix Your Processes In order to succeed in running a data-driven organization, you must have the proper analytical business processes in place so that any insights derived from your efforts can be applied to improving operations. In this article, the author proposes 5 principles to ensure analytics are used correctly and deliver the results the organization wants. # How Big Data is Being Used to Target Human Trafficking Our last article this week is a piece about how Google announced recently that it will be partnering with other organizations in an effort to leverage data analytics in helping to fight human trafficking. Part of the effort will include aggregation of previously dispersed data and another part will consist of developing algorithms to identify patterns and better predict trafficking trends. This article lists additional details about the project. That's it for this week. Make sure to come back next week when we’ll have some more interesting articles! If there's something we missed, feel free to let us know in the comments below. ##### Read Our Other Round-Ups # Weekly Round-Up: Big Data Value, Education, Social Data Analysis, and Saving the Planet Welcome back to the round-up, an overview of the most interesting data science, statistics, and analytics articles of the past week. This week, we have 4 fascinating articles ranging in topics from Big Data's impact on education to using data to reduce global violence. In this week's round-up: • The Value of Big Data Isn't the Data • Big Data Will Revolutionize Learning • Data Analysis Should Be a Social Event • Using Big Data to Save the Planet # The Value of Big Data Isn't the Data This is an Harvard Business Review blog post by CTO of Narrative Science and Northwestern faculty member, Kris Hammond about where he believes the value is in Big Data. Hammond proposes that the value is in getting machines to conduct the data analysis we need conducted and communicating their findings in an intuitive way. In the post, he describes in more detail why he believes this is so valuable and provides explanations and diagrams outlining the steps that can be taken in order to put these processes in place. # Big Data Will Revolutionize Learning This interesting Smart Data Collective article is about how technology now allows us to capture information about virtually everything that happens in education and what this means for the future of education. Some of these things include customizing content for individual students, reducing drop-out rates, and enhancing the overall learning experience - all resulting in improved student outcomes. The articles talks a little about each of these and describes how they are, and will continue to be, implemented. # Data Analysis Should Be a Social Event This is another interesting HBR article advocating a more social approach to solving data analysis problems. The authors urge us to use an approach familiar to those that have attended data-dives or hackathons before - get a group of people with various different perspectives together to brainstorm and come up with ideas about how to best solve the problem you're trying to solve. The article points out that this approach doesn't just work well at hackathons, it has also been implemented with great success at companies. # Using Big Data to Save the Planet Our final article this week is a Slashdot piece about how the U.S. State Department is partnering with groups from around the world and using data analytics to help reduce violence in countries where it is a major problem. According to the article, they are using an analytics tool named Senturion to track data that can be obtained from social networks, economic data, and other sources to provide output that can help determine what types of resources are necessary on the ground in those troubled countries. The article mentions some of the countries where this analytics system is helping to identify conflict trends and also provides some examples of specific initiatives it is providing assistance with. That's it for this week. Make sure to come back next week when we’ll have some more interesting articles! If there's something we missed, feel free to let us know in the comments below. ##### Read Our Other Round-Ups # Guys Vs Girls - Check out the New Data Blog from Local Startup Hinge Today we are reblogging (with permission of course) a new local data blog, Hingesights (insights.hinge.co), from local dating startup Hinge. In their own words: Hinge is a better way to meet dates. Simply rate your interest in your friends' Facebook friends, then Hinge lets you know when you're both interested. It’s a fun way to see who’s out there, and to connect with the friends of friends you may never have met otherwise. Matches always share a social connection, so you’re always meeting through friends. This week we look at the difference in how men and women use Hinge, and how it contributes to the rarely understood mating rituals of one of the world’s most mysterious and interesting specimen: the Single Washingtonian. ## Hingesights: Guys vs. Girls! Let the battle of the sexes begin! Our nerds have been at it again, and it turns out guys and girls do not play Hinge the same way. Ladder theorists and evolutionary biologists step aside - Hinge is here to shed a little light on the great mysteries of courtship. Who makes the cut? It’s common knowledge that girls are pickier than guys. If you want proof, we suggest taking a girl and a guy to a restaurant, and see which one asks to completely restructure their salad. The big question is, how much pickier is the fairer sex? All in all, girls favorite only 16% of their daily potential matches. Remember, ladies-- you don’t have to hoard your favorites. They’re unlimited. It doesn’t mean the guy is your soulmate, just that you’d be open to starting a conversation. On the other side, guys favorite a solid 34% of their daily potentials. Chivalry, perhaps? Equal opportunity daters? Maybe it’s video game tendencies, and their thumbs just instinctively favorite girls because of the relative location of the buttons on an X-Box controller. Whatever the reason, our data confirms that girls are pretty darn choosy with their potential dates, and guys are a bit more “open-minded.” Or whatever word you’d like to use there. The Clooney Effect Another trend we noticed is that as age increases, the likelihood of favoriting potentials slightly increases for women, but actually decreases for men. Are men just losing their motivation, or are they suddenly slammed with dating options as they enter their Clooney years? Either way, we're certainly glad to see women pursuing a solid dating life, regardless of age. Get it, girls. Our closing takeaways? Both sides need to keep saving favorites! It’s good for you, regardless of where you are on the ladder of life. And ladies of Hinge: live a little! You never know-- your next spontaneous favorite could be your next great date. ### Permalink to this blog post: http://blog.hinge.co/2013/04/04/hingesights-guys-vs-girls/ # Big Data Week 2013 There's a lot more to the data, statistics, and analytics community than "Big Data." I've argued that focusing on the scale of certain modern data sets can distract from key innovations in statistical and machine learning techniques, visualization and exploration tools, and predictive applications that have revolutionized all of our work in recent years. But there's no question that the ability to collect, manage, explore, and productize terabyte and petabyte-scale data sets has been revolutionary for those industries that actually have that quantity of information, and has driven broad interest in the value of data and statistical modeling. So, we at Data Community DC are very pleased to be the local organizing partners for Big Data Week 2013. Big Data Week is a global, loosely-coordinated festival of events, April 22nd-28th, organized by folks in London, and with participants in at least 18 cities, including Washington, DC. Locally, we at DC2 decided to use this opportunity as a great excuse to work more closely with other planners of data-related events in the region. In addition to events run by DC2-affiliated Meetups (Data Business DC, Data Science DC Data Science MD, Data Visualization DC, and R Users DC), we're very pleased to be coordinating with Big Data DC, Open Analytics DC, WINFORMS, and INFORMS MD. Together, we'll be bringing you about ten events over eleven days, all themed in some way around big data! Here's what you need to know: # The Weird Dynamics of Viral Marketing in a Growing Market This is the third part of a four part series of blog posts on viral marketing. In part 1, I discuss the faulty assumptions in the current models of viral marketing. In part 2, I present a better mathematical model of viral marketing. In part 4, I’ll discuss the effects of returning customers. If the market is static, strong viral sharing can lead to rapid early growth, but once a peak is reached, the number of customers falls to zero unless your product has 100% retention. So how can a business use viral marketing to grow their customer base for the long term? Which factors (i.e. sharing rate, churn, market size, market growth) matter? In this blog post, I’ll adapt the mathematical model of viral marketing from part 2 of this series to examine how changing market size affects viral growth. What is “the Market”: What comprises “the market” depends on the nature of the product. If it is an iPhone game, then the market is people who own iPhones and play games on them. If it is a YouTube video of a pug climbing the stairs like a boss, then the market is people with internet connected devices who find that kind of video humorous. For the first example, entering the market could mean that you’ve bought your first iPhone, or started playing games on it. Leaving the market could mean that you’ve stopped playing games or swapped your iPhone for a different type of device. Note that leaving the market is different from becoming a former customer. In the case of an iPhone game, becoming a former customer means that you’ve stopped using the game and perhaps removed it from your phone -- you are still part of the market for iPhone games. The Model: If new potential customers can be added to the market and members of any subpopulation can leave the market, the parameters are now: • $$\beta$$ - The infection rate (sharing rate) • $\gamma$ - The recovery rate (churn rate) • $\alpha$ - Birth rate (rate that potential customers are entering the market) • $\mu$ - Death rate (rate that people are leaving the market) In part 2, I described how the population transitions from potential customers to current customers to former customers. Here, I’ll add terms to the differential equations to model how people are entering and leaving the market. Note that the total market size, $N = S + I + R$, is no longer constant, but will grow or shrink with time. Assume that the population entering the market joins as part of the ‘potential customer’ population. The number of new potential customers joining the market per unit time is $\alpha (S + I + R)$. The populations leaving the market can come from any of the three subpopulations. The numbers of people leaving the potential customer, current customer, and former customer groups per unit time are $\mu S$, $\mu I$, and $\mu R$, respectively. The equations become: • $dS/dt = -\beta SI + \alpha (S + I + R) - \mu S$ • $dI/dt = \beta SI - \gamma I - \mu I$ • $dR/dt = \gamma I - \mu R$ Examining the Equations: Since $N = S + I + R$, $dN/dt = (\alpha - \mu )N$ which is an equation that we can solve. Thus $N(t) = N(0) * e^{(\alpha - \mu ) t}$. The market grows exponentially if $\alpha > \mu$ and shrinks exponentially if $\alpha < \mu$. If $\alpha = \mu$, the total market size stays the same with people entering and leaving the market at equal rates. We can learn some more about the dynamics by examining where $dS/dt$ and $dI/dt$ are zero (that is, where the potential customer base and current customer base don't change): $dI/dt = 0$ if $S = \frac{\gamma + \mu}{β}$ or $I = 0$ $dS/dt = 0$ if $S = \frac{\alpha N(t)}{\beta I + \mu}$ Plotting these lines in the $S$ vs $I$ plane divides it into up to four regions, depending on the relative values of the parameters: The blue lines represent where $dI/dt = 0$. The green line represents where $dS/dt = 0$. Note that since $dS/dt$ is proportional to $N(t)$, the green line will rise or lower with time depending on whether the market is growing or shrinking. The red arrows indicate the general direction the $S-I$ trajectory will be moving in while it is in each region. This suggests that if $S(0) > \frac{\gamma + \mu}{\beta}$ then the number of number of customers grows, initially. In real terms, that means that initial growth depends on having a large enough sharing rate and number of current customers compared to the churn and "death" rates. If $\frac{\alpha N(0)}{\mu} > \frac{\gamma + \mu}{β}$ that is, the market size, growth rate, and sharing rate are large enough, then the number of customers may fluctuate, cyclically. In either case, the number of customers asymptotically approaches the point where both $dI/dt$ and $dS/dt$ are zero, $\frac{α}{γ + μ}N(t)$ or $\frac{α}{γ + μ}N(0) * e^{(α - μ) t}$. Notice that in this case, the long term behavior does not depend on the viral sharing rate, $β$! Instead, how the number of customers grows or shrinks long term depends entirely on whether the market is growing or shrinking. If $S(0) < \frac{γ + μ}{β}$ then the number of customers will only grow if the number of potential customers grows larger than $\frac{γ + μ}{β}$ before the number of current customers drops to zero. In less mathematical terms, what this all means is that, if your customer base doesn't die out in the beginning, your customer base will grow exponentially, as long as you have a growing market. How fast your customer base grows in the long term depends on the growth of the market and the churn rate, but not on the viral sharing rate! Examples: As in Part 2, numerically integrating the differential equations allows for visualizing the effect each parameter has on viral growth. First, compare different values of the sharing rate, β. With the parameters: • $N(0)$ = 1 million people in the market • $γ$ = 70% of customers lost per day • $I(0)$ = 10 current customers • $α$ = 14 new people in the market, per 1000 people • $μ$ = 8 people lost from the market, per 1000 people compare $βN(0)$ = 5, 1, 0.8, 0.5 invites per customer per day. The condition for initial growth is: $βN(0)$ $\frac{γ + μ}{β}$ 5 140,000 1 700,000 0.8 875,000 0.5 1,400,000 For higher values of the sharing rate, $β$, a higher initial peak in the number of customers is reached and the ups and downs in the number of current customers end sooner, but the values for which the growth condition is met all follow the same pattern of growth after several months. That is, they all asymptotically approach $\frac{α}{γ + μ}N(0) * e^{(α - μ) t}$. Now consider the effects of varying churn: • $N(0)$ = 1 million people in the market • $βN(0)$ = 0.8 invites per customer per day • $I(0)$ = 10 current customers • $α$ = 14 new people in the market, per 1000 people • $μ$ = 8 people lost from the market, per 1000 people compare $γ$ = 90%, 70%, 40%, or 10% of customers lost per day. The condition for growth is $γ$ $\frac{γ + μ}{β}$ 0.9 1,125,000 0.7 875,000 0.4 500,000 0.1 125,000 Notice how the change in churn rate, $γ$, affects both the height of the initial peak and the long term growth. Lower levels of churn also lead to less of a roller coaster ride. For the case of 90% churn, the condition for growth, $S(t) > \frac{γ + μ}{β}$, is not met at $t=0$, but because of the growth of the market, $S(t)$ crosses the threshold before $I(t)$ goes to zero and viral growth is still achieved. Finally, consider the effects of varying the growth rate of the market: • $N(0)$ = 1 million people in the market • $βN(0)$ = 2 invites per customer per day • $I(0)$=10 current customers • $γ$ = 50% customers lost per day • $μ$ = 8 people lost from the market, per 1000 people compare $α$ = 4, 8, 14, 16 new people in the market, per 1000 people. The size of the initial peak is barely affected by the market growth rate, $α$, but long term behavior is affected greatly. For a “birth” rate that is lower than the “death” rate, even with strong viral sharing, the customer base drops to zero within about a month. For a $α = μ$, the customer base reaches a steady value within a year. Growing markets lead to growing customer bases, with small changes in market growth rates having a large effect in the long term. Conclusion: All this fancy math and book learning brings us to a conclusion that is nothing new: a large and growing market is still one of the most important factors in growing a customer base for the long term. Keeping more of the customers you have also has a strong effect on long term viral growth. However the effects of a large viral sharing rate are only seen in the short term. Strong viral growth is important if your goal is to reach a large number of people quickly, as with a viral advertising campaign. But if your goal is to grow a customer base for a product for the long term, any amount of viral sharing can lead to long term growth as long as the market is growing and the churn in the customer base is low enough. In Part 4, I'll look at the effect of returning customers. TLDR: Differences in viral sharing rates only matter in the short term. For long term growth, the most important factors are low churn and a growing market. Image Credit. # Mid Maryland Data Science Kickoff Event Review On Tuesday, January 29th, nearly 90 academics, professionals, and data science enthusiasts gathered at JHU APL for the kick-off meetup of the new Mid-Maryland Data Science group. With samosas on their plates and sodas in hand, members filled the air with conversations about their work and interests. After their meal, members were ushered into the main auditorium and the presenters took their place at the front. #### Greetings and Mission by Jason Barbour & Matt Motyka Jason and Matt kicked off the talks with an introduction of the group. Motivated by both growth of data science and the vast opportunities being made available by powerful free tools and open access to data, they described their interest in creating a local group that help grow Maryland data science community. Being software developers with analytic experience, Jason and Matt next described their seven keys to a success analytic: infrastructure, people, data, model, and presentation. Lastly, metrics about the interests and experience of the members was presented. #### The Rise of Data Products With excitement and passion, Sean took the stage to show how now is the Gold Rush for data products. Laying out the definition of a data product, and cycling through several well known examples, Sean explained how these products are able to bring social, financial, or environmental value through the combination of data and algorithms. Consumers want data, and the tools and infrastructure needed to supply this demand are available either freely or extremely low cost. Data scientists are now able to harness this stack of tools to provide the data products that consumers crave. As Sean succinctly stated, it is a great time time to work with data. The article version of the talk can be found here. #### The Variety of Data Scientists Being a full-fledged data science, Harlan followed up Sean by presenting his research into what the name “data scientists” really means. Using the results of a data scientist survey, Harlan listed several skill groupings that provide a shorthand for the variety of skills that data scientists possess: programming, stats, math, business, and machine learning/big data. Next Harlan, discussed that the diverse backgrounds of data scientists can be more accurately categorized into four types: data businessperson, data creative, data researcher, and data engineer. With this breakdown, Harlan demonstrated that the data scientists community is actually composed of individuals with a variety of interests and skills. #### Cloudera Impala - Closing the near real time gap in BIGDATA A true cyber security evangelist, Wayne Wheeles presented how Cloudera’s Impala, was able to make near real time security analysis a reality. With his years of experience in the field of cyber security, and his prior work utilizing big data technologies, Wayne was given unique access to Cloudera’s latest tool. Through his testing and analysis, he concluded that the Impala tool offered a significant improvement in performance and could become a vital tool in cyber security. After the last presentation, more than a dozen members joined joined us at nearby Looney’s Pub to end the night with a few beers and snacks. To everyone's surprise, Donald Miner of EMC Greenplum offered to pick-up the tab! You can follow him on Twitter or LinkedIn from this page. If you missed this first event, don't worry as the next one is coming up on March 14th in Baltimore. Check it out here. # Weekly Round-Up: Long Data, Data Storytelling, Operational Intelligence, and Visualizing Inaugural Speeches Welcome back to the round-up, an overview of the most interesting data science, statistics, and analytics articles of the past week. This week, we have 4 fascinating articles ranging in topics from data storytelling to visualizing presidential inauguration speeches. In this week's round-up: • Forget Big Data, Think Long Data • Telling a Story with Data • Is Operational Intelligence the End Game for Big Data? • US Presidents Inaugural Speech Text Network Analysis # Forget Big Data, Think Long Data While Big Data is all the rage these days, this Wired article trumpets the merits of Long Data - data sets that have massive historical sweeps. It points out that Big Data is usually from the present or from the recent past and so you often don't get the same perspective as you do from data sets that span very long timelines. These data sets let you observe how events unfold over time, which can provide valuable insights. The article goes on to describe more differences between Big and Long Data and cites examples of some of the ways Long Data is used today. # Telling a Story with Data Deloitte University Press published an interesting post this week about how to tell a story with data. The post argues that unless decision-makers understand the data and its implications, they may not change their behavior and adopt analytical approaches while making decisions. This is where data storytelling - the art of communicating the insights that can be drawn from the data - comes in. The post goes on to describe some good and bad examples of this and also provides some useful guidelines for it. # Is Operational Intelligence the End Game for Big Data? This is a post on the Inside Analysis blog that talks about how Business Intelligence is beginning to be taken to the next level, and how that level is Operational Intelligence. With the advancement of data science and big data technologies, organizations are starting to be able to take a deeper look into their data, draw insights that weren't visible previously, and start using predictive analytics to forecast more accurately. The post goes on to talk about Operational Intelligence and how these new insights can be transferred to a user or system that can make the appropriate business decisions and take the required actions. # US Presidents Inaugural Speech Text Network Analysis This is a post from Nodus Labs showing off some of the interesting work they've done creating network visualizations out of US presidential inaugural addresses. The post describes their methodology, includes a video explaining the networks, and they even embedded some examples that you can play around with and explore! That's it for this week. Make sure to come back next week when we’ll have some more interesting articles! If there's something we missed, feel free to let us know in the comments below. ##### Read Our Other Round-Ups # A Better Mathematical Model of Viral Marketing This is the second part of a four part series of blog posts on viral marketing. In part 1, I discuss the faulty assumptions in the current models of viral marketing. In part 3, I show the weird dynamics of viral marketing in a growing market. In part 4, I'll discuss the effects of returning customers. Current models of viral marketing for the business community rely on faulty assumptions. As a result, these models fail to reflect real world examples. So how can the business community build a more realistic model of viral marketing? How do you know which factors (e.g. viral coefficient, time scale, churn, market size) are most important? Fortunately, there is a rich history of literature on mathematical models for viral growth (and decline), dating all the way back to 1927. These models rigorously treat viral spread, churn, market size, and even the change in the market size and the possibility that former customers return. Obviously, nobody was thinking of making a YouTube video or an iPhone app go viral back when phones didn't even have rotary dials. These models are of the viral spread of … viruses! The Model: The classic SIR model of the spread of disease is by Kermack and McKendrick. (Sorry I couldn’t link to the original paper. You can buy it for$54.50 here -- blame the academicpublishingindustry). I’ve applied this model to viral marketing by drawing analogies between a disease and a product. The desired outcomes are very different, but the math is the same.

Kermack and McKendrick divide the total population of the market, $N$, into three subpopulations.

• $S$ - The number of people susceptible to the disease (potential customers)
• $I$ - The number of people who are infected with the disease (current customers)
• $R$ - The number of people who have recovered from the disease (former customers).

These three subpopulations change in number over time. The population of potential customers become current customers as a result of successful invitations. Current customers become former customers if they decide to stop using the product. For simplicity, I’ll treat the total market size, $N = S + I + R$, as static and former customers as immune (for now). The parameters that govern spread of disease are:

• $β$ - The infection rate (sharing rate)
• $γ$ - The recovery rate (churn rate)

Assume that current customers, $I$, and potential customers, $S$, communicate with each other at an average rate that is proportional to their numbers (as governed by the Law of Mass Action). This gives $βSI$ as the number of new customers, per unit time, due to word of mouth or online sharing. As the number of new customers grows by $βSI$, the number of potential customers shrinks by the same number. This plays the same role that the “viral coefficient” does in Skok’s model, but accounts for the fact that conversion rates on sharing slow down when the fraction of people who have already tried the product gets large. It also does away with the concept of "cycle time". Instead, it accounts for the average time it takes to share something and the average frequency at which people share by putting a unit of time into the denominator of $β$. Thus, $β$ represents the number of successful invitations per current customer per potential customer per unit time (i.e. hour, day, week). I propose that this is a more robust definition of viral coefficient than the one used by Ries and Skok because modeling viral sharing as an average rate accounts for the following realities:

• Customers do not share in synchronous batches.
• Each user has a different timeframe for trying a product, learning to love it, and sharing it with friends. Rather than assuming that they all have the same cycle time, $β$ represents an average rate of sharing.
• Users might invite others when first trying a product or after they’ve used it for quite a while.

In this model, current customers become former customers at a rate defined by the parameter $γ$. That is, $γ$ is the fraction of current customers who become former customers in a unit of time. It has the dimensions of inverse time $(1/t)$, and $1/γ$ represents the average time a user remains a user. So, if $γ = 1\%$ of users lost per day, then the average length of time a user remains active is 100 days.

The differential equations governing viral spread are:

• $dS/dt = -βSI$
• $dI/dt = βSI - γI$
• $dR/dt = γI$

Examining the Equations:

These are non-linear differential equations that cannot be solved to produce convenient, insight yielding formulas for $S(t)$, $I(t)$, and $R(t)$. What they lack in convenient formulas, they make up for with more interesting dynamics (especially when considering changing market sizes and returning customers). You can still learn a lot by examining them and integrating them numerically. Let’s assume that $t=0$, represents the launch of a new product. Initially, at least the founding team uses the product and represent the initial customer base, $I(0)$. The initial number of former customers, $R(0)$, is zero and the rest of the people in the market are potential customers, $S(0)$.

The first thing to note is that there will be a growing customer base $(dI/dt > 0)$ as long as:

$βS/γ > 1$

That is, viral growth will occur as long as the addressable market size, $S(0)$, and sharing rate, $β$, are sufficiently large compared to the churn, $γ$. This model shows that with a big enough market, you can go viral even with a small $β$ as long as your churn is also small enough (consistent with the Pinterest example described in part 1). This model also shows that the effects of churn cannot be ignored, even in very early viral growth.

If at $t=0$, $S$ is very close to $N$, then $βS/γ$ is approximately $βN/γ$. Thus, if $βN/γ > 1$, initial growth will occur and if $βN/γ < 1$, the customer base will not grow. This is sometimes called the “basic reproductive number” in epidemiology literature. It is essentially what Eric Ries calls the “viral coefficient” although it depends on market size and churn as well as the viral sharing rate. It is approximately the average number of new customers each early customer will invite during the entire time that they remain a customer, which is $1/γ$. However, in the case that viral growth does occur, $βN/γ$ rapidly ceases to represent the number of customers that each customer invites.

Another thing you can see by examining the equations is that if you ignore the change in the market size (an approximation that makes sense for short lived virality, such as with a YouTube video), the customer base always goes to zero at long times unless you have zero churn. Once the number of current customers reaches a peak where $dI/dt = 0$ at $I = N - γ/β$, the rate of change in the number of current customers becomes negative and the number of customers eventually reaches zero. This is consistent with the data provided in the Mashable post on the half-lives of Twitter vs. YouTube content. Again, note the key role that churn has in determining the peak number of customers.

Examples:

We can gain more insight from these equations by numerically integrating them. For these examples, the unit of time used to define $β$ and $γ$ is one day, though the choice is arbitrary. I’ve given values of $β$ as $βN$ to create better correspondence with Ries’ concept of viral coefficient -- If at $t=0$, $S(0)$ is approximately $N$, $βN$ is approximately the number of new customers each existing customer begets per day.

With the parameters:

• $N = 1$ million people in the market
• $βN = 10$ invites per current user per day
• $γ = 50\%$ of customers lost per day
• $I(0) = 10$ current customers

numerically integrating the equations given above yields the following for how the number of customers changes for the first 30 days: This shows a traffic pattern similar to that of a popular Twitter link where traffic quickly spikes and then dies down as people tire of looking at it. (In the case of visiting a webpage, a “customer” can be defined as a visitor).

For a smaller churn rate, $γ = 1\%$ of customers lost per day, we see the following for the growth and decline in the number of customers over 300 days: This shows how even for low values of churn, without new potential customers joining the market, or former customers returning, the customer base always diminishes after reaching it's peak. Also note how a smaller churn rate allows us to reach a higher peak in traffic.

So how can viral growth be sustained? For that, you need to consider how the change in the market size affects viral marketing, which I’ll examine in part 3.

(For another fun example of how to apply the SIR model, see my post on the Mathematics of the Walking Dead.)

TLDR: A better definition of “viral coefficient” is successful invitations per existing user per potential user per unit time. But market size and customer churn are just as, if not more, important than viral coefficient. Viral growth in a static market is unsustainable unless you have absolutely zero churn.

Image Credit.

# 4 Major Mistakes in the Current Understanding of Viral Marketing

This is the first part of a four part series of blog posts on viral marketing. In part 2, I present a better mathematical model of viral marketing. In part 3, I show the weird dynamics of viral marketing in a growing market. In part 4, I'll discuss the effects of returning customers.

Viral marketing is arguably the most sought after engine of growth due to it’s potential to drive explosive increases in the number of customers at little or no cost. There has been tremendous interest in understanding virality in marketing and product development, particularly with the advent of the social web. Unfortunately, efforts to build mathematical models for the business community have not been successful in reflecting reality and offering insights into which factors matter most in achieving viral growth.

In The Lean Startup, Eric Ries defines the viral coefficient as “how many new customers will use the product as a consequence of each new customer who signs up” and declares that a viral coefficient greater than 1 will lead to exponential growth while a viral coefficient less than 1 leads to little growth at all. However, his treatment of the viral coefficient makes no mention of a timescale. Is it the number of new customers an existing customer brings in immediately upon signing up? Or within a day? Or within the entire time that they are using the product?

At ForEntrepreneurs.com, David Skok introduces the concept of a “cycle time” -- the total time it takes to try a product and share it with friends. In doing so, he correctly notes the importance of a timescale as a factor in achieving viral growth. In fact, he declares it to be even more important than the viral coefficient. He first models the accumulation of users in a spreadsheet and then, with help from Kevin Lawler, derives a formula for viral growth:

$C(t)=C(0) (K^{1+t/c_t} - 1) / (K-1)$

where $C(t)$ represents the number of customers at time $t$, $K$ represents the viral coefficient, and $c_t$ represents the cycle time.

This model depends on the following assumptions, which I’ll address in turn:

1. The market is infinite.
2. There is no churn in the customer base -- once a customer, always a customer.
3. Customers send invites shortly after trying the product, if at all, and never again.
4. Every customer has the same cycle time and the cycles all happen in unison.

Infinite Market Size:

Since viral growth can be so explosive, the market for a product can become saturated very quickly. As the market becomes saturated, fewer potential customers will respond to invitations, effectively reducing the “viral coefficient” (as it is defined by Ries and Skok). Since market saturation could occur in a matter of days or weeks, we cannot ignore the effect of a finite market size.

Customer Churn:

Neither Ries’ nor Skok’s models account for churn in customers -- the rate at which customers stop using the product. Eric Ries treats the concept of churn in his discussion of another engine of growth which he calls  “Sticky Marketing” and suggests that startups concentrate on only one engine of growth at a time. While the advice for startups to focus on one engine of growth at a time may be sound, it does not justify leaving this very real effect out of the equations.

When Customers Send Invites:

Skok’s model depends on the assumption that each new customer sends invitations shortly after trying the product and then never again. While this may be true for some products, it’s likely that the pattern of sharing depends on the nature of the product and often occurs long after the user has had a chance to try the product and grow to love it.

Customer Cycle Times:

How users beget more users via invitations, leading to a compounding user base, is the essence of viral growth. In Skok’s model, users receive invites, try the product, love the product, and invite a batch of new users in synchronous cycles. These uniform cycles correlate conveniently to the columns of a spreadsheet. But in reality, different users take different amounts of time to progress from trying a new product to inviting their friends. The number of users of a product doesn't compound at finite, regular intervals like bank interest. Instead, it ramps up customer by customer. The customers have a distribution of cycle times which are not synchronous, but staggered randomly.

By modeling viral growth as batches of invitations that happen in tandem, Skok effectively conflates the timescale of “try it, love it, share it” with a compounding interval. That is, Skok’s concept of a “cycle time” represents two unrelated timescales:

1. The total time between trying a product for the first time and inviting a batch of friends to try it.
2. The finite intervals at which the user base compounds.

The reason Skok’s “cycle time” has such a large effect is because it represents a compounding interval, not because it represents how quickly users go from trying a product to sharing it with friends. Furthermore, this compounding interval is an artifact of the assumption that current users invite new users in synchronous cycles. Thus, the conclusion that “cycle time” is the most important factor in achieving viral growth is an artifact of the faulty "synchronous cycle" assumption in this model.

The assumption that the number of users compound at finite, regular intervals is further muddled by the derivation of a continuous formula. If current users only invite new users in a batch at the end of each cycle, then how can the number of users ramp up continuously during the cycle? A continuous, as opposed to stepwise, viral growth in usership is only possible if sharing is happening constantly, at an average rate. To model this, we need a totally new definition of a viral coefficient that includes a timescale in the denominator -- the average number of invitations per unit time.

How this Model Breaks Down:

Some real world examples contradict the models developed by Ries and Skok. A post at TechCrunch regarding the growth of Pinterest shows that a customer base can grow even with a smallish viral coefficient as long as the churn rate is low. This post from Mashable compares the half life of various sharing methods, showing that even the content shared via methods that have longer half lives get very little traffic once everyone has seen it a few times.

So how can the business community build a more realistic model of viral marketing? In part 2, I present a better mathematical model for viral marketing that uses a better definition of viral coefficient and takes into account churn, finite market size, and continuous sharing.

TLDR: Skok's conclusion that "cycle time" is the most important factor in viral marketing is an artifact of faulty assumptions. A better model of viral marketing requires redefining the "viral coefficient".

Image Credit.

# The (near) Future of Data Analysis - A Review

co-organizes Data Business DC, among many other things. Hadley Wickham, having just taught workshops in DC for RStudio, shared with the DC R Meetup his view on the future, or at least the near future of Data Analysis. Herein lies my notes for this talk, spiffed up into semi-comprehensible language. Please note that my thoughts, opinions, and biases have been split, combined, and applied to his. If you really only want to download the slides, scroll to the end of this article.

As another legal disclaimer, one of the best parts of this talk was Hadley's commitment to making a statement, or, as he related, "I am going to say things that I don't necessarily believe but I will say them strongly."

You will also note that Hadley's slides are striking even at 300px wide ... evidence that the fundamental concepts of data visualization and graphic design overlap considerably.

Data analysis is a heavily overladen term with different meanings to different people.

However, there are three sets of tools for data analysis:

1. Transform, which he equated to data munging or wrangling
2. Visualization, which is useful for raising new questions but, as it requires eyes on each image, does not scale well; and
3. Modeling,  which complements visualization and where you have made a question sufficiently precise that you can build a quantitative model. The downside to modeling is that it doesn't let you find what you don't expect.

Now, I have to admit I loved this one. Data analysis is "typing not clicking." Not to disparage all of those Excel users out there but programming or coding (in open source languages) allows one to automate processes, make analyses reproducible, and even help communicate your thoughts and results, even to "future you."  You can also throw your code on Stack Overflow for help or to help others.

Hadley also described data analysis as much more cogitation time than CPU execution time. One should spend more time thinking about things than actually doing them.  However, as data sets scale, this balance may shift a bit ... or one could argue the longer it takes to run your analysis, the more thought you should put into the idea and code before it runs for days as the penalty for doing the wrong analysis or an incorrect analysis grows. Luckily, we aren't quite back to the days of the punchcard.

Above is a nice way of looking at some of the possible data analysis tool sets for different types of individuals. To put this into the vernacular of the data scientist survey that Harlan, Marck and I put out,  R+js/python would map well to the Data Researcher, R+sql+regex+xpath, could map to the Data Creative, and R+java+scala+C/C++ could map to the Data Developer.  Ideally, one would be a polyglot and know languages that span these categorizations.

Who doesn't love this quote? The future (of R and data analysis) is here in pockets of advanced practitioners. As ideas disperse through the community and the rest of the masses catchup, we push forward.

Communication is key ...

but traditional tools fall short when communicating data science ideas and results and methods. Thus, rmarkdown gets it done and can be quickly and easily translated into HTML.

Going one step further but still coupled to rmarkdown is a new service, RPubs, that allows one click publishing of rmarkdown to the web for free. Check it out ...

If rmarkdown is the Microsoft Word of data science, than Slidify is the comparable to Powerpoint (and it is free), allowing one to integrate text, code, and output powerfully and easily.

While these tools are great, they aren't perfect.  We are not yet at a point where our local IDE has been seemlessly integrated into our code versioning system, our data versioning system, our environment and dependency versioning system, our publishing/broadcasting results generating system, or our collaboration systems.

Not there yet ...

Amen.

Basically, Rcpp allows you to embed C++ code into your R code easily.  Why would someone want to do that? Because it allows you to easily circumvent the performance penalty of FOR loops in R; just write them in C++.

On a personal rant, I don't think mixing in additional languages is necessarily a good idea, especially C++.

Notice the units of microseconds.  There is always a trade off between the time spent optimizing your code and running your slow code.

Awesome name, full stop. Let's take two great tastes, ggplot2 and D3.js, and put them together.

If you don't know about ggplot2 or the Grammar of Graphics, click the links!

D3 allows one to make beautiful, animated, and even interactive data visualizations in javascript for the web. If you have seen some of the impressive interactive visualizations at the New York Times, you have seen D3 in action.  However, D3 has a quite steep learning curve as it requires understanding of CSS, HTML, javascript,

As a comparison, what might take you a few lines of code to create in R + gglot2, could take you a few hundred lines of code in D3.js.  Some middle ground is needed, allowing R to produce web suitable, D3-esque graphics.

ps Just in case you were wondering, r2d3 does not yet exist. It is currently vaporware.

Enter shiny which allows you to make web apps with R hidden as the back end, generating .pngs that are refreshed, potentially with an adjustable parameter input from the same web page. This doesn't seem the Holy Grail everyone is looking for but is moving the conversation forward.

One central theme was the idea that we want to say what we want and allow the software to figure out the best way to do that. We want a D3-type visualization but we don't want to learn 5 languages to do it. Also, this applies equally on the data analysis size for data sets ranging many orders of magnitude.

Another theme was that the output of the future is HTML 5.  I did not know this but R Studio is basically a web browser, everything is drawn using HTML5, js, and CSS.

Loved this slide because who doesn't want to know?!

DPLYR is an attempt at a grammar of data manipulation, abstracting out the back end of crunching the data from the description of what someone wants done (and no, SQL is not the solution to that problem).

And this concludes what was a fantastic talk about The ^(near) Future of Data Analysis. If you've made it this far and still want to download Hadley's full slide deck or Marck's introductory talk, look no further:

# 3 Ways You’re Ruining Your GA Data, and How to Stop: Tips from a Data Scientist

This is a cross post by Sean Patrick Murphy (@sayhitosean), who runs the DataBusinessDC meetup. Finish reading over at Spinnakr's blog. Marketers who want to be more data-driven often start with Google Analytics. It’s ubiquitous, free, and comparatively easy to use. It’s also the first client data I have access to in many of my consulting projects. Since I see it all the time, I’ve noticed that most companies’ GA is afflicted with serious hygiene problems.

To learn more about three of the common issues ruining the usability of GA data, and instructions for fixing them, keep reading here.