business

Event Recap: Tandem NSI Deal Day (Part 2)

This is the second part of a guest post by John Kaufhold. Dr. Kaufhold is a data scientist and managing partner of Deep Learning Analytics, a data science company based in Arlington, VA. He presented an introduction to Deep Learning at the March Data Science DC Meetup. Tandem NSI is a public-private partnership between Arlington Economic Development and Amplifier Ventures. According to the TNSI website, the partnership is intended to foster a vibrant technology ecosystem that combines entrepreneurs, university researchers and students, national security program managers and the supporting business community. I attended the Tandem NSI Deal Day on May 7; this post is a summary of a few discussions relevant to DC2.

In part one, I discussed the pros and cons of starting a tech business in the DC region; in this post, I'll discuss the specific barriers to entry of which entrepreneurs focusing on obtaining federal contractors should be aware when operating in our region, as well as ideas for how interested members of our community can get involved.

Barriers to innovation and entrepreneurship for federal contractors

One of the first talks of the day came from SpaceX's Deputy General Counsel, David Harris. It captured in one slide an issue all small technology companies operating in the federal space face, namely the FAR (Federal Acquisition Regulations). Specifically, David simply counted the number of clauses in different types of contracts, including standard Collaborative Research And Development Agreements, Contract Service Level Agreement Property Licenses, SpaceX's Form LSA, and a consumer-off-the-shelf procurement contract. The number of clauses is generally 12 to 27 in each of these contracts. As a bottom line, he compared these to the number of clauses in a Traditional FAR-fixed-price with one cost-plus Contract Line Item Number: more than 200 clauses. In discussion, there was even a suggestion that the federal government might want to reexamine how it does business with smaller technology companies to encourage innovators to spend time innovating rather than parsing legalese. The tacit message was the FAR may go too far. Add to the FAR the requirements of the Defense Contract Audit Agency and sometimes months-long contracting delays, and you have created a heavy legal and accounting burden on innovators.

Peggy Styer of Blackbird also told a story about how commitment to mission and successful execution for the government can sometimes narrow the potential market for a business. A paraphrase of Peggy's story: It's good to be focused on mission, but there can be strategic conflict between commercial and government success. As an example, when they came under fire in theatre, special ops forces were once expected to carry a heavy tracking device the size of a car battery and run for their lives into the desert where a rescue team could later find and retrieve them. Blackbird miniaturized a tracking device with the same functionality, which made soldiers on foot faster and more mobile, improving survivability. The US government loved the device. But they loved it so much they asked Blackbird to sell to the US government exclusively (and not to commercialize it for competitors). This can put innovators for the government in a difficult position with a smaller market than they might have expected in the broader commercial space.

Dan Doney, Chief Innovation Officer at the Defense Intelligence Agency described a precedent “culture" of the “man on the moon” success that was in many ways a blueprint for how research is still conducted in the federal government. Specifically, putting a man on the moon was a project of a scale and complexity only our coordinated US government could manage in the 1960s. To accomplish the mission, the US government collected requirements, matched requirements with contractors, and systematically filled them all. And that was a tremendous success. However, almost 50 years later, a slavish focus on requirements may be the problem, Dan argued. Dan described "so much hunger” to solve mission-critical problems by our local innovative entrepreneurs that in order to exploit it, the government needs to eliminate the “friction” from the system. Dan argued eliminating that “friction” has been shown to get enormous results faster and cheaper than traditional contracting models. He continued: "our innovation problems are communication problems," pointing out that Broad Area Announcements -- how the US govt often announces project needs--are terrible abstractions of problems to be solved. The overwhelming jumble of legalese that has nothing to do with technical work was also discussed as a barrier for technical minds—just finding the technical nugget the BAA is really asking for is an exhausting search across all the fedbizops announcements.

A brief discussion of how contracts can become inflexible handcuffs that focus contractors on “hitting their numbers” on the tasks a PM originally thought they should solve at the time of contracting, while in the course of a program it may even become clear a contractor should now be solving other, more relevant problems. In essence, contractors are asked to ask and answer relevant research questions, and research is executed with contracts, but those contracts often become counterproductively inflexible for asking and answering research questions.

What can DC2 do?

  1. I only recognized three DC2 participants at this event. With a bigger presence, we could be a more active and relevant part of the discussion on how to incentivize government to make better use of its innovative entrepreneurial resources here in the DMV.
  2. Deal Day provided a forum to hear from both successful entrepreneurs and the government side. These panels documented some strategies for how some performers successfully navigated those opportunities for their businesses. What Deal Day didn’t offer was a chance to hear from small innovative startups on what their particular needs are. Perhaps DC2 could conduct a survey of its members to inform future Tandem NSI discussions.

Elements of an Analytics "Education"

This a guest post by Wen Phan, who will be completing a Master of Science in Business at George Washington University (GWU) School of Business.  Wen is the recipient of the GWU Business Analytics Award for Excellence and Chair of the Business Analytics Symposium, a full-day symposium on business analytics on Friday, May 30th -- all are invited to attend. Follow Wen on Twitter @wenphan.

GWU Business Analytics Symposium, 5/30/14, Marvin CenterWe have read the infamous McKinsey report. There is the estimated 140,000- to 190,000-person shortage of deep analytic talent by 2018, and an even bigger need - 1.5 million professionals - for those who can manage and consume analytical content. Justin Timberlake brought sexy back in 2006, but it’ll be the data scientist that will bring sexy to the 21st century. While data scientists are arguably the poster child of this most recent data hype, savvy data professionals are really required across many levels and functions of an organization. Consequently, a number of new and specialized advanced degree programs in data and analytics have emerged over the past several years – many of which are not housed in the traditional analytical departments, such as statistics, computer science or math. These programs are becoming increasingly competitive and graduates of these programs are skilled and in demand. For many just completing their undergraduate degrees or with just a few years of experience, these data degrees have become a viable option in developing skills and connections for a burgeoning industry. For others with several years of experience in adjacent fields, such as myself, such educational opportunities provide a way to help with career transitions and advancement.

I came back to school after having worked for a little over a decade. My undergraduate degree is in electrical engineering and at one point in my career, I worked on some of the most advanced microchips in the world. But I also have experience in operations, software engineering, product management, and marketing. Through it all, I have learned about the art and science of designing and delivering technology and products from ground zero - both from technical and business perspectives. My decision to leave a comfortable, well-paid job to return to school was made in order to leverage my technical and business experience in new ways and gain new skills and experiences to increase my ability to make an impact in organizations.

There are many opinions regarding what is important in an analytics education and just as many options to pursuing them, each with their own merits. Given that, I do believe there are a few competencies that should be developed no matter what educational path one takes, whether it is graduate school, MOOCs, or self-learning. What I offer here are some personal thoughts on these considerations based on my own background, previous professional experiences, and recent educational endeavor with analytics and, more broadly, using technology and problem solving to advance organizational goals.

Not just stats.

For many, analytics is about statistics and a data degree is just slightly different from a statistics one. There is no doubt that statistics plays a major role in analytics, but it is still just one of the technical skills. If you are a serious direct handler of data of any kind, it will be obvious that programming chops are almost a must. For more customized and sophisticated processing, even substantial computer science knowledge – data structures, algorithms, and design patterns – will be required. Of course, even this idea has been pretty mainstream and is nicely captured by Drew Conway’s Data Science Venn Diagram. Other areas not as obvious to data competency are that of data storage theory and implementation (e.g. relational databases and data warehouses), operations research, and decision analysis. The computer science and statistics portions really focus on the sexy predictive modeling aspects of data. That said, knowing how to effectively collect and store data upstream is tremendously valuable. After all, it is often the case that data extends beyond just one analysis or model. Data begets more data (e.g. data gravity). Many of the underlying statistical methods, such as maximum likelihood estimation (MLE), neural networks and support vector machines, all rely on principles and techniques of operations research. Further, operations research, also called optimization, offers a prescriptive perspective on analytics. Last, it is obvious that analytics can help identify trends, understand customers, and forecast the future. However, in and of themselves those activities do not add any value; it is the decisions and resulting actions taken on those activities that deliver value. But, often, these decisions must be made in the face of substantial uncertainty and risk - hence the importance of critical decision analysis. The level of expertise required in various technical domains must align with your professional goals, but a basic knowledge of the above should allow you adequate fluency across analytics activities.

Applied.

I consider analytics an applied degree similar to how engineering is an applied degree. Engineering applies math and science to solve problems. Analytics is similar this way. One importance of applied fields is that they are where the rubber of theory needs to meet the road of reality. Data is not always normally distributed. In fact data is not always valid or even consistent. Formal education offers rigor in developing strong foundational knowledge and skills. However, just as important are the skills to deal with reality. It is no myth that 80% of analytics is just about pre-processing the data; I call it dealing with reality. It is important to understand the theory behind the models, and frankly, it’s pretty fun to indulge in the intricacies of machine learning and convex optimization. In the end though, those things have been made relatively straightforward to implement with computers. What hasn’t (yet) been nicely encapsulated in computer software is the judgment and skill required to handle the ugliness of real-world data. You know what else is reality? Teammates, communication, and project management constraints. All this is to say that so much of an analytics education includes other areas that are not the theory, and I would argue that the success of many analytics endeavors are limited not by the theoretical knowledge, but rather by the practicalities of implementation whether with data, machines, or people. My personal recommendation to aspiring or budding data geeks is to cut your teeth as much as possible in dealing with reality. Do projects. As many of them as possible. With real data. And real stakeholders. And, for those of you manager types, give it a try; it’ll give you the empathy and perspective to effectively work with the hardcore data scientists and manage the analytics process.

Working with complexity and ambiguity.

The funny thing about data is that you have problems both when you have too little and too much of it. With too little data, you are often making inferences and assessing the confidence of those inferences. With too much data, you are trying not to get confused. In the best case scenarios, your objectives in mining the data are straightforward and crystal clear. However, that is often not the case and exploration is required. Navigating this process of exploration and value discovery can be complex and ambiguous. There are the questions of “where do I start?” and “how far do I go?” This really speaks to the art of working with data. You pick up best practices along the way and develop some of your own. Initial exploration tactics may be as simple as profiling all attributes and computing correlations among a few of thing, seeing if anything looks promising or sticks. This process is further exacerbated with “big data”, where computational time is non-negligible and limits feedback delays during any kind of exploratory data analysis.

You can search the web for all kinds of advice on skills to develop for a data career. The few tidbits I include above are just my perspectives on some of the higher order bits in developing solid data skills. Advanced degree programs offer compelling environments to build these skills and gain exposure in an efficient way, including a professional network, resources, and opportunities. However, it is not the only way. As with all professional endeavors, one needs to assess his or her goals, background, and situation to ultimately determine the educational path that makes sense.

References:

[1] James Manyika, Michael Chui, Brad Brown, Jacques Bughin, Richard Dobbs, Charles Roxburgh, Angela Hung Byers. “Big Data: The Next Frontier for Innovation, Competition, and Productivity.” McKinsey Global Institute. June 2011.

[2] Thomas H. Davenport, D.J. Patil . “Data Scientist: The Sexiest Job of the 21st Century.” Harvard Business Review. October 2012.

[3] Quentin Hardy. "What Are the Odds That Stats Would Be This Popular?" The New York Times. January 26, 2012. 

[4] Patrick Thibodeau. “Career alert: A Master of analytics degree is the ticket – if you can get into class”. Computerworld. April 24, 2014. 

[5] Drew Conway. “The Data Science Venn Diagram.” 

[6] Kristin P. Bennett, Emilio Parrado-Hernandez. “The Interplay of Optimization and Machine Learning Research.” Journal of Machine Learning Research 7. 2006. 

[7] Mousumi Ghosh. “7 Key Skills of Effective Data Scientists.” Data Science Central. March 14, 2014.

[8] Anmol Rajpurohit. “Is Data Scientist the right career path for you? Candid advice.” KDnuggets. March 27, 2014. 

 

Selling Data Science: Validation

FixMyPineapple2 We are all familiar with the phrase "We can not see the forest for the trees", and this certainly applies to us as data scientists.  We can become so involved with what we're doing, what we're building, the details of our work, that we don't know what our work looks like to other people.  Often we want others to understand just how hard it was to do what we've done, just how much work went into it, and sometimes we're vain enough to want people to know just how smart we are.

So what do we do?  How do we validate one action over another?  Do we build the trees so others can see the forrest?  Must others know the details to validate what we've built, or is it enough that they can make use of our work?

We are all made equal by our limitation to 24 hours in a day, and we must choose what we listen to and what we don't, what we focus on and what we don't.  The people who make use of our work must do the same.  John Locke proposed the philosophical thought experiment, "If a tree falls in the woods and no one is around to hear it, does it make a sound?"  If we explain all the details of our work, and no one gives the time to listen, will anyone understand?  To what will people give their time?

Let's suppose that we can successfully communicate all the challenges we faced and overcame in building our magnificent ideas (as if anyone would sit still that long), what then?  Thomas Edison is famous for saying, “I have not failed. I've just found 10,000 ways that won't work.”, but today we buy lightbulbs that work, who remembers all the details about the different ways he failed?  "It may be important for people who are studying the thermodynamic effects of electrical currents through materials." Ok, it's important to that person to know the difference, but for the rest of us it's still not important.  We experiment, we fail, we overcome, thereby validating our work because others don't have to.

Better to teach a man to fish than to provide for him forever, but there are an infinite number of ways to successfully fish.  Some approaches may be nuanced in their differences, but others may be so wildly different they're unrecognizable, unbelievable, and beg for incredulity.  The catch is (no pun intended) methods are valid because they yield measurable results.

It's important to catch fish, but success is not consistent nor guaranteed, and groups of people may fish together so after sharing their bounty everyone is fed.  What if someone starts using this unrecognizable and unbelieveable method of fishing?  Will the others accept this "risk" and share their fish with those who won't use the "right" fishing technique, their technique?  Even if it works the first time that may simply be a fluke they say, and we certainly can't waste any more resources "risking" hungry bellies now can we.

So does validation lie in the method or the results?  If you're going hungry you might try a new technique, or you might have faith in what's worked until the bitter end.  If a few people can catch plenty of fish for the rest, let the others experiment.  Maybe you're better at making boats, so both you and the fishermen prosper.  Perhaps there's someone else willing to share the risk because they see your vision, your combined efforts giving you both a better chance at validation.

If we go along with what others are comfortable with, they'll provide fish.  If we have enough fish for a while, we can experiment and potentially catch more fish in the long run.  Others may see the value in our experiments and provide us fish for a while until we start catching fish.  In the end you need fish, and if others aren't willing to give you fish you have to get your own fish, whatever method yields results.

Data Science MD Discusses Health IT at Shady Grove

For its June meetup, Data Science MD explored a new venue, The Universities at Shady Grove. And what better way to venture into Montgomery County than to spend an evening discussing one of its leading sectors. That's right, an event all about healthcare. And we tackled it from two different sides.

http://www.youtube.com/playlist?v=PLgqwinaq-u-Ob2qS9Rt8uCXeKtT830A7e&w=640&h=385

The night started with a presentation from Gavin O'Brien from NIST's National Cybersecurity Center of Excellence. He spoke about creating a secure mobile Health IT platform that would allow doctors and nurses to share relevant pieces of information in a manner that is secure and follows all guidelines and policies set forth documenting how health data must be handled. Gavin's presentation focused on securing 802.11 links as opposed to cellular links or other types of wireless links as this is a good first step and is immediately practical when deployed within one building like a hospital. Gavin discussed all of the technological challenges, from encrypting data during transmission rather than in the clear where it can be intercepted as well as creating Access Control Lists so that only the correct people saw a patient's data. As his talk progressed, one thought was constantly in the back of my mind: how can this architecture be put in place to provide the protection for the data that the policies stipulate while still allowing the data to be distributed so that analytics can be run on the data? For instance, a hospital should be interested in trends among patients in their care like if patients had complications all after receiving the same family of drugs or specific drug (perhaps from the same batch), when patients have the most problems and therefore require the most attention and when a bacteria or virus may be loose in the hospital, further complicating patients ailments. The architecture may allow these types of analytics but they were not specifically discussed during Gavin's presentation. If you have any ideas how a compliant architecture can support these analytics or potential problems to running analytics, please provide a comment to this post.

The final speaker of the night was Uma Ahluwalia, the Director of Health and Human Services for Montgomery County. Uma spoke about the various different avenues that county residents have to report problems and that often times, their needs cross many different segments of health and human services, usually requires their stories to be retold each time. According to her vision, a resident/patient could report their problem to any one of six segments and then all of the segments could see the information without the patient having to reiterate their story over and over again. One big problem with this solution is that data would be shared across many groups, giving county workers access to more information than they should according to health regulations. However, Montgomery County sees each segment as a part of one organization, and therefore the data can be shared internally among all employees within that organization. While this should help with reducing the amount of time patients need to retell their story, it still does not provide an open platform for data scientists. However, Uma also had a potential solution to that problem: volunteers. Volunteers can sign non-disclosure agreements allowing them access to see patient data to help create useful analytics, thereby opening the problem space to many more minds in the hopes of creating truly revolutionary analytics. Perhaps you will be the next great mind that unlocks the meaning behind a current social issue.

Finally, Data Science MD needs to acknowledge a few key people and groups that contributed to this meetup. Mary Lowe and Melissa Marquez from the Universities at Shady Grove were instrumental in making this happen, helping to secure the room and providing the food and A/V setup. Dan Hoffman, the Chief Innovation Officer for Montgomery County also provided a great deal of support to make this happen. Finally, John Rumble, a DSMD member, took the lead in getting DSMD beyond the Baltimore/Columbia corridor. Thanks so much to all of these key people.

If you want to catch up on previous meetups, please check out our YouTube channel.

Please check our July meetup where we discuss analysis techniques in Python and R at Betamore.

A Better Mathematical Model of Viral Marketing

This is the second part of a four part series of blog posts on viral marketing. In part 1, I discuss the faulty assumptions in the current models of viral marketing. In part 3, I show the weird dynamics of viral marketing in a growing market. In part 4, I'll discuss the effects of returning customers.

Current models of viral marketing for the business community rely on faulty assumptions. As a result, these models fail to reflect real world examples.

So how can the business community build a more realistic model of viral marketing? How do you know which factors (e.g. viral coefficient, time scale, churn, market size) are most important? Fortunately, there is a rich history of literature on mathematical models for viral growth (and decline), dating all the way back to 1927. These models rigorously treat viral spread, churn, market size, and even the change in the market size and the possibility that former customers return. Obviously, nobody was thinking of making a YouTube video or an iPhone app go viral back when phones didn't even have rotary dials. These models are of the viral spread of … viruses!

The Model:

The classic SIR model of the spread of disease is by Kermack and McKendrick. (Sorry I couldn’t link to the original paper. You can buy it for $54.50 here -- blame the academicpublishingindustry). I’ve applied this model to viral marketing by drawing analogies between a disease and a product. The desired outcomes are very different, but the math is the same.

Kermack and McKendrick divide the total population of the market, [latex]N[/latex], into three subpopulations.

  • [latex]S[/latex] - The number of people susceptible to the disease (potential customers)
  • [latex]I[/latex] - The number of people who are infected with the disease (current customers)
  • [latex]R[/latex] - The number of people who have recovered from the disease (former customers).

These three subpopulations change in number over time. The population of potential customers become current customers as a result of successful invitations. Current customers become former customers if they decide to stop using the product. For simplicity, I’ll treat the total market size, [latex]N = S + I + R[/latex], as static and former customers as immune (for now). The parameters that govern spread of disease are:

  • [latex]β[/latex] - The infection rate (sharing rate)
  • [latex]γ[/latex] - The recovery rate (churn rate)

Assume that current customers, [latex]I[/latex], and potential customers, [latex]S[/latex], communicate with each other at an average rate that is proportional to their numbers (as governed by the Law of Mass Action). This gives [latex]βSI[/latex] as the number of new customers, per unit time, due to word of mouth or online sharing. As the number of new customers grows by [latex]βSI[/latex], the number of potential customers shrinks by the same number. This plays the same role that the “viral coefficient” does in Skok’s model, but accounts for the fact that conversion rates on sharing slow down when the fraction of people who have already tried the product gets large. It also does away with the concept of "cycle time". Instead, it accounts for the average time it takes to share something and the average frequency at which people share by putting a unit of time into the denominator of [latex]β[/latex]. Thus, [latex]β[/latex] represents the number of successful invitations per current customer per potential customer per unit time (i.e. hour, day, week). I propose that this is a more robust definition of viral coefficient than the one used by Ries and Skok because modeling viral sharing as an average rate accounts for the following realities:

  • Customers do not share in synchronous batches.
  • Each user has a different timeframe for trying a product, learning to love it, and sharing it with friends. Rather than assuming that they all have the same cycle time, [latex]β[/latex] represents an average rate of sharing.
  • Users might invite others when first trying a product or after they’ve used it for quite a while.

In this model, current customers become former customers at a rate defined by the parameter [latex]γ[/latex]. That is, [latex]γ[/latex] is the fraction of current customers who become former customers in a unit of time. It has the dimensions of inverse time [latex](1/t)[/latex], and [latex]1/γ[/latex] represents the average time a user remains a user. So, if [latex]γ = 1\%[/latex] of users lost per day, then the average length of time a user remains active is 100 days.

The differential equations governing viral spread are:

  • [latex]dS/dt = -βSI[/latex]
  • [latex]dI/dt = βSI - γI[/latex]
  • [latex]dR/dt = γI[/latex]

Examining the Equations:

These are non-linear differential equations that cannot be solved to produce convenient, insight yielding formulas for [latex]S(t)[/latex], [latex]I(t)[/latex], and [latex]R(t)[/latex]. What they lack in convenient formulas, they make up for with more interesting dynamics (especially when considering changing market sizes and returning customers). You can still learn a lot by examining them and integrating them numerically. Let’s assume that [latex]t=0[/latex], represents the launch of a new product. Initially, at least the founding team uses the product and represent the initial customer base, [latex]I(0)[/latex]. The initial number of former customers, [latex]R(0)[/latex], is zero and the rest of the people in the market are potential customers, [latex]S(0)[/latex].

The first thing to note is that there will be a growing customer base [latex](dI/dt > 0)[/latex] as long as:

[latex]βS/γ > 1[/latex]

That is, viral growth will occur as long as the addressable market size, [latex]S(0)[/latex], and sharing rate, [latex]β[/latex], are sufficiently large compared to the churn, [latex]γ[/latex]. This model shows that with a big enough market, you can go viral even with a small [latex]β[/latex] as long as your churn is also small enough (consistent with the Pinterest example described in part 1). This model also shows that the effects of churn cannot be ignored, even in very early viral growth.

If at [latex]t=0[/latex], [latex]S[/latex] is very close to [latex]N[/latex], then [latex]βS/γ[/latex] is approximately [latex]βN/γ[/latex]. Thus, if [latex]βN/γ > 1[/latex], initial growth will occur and if [latex]βN/γ < 1[/latex], the customer base will not grow. This is sometimes called the “basic reproductive number” in epidemiology literature. It is essentially what Eric Ries calls the “viral coefficient” although it depends on market size and churn as well as the viral sharing rate. It is approximately the average number of new customers each early customer will invite during the entire time that they remain a customer, which is [latex]1/γ[/latex]. However, in the case that viral growth does occur, [latex]βN/γ[/latex] rapidly ceases to represent the number of customers that each customer invites.

Another thing you can see by examining the equations is that if you ignore the change in the market size (an approximation that makes sense for short lived virality, such as with a YouTube video), the customer base always goes to zero at long times unless you have zero churn. Once the number of current customers reaches a peak where [latex]dI/dt = 0[/latex] at [latex]I = N - γ/β[/latex], the rate of change in the number of current customers becomes negative and the number of customers eventually reaches zero. This is consistent with the data provided in the Mashable post on the half-lives of Twitter vs. YouTube content. Again, note the key role that churn has in determining the peak number of customers.

Examples:

We can gain more insight from these equations by numerically integrating them. For these examples, the unit of time used to define [latex]β[/latex] and [latex]γ[/latex] is one day, though the choice is arbitrary. I’ve given values of [latex]β[/latex] as [latex]βN[/latex] to create better correspondence with Ries’ concept of viral coefficient -- If at [latex]t=0[/latex], [latex]S(0)[/latex] is approximately [latex]N[/latex], [latex]βN[/latex] is approximately the number of new customers each existing customer begets per day.

With the parameters:

  • [latex]N = 1[/latex] million people in the market
  • [latex]βN = 10[/latex] invites per current user per day
  • [latex]γ = 50\%[/latex] of customers lost per day
  • [latex]I(0) = 10[/latex] current customers

numerically integrating the equations given above yields the following for how the number of customers changes for the first 30 days: This shows a traffic pattern similar to that of a popular Twitter link where traffic quickly spikes and then dies down as people tire of looking at it. (In the case of visiting a webpage, a “customer” can be defined as a visitor).

For a smaller churn rate, [latex]γ = 1\%[/latex] of customers lost per day, we see the following for the growth and decline in the number of customers over 300 days: This shows how even for low values of churn, without new potential customers joining the market, or former customers returning, the customer base always diminishes after reaching it's peak. Also note how a smaller churn rate allows us to reach a higher peak in traffic.

So how can viral growth be sustained? For that, you need to consider how the change in the market size affects viral marketing, which I’ll examine in part 3.

(For another fun example of how to apply the SIR model, see my post on the Mathematics of the Walking Dead.)

TLDR: A better definition of “viral coefficient” is successful invitations per existing user per potential user per unit time. But market size and customer churn are just as, if not more, important than viral coefficient. Viral growth in a static market is unsustainable unless you have absolutely zero churn.

Image Credit.

4 Major Mistakes in the Current Understanding of Viral Marketing

Let's go viral right now

This is the first part of a four part series of blog posts on viral marketing. In part 2, I present a better mathematical model of viral marketing. In part 3, I show the weird dynamics of viral marketing in a growing market. In part 4, I'll discuss the effects of returning customers.

Viral marketing is arguably the most sought after engine of growth due to it’s potential to drive explosive increases in the number of customers at little or no cost. There has been tremendous interest in understanding virality in marketing and product development, particularly with the advent of the social web. Unfortunately, efforts to build mathematical models for the business community have not been successful in reflecting reality and offering insights into which factors matter most in achieving viral growth.

In The Lean Startup, Eric Ries defines the viral coefficient as “how many new customers will use the product as a consequence of each new customer who signs up” and declares that a viral coefficient greater than 1 will lead to exponential growth while a viral coefficient less than 1 leads to little growth at all. However, his treatment of the viral coefficient makes no mention of a timescale. Is it the number of new customers an existing customer brings in immediately upon signing up? Or within a day? Or within the entire time that they are using the product?

At ForEntrepreneurs.com, David Skok introduces the concept of a “cycle time” -- the total time it takes to try a product and share it with friends. In doing so, he correctly notes the importance of a timescale as a factor in achieving viral growth. In fact, he declares it to be even more important than the viral coefficient. He first models the accumulation of users in a spreadsheet and then, with help from Kevin Lawler, derives a formula for viral growth:

[latex]C(t)=C(0) (K^{1+t/c_t} - 1) / (K-1)[/latex]

where [latex]C(t)[/latex] represents the number of customers at time [latex]t[/latex], [latex]K[/latex] represents the viral coefficient, and [latex]c_t[/latex] represents the cycle time.

This model depends on the following assumptions, which I’ll address in turn:

  1. The market is infinite.
  2. There is no churn in the customer base -- once a customer, always a customer.
  3. Customers send invites shortly after trying the product, if at all, and never again.
  4. Every customer has the same cycle time and the cycles all happen in unison.

Infinite Market Size:

Since viral growth can be so explosive, the market for a product can become saturated very quickly. As the market becomes saturated, fewer potential customers will respond to invitations, effectively reducing the “viral coefficient” (as it is defined by Ries and Skok). Since market saturation could occur in a matter of days or weeks, we cannot ignore the effect of a finite market size.

Customer Churn:

Neither Ries’ nor Skok’s models account for churn in customers -- the rate at which customers stop using the product. Eric Ries treats the concept of churn in his discussion of another engine of growth which he calls  “Sticky Marketing” and suggests that startups concentrate on only one engine of growth at a time. While the advice for startups to focus on one engine of growth at a time may be sound, it does not justify leaving this very real effect out of the equations.

When Customers Send Invites:

Skok’s model depends on the assumption that each new customer sends invitations shortly after trying the product and then never again. While this may be true for some products, it’s likely that the pattern of sharing depends on the nature of the product and often occurs long after the user has had a chance to try the product and grow to love it.

Customer Cycle Times:

How users beget more users via invitations, leading to a compounding user base, is the essence of viral growth. In Skok’s model, users receive invites, try the product, love the product, and invite a batch of new users in synchronous cycles. These uniform cycles correlate conveniently to the columns of a spreadsheet. But in reality, different users take different amounts of time to progress from trying a new product to inviting their friends. The number of users of a product doesn't compound at finite, regular intervals like bank interest. Instead, it ramps up customer by customer. The customers have a distribution of cycle times which are not synchronous, but staggered randomly.

By modeling viral growth as batches of invitations that happen in tandem, Skok effectively conflates the timescale of “try it, love it, share it” with a compounding interval. That is, Skok’s concept of a “cycle time” represents two unrelated timescales:

  1. The total time between trying a product for the first time and inviting a batch of friends to try it.
  2. The finite intervals at which the user base compounds.

The reason Skok’s “cycle time” has such a large effect is because it represents a compounding interval, not because it represents how quickly users go from trying a product to sharing it with friends. Furthermore, this compounding interval is an artifact of the assumption that current users invite new users in synchronous cycles. Thus, the conclusion that “cycle time” is the most important factor in achieving viral growth is an artifact of the faulty "synchronous cycle" assumption in this model.

The assumption that the number of users compound at finite, regular intervals is further muddled by the derivation of a continuous formula. If current users only invite new users in a batch at the end of each cycle, then how can the number of users ramp up continuously during the cycle? A continuous, as opposed to stepwise, viral growth in usership is only possible if sharing is happening constantly, at an average rate. To model this, we need a totally new definition of a viral coefficient that includes a timescale in the denominator -- the average number of invitations per unit time.

How this Model Breaks Down:

Some real world examples contradict the models developed by Ries and Skok. A post at TechCrunch regarding the growth of Pinterest shows that a customer base can grow even with a smallish viral coefficient as long as the churn rate is low. This post from Mashable compares the half life of various sharing methods, showing that even the content shared via methods that have longer half lives get very little traffic once everyone has seen it a few times.

So how can the business community build a more realistic model of viral marketing? In part 2, I present a better mathematical model for viral marketing that uses a better definition of viral coefficient and takes into account churn, finite market size, and continuous sharing.

TLDR: Skok's conclusion that "cycle time" is the most important factor in viral marketing is an artifact of faulty assumptions. A better model of viral marketing requires redefining the "viral coefficient".

Image Credit.