Data Community DC and District Data Labs are hosting a full-day Analyzing Social Media Data with R workshop on Saturday January 24th. For more info and to sign up, go to http://bit.ly/1FJjFIz. Register before January 9th for an early bird discount!

# Win Free eCopies of Social Media Mining with R

*This is a sponsored post by **Richard Heimann**. Rich is Chief Data Scientist at **L-3 NSS** and recently published *Social Media Mining with R* (Packt Publishing, 2014) with co-author **Nathan Danneman**, also a Data Scientist at L-3 NSS Data Tactics. Nathan has been featured at recent **Data Science DC** and **DC NLP** meetups.*
Nathan Danneman and Richard Heimann have teamed up with DC2 to organize a giveaway of their new book, *Social Media Mining with R*.

Over the new two weeks five lucky winners will win a digital copy of the book. Please keep reading to find out how you can be one of the winners and learn more about *Social Media Mining with R*.

Overview: Social Media Mining with R

*Social Media Mining with R* is a concise, hands-on guide with several practical examples of social media data mining and a detailed treatise on inference and social science research that will help you in mining data in the real world.

Whether you are an undergraduate who wishes to get hands-on experience working with social data from the Web, a practitioner wishing to expand your competencies and learn unsupervised sentiment analysis, or you are simply interested in social data analysis, this book will prove to be an essential asset. No previous experience with R or statistics is required, though having knowledge of both will enrich your experience. Readers will learn the following:

- Learn the basics of R and all the data types
- Explore the vast expanse of social science research
- Discover more about data potential, the pitfalls, and inferential gotchas
- Gain an insight into the concepts of supervised and unsupervised learning
- Familiarize yourself with visualization and some cognitive pitfalls
- Delve into exploratory data analysis
- Understand the minute details of sentiment analysis

How to Enter?

All you need to do is share your favorite effort in social media mining or more broadly in text analysis and natural language processing in the comments section of this blog. This can be some analytical output, a seminal white paper or an interesting commercial or open source package! In this way, there are no losers as we will all learn.** **

The first five commenters will win a free copy of the eBook. (DC2 board members and staff are not eligible to win.) Share your public social media accounts (about.me, Twitter, LinkedIn, etc.) in your comment, or email media@datacommunitydc.org after posting.

# Analyzing Social Media Networks using NodeXL

*This is a guest post from Marc Smith, Chief Social Scientist at Connected Action Consulting Group, and a developer of NodeXL, an Excel-based system for (social) network analysis. Marc will be leading a workshop on NodeXL, offered through Data Community DC, on Wednesday, November 13th. If the below peaks your fancy, please register. Parts of this post appeared first on connectedaction.net.*

I am excited to have the opportunity to present a NodeXL workshop with Data Community DC on November 13th at 6pm in Washington, D.C.

In this session I will describe the ways NodeXL can simplify the process of collecting, storing, analyzing, visualizing and publishing reports about connected structures. NodeXL supports the exploration of social media with import features that pull data from personal email indexes on the desktop, Twitter, Flickr, Youtube, Facebook and WWW hyperlinks. NodeXL allows non-programmers to quickly generate useful network statistics and metrics and create visualizations of network graphs. Filtering and display attributes can be used to highlight important structures in the network. Innovative automated layouts make creating quality network visualizations simple and quick.

For example, this a map of the connections among the people who recently tweeted about the DataCommunityDC Twitter account was created with just a few clicks and no coding:

This graph represents a network of 67 Twitter users whose recent tweets contained “DataCommunityDC", taken from a data set limited to a maximum of 10,000 tweets. The network was obtained from Twitter on Tuesday, 05 November 2013 at 15:15 UTC. The tweets in the network were tweeted over the 7-day, 16-hour, 4-minute period from Monday, 28 October 2013 at 22:38 UTC to Tuesday, 05 November 2013 at 14:42 UTC. There is an edge for each “replies-to” relationship in a tweet. There is an edge for each “mentions” relationship in a tweet. There is a self-loop edge for each tweet that is not a “replies-to” or “mentions”.

The network has been segmented into groups (“G1, G2, G3…”) and each group is labeled with the words most frequently used in the tweets from the people in that group. The size of each Twitter user’s profile picture represents the log scaled value of their follower count.

Analysis of the network location of each participant reveals the people in key locations in the network, people at the “center” of the graph:

- @datacommunitydc
- @datafest2013
- @gabosama
- @harlanh
- @mlh_holmes
- @terebouza
- @greglinch
- @intridea
- @gilpress
- @katiestriff

For more examples, please see the NodeXL Graph Gallery at: http://nodexlgraphgallery.org/Pages/Default.aspx

# Uncovering Hidden Social Information Generates Quite a Buzz

*We are pleased to have community member Greg Toth present this event review. Greg is a consultant and entrepreneur in the Washington DC area. As a consultant, he helps clients design and build large-scale information systems, process and analyze data, and solve business and technical problems. As an entrepreneur, he connects the dots between what’s possible and what’s needed, and brings people together to pursue new business opportunities. Greg is the president of Tricarta Corporation and the CTO of EIC Data Systems, Inc.*
The March 2013 meetup of Data Science DC generated quite a buzz! Well over a hundred data scientists and practitioners gathered in Chevy Chase to hear Prof. Jennifer Golbeck from the Univ. of Maryland give a very interesting – and at times somewhat startling – talk about how hidden information can be uncovered from people’s online social media activities.

Prof. Golbeck develops methods for discovering things about people online. She opened her talk with a brief example of how bees reveal specific information to their hive’s social network through the characteristics of their “waggle dance.” The figure eight patterns of the waggle dance convey distance and direction to pollen sources and water to the rest of the hive – which is a large social network.

**Facebook Information Sharing**

From there the discussion turned to how Facebook’s information sharing defaults have evolved from 2005 through 2010. In 2005, Facebook’s default settings shared a relatively narrow set of your personal data with friends and other Facebook users. At this point none of your information was – **by default** – shared with the entire Internet.

In subsequent years the default settings changed each year, sharing more and more information with a wider and wider audience. By 2009, several pieces of your information were being shared openly with anyone on the Internet unless you had changed the default settings. By 2010 the default settings were sharing significant amounts of information with a large swath of other people, including people you don’t even know.

The Facebook sharing information Prof. Golbeck described came from Matt McKeon’s work, which can be found here: http://mattmckeon.com/facebook-privacy/

This ever-increasing amount of shared information has opened up new avenues for people to find out things about you, and many people may be shocked at what's possible. Prof. Golbeck gave a live demonstration of a web site called Take This Lollipop, using her own Facebook account. I won’t spoil things by telling you what it does, but suffice to say it was quite startling. If this piques your interest, check out www.takethislollipop.com

**Predicting Personality Traits**

From there the discussion shifted to a research project intended to determine whether it's possible to predict people's personality traits by analyzing what they put on social media. First, a group of research participants were asked to identify their core personality traits by going through a standardized psychological evaluation. The Big Five factors that they measured are *openness*, *conscientiousness*, *extraversion*, *agreeableness*, and *neuroticism.*

Next the research team gathered information from these people’s Facebook and Twitter accounts, including language features (e.g. words they use in posts), personal information, activities and preferences, internal Facebook stats, and other factors. Tweets were processed in an application called LIWC, which stands for Linguistic Inquiry and Word Count. LIWC is a text analysis program that examines a piece of text and the individual words it contains, and computes numeric values for positive and negative emotions as well as several other factors.

The data gathered from Twitter and Facebook was fed into a personality prediction algorithm developed by the research team and implemented using the Weka machine learning toolkit. Predicted personality trait values from the algorithm were compared to the original Big Five assessment results to evaluate how well the prediction model performed. Overall, the difference between predicted and measured personality traits was roughly 10 to 12% for Facebook (considered very good) and roughly 12 to 18% for Twitter (not quite as good). The overall conclusion was that *yes, it is possible to predict personality traits by analyzing what people put on social media.*

**Predicting Political Preferences**

The second research project was about computing political preference in Twitter audiences. Originally this project started with the intention of looking at the Twitter feeds of news media outlets and trying to predict media bias. However, the topic of media bias in general was deemed too problematic and controversial and they decided instead to focus on predicting the political preferences of the *media audiences*.

The objective was to come up with a method for computing the political orientation of people who followed popular news media outlets on Twitter. To do this, the team computed the political preference of about 1 million Twitter users by finding which Congresspeople they followed on Twitter, and looking at the liberal to conservative ratings of those Congresspeople. A key assumption was that people's political preferences will, on average, reflect those of the Congresspeople they follow.

From there, the team looked at 20 different Twitter news outlets and identified who followed each one. The political preferences of each media outlet's followers were composited together to compute an overall audience political preference factor ranging from heavily conservative to heavily liberal at the two extremes, with moderate ranges in the middle. The results showed that Fox News had the most conservative audience, NPR Morning Edition had the most liberal audience, and Good Morning America was in the middle with a balanced mix of both conservative and liberal followers. Further details on the results can be found in the paper here.

**Summary & Wrap-up**

An awful lot of things about you can be figured out by looking at public information in your social media streams. Personality traits and political preferences are but two examples. Sometimes this information can be used for beneficial purposes, such as showing you useful recommendations. Likewise, a future employer could use this kind of information to form opinions during the hiring process. People don't always think about this (or necessarily even realize what's possible) when they post things to social media.

Overall Prof. Golbeck’s presentation was well received and generated a number of questions and conversations after the talk. The key takeaway was that “We know who you are and what you are thinking” and that information can be used for a variety of purposes – in most cases without you even being aware. The situation was summed up pretty well in one of Prof. Golbeck’s opening slides:

I develop methods for discovering things about people online.

I never want anyone to use those methods on me.

-- Jennifer Golbeck

For those who want to delve deeper, several resources are available:

- Dr. Golbeck's presentation slides and audio from the event (MP3, ~15MB)
- Dr. Golbeck’s home page and research papers
- Dr. Golbeck’s new book titled Analyzing the Social Web
- Univ. of Maryland’s Human-Computer Interaction Laboratory 30
^{th}Annual Symposium, May 22-23, 2013 - Dr. Golbeck recommended a recent similar paper by Berkeley researchers that got some press: Private traits and attributes are predictable from digital records of human behavior.

**Commentary**

Overall I found this presentation to be very worthwhile and thought-provoking. Prof. Golbeck was an engaging speaker who was both informative and entertaining. She provided a number of useful references, links and papers for delving deeper into the topics covered. The venue and logistics were great and there were plenty of opportunities for networking and talking with colleagues both before and after the presentation.

The topic of predicting people's traits and behaviors is very relevant, particularly in the realm of politics. At least one other Data Science DC meetup held within the last few months focused on how data sciences were used in the last presidential election and the tremendous impact it had. That trend is sure to continue, fueled by research like this coupled with the availability of data, more sophisticated tools, and the right kinds of data scientists to connect the dots and put it all together.

If you have the time, I would recommend listening to the audio recording and following along the slide deck. There were many more interesting details in the talk than what I could cover here.

My personal opinion is that too few people realize the data footprint they leave when using social media. That footprint has a long memory and can be used for many purposes, including purposes that haven't even been invented yet. Many people seem to think that either the data they put on social media is trivial and doesn't reveal anything, or think that no-one cares and it's just "personal stuff." But as we've seen in this talk, people can discover a lot more than you may think.

*This post contains affiliate links.*

# The Weird Dynamics of Viral Marketing in a Growing Market

*This is the third part of a four part series of blog posts on viral marketing. In part 1, I discuss the faulty assumptions in the current models of viral marketing. In part 2, I present a better mathematical model of viral marketing. In part 4, I’ll discuss the effects of returning customers.*

If the market is static, strong viral sharing can lead to rapid early growth, but once a peak is reached, the number of customers falls to zero unless your product has 100% retention. So how can a business use viral marketing to grow their customer base for the long term? Which factors (i.e. sharing rate, churn, market size, market growth) matter? In this blog post, I’ll adapt the mathematical model of viral marketing from part 2 of this series to examine how changing market size affects viral growth.

**What is “the Market”:**

What comprises “the market” depends on the nature of the product. If it is an iPhone game, then the market is people who own iPhones and play games on them. If it is a YouTube video of a pug climbing the stairs like a boss, then the market is people with internet connected devices who find that kind of video humorous. For the first example, entering the market could mean that you’ve bought your first iPhone, or started playing games on it. Leaving the market could mean that you’ve stopped playing games or swapped your iPhone for a different type of device. Note that leaving the market is different from becoming a former customer. In the case of an iPhone game, becoming a former customer means that you’ve stopped using the game and perhaps removed it from your phone -- you are still part of the market for iPhone games.

**The Model:**

If new potential customers can be added to the market and members of any subpopulation can leave the market, the parameters are now:

- \(\beta\) - The infection rate (sharing rate)
- [latex]\gamma[/latex] - The recovery rate (churn rate)
- [latex]\alpha[/latex] - Birth rate (rate that potential customers are entering the market)
- [latex]\mu[/latex] - Death rate (rate that people are leaving the market)

In part 2, I described how the population transitions from potential customers to current customers to former customers. Here, I’ll add terms to the differential equations to model how people are entering and leaving the market. Note that the total market size, [latex]N = S + I + R[/latex], is no longer constant, but will grow or shrink with time.

Assume that the population entering the market joins as part of the ‘potential customer’ population. The number of new potential customers joining the market per unit time is [latex]\alpha (S + I + R)[/latex]. The populations leaving the market can come from any of the three subpopulations. The numbers of people leaving the potential customer, current customer, and former customer groups per unit time are [latex]\mu S[/latex], [latex]\mu I[/latex], and [latex]\mu R[/latex], respectively.

The equations become:

- [latex]dS/dt = -\beta SI + \alpha (S + I + R) - \mu S[/latex]
- [latex]dI/dt = \beta SI - \gamma I - \mu I[/latex]
- [latex]dR/dt = \gamma I - \mu R[/latex]

**Examining the Equations:**

Since [latex]N = S + I + R[/latex],

[latex]dN/dt = (\alpha - \mu )N[/latex]

which is an equation that we can solve. Thus

[latex]N(t) = N(0) * e^{(\alpha - \mu ) t}[/latex].

The market grows exponentially if [latex]\alpha > \mu[/latex] and shrinks exponentially if [latex]\alpha < \mu[/latex]. If [latex]\alpha = \mu[/latex], the total market size stays the same with people entering and leaving the market at equal rates.

We can learn some more about the dynamics by examining where [latex]dS/dt[/latex] and [latex]dI/dt[/latex] are zero (that is, where the potential customer base and current customer base don't change):

[latex]dI/dt = 0[/latex] if [latex]S = \frac{\gamma + \mu}{β}[/latex] or [latex]I = 0[/latex] [latex]dS/dt = 0[/latex] if [latex]S = \frac{\alpha N(t)}{\beta I + \mu}[/latex]

Plotting these lines in the [latex]S[/latex] vs [latex]I[/latex] plane divides it into up to four regions, depending on the relative values of the parameters:

The blue lines represent where [latex]dI/dt = 0[/latex]. The green line represents where [latex]dS/dt = 0[/latex]. Note that since [latex]dS/dt[/latex] is proportional to [latex]N(t)[/latex], the green line will rise or lower with time depending on whether the market is growing or shrinking. The red arrows indicate the general direction the [latex]S-I[/latex] trajectory will be moving in while it is in each region. This suggests that if

[latex]S(0) > \frac{\gamma + \mu}{\beta}[/latex]

then the number of number of customers grows, initially. In real terms, that means that initial growth depends on having a large enough sharing rate and number of current customers compared to the churn and "death" rates.

If

[latex]\frac{\alpha N(0)}{\mu} > \frac{\gamma + \mu}{β}[/latex]

that is, the market size, growth rate, and sharing rate are large enough, then the number of customers may fluctuate, cyclically. In either case, the number of customers asymptotically approaches the point where both [latex]dI/dt[/latex] and [latex]dS/dt[/latex] are zero, [latex]\frac{α}{γ + μ}N(t)[/latex] or

[latex]\frac{α}{γ + μ}N(0) * e^{(α - μ) t}[/latex].

*Notice that in this case, the long term behavior does not depend on the viral sharing rate, [latex]β[/latex]!*

Instead, how the number of customers grows or shrinks long term depends entirely on whether the market is growing or shrinking.

If [latex]S(0) < \frac{γ + μ}{β}[/latex] then the number of customers will only grow if the number of potential customers grows larger than [latex]\frac{γ + μ}{β}[/latex] before the number of current customers drops to zero.

In less mathematical terms, what this all means is that, if your customer base doesn't die out in the beginning, your customer base will grow exponentially, as long as you have a growing market. How fast your customer base grows in the long term depends on the growth of the market and the churn rate, but not on the viral sharing rate!

**Examples:**

As in Part 2, numerically integrating the differential equations allows for visualizing the effect each parameter has on viral growth.

First, compare different values of the sharing rate, β. With the parameters:

- [latex]N(0)[/latex] = 1 million people in the market
- [latex]γ[/latex] = 70% of customers lost per day
- [latex]I(0)[/latex] = 10 current customers
- [latex]α[/latex] = 14 new people in the market, per 1000 people
- [latex]μ[/latex] = 8 people lost from the market, per 1000 people

compare [latex]βN(0)[/latex] = 5, 1, 0.8, 0.5 invites per customer per day. The condition for initial growth is:

[latex]βN(0)[/latex]

[latex]\frac{γ + μ}{β}[/latex]

5

140,000

1

700,000

0.8

875,000

0.5

1,400,000

For higher values of the sharing rate, [latex]β[/latex], a higher initial peak in the number of customers is reached and the ups and downs in the number of current customers end sooner, but the values for which the growth condition is met all follow the same pattern of growth after several months. That is, they all asymptotically approach [latex]\frac{α}{γ + μ}N(0) * e^{(α - μ) t}[/latex].

Now consider the effects of varying churn:

- [latex]N(0)[/latex] = 1 million people in the market
- [latex]βN(0)[/latex] = 0.8 invites per customer per day
- [latex]I(0)[/latex] = 10 current customers
- [latex]α[/latex] = 14 new people in the market, per 1000 people
- [latex]μ[/latex] = 8 people lost from the market, per 1000 people

compare [latex]γ[/latex] = 90%, 70%, 40%, or 10% of customers lost per day. The condition for growth is

[latex]γ[/latex]

[latex]\frac{γ + μ}{β}[/latex]

0.9

1,125,000

0.7

875,000

0.4

500,000

0.1

125,000

Notice how the change in churn rate, [latex]γ[/latex], affects both the height of the initial peak and the long term growth. Lower levels of churn also lead to less of a roller coaster ride. For the case of 90% churn, the condition for growth, [latex]S(t) > \frac{γ + μ}{β}[/latex], is not met at [latex]t=0[/latex], but because of the growth of the market, [latex]S(t)[/latex] crosses the threshold before [latex]I(t)[/latex] goes to zero and viral growth is still achieved.

Finally, consider the effects of varying the growth rate of the market:

- [latex]N(0)[/latex] = 1 million people in the market
- [latex]βN(0)[/latex] = 2 invites per customer per day
- [latex]I(0)[/latex]=10 current customers
- [latex]γ[/latex] = 50% customers lost per day
- [latex]μ[/latex] = 8 people lost from the market, per 1000 people

compare [latex]α[/latex] = 4, 8, 14, 16 new people in the market, per 1000 people. The size of the initial peak is barely affected by the market growth rate, [latex]α[/latex], but long term behavior is affected greatly. For a “birth” rate that is lower than the “death” rate, even with strong viral sharing, the customer base drops to zero within about a month. For a [latex]α = μ[/latex], the customer base reaches a steady value within a year. Growing markets lead to growing customer bases, with small changes in market growth rates having a large effect in the long term.

**Conclusion:** All this fancy math and book learning brings us to a conclusion that is nothing new: a large and growing market is still one of the most important factors in growing a customer base for the long term. Keeping more of the customers you have also has a strong effect on long term viral growth. However the effects of a large viral sharing rate are only seen in the short term. Strong viral growth is important if your goal is to reach a large number of people quickly, as with a viral advertising campaign. But if your goal is to grow a customer base for a product for the long term, any amount of viral sharing can lead to long term growth as long as the market is growing and the churn in the customer base is low enough.

In Part 4, I'll look at the effect of returning customers.

**TLDR:** Differences in viral sharing rates only matter in the short term. For long term growth, the most important factors are low churn and a growing market.