Win Free eCopies of Social Media Mining with R

This is a sponsored post by Richard Heimann. Rich is Chief Data Scientist at L-3 NSS and recently published Social Media Mining with R (Packt Publishing, 2014) with co-author Nathan Danneman, also a Data Scientist at L-3 NSS Data Tactics. Nathan has been featured at recent Data Science DC and DC NLP meetups. Nathan Danneman and Richard Heimann have teamed up with DC2 to organize a giveaway of their new book, Social Media Mining with R.

Over the new two weeks five lucky winners will win a digital copy of the book. Please keep reading to find out how you can be one of the winners and learn more about Social Media Mining with R.

Overview: Social Media Mining with R

Social Media Mining with R is a concise, hands-on guide with several practical examples of social media data mining and a detailed treatise on inference and social science research that will help you in mining data in the real world.

Whether you are an undergraduate who wishes to get hands-on experience working with social data from the Web, a practitioner wishing to expand your competencies and learn unsupervised sentiment analysis, or you are simply interested in social data analysis, this book will prove to be an essential asset. No previous experience with R or statistics is required, though having knowledge of both will enrich your experience. Readers will learn the following:

  • Learn the basics of R and all the data types
  • Explore the vast expanse of social science research
  • Discover more about data potential, the pitfalls, and inferential gotchas
  • Gain an insight into the concepts of supervised and unsupervised learning
  • Familiarize yourself with visualization and some cognitive pitfalls
  • Delve into exploratory data analysis
  • Understand the minute details of sentiment analysis

How to Enter?

All you need to do is share your favorite effort in social media mining or more broadly in text analysis and natural language processing in the comments section of this blog. This can be some analytical output, a seminal white paper or an interesting commercial or open source package! In this way, there are no losers as we will all learn. 

The first five commenters will win a free copy of the eBook. (DC2 board members and staff are not eligible to win.) Share your public social media accounts (about.me, Twitter, LinkedIn, etc.) in your comment, or email media@datacommunitydc.org after posting.

SynGlyphX: Hello and Thank You DC2!

The following is a sponsored post brought to you by one of the supporters of two of Data Community's five meetups.

Hello and Thank You DC2!

This week was my, and my company’s, introduction to Data Community DC (DC2).  We could not have asked for a more welcoming reception.  We attended and sponsored both Tuesday’s DVDC event on Data Journalism and Thursday’s DSDC event on GeoSpatial Data Analysis.  They were both pretty exciting, and timely, events for us.

SynglyphyxAs I mentioned, I’m new to DC2 and new to the “data as a science” community.  Don’t get me wrong, while I’m new to DC2 I’ve been awash in data my entire career.  I started as a young consultant reconciling discrepancies in the databases of a very early Client-Server implementation.  Basically, I had to make sure that all the big department store orders on the server were in sync with the home delivery client application.  A lot of manual reconciling that ultimately led to me programming code to semi-automatically reconcile the two databases.  Eventually (I think) they solved the technical issues that led the Client-Server databases being out of sync.

Synglyphyx2More recently, I was working for a company with a growing professional services organization.  The company typically hired new employees after a contract was signed; but the new professional services work involved short project durations.  If we waited to hire, the project would be over before someone started.  We developed a probability adjusted / portfolio analysis approach to compare supply of available resources (which is always changing as people finish projects, get extended, leave the organization) vs. demand (which is always changing as well), that enabled us to determine a range of positions and skillsets to hire for in a defined timeframe.

In both instances, it was data science that drove effective decision making.  Sure, you can apply some “gut” to any decision, but having some data science behind you makes the case much stronger.

So, I was fascinated to listen to the journalists discuss how they are applying data analysis to help:  1) support existing story lines; and 2) develop new story lines.  Nathan’s presentation on analyzing AIS data was interesting (and a bit timely as we had just gotten a verbal win for a client on doing similar type work, similar, but not exactly the same).

I know the power of data to solve complex business, operational, and other problems.  With our new company, SynGlyphX, we are focused on helping people both visualize and interact with their data.  We live in a world with sight and three dimensions.  We believe that by visualizing the data (unstructured, filtered, analyzed, any kind of data), we can help people leverage the power of the brain to identify patters, spot trends, and detect anomalies.  We joined DC2 to get to know folks in the community, generate some awareness for our company, and to get your feedback on what we are doing.  Thank you all for welcoming us and our company, SynGlyphX, to the community.  We appreciated everyone’s interest in the demonstrations of our interactive visualization technology.  Our website traffic was up significantly last week, so I am hoping this is a sign that you were interested in learning more about us.  Additionally, I have heard from a number of you since the events, and welcome hearing from more.

Here’s my call to action, I encourage you to tweet us your answer to the following question:  “Why do you find it helpful to visually interact with your data?”

See you at upcoming events.

Mark Sloan

About the Author:

As CEO of SynGlyphX, Mark brings over two decades of experience.  Mark began his career at Accenture, co-founded the global consulting firm RTM Consulting, and served as Vice President and General Manager of Convergys’ Consulting and Professional Services Group.

Mark has a M.B.A. from The Wharton School of the University of Pennsylvania, and a B.S. in Civil Engineering from the University of Notre Dame. He is a frequent speaker at industry events and has served as an Advisory Board Member for the Technology Professional Services Association (now Technology Services Industry Association (TSIA)).

General Assembly & DC2 Scholarship

GA DC2 Scholarship The DC2 mission statement emphasises that "Data Community DC is an organization committed to connecting and promoting the work of data professionals...", ultimately we see DC2 becoming a hub for data scientists interested in exploring new material, advancing their skills, collaborating, starting a business with data, mentoring others, teaching classes, changing careers, etc. Education is clearly a large part of any of these interests, and while DC2 has held a few workshops and is sponsored by organizations like Statistics.com, we knew we could do more and so we partnered with General Assembly and created a GA & DC2 scholarship specifically for members of Data Community DC.

For our first scholarship we landed on Front End Web Development and User Experience, which we naturally announced first at Data Viz DC.  How does this relate to data science?  As I was happy to rebut Mr. Gelman in our DC2 blogpost reply, sometimes I would love to have a little sandbox where I get to play with algorithms all day, but then again this is exactly what I've run away from in 2013 in becoming an independent data science consultant, I don't want a business plan I'm not a part of dictating what I can play with.  Enter Web Dev and UX.  As Harlan Harris, organizer of DSDC, mentions in his venn diagram on what makes a data scientist, which Tony Ojeda later emphasizes, programming is a natural and necessary part of being a data scientist.  In other words, there's this thing called the interwebs that has more data than you can shake a stick at, and if you can't operate in that environment then as a data scientist you're asking someone else to do that heavy lifting for you.

Over the next month we'll be choosing the winners of the GA DC2 Scholarship, and if you'd like to see any other scholarships in the future please leave your thoughts in the comments below or tweet us.

Happy Thanksgiving!

Hadoop as a Data Warehouse Archive

Recently, companies have seen a huge growth in data volume both from existing structured data and from new, multi-structured data. Transaction data in particular from online shopping and mobile devices along with clickstream and social data is creating more data in one year than was ever created before. How is a company supposed to keep track of and store all of this data effectively? Traditional data warehouses would have to be constantly expanding to keep up with this constant stream of data, making storage increasingly too expensive and time consuming. Businesses have found some relief using Hadoop to extract and load the data into the data warehouse, but as the warehouse becomes full, businesses have had to expand the data warehouse’s storage capabilities.

Instead, businesses should consider moving the data back into Hadoop, turning Hadoop into a data warehouse archive. There are several advantages to using Hadoop as an archive in conjunction with a traditional data warehouse. Here’s a look at a few.

Improved Resilience and Performance

Many of the platforms designed around Hadoop have focused on making Hadoop more user friendly and have adjusted or added features to help protect data. MapR, for example removes single points of failure in Hadoop that made it easy for data to be destroyed or lost. Platforms will often offer data mirroring across clusters to help support failover and disaster recovery as well.

With a good level of data protection and recovery abilities, Hadoop platforms become a viable option for the long-term storage of Big Data and other data that has been archived in a data warehouse.

Hadoop also keeps historical data online and accessible which makes it easier to revisit data when new questions come and is dramatically faster and easier than going through traditional magnetic tapes.

Handle More Data for Less Cost

Hadoop’s file system can capture 10s of terabytes of data in a day, and this is accomplished at the lowest possible cost due to open source economics and commodity hardware. Hadoop can also easily handle more data by adding more Hadoop nodes to the cluster to continue to process data at speed thanks to Hadoop’s greater compute power. This is much less expensive than the continuous upgrades that would be required to maintain a traditional warehouse and to keep up with the extreme amount of data. On top of that, data tape archives found in traditional data stores can become costly because the data is difficult to retrieve. Not only is the data stored offline, requiring tons of time to restore, but the cartridges are prone to degrade over time resulting in costly losses of data.

High Availability

Traditional data warehouses often made it difficult for global businesses to maintain all of their data in one place with employees working and logging in from various locations around the world. Hadoop platforms will generally allow direct access to remote clients that want to mount the cluster to read or write data flows. This means that clients and employees will be working directly on the Hadoop cluster rather than first uploading data to a local or network storage system. In a global business where ETL processing may need to happen several times within the day, high availability is very important.

Reduce Tools Needed

Coupled with increased availability, the ability to access the cluster directly dramatically reduces the number of tools needed to capture data. For example, this reduces the need for log collection tools that may require agents on every application server. It also eliminates the need to keep up with changing tape formats every couple years or risk being unable to restore data because it is stored on obsolete tapes.

Author Bio

Rick Delgado, Freelance Tech Journalist

I've been blessed to have a successful career and have recently taken a step back to pursue my passion of writing. I've started doing freelance writing and I love to write about new technologies and how it can help us and our planet.

Sponsored Event: Survival of the Fittest: How Software & SaaS Providers Can Avoid Extinction in the Age of Analytics

Join us on May 15th for a seminar on embedded analytics best practices and go-to-market strategies for software and SaaS providers. Learn about customer use cases, how to price, package, and promote your analytics offering, and participate in hands-on technical training (you'll walk away with what you create!). Following the presentation, network with your peers at the Tech Cocktail Mixer & Startup Showcase – food & drinks are on us! This event is free to attend, so share the registration link with your colleagues and peers.

Date and Time May 15, 2013 8:30am – 6:30pm

Registration page http://www.logianalytics.com/washington_dc?cm=datacommunitydc-dcsem

Location: 1776 Campus (1133 15th St NW, Washington, DC 20005) View Larger Map

A Pioneering Conference on Big Data and Open Analytics

This is a sponsored post by Scott Raspa at IKANOW, a big data software company. He is involved in the company’s sales and marketing efforts supporting public and private sector clients. He can be found on Twitter as @sraspa. IKANOW is producing the Open Analytics Summit, which is a sponsor of Data Community DC. Big Data conferences cover everything from NBA rebounds to government spending. Now there is a conference that covers open analytics across industries and platforms... and it’s right here in DC: The Open Analytics Summit.

Growth in Big Data Conferences

Since 2011, interest in Big Data has grown exponentially. A quick look at the Google trends shows that beginning in 2011, Big Data began its ascent and has now entered the common vernacular.

Google Trends plot of "big data"

Not surprisingly, as interest in Big Data has grown, so has interest in Big Data conferences. There have been 286 Big Data conferences according to social conference directory Lanyrd since 2010. 93 more are already scheduled or taking place in 2013. Google Trends shows the greatest growth in Big Data conferences starting in 2012, and there is no doubt that conferences have continued to grow alongside interest in the topic.

Google Trends showing "big data conferences"

Initially, Big Data conferences focused on platform-specific solutions. Hadoop World in October 2010 was a pioneer in this space and similar conferences have focused on MongoDB, NoSQL, Cassandra and others. Highly technical, and focused on a community of engineers, these conferences have continued to grow rapidly. On the heels of the technical conferences came events focusing on specific data types. Government and enterprise data-focused conferences grew in popularity as well. While the majority of spending and largest conferences have been focused around big players like IBM, Intel and the Federal Government, more specific markets are taking part as well. This past weekend for example was the 7th Annual MIT Sloan Sports Analytics Conference. This conference looks specifically at how data can be used in professional sports and has attracted over 2700 attendees this year, with research papers ranging from “Breaking Down the Rebound” to “Effort versus Concentration.” With Big Data touching fields from sports to government, we are excited to host in DC a new conference pioneer looking at Big Data and Open Analytics.

Open Analytics Summit

The Open Analytics DC Summit takes a unique approach by combining both data types, professional backgrounds and industries. Companies such as OpenGeo, Berico, and Praescient Analytics will present on how open source tools can create understanding from structured and unstructured data. Specific platforms will be covered with deep dives and examples of how to use Hadoop, Elasticsearch, MongoDB, open analytics platform IKANOW, and others. Big Data at huge organizations and companies will be the focus for people like Sean Patrick Murphy, a Senior Scientist at Johns Hopkins University (and board member at Data Community DC), Shahid Shah, a Chief Architect at OMB, and Donald Cox of Recovery, who will look at Data Analytics in the Government. The Open Analytics Summit is unique in that data scientists, CTO’s and analysts will mingle with executives and industry experts. It will focus on combining technical skills with real-world implementation experience, making the Summit a great place to learn to integrate Big Data in to projects and analysis within companies and organizations of all sizes.

Open Analytics Summit Special for Data Community DC

The Open Analytics DC Summit is excited to announce special pricing for Data Community DC readers and members. You can purchase tickets for the Summit and enter the code DataDC50 to receive $50 off the regular ticket price. At only $225, this conference is a great investment to learn more about implementing Big Data and open source analytics, and to meet industry leaders who will prove to be valuable contacts in the future. While Big Data’s moment has arrived, it is about time that open analytics gets a focused look. The Open Analytics Summit on March 25th in Washington, DC does just that. And we couldn’t be more excited.

An Introduction to TechBreakfast and Why You Should Care

I want to introduce you to TechBreakfast, a rapidly growing meetup group that hosts events in Baltimore, Columbia, DC, and Northern VA.  So why the introduction? Don't you already have enough Meetup groups to go to? DC2 has at least two good reasons for doing so. First, TechBreakfast demos early stage technology companies from the area. While entrepreneurship and data are different, these two subjects are closely related and the data revolution probably would not be happening if it weren't for the tech startup scene.  Thus, we thought you might just be interested in this area.  While potentially relevant content is a good start, we don't mention any and all tech meetups that come our way. In this writer's humble opinion, TechBreakfast is one of the best run and enjoyable meetups that I have been to ... and I have been to hundreds such events. Keep reading if you want to learn more and discover some exciting upcoming events.


Want to see cool new technology? Want to interact with other cool techies, startups, and business folks? Have some time in the morning? Then come to TechBreakfast, a monthly breakfast in Baltimore, Columbia, DC, and Northern Virginia where entrepreneurs, techies, developers, designers, business people, and interested people see showcases on cool new technology in a demo format and interact with each other . "Show and Tell for Adults" is what we usually say. No boring presentations or speakers who drone on. This is a "show and tell" format where we tell people to show me, don't tell meabout the great things they are working on. Each TechBreakfast begins at 8:00am and goes until 10AM (although people usually hang around later).  This event is FREE! Thank our sponsors when you see them!

  • Wed. Feb. 27, 2013: Baltimore TechBreakfast - Featuring Vince Talbert Success Story - Bill Me Later. Featuring awesome technology companies showcasing their innovations in a demo-format and a Success Stories guest. Presenters for this installment: Vince Talbert, Platfolio, Sexual Health Innovations, OpiaTalk, and ChefTabl. FREE. Location: DLA Piper, Baltimore, MD. Register and info at http://www.meetup.com/TechBreakfast/events/97731722/
  • Fri. Mar. 1, 2013: Insurance BizWorkshop - Just because you are a startup or a small business doesn't mean that things can't go wrong. And when those things go wrong... they can really go wrong. The beauty of insurance is that it's there to protect you when things go wrong. And it doesn't always need to cost an arm and a leg. In the Insurance BizWorkshop on March 1, 2013, we'll bring one of the area's most advanced and notable firms in the insurance field to help you figure out what you need, how little or much coverage you need to cover those risks, and how to save a bundle doing it. Indeed, cover your butt for pennies on the dollar. Cost is $15 if you register by Feb. 27, 2013. More information and register at http://www.meetup.com/BizWorkshop/events/105199632/.
  • Wed. Mar. 6, 2013: NoVA TechBreakfast - Featuring awesome technology companies showcasing their innovations in a demo-format. Presenters for this installment: Nanobird, OpiaTalk, Omic Biosystems, Workman, Stormpins. FREE. Location: AOL Fishbowl, Reston, VA. Register and info at http://www.meetup.com/TechBreakfast/events/103533882/
  • Tue. Mar. 12, 2013: Columbia TechBreakfast - Featuring awesome technology companies showcasing their innovations in a demo-format. Presenters for this installment: Thycotic, Gruply, RackTop Systems, MeetLocalBiz, Light Point Security. FREE. Location: Loyola Columbia, Columbia, MD. Register and info at http://www.meetup.com/TechBreakfast/events/97737422/

Community Indicators Conference

This is a guest post from Jim Farnham, co-organizer of the upcoming Community Indicators Consortium Impact Summit in College Park. Community indicators are an effort to bring performance metrics to local-level governance, and as such are related to the Open Data movement. CIC is sponsoring DC2 this month, in an effort to get the local community aware of this event. We thank Jim and CIC for their support, and urge locals interested in Open Data and related topics to consider attending! CIC’s Impact Summit -- November 15-16,2012 in College Park, MD -- will be a forum for community indicators practitioners and stakeholders to share projects, research and lessons of various fields including Sustainability, Health, Education as well as to explore approaches and tools in creating positive impacts within our communities. The Community Indicators Consortium (CIC) is an active, open learning network and global community of practice among persons interested or engaged in the field of indicators development and application. We host webinars and conferences, manage an indicator project database, and undertake projects aimed at building the field.

Our conference provides an opportunity to share ways of increasing the impact of our work on behalf of our communities, public officials, funders, professional networks, businesses and clients in a variety of formats and tracks.

Join over 200 practitioners, analysts, academics, funders, and data providers in over 20 sessions to share our work and explore new ways and tools to track impact, understand macro trends buffeting our communities, and learn how to effect change and bridge the distance between objective high quality data and subjective perceptions and interpretation.  Full details are available at the conference web site.  For those of you who cannot make it to College Park, we will provide streaming of portions of the conference and other portions will be online for member access after the conference

(Individual membership in CIC is $75 per year and provides discounted access to CIC conferences, free access to 10-20 webinars per year and all webinar and conference archives).

Presenters at the CIC Impact Summit  include: Robert Groves – Former Director of US Census Bureau Bryan Sivak  – Chief Technology Officer, US. Department of Health and Human Services Eugenie Birch – University of Pennsylvania Charlotte Kahn – Director of Boston Indicators Project Chantel Bottoms – Austin Community Action Network Michael McAfee – Promise Neighborhoods Institute, PolicyLink Salin Geevargese – HUD, Office of Sustainable Housing and Communities

And about 50 others.

Follow us @CommunityIC and tweet to #CICSUMMIT

Please contact us at conference@communityindicators.net with any questions or comments regarding the conference.