Data Community DC and District Data Labs are hosting another session of their Building Data Apps with Python workshop on Saturday February 6th from 9am - 5pm. If you're interested in learning about the data science pipeline and how to build and end-to-end data product using Python, you won't want to miss it. Register before January 23rd for an early bird discount!
Data products are usually software applications that derive their value from data by leveraging the data science pipeline and generate data through their operation. They aren’t apps with data, nor are they one time analyses that produce insights - they are operational and interactive. The rise of these types of applications has directly contributed to the rise of the data scientist and the idea that data scientists are professionals “who are better at statistics than any software engineer and better at software engineering than any statistician.”
These applications have been largely built with Python. Python is flexible enough to develop extremely quickly on many different types of servers and has a rich tradition in web applications. Python contributes to every stage of the data science pipeline including real time ingestion and the production of APIs, and it is powerful enough to perform machine learning computations. In this class we’ll produce a data product with Python, leveraging every stage of the data science pipeline to produce a book recommender.
Come join us the day after Memorial day for a new type of Meet Up. In the past, Data Innovation DC and Data Community DC have brought in fascinating speakers discussing data products and services that have already been built or data sets that are now available for public consumption. This Tuesday, we are changing things up as part of the National Day of Civic Hacking. Our goal is to have individuals and teams interested in building commercially viable data products attend and listen to experts strongly familiar with data problems that consumers of US Census data are having. Simply put, we are trying to line up problems that other people (also known as potential customers) will pay to have them solved. As a massive added bonus, if your team can put something together before the end of next weekend, you may be able to attract national-level press interest.
Some of the bios for our Tuesday Panelists are below. If you are interested in attending for free, please register here.
Andrew W. Hait serves as the Data Product and Data User Liaison in the Economic Planning and Coordination Division at the U.S. Census Bureau. With over 26 years of service at the Bureau, Andy oversees the data products and tools and coordinates data user training for the Economic Census and the Census Bureaus other economic survey programs. He also is the lead geographic specialist in the Economic Programs directorate. Andy is the Census Bureau’s inside man for understanding our customer’s needs.
Judith Johnson (Remote)
Judith K. Johnson joins us from the Small Business Administration-funded Small Business Development Center’s (SBDC) National Information Clearinghouse to as Lead Librarian. She monitors daily incoming operations, provide business information research and review completed research by staff before distribution to SBDC advisors located nationwide. Ms. Johnson’s also provides preliminary patent or trademark searches and trains staff and SBDC advisors. She comes to the panel with a strong handle on entrepreneur / business owner data needs.
M.U.R.P., is a GIS Analyst at Carson Research Consulting (CRC). His work primarily revolves around the Baltimore DataMind. Mr. Earls is also responsible for managing social media (e.g., Facebook and Twitter) for the DataMind as well as the DataMind blog. He provides assistance with data visualization and mapping for other CRC projects as needed.
Dr. Taj Carson
The CEO and founder of Carson Research Consulting (CRC), a research and evaluation firm based in Baltimore. Dr. Carson has been working in the field of evaluation since 1997 and specializes in research and evaluation that can be used to improve organizations and program performance. She is also the creator and driving force behind the Baltimore DataMind, an interactive online mapping tool that allows users to visualize various socio-economic data for the Baltimore city at the neighborhood level.
Kim Pierson (remote)
Kim Pierson is a Senior Data Analyst with ProvPlan in Providence, Rhode Island. She has 6 years of experience in data analysis, geospatial information, and data visualization. She works with community organizations, non-profits, government agencies, and national organizations to transform data into information that supports better decision making, strengthens communities, and a promotes a more informed populace. She specializes in urban-data analysis including demographic, education, health, public safety, and Census data. She has worked on web-based data and mapping applications including the RI Community Profiles, RI DataHUB, and ArcGIS Viewer for Flex applications. She holds a M.A. degree in Urban and Regional Planning from the University of Illinois.
Interested in starting a company? It is summertime, the time for sequels. Our first event with the US Census Bureau was such a success that we are having a follow up event as part of the National Day of Civic Hacking.
In our first Census Event, we had Census data experts come and talk about the data that the US Census Bureau has available and how it could potentially be used to start a company. During this event, it was uncovered that Census has a number of data consumers that have legitimate problems around the Census data that they consume; these companies could use help and this represented a very legitimate business opportunity.
At this event, we are going to bring in actual Census data consumers to discuss their data-related problems. Why? Because customer development and finding the data-product market fit are the hard parts of starting a company. By providing access to potential customers who have very specific problems around open data sets, we are trying to lower the barriers for enterprising individuals and teams to start companies. We sincerely hoping that teams will form to address these issues and potentially commercialize their solutions.
If you want, think of this as a practical hackathon. Instead of spending the weekend building a small application or website or data visualization, spend a few hours understanding real, addressable business problems that can be commercialized. We will leave it to you and your team to build the solution on your own time but will still provide drinks and the pizza.
Oh yeah, time to mention the last carrot we have to dangle. We will be an official part of the National Day of Civic Hacking. That means that those industrious teams that jump in to solve a problem and that can assemble something interesting by the end of the weekend, could get access to national level press.
Questions, please email Sean Murphy through MeetUp.com or via Twitter @SayHiToSean.
A crucial part of learning data science is applying the skills you learn to real world projects. Working on interesting projects keeps you motivated to continue learning and helps you sharpen your skills. Working in teams helps you learn from the different experiences of others and generate new ideas about learning avenues you can pursue in the future. That's why District Data Labs is starting a virtual incubator program for data science projects!
The incubator program is:
- Free (no cost to you)
- Part-time (you can work on projects in your spare time)
- Virtual (you don't need to be located in the DC area)
The first class of the incubator is scheduled to run from May through October 2014. This is a great way to learn by working on a project with other people and even potentially sharing in the rewards of a project that ends up being commercially viable.
For more info and to apply, you can go to http://bit.ly/1dqp11k.
Applications end soon, so get yours in today!
Back in 2010, New York-based data scientist Drew Conway famously created the Data Science Venn Diagram. Illustrating that Data Science is the intersection of "Substantive Expertise," "Hacking Skills," and "Math and Statistics Knowledge," the diagram had a substantial impact on the nascent community. It was one of a key set of articles that helped to define the distinguishing features of data science, as a discipline, and to clarify why there was a need for a new term (rather than just "applied statistics"). If you haven't seen the diagram, click through before proceeding. Another new term is "Data Products." Mike Loukides and DJ Patil, among others, have written about the things that data scientists build. I'd like to add to that a Venn Diagram for data products that clarifies what that terms means. And more importantly, the diagram relates data products to other sorts of data-related artifacts that have existed for a long time. Here it is:
What does this mean? First of all, there are three sets of skills, directly paralleling Drew's data science skill sets, all floating in a sea of data. When you combine Data with Domain Knowledge, you get Spreadsheets. With Statistics, Predictive Analytics, and Visualization, you get Exploratory Data Analysis and Statistical Programming. And with Software Engineering, you get Databases. Highly useful systems and products, but nothing particularly new.
Combining pairs of sets with this sea of data, you get more specific products:
- Data + Software Engineering + Domain Knowledge = Business Rules and Expert Systems with implementations such as Drools and FICO's Blaze.
- Data + Software Engineering + EDA & Statistical Programming = BI and Statistics Tools, such as Tableau, SPSS, and many more general-purpose statistical systems.
- Data + Domain Knowledge + EDA & Statistical Programming = One-Off Analyses, which may be a PDF article, or a data-driven Powerpoint presentation, or simply a chart showing a distribution sent via email.
And at the center of it is all a Data Product, which is a piece of software that includes both Domain Knowledge and Statistical components. These may be widgets in a larger web tool, such as LinkedIn's People You May Know, or software systems designed for specific analytic purposes, with baked-in domain knowledge. Tools that are designed for statistical analysis of DNA sequences, or optimization of truck routing for distributors, or many many other things, all fall into this category. In many cases, data products make it easy for regular people to get what they need without having to dive into a very complex set of data and a very complex set of algorithms.
What are the consequences of this framework? I'd assert that the value of a product that combines all three aspects of a data product, requiring all three skill sets of a data scientist to design and build, may be substantially more valuable than products that combine just one or two of the components.
One-off analyses can be great, but a repeatable, reproducible analysis is much better. Business rules can lead to maintainable software systems, but without statistical capabilities, they may be too rigid to adequately work in many real world situations. (See the history of AI research prior to about the 1980s.) And general purpose BI and Statistics tools are extremely useful, but may become even more powerful when the systems are designed for and incorporate particular domain knowledge.
What do you think? What's missing? Does this clarify your thinking? Or is this entirely obvious?
On Tuesday, January 29th, nearly 90 academics, professionals, and data science enthusiasts gathered at JHU APL for the kick-off meetup of the new Mid-Maryland Data Science group. With samosas on their plates and sodas in hand, members filled the air with conversations about their work and interests. After their meal, members were ushered into the main auditorium and the presenters took their place at the front.
Greetings and Mission
by Jason Barbour & Matt Motyka
Jason and Matt kicked off the talks with an introduction of the group. Motivated by both growth of data science and the vast opportunities being made available by powerful free tools and open access to data, they described their interest in creating a local group that help grow Maryland data science community. Being software developers with analytic experience, Jason and Matt next described their seven keys to a success analytic: infrastructure, people, data, model, and presentation. Lastly, metrics about the interests and experience of the members was presented.
The Rise of Data Products
by Sean Murphy
With excitement and passion, Sean took the stage to show how now is the Gold Rush for data products. Laying out the definition of a data product, and cycling through several well known examples, Sean explained how these products are able to bring social, financial, or environmental value through the combination of data and algorithms. Consumers want data, and the tools and infrastructure needed to supply this demand are available either freely or extremely low cost. Data scientists are now able to harness this stack of tools to provide the data products that consumers crave. As Sean succinctly stated, it is a great time time to work with data.
The article version of the talk can be found here.
The Variety of Data Scientists
Being a full-fledged data science, Harlan followed up Sean by presenting his research into what the name “data scientists” really means. Using the results of a data scientist survey, Harlan listed several skill groupings that provide a shorthand for the variety of skills that data scientists possess: programming, stats, math, business, and machine learning/big data. Next Harlan, discussed that the diverse backgrounds of data scientists can be more accurately categorized into four types: data businessperson, data creative, data researcher, and data engineer. With this breakdown, Harlan demonstrated that the data scientists community is actually composed of individuals with a variety of interests and skills.
Cloudera Impala - Closing the near real time gap in BIGDATA
A true cyber security evangelist, Wayne Wheeles presented how Cloudera’s Impala, was able to make near real time security analysis a reality. With his years of experience in the field of cyber security, and his prior work utilizing big data technologies, Wayne was given unique access to Cloudera’s latest tool. Through his testing and analysis, he concluded that the Impala tool offered a significant improvement in performance and could become a vital tool in cyber security.
After the last presentation, more than a dozen members joined joined us at nearby Looney’s Pub to end the night with a few beers and snacks. To everyone's surprise, Donald Miner of EMC Greenplum offered to pick-up the tab! You can follow him on Twitter or LinkedIn from this page.
If you missed this first event, don't worry as the next one is coming up on March 14th in Baltimore. Check it out here.
Tonight's talk is focused on capturing what I see as a new (or continuing) Gold Rush and could not be more excited about.
Before we can talk about the Rise of Data products, we need to define a Data Product. Hilary Mason provides the following definition: "a data product is a product that is based on the combination of data and algorithms." To flesh this definition out a bit, here are some examples.
1) LinkedIn has a well known data science team and highlighted below is one such data product - a vanity metric indicating how many times you have appeared in searches and how many times people have viewed your profile. While some may argue that this is more of a data feature than product, I am sure it drives revenue as you have to pay to find out who is viewing your profile.
2) Google's search is the consumate data product. Take one part giant web index (data) and add it to the page rank algorithm (the algorithms) and you have a ubiquitous data product.
3) Last, and not least, is Hipmunk. This company allows users to search flight data and visualize the results in an easy to understand fashion. Additionally, Hipmunk attempts to quantify the pain entailed by different flights (those 3 layovers add up) into an "agony" metric.
So let's try a slightly different definition - a data product is the combination of data and algorithms that creates value--social, financial, and/or environmental in nature--for one or more individuals.
One can argue that data products have been around for some time and I would completely agree. However, the point of this talk is why are they exploding now?
I would argue that it is all about supply and demand. And, for this brief 15 minute talk (a distillation of a much longer talk), I am going to constrain the data product supply issue to the availability and cost of the tools required to explore data and the infrastructure required to deliver data products. On the demand side, I am going to do a "proof by example," complete with much arm waving, to show that today's mass market consumers want data.
On the demand side, let's start with something humans have been doing ever since they came down from the trees: running.
With a small sensor embedded in the shoe (not the only way these days), Nike+ collects detailed information about runners and simply cannot give enough data back to its customers. In terms of this specific success as evidence of general data product demand, Nike+ users have logged over 2 billion miles as of 1/29/2013.
As further evidence of mass market data desire, 23and Me has convinced nearly a quarter million people to spit into a little plastic cup, seal it up, mail it off, and get their DNA sequenced. 23and Me then gives back the data to the user in the form of a genetic profile, complete with relative genetic disease risks and clear/detailed explanations of those numbers.
And finally is Google Maps or GPS in general .. merging complex GIS data with sophisticated algorithms to compute optimal pathing and estimated time of arrival. Who doesn't use this data product?
In closing, the case for overwhelming data product demand is strong ::insert waving arms::: and made stronger by the fact that our very language has become sprinkled with quasi stat/math terms. Who would ever think that pre-teens would talk about something trending?
Let's talk about the supply side of the equation now, starting with the tools required to explore data.
Then: Everyone's "favorite" old-school tool, Excel, costs a few hundred dollars depending on many factors.
Now: Google docs has a spreadsheet where 100 of your closest friends can simultaneously edit your data while you watch in real time.
And the cost, FREE.
Let's take a step past spreadsheets and rapidly prototype some custom algorithms using Matlab (Yes, some would do it in C but I would argue that most can do it faster in Matlab). The only problem here is that Matlab ain't cheap. Beware when a login is required to get even individual license pricing.
Now, you have Python and a million different modules to support your data diving and scientific needs. Or, for the really adventurous, you can jump to the very forward looking, wickedly-fast, big-data ready, Julia. If a scientific/numeric programming language can be sexy, it would be Julia.
And the cost, FREE.
Let's just say you want to work with data frames with some hardcore statistical analyses. For a number of years, you have had SAS, Stata, and SPSS but these tools come at an extremely high cost. Now, you have R. And its FREE.
Yes, an amazing set of robust and flexible tools for exploring data and prototyping data products can now be had for the low, low price of free, which is a radical departure from the days of yore.
Now that you have an amazing result from using your free tools, it is time to tell the world.
Back in the day (think Copernicus and Galileo), you would write a letter containing your amazing results (your data product) which would then take a few months to arrive to a colleague (your market). This was not a scalable infrastructure.
Contemporary researchers push their findings out through the twisted world of peer-reviewed publications ... where the content producers (researchers) often have to pay to get published while someone else makes money off of the work. Curious. More troubling is the fact that these articles are static.
Now, if you want to reach a global audience, you can pick up a CMS like WordPress or a web framework such as Rails or Django and build an interactive application. Oh yeah, these tools are free.
So the tools are free and now the question of infrastructure must be addressed. And before we hit infrastructure, I need to at least mention that over used buzz word, "big data."
In terms of data products, "big data" is interesting for at least the simple reason that having more data increases the odds of having something valuable to at least someone.
Think of it this way, if Google only indexed a handful of pages, "Google" would never have become the verb that it is today.
If you noticed the pattern of tools getting cheaper, we see the exact same trend with data stores. Whether your choice is relational or NOSQL, big or little-data, you can have your pick for FREE.
With data stores available for the low cost of nothing, we need actual computers to run everything. Traditionally, one bought servers which cost an arm and a leg and don't forget siting requirements and maintenance among other costs. Now Amazon's EC2 and Google Compute Engine allow you to spin up a cluster of 100 instances in a few minutes. Even better, with Heroku, sitting on top of Amazon, you can stand up any number of different data stores in minutes.
Why should you be excited? Because the entire tool set and the infrastructure required to build and offer world-changing data products is now either free or incredibly low cost.
Let me put it another way. Imagine if Ford started giving away car factories, complete with all required car parts, to anyone with the time to make cars!!!!!
Luckily, there are such individuals who will put this free factory to work. These "data scientists" understand the entire data science stack or pipeline. They can by themselves take raw data to a product ready to be consumed globally (or at least make a pretty impressive prototype). While these individuals are relatively rare now, this state will change. Such an opportunity will draw a flood of individuals, and that rate will only increase as the tools become simpler to use.
Let's make the excitement a bit more personal and go back to that company with a lovable logo, Hipmunk.
If I remember the story correctly, two guys at the end of 2010 taught themselves Ruby On Rails and built what would become the Hipmunk we know and love today.
Learned to Code.
And, by the way, Hipmunk has $20.2 million in funding 2 years later!
It is a great time to work with data.