Data Community DC and District Data Labs are hosting another session of their Building Data Apps with Python workshop on Saturday February 6th from 9am - 5pm. If you're interested in learning about the data science pipeline and how to build and end-to-end data product using Python, you won't want to miss it. Register before January 23rd for an early bird discount!

Posted
AuthorTony Ojeda

Data Community DC, Inc., is hiring a part-time Communications Manager who will be responsible for creating and delivering electronic content to our diverse technical community. The role is hourly, typically 5 hours/week, and includes the following tasks:

* Assemble a short weekly email newsletter, primarily from existing content (blog posts, event announcements, job announcements, other articles). Some distillation/excerpting of content will be necessary.
* Post social media content, including on Twitter and other platforms, and coordinating guest bloggers and editing their work.
* Collect and manage submitted job ads.
* Participate in monthly DC2 board meetings, and follow the board mailing list.

Posted
AuthorHarlan Harris

For several years now, the DC Nightowls meetup has been a stable of after hours coworking for entrepreneurs, startups, and self-starters doing interesting projects in the Metropolitan DC area. A number of our Data Science community members have been owls as well. 

Recently, we decided to combine forces by re-focusing DC Nightowls through a new program called DC2 Digital Nomads.  The new program's focus is on the gig economy, freelance knowledge workers, and remote working.

Posted
AuthorAndrew Conklin

If you haven’t come around yet, it’s past time: Data ethics is really important. 

A quick glance at recent ethical dilemmas is telling. Troubling instances of the mosaic effect — in which different anonymized datasets are combined to reveal unintended details — include the tracking of celebrity cab trips and the identification of Netflix user profiles. It is also difficult to remain unconcerned with the tremendous influence wielded by corporations and their massive data stores, most notoriously embodied by Facebook’s secret psychological experiments. And new issues are emerging all the time. I dare you to read MIT’s recent article on why we must train self-driving cars to kill without letting out a disquieted “Huh.”

Posted
AuthorGuest Author

During an upcoming free workshop, Andrej Lapajne will be going in depth on the benefits of using IBCS to improve your data visualization practices and communication. Here is a brief introduction to what IBCS is and how it is helping businesses across the world visualize their data effectively and consistently.

Are you using data visualization to improve your reports, presentations and communications, or to unknowingly hinder them? All too often, reports fall somewhere between messy spreadsheets and dashboards, full of poorly labeled and inappropriate charts, that simply do not get the message across to the decision-makers.

figure1

Countless reports and presentations are created throughout organizations on a daily basis, all in different formats, lengths, shapes and colors, depending on preferences of the person who prepares them. The end results are often managers not making their way through the data presented, time being wasted, and important decisions failing to be made.

The solution - International Business Communication Standards

In 2004 Dr. Rolf Hichert, the renowned German professor, took on a challenge to standardize the way data visualizers present data in their reports, dashboards and presentations. His extremely successful work culminated in 2013 with the public release of the International Business Communication Standards (IBCS) the world’s first practical proposal for the standardized design of business communication.

The IBCS consistently define shapes and colors of actuals and budgets, variances, different KPIs, etc. Often referred to as the “traffic signs for management”, the IBCS are a set of best practices that went viral in Europe and have solved business communication problems in numerous companies such as SAP, Bayer, Lufthansa, Philips, Coca-Cola Bottlers, Swiss Post, etc.

figure2 Profit & Loss analysis (income statement) with waterfall charts and variances

How does it work?

Let’s take a look at a typical column chart, designed to help us compare actual sales figures vs. budget:

figure3

Is it efficient? The colors used are completely arbitrary, probably just an accidental default of the software tool. It is quite hard to estimate the variances to budget. Are we above the budget or below the budget in a particular month? For how much?

Now let’s observe the same dataset, designed according to the IBCS:

figure4

The actuals are depicted as dark grey full columns, while the budget is an outline. This is called scenario coding: the budget is an empty frame that has to be filled up with the actuals.

The variances are explicitly calculated and visualized. Positive variance is green, negative is red. The user’s attention is guided to the variances, which are in this case the key element to understand the sales performance.

The values are explicitly labeled at the most appropriate position on the chart. All texts are standardized, exact, short and displayed horizontally.

Storyline, visual design and uniform notation

The IBCS standards are not just about charts. They comprise of an extensive set of rules and recommendations for the design of business communication that help:

  1. Organize and structure your content by using an appropriate storyline
  2. Present your content by using an appropriate visual design and
  3. Standardize the content by using a consistent, uniform notation.

After you apply the IBCS rules to your standard variance report, it will look something like this:

figure5

Sales variance report - Actual vs PY vs Budget

As you may have noticed, this report has several distinctive features:

  • The key message (headline) at the top
  • Title elements below the key message
  • Clear structure of columns (first PY for previous year values, then AC for actual and at the end BU for budget; always in this order)
  • Scenario markers below column headers (grey for PY, black for AC and outline for BU)
  • Strictly no decorative elements, only a few horizontal lines
  • Variances are visualized with red/green “plus-minus” charts and embedded into the table
  • Absolute variances (ΔPY, ΔBU) are visualized as a bar chart, while relative variances (ΔPY%, ΔBU%) are visualized as “pin” charts (we prefer to call them “lollipop” charts)
  • Semantic axis in charts: grey axis for variance to PY (grey = previous year), double line for variance to budget (outline = budget)
  • Numbered explanatory comments that are integrated into the report.

A clear message, appropriate data visualization and accurate explanations. The story that numbers are telling presented on one single page. That's what the managers expect.

We will be going into much further detail on IBCS guidelines for visualizing data and business communications during a free workshop on Oct 29th at 10am. To learn more and register, please visit us at zebra.bi/dc2015.

Posted
AuthorSean Gonzalez

Data Community DC is pleased to announce our State of Data Science Education event.

Its goal is to bring together educators, vocational programs and companies to discuss the state and future of data science.

With the rise of data science both schools and companies are rapidly building up their programs and departments, but their beliefs about what a data scientist is and what skills they should have vary considerably.

Posted
AuthorRobert Vesco

Last week, the Urban Institute hosted a discussion on the evolving landscape of data and the potential impact on social science. “Machine Learning in a Data-Driven World” covered a wide range of important issues around data science in academic research and their real policy applications. Above all else, one critical narrative emerged:

Data are changing, and we can use these data for social good, but only if we are willing to adapt to new tools and emerging methods.

Posted
AuthorGuest Author

This guest blog post on ROCs was spurred by a conversation in the Q&A at Data Science DC’s June 16th Meetup on “Predicting Topics and Sharing in Social Media”. John Kaufhold, managing partner of Deep Learning Analytics, asked Bill Rand, Assistant Professor of Marketing at University of Maryland, about ROCs and convex hulls. In the post, Dr. Kaufhold satirizes data science moments lost in Q&A, talks ROC curves, and discusses the value of error bars in visualizing data science results.

Posted
AuthorGuest Author

This guest post is written by David Masad. David is a PhD candidate at George Mason University’s Department of Computational Social Science, where he studies international conflict and cooperation using agent-based modeling, event data, and network analysis. You can follow him on Twitter at @badnetworker.

'SciPy' has a few different meanings. It is a particular Python package, which brings together fast, efficient implementations of many key functions and algorithms for scientific computation. It's also the label for the broader scientific Python stack, the set of libraries and tools that make Python an increasingly popular language for science and research. Finally, it's what everyone calls what's nominally the Scientific Computing with Python conference, which for the past few years has been held every summer in Austin, TX.

This year, it involved two days of intensive tutorials; three days of presentations, talks, and discussion sections; and two more days of informal coding 'sprints'. Though I've been using the scientific Python tools for several years now, this was my first time attending SciPy. I even got a little "1st SciPy" sticker to add to my conference badge. For about five days, I got to be a part of a great community, and experience more Python, brisket and Tex-Mex than I realized was possible.

Part of the fun of this particular conference was the opportunity to talk to researchers working in areas far afield from my own. For example, over lunch burritos I got to talk shop with someone working at a climate research firm, and discovered interesting overlaps between weather and political forecasting. An astronomy talk contained some great insights into building community consensus around a common software tool. And a talk that was officially about oceanography had some very important advice on choosing a color palette for data visualization. (If you followed the links above, you saw that all the SciPy talks are available online, thanks to Enthought's generous sponsorship -- it's not too late to see any SciPy talks that seem interesting to you.)

The SciPy attendees from the DC area were a good cross-section of the diverse scientific Python community in general. There were an epidemiologist and a geneticist, a few astronomers, a government researcher, a library technologist, and a couple of social scientists (myself included). (If there were any DC-area geo-scientists there, I didn't get a chance to meet them).

There was also plenty of material directly applicable to data science. Chris Wiggins, the chief data scientist at the New York Times, gave the first day's keynote, with plenty of good insight into bringing data science into a large, established organization. Chris Fonnesbeck gave an engaging talk on the importance of statistical rigor in data science. Quite a few of the presentations introduced tools that data scientists can install and use right now. These include Dask, for out-of-core computation; xray, an exciting new library for multidimensional data; and two talks on using Docker for reproducible  research. There was a whole track devoted to visualization, including a talk on VisPy, a GPU-accelerated visualization library, that gave the conference one of its big 'Wow!' moments. And the future of Jupyter (still better known as the IPython Notebook) was announced in a 5-minute lightning talk, between demos of bad-idea ways to call Assembly directly from Python and Notebook flight sim widgets (which Erik Tollerud immediately dubbed 'planes on a snake').

Not only did I get to learn a lot from other people's research and tools, I got to present my own. Jackie Kazil and I unveiled Mesa, an agent-based modeling framework we've been building with other contributors. The sprint schedule after the conference proper gave us a chance to work with new collaborators who discovered the package at our talk the day before. A couple extra heads, and a couple days of extra work, mean that Mesa came out of SciPy noticeably better than when it came in. Quite a few other tools came out of the sprints with some improvements, including ones at the core of the scientific Python stack. Getting to work beside (and, later, drink beer with) such experienced developers was an educational opportunity in itself.

SciPy isn't just for physical scientists or hardcore programmers. If you use Python to analyze data or build models, or think you might want to, you should absolutely consider SciPy next year. The tutorials at the beginning can help novices to experts learn something new, and the sprints provide an opportunity to gain experience putting that knowledge to work. In between, the conference provides a great opportunity to gain exposure to a great variety of Python tools, and the community that builds, maintains and uses them. And even if you can't attend -- the videos of the talks are always online.

Posted
AuthorGuest Author