Data Community DC and District Data Labs are hosting an Intro to Python for Data Science workshop on Saturday October 3rd from 9am - 5pm. Register before September 19th for an early bird discount!
Data Community DC and District Data Labs are excited to be hosting an Introduction to R Programming workshop on Saturday, September 26th. Register before 9/12/2015 for an early bird discount!
Data Community DC is pleased to announce our State of Data Science Education event.
Its goal is to bring together educators, vocational programs and companies to discuss the state and future of data science.
With the rise of data science both schools and companies are rapidly building up their programs and departments, but their beliefs about what a data scientist is and what skills they should have vary considerably.
Last week, the Urban Institute hosted a discussion on the evolving landscape of data and the potential impact on social science. “Machine Learning in a Data-Driven World” covered a wide range of important issues around data science in academic research and their real policy applications. Above all else, one critical narrative emerged:
Data are changing, and we can use these data for social good, but only if we are willing to adapt to new tools and emerging methods.
Last month, a new meetup group for women data scientists in the DC area was started by Mandi Traud and Jackie Kazil.
Women Data Scientists DC is a meetup group for women data scientists, women who want to be data scientists, and supporters of women in data science. Their monthly meetings will include presentations by data scientists, networking events, mentoring opportunities, and workshops to learn new data science skills.
Co-founders Jackie Kazil and Mandi Traud launched on July 9th with two members, and by the next day, the group had more than 85 members and growing!
Here's what the co-founders said individually when asked about how and why they decided to start this group.
This guest blog post on ROCs was spurred by a conversation in the Q&A at Data Science DC’s June 16th Meetup on “Predicting Topics and Sharing in Social Media”. John Kaufhold, managing partner of Deep Learning Analytics, asked Bill Rand, Assistant Professor of Marketing at University of Maryland, about ROCs and convex hulls. In the post, Dr. Kaufhold satirizes data science moments lost in Q&A, talks ROC curves, and discusses the value of error bars in visualizing data science results.
This guest post is written by David Masad. David is a PhD candidate at George Mason University’s Department of Computational Social Science, where he studies international conflict and cooperation using agent-based modeling, event data, and network analysis. You can follow him on Twitter at @badnetworker.
'SciPy' has a few different meanings. It is a particular Python package, which brings together fast, efficient implementations of many key functions and algorithms for scientific computation. It's also the label for the broader scientific Python stack, the set of libraries and tools that make Python an increasingly popular language for science and research. Finally, it's what everyone calls what's nominally the Scientific Computing with Python conference, which for the past few years has been held every summer in Austin, TX.
This year, it involved two days of intensive tutorials; three days of presentations, talks, and discussion sections; and two more days of informal coding 'sprints'. Though I've been using the scientific Python tools for several years now, this was my first time attending SciPy. I even got a little "1st SciPy" sticker to add to my conference badge. For about five days, I got to be a part of a great community, and experience more Python, brisket and Tex-Mex than I realized was possible.
Part of the fun of this particular conference was the opportunity to talk to researchers working in areas far afield from my own. For example, over lunch burritos I got to talk shop with someone working at a climate research firm, and discovered interesting overlaps between weather and political forecasting. An astronomy talk contained some great insights into building community consensus around a common software tool. And a talk that was officially about oceanography had some very important advice on choosing a color palette for data visualization. (If you followed the links above, you saw that all the SciPy talks are available online, thanks to Enthought's generous sponsorship -- it's not too late to see any SciPy talks that seem interesting to you.)
The SciPy attendees from the DC area were a good cross-section of the diverse scientific Python community in general. There were an epidemiologist and a geneticist, a few astronomers, a government researcher, a library technologist, and a couple of social scientists (myself included). (If there were any DC-area geo-scientists there, I didn't get a chance to meet them).
There was also plenty of material directly applicable to data science. Chris Wiggins, the chief data scientist at the New York Times, gave the first day's keynote, with plenty of good insight into bringing data science into a large, established organization. Chris Fonnesbeck gave an engaging talk on the importance of statistical rigor in data science. Quite a few of the presentations introduced tools that data scientists can install and use right now. These include Dask, for out-of-core computation; xray, an exciting new library for multidimensional data; and two talks on using Docker for reproducible research. There was a whole track devoted to visualization, including a talk on VisPy, a GPU-accelerated visualization library, that gave the conference one of its big 'Wow!' moments. And the future of Jupyter (still better known as the IPython Notebook) was announced in a 5-minute lightning talk, between demos of bad-idea ways to call Assembly directly from Python and Notebook flight sim widgets (which Erik Tollerud immediately dubbed 'planes on a snake').
Not only did I get to learn a lot from other people's research and tools, I got to present my own. Jackie Kazil and I unveiled Mesa, an agent-based modeling framework we've been building with other contributors. The sprint schedule after the conference proper gave us a chance to work with new collaborators who discovered the package at our talk the day before. A couple extra heads, and a couple days of extra work, mean that Mesa came out of SciPy noticeably better than when it came in. Quite a few other tools came out of the sprints with some improvements, including ones at the core of the scientific Python stack. Getting to work beside (and, later, drink beer with) such experienced developers was an educational opportunity in itself.
SciPy isn't just for physical scientists or hardcore programmers. If you use Python to analyze data or build models, or think you might want to, you should absolutely consider SciPy next year. The tutorials at the beginning can help novices to experts learn something new, and the sprints provide an opportunity to gain experience putting that knowledge to work. In between, the conference provides a great opportunity to gain exposure to a great variety of Python tools, and the community that builds, maintains and uses them. And even if you can't attend -- the videos of the talks are always online.
Data Community DC and District Data Labs are excited to be hosting another Social Network Analysis with Python workshop on Saturday August 15th where you can learn how to use Python to construct and analyze a social network, compute cardinality, traverse and query graphs, compute clusters, and create visualizations. For more info and to sign up, go to the DDL course page. Register before August 1st for an early bird discount!
Data Community DC and District Data Labs are excited to be hosting another Natural Language Processing with NLTK workshop on Saturday July 25th, where you can learn how to analyze text and create language aware data products with Python. For more info and to sign up, go to the DDL course page. Register before July 11th for an early bird discount!
Dennis D. McDonald, Ph.D. Dennis is an independent management consultant based in Alexandria, Virginia. His experience includes consulting company ownership and management, database publishing and data transformation projects, managing the consolidation of large systems, open data, statistical research, corporate IT strategy, and IT cost analysis. Dennis recently attended one of our Meetups, "Get Moving with Data - The US Department of Transportation and its Data," and was kind enough to write a guest post for the Data Community DC blog. This article is originally published on http://www.ddmcd.com/transportation.html.
Data products are usually software applications that derive their value from data by leveraging the data science pipeline and generate data through their operation. They aren’t apps with data, nor are they one time analyses that produce insights - they are operational and interactive. The rise of these types of applications has directly contributed to the rise of the data scientist and the idea that data scientists are professionals “who are better at statistics than any software engineer and better at software engineering than any statistician.”
These applications have been largely built with Python. Python is flexible enough to develop extremely quickly on many different types of servers and has a rich tradition in web applications. Python contributes to every stage of the data science pipeline including real time ingestion and the production of APIs, and it is powerful enough to perform machine learning computations. In this class we’ll produce a data product with Python, leveraging every stage of the data science pipeline to produce a book recommender.
Data Science DC is hosting a Meetup "Forecasting International Events" on Thursday, April 30th. Founder of GovBrain, Mr. Brent M. Eastwood, PhD. wrote a nice introduction about the event and his company. Read more about how companies and researchers can benefit from forecasting techniques.
Eighty percent or more of the time spent on data science projects is spent acquiring data, cleaning it, and preparing it for analysis. That data can come from a variety of sources, including APIs or individual web pages. However, not all data is created equal. Once we have automated its acquisition, much of it requires lengthy cleaning and formatting before it can be used. In this course, you will learn how to obtain data via web scraping and APIs, how to clean and consolidate your data, and how to wrangle it into a database so that it is ready for analysis.
Saturday? Yup, we are changing things up and swapping out pizza and empanadas for bagels and other brunchy foods.
Eight teams of DC data scientists have come together for a three month incubator to turn theory into practice on projects spanning healthcare, economics, the environment, and more. Learn from their experience implementing a Deep Learning network on commercially available hardware, on deploying a D3.js visualization web app using Heroku, or on building a desktop GUI with Python... plus much more! Enjoy brunch and drinks on us as we are taken from concept to production on eight data products, and then join the judges by voting for the winner!
Data Community DC and District Data Labs are hosting a full-day Intro to Machine Learning with Python workshop on Saturday February 28th. For more info and to sign up, go to [http://bit.ly/1xQ9f4n]. Register before February 13th for an early bird discount!
There have been three big changes at Data Community DC in recent days! Read more about changes to our Board and our line-up of Meetup groups.
Building off the success of the 2014 event, Health Datapalooza is once again opening its doors to app developers who would like to demo their app in front of an audience of more than 2,000 key healthcare industry executives, venture capitalists, providers and more!
On December 11th, Prof. Regina Nuzzo from Galludet University talked at Data Science DC, about Problems with the p-value. The event was well-received. If you missed it, the slides and audio are available. Here we provide Dr. Nuzzo's references and links from the talk, which are on their own a great resource for those considering communication about statistical reliability. (Note that the five topics she covered used examples from highly-publicized studies of sexual behavior.)