Videos

DIDC MeetUp Review - The US Census Bureau Pushes Data

Data Community DC is excited to welcome Andrea to our host of bloggers. Andrea's impressive bio is below and she will be bringing energy, ideas, and enthusiasm to the Data Innovation DC organizational team. IMG_20131112_192600

Census Data is cool?

At least that’s what everyone discovered at last night’s Data Innovation DC's MeetUp. The U.S Census Bureau came in to "reverse pitch" their petabytes of data to a group of developers, data scientists and data-preneurs at Cooley LLP in Downtown DC.

First off, let's offer a massive thanks to the US Census Bureau that sent five of their best and brightest to come engage the community long into the evening and late night hours. Who specifically did they send? Just take a look at the impressive list below:

census_contact

Editor's note - a special thank you to Logan Powell who made this entire event possible.

And they brought the fantastic Jeremy Carbaugh jcarbaugh [at] sunlightfoundation.com from the Sunlight Foundation, a company working on making census data (and other government data) interesting, fun, and mobile. They have this sweet app called Sitegeist. You give it a location and it gives you impressive stats such as the history of the place, how many people are baby making, or just living the bachelor lifestyle; it even connects to Yelp and wunderground too just in case you need the weather and a place to grab a brewski while you’re at it. Further, Eric at the Census bureau made a great point for everyone out there in real estate. You can use this app to show potential buyers how the demographics in the area have changed, good school districts, income levels, number of children per household, etc.. You know you’ll look good whipping out that tablet and showing them ;)

By the way, Sunlight created a very convenient python wrapper for the Census API; you can pip it off of PyPI and check out the source on github here (a round of applause for our sunlight folks!) Did I mention that they are a non-profit doing this with far less funding then many others out there?

Sitegeist is nice but exactly how accessible is the Census Data?  I am glad you asked. The census has two approaches, their American Fact Finder and API, both easy to use. The fact finder is good to just go ahead and peruse what you may find interesting before actually grabbing the data for yourself. The api is like the Twitter version 1 API. You get a key and use stateless HTTP GET requests to pull the data via the web. For those non-api folks, I’ll be posting a how-to shortly.

The census also has their own fun mobile app called Americas Economy.

Alright so we’ve got some data, we’ve got some ways to get it but what’s up with the reverse pitch thing? This was the best part as everyone had awesome ideas and ideations.

Some questions included:

Can we blend WorldBank and Federal Reserve Bank data to get meaningful results?

This came from a guy who was already building some nice apps around WB and Fed data. The general consensus was "yes," a lot of business value can come from that, but they need folks like us to come up with use-cases. So, thoughts? Please comment and tinker away.

What about the geospatial aspects of the data?

There were a lot of questions around the GIS mapping data and some problems with drilling down on the geo-spatial data down to block sizes or very small lots of land. People seem really interested in getting this data for things like understanding how diseases spread, patters of migration etc. The Census folks said that with the longer term surveys you can definitely get down to the block level but, because boundaries and borders can be defined differently across the nation, it is very difficult to normalize the data. Another use-case? A herculean effort? Hmm..food for thought. Also, shortly after the event, someone posted this on geo-normalization in Japan. Thanks Logan!

Editor's note: More information on US Census Grids can be found here.

How does Census data help commercial companies?

There was a great established use case where the Census helped Target Retail understand their demographic. That blew me away. The gov’t and a private retail company working to make a better profit, a better product? This definitely got my creative juices flowing, hopefully it will get everyone out there cogitating too.

https://www.youtube.com/watch?v=jgsdQxTv5kY

or, check out this case study from the National Association of Homebuilders:

http://www.youtube.com/watch?v=CBDmE5Nj0BY

and last but not least, an example of Census data helping disaster relief (not really commercial but Logan didn't get a chance to show all of his videos):

http://www.youtube.com/watch?v=PaEu8-xH9LE

We finally had people talking about the importance of longitudinal studies.

What is different now for our nation in terms of demographics, culture, and geography from 20-30-50 years ago? Just imagine some really cool heat map or time series visualization of how Central Park in NY or Rock Creek in DC has changed…yes I am saying this so someone actually goes out and gives that one a go. Don’t worry you can take the credit ;)

Oh and I almost forgot due to obvious privacy issues a lot of the data is pre-processed so you can’t stalk your ex-boss/boyfriend/girlfriend. But, listen up! If you are in school and doing research and want to get your hands on the microdata, you can apply. Go to this link and check it out (http://www.census.gov/ces/rdcresearch/howtoapply.html). For those of you stuck on a thesis topic in any domain that may need information about society, cough cough, nudge nudge ...

So there you have it, these are the kinds of meetups happening at Data Innovation DC. I don’t know about you, but I definitely have a new perspective on government data. I also feel a little more inclined to open my door when those census folk drop by and give them real answers.

Please comment as you see fit and send me questions.  Also, JOIN Data Innovation DC and check out Data Community DC with all of other related data meetup groups. Let us know what kind of information you want to know about and what issues/topics you want us to address.

I’m new to the blog/review game but will continue to review meetups and some hot topics, podcasts etc. that I think need to be checked out. Let me know if you want me to speak to anything in particular.

Hadoop for Data Science: A Data Science MD Recap

Hadoop logo On October 9th, Data Science MD welcomed Dr. Donald Miner as its speaker to talk about doing data science work and how the hadoop framework can help. To start the presentation, Don was very clear about one thing: hadoop is bad at a lot of things. It is not meant to be a panacea for every problem a data scientist will face.

With that in mind, Don spoke about the benefits that hadoop offers data scientists. Hadoop is a great tool for data exploration. It can easily handle filtering, sampling and anti-filtering (summarization) tasks. When speaking about these concepts, Don expressed the benefits of each and included some anecdotes that helped to show real world value. He also spoke about data cleanliness in a very Baz Luhrmann Wear Sunscreen sort of way, offering that as his biggest piece of advice.

Don then transitioned to the more traditional data science problems of classification (including NLP) and recommender systems.

The talk was very well received by DSMD members. If you missed it, check out the video:

http://www.youtube.com/playlist?list=PLgqwinaq-u-Mj5keXlUOrH-GKTR-LDMv4

Our next event will be November 20th, 2013 at Loyola University Maryland Graduate Center starting at 6:30PM. We will be digging deeper into the daily lives of 3 data scientists. We hope you will join us!

2013 September DSMD Event: Teaching Data Science to the Masses

Stats-vs-data-science For Data Science MD's Septmeber meetup, we were very fortunate to have the very talented and very passionate Dr. Jeff Leek speak about his experiences teaching Data Science through the online learning platform Coursera. This was also a unique event for DSMD itself because it was the first meetup that only featured one speaker. Having one speaker speak for a whole hour can be a disaster if the speaker is unable to keep the attention of those in the audience. However, Dr. Leek is a very dynamic and engaging speaker and had no problem keeping the attention of everyone in the room, including a couple of middle school students.

For those of you who are not familiar with Dr. Leek, he is a biostatistician at Johns Hopkins University as well as a instructor in the JHU biostatistics program. His biostatistics work typically entails analyzing human genome sequenced data to provide insights to doctors and patients in the form of raw data and advanced visualizations. However, when he is not revolutionizing the medical world or teaching the great biostatisticians of tomorrow at JHU, you may look for him teaching his course on Coursera, or providing new content to his blog, Simply Statistics.

Now, on to the talk. Johns Hopkins and specifically Dr. Leek got involved in teaching a Coursera course because they have constantly been looking at ways to improve learning for their students. They had been "flipping the classroom" by taking lectures and posting them to YouTube so that students could review the lecture material before class and then use the classroom time to dig deeper into specific topics. Because online videos are such a vital component of Massive Open Online Classes (MOOCs), it is no surprise that they took the next leap.

But just in case you think that JHU and Dr. Leek are new to this whole data science "thing," check out their Kaggle team's results for the Heritage Health Prize.

jhu-kaggle-team

Even though their team fell a few places when run on the private data, they still had a very impressive showing considering there were 1358 teams that entered and over 20,000 entries. But what exactly does data science mean to Dr. Leek? Check out his expanded components of data science chart, that differs from similar charts of other data scientists by showing the root disciplines of each component too.

expanded-fusion-of-data-science

But what does the course look like?

course-setup

He covers topics such as type of analyses, how to organize a data analysis, data munging as well as others like:

concepts-machine-learning

statistics-concept

One of the interesting things to note though is that he also shows examples of poor data analysis attempts. There is a core problem with the statistics example from above (pointed out by high school students). Below is an example of another:

concepts-confounding

And this course, in addition to two other courses, Computing for Data Analysis and Mathematical Biostatistics Bootcamp taught by other JHU faculty, have had a very positive response.

jhu-course-enrollment

But how do you teach that many people effectively? That is where the power of Coursera comes in; JHU could have chosen other providers like edX or Udacity but decided to go with Coursera. The videos make it easy to convey knowledge and message boards provide a mechanism to ask questions. Dr. Leek even had students answering questions for other students so that all he had to do was validate the response. But he also pointed out that his class' message board was just like all other message boards and followed 1/98/1 rule where 1% of people respond in a mean way and are unhelpful, 1% of people are very nice and very helpful and the other 98% don't care and don't really respond.

One of the most unique aspects of Coursera is that it helps to scale to tens of thousands of students by using peer/student grading. Each person grades 4 different assignments so that everyone is submitting one answer and grading 4 others. The final score for each student is the median of the four scores from the other students. The rubric used in Dr. Leek's class is below.

grading-rubric

The result of this grading policy, based on Dr. Leek's analysis is that good students received good grades, poor students received poor grades and middle students' grades fluctuated a fair amount. So it seems like the policy works mostly, but there is still room for improvement.

But why does Johns Hopkins and Dr. Leek even support this model of learning? They do have full time jobs that involve teaching after all. Well, besides being huge supporters of open source technology and open learning, they also see many other reasons for supporting this paradigm.

partial-motivation

Check out the video for the many other reasons why JHU further supports this paradigm. And, while you are at it, see if you can figure out if the x and y axes are related in some way. This was our data science/statistics problem for the evening. The answer can also be found in the video.

stat-challenge

http://www.youtube.com/playlist?list=PLgqwinaq-u-NNCAp3nGLf93dh44TVJ0vZ

We also got a sneak peek at a new tool/component that integrates directly into R - swirl. Look for a meetup or blog post about this tool in the future.

swirl

Our next meetup is on October 9th at Advertising.com in Baltimore beginning at 6:30PM. We will have Don Miner speak about using Hadoop for Data Science. If you can make it, come out and join us.

Data Science MD August Recap: Sports Analytics Meetup

pitchfx For August's meetup, Data Science MD hosted a discussion on one of the most popular fields of analytics, the wide world of sports. From Moneyball to the MIT Sloan Sports Analytics conference, there has been much interest by researchers, team owners, and athletes in the area of sports analytics. Enthusiastic fans eagerly pour over recent statistics and crunch the numbers to see just how well their favorite sports team will do this season.

One issue that sports teams must deal with is the reselling of tickets to different events. Joshua Brickman, the Director of Ticket Analytics for , led off the night by discussing how the Washington Wizards are addressing secondary markets such as StubHub. One of the major initiatives taking place is a joint venture between Ticketmaster and the NBA to create a unified ticket exchange for all teams. Tickets, like most items, operate on a free market, where customers are free to purchase from whomever they choose. brickman-slide-1

Joshua went on to explain that teams could either try to beat the secondary markets by limiting printing, changing fee structures, and offering guarantees, or they could instead take advantage of the transaction data received each week from the league across secondary markets.

brickman-slide-2

Josh outlined that the problem with the data was that it was only for the Wizards, it was only received weekly, and it doesn't take into consideration dynamic pricing changes. So instead they built their own models and queries to create heat maps. The first heat map shows the inventory sold. For this particular example, the Wizards had a sold out game.

brickman-slide-3

Possibly of more importance was the heat map showing at what premium were tickets sold on the secondary market. In certain cases, the prices were actually lower than face value.

brickman-slide-4

As with most data science products, the visualization of the results is extremely important. Joshua explained that the graphical heat maps make the data easily digestible for sales and directors, and supplements their numerical tracking. Their current process involves combining SQL queries with hand drawn shape files. Joshua also explained how they can track secondary markets and calculate current dynamic prices to see discrepancies.

brickman-slide-5

Joshua ended with describing how future work could involve incorporating historical data and current secondary market prices to modify pricing to more closely reflect current conditions.

brickman-slide-6

Our next speaker for the night was , the Player Information Analyst for our very own Baltimore Orioles. Tom began by describing how PITCHf/x is installed in every major league stadium and provides teams with information the location, velocity, and movement of every pitch. Using heat maps, Tom was able to show how the strike zone has changed between 2009 and 2013.

duncan-slide-1

Tom then described the R code necessary to generate the heat maps.

duncan-slide-2

Since different batters have different strike zones, locations needed to be rescaled to define new boundaries. For instance, Jose Altuve, who is 5'5", has a relative Z location that is shifted slightly higher.

duncan-slide-3

Tom then went on to describe the impact that home plate umpires have on the game. On average, 155 pitches are called per game, with 15 being within one inch of the strike zone, and 31 being within two inches. With a game sometimes being determined by a single pitch, the decisions that an home plate umpire make are very important. A given pitch is worth approximately 0.13 runs.

duncan-slide-4

Next Tom showed various heat map comparisons that highlighted differences between umpires, batters, and counts. One of the most surprisingly results was the difference when the batter faced an 0-2 count versus 3-0 count. I suggest readers look at all the slides to see the other interesting results.

duncan-slide-5

While the heat maps provide a lot of useful information, it is sometimes interesting to look at certain pitches of interest. By linking to video clips, Tom demonstrated how an interactive strike scatter plot could be created. Please view the video to see this demonstration.

duncan-slide-6

Tom concluded by saying that PITCHf/x is very powerful, and yes, umpires have a very difficult job!

Slides can be found here for Joshua and here for Tom.

The video for the event is below:

[youtube=http://www.youtube.com/playlist?list=PLgqwinaq-u-NLZhSml9VHXVgHEh6l7wQH]

 

 

Data Science MD July Recap: Python and R Meetup

highres_261064962 For July's meetup, Data Science MD was honored to have Jonathan Street of NIH and Brian Godsey of RedOwl Analytics come discuss using Python and R for data analysis.

Jonathan started off by describing the growing ecosystem of Python data analysis tools including Numpy, Matplotlib, and Pandas.

He next walked through a brief example demonstrating Numpy, Pandas, and Matplotlib that he made available with the IPython notebook viewer.

The second half of Jonathan's talk focused on the problem of using clustering to identify scientific articles of interest. He needed to a) convert PDF to text b) extract sections of the document c) cluster and d) retrieve new material.

Jonathan used the PyPDF library for PDF conversion and then used the NLTK library for text processing. For a thorough discussion of NLTK, please see Data Community DC's multi-part series written by Ben Bengfort.

Clustering was done using scikit-learn, which identified seven groups of articles. From these, Jonathan was then able to retrieve other similar articles to read.

Overall, by combining several Python packages to handle text conversion, text processing, and clustering, Jonathan was able to create an automated personalized scientific recommendation system. Please see the Data Community DC posts on Python data analysis tutorials and Python data tools for more information.

Next to speak was Brian Godsey of RedOwl Analytics who was presenting on their social network analysis. He first described the problem of identifying misbehavior in a financial firm. Their goals are to detect patterns in employee behavior, measure differences between various types of people, and ultimately find anomalies in behavior.

In order to find these anomalies, they model behavior based on patterns in communications and estimate model parameters from a data set and set of effects.

Brian then revealed that while implementing their solution they have developed a R package called rRevelation that allows a user to import data sets, create covariates, specify a model's behavioral parameters, and estimate the parameter values.

To conclude his presentation, Brian demonstrated using the package against the well-known Enron data set and discussed how larger data sets requires using other technologies such as MapReduce.

  http://www.youtube.com/watch?v=1rFlqD3prKE&list=PLgqwinaq-u-Piu9tTz58e7-5IwYX_e6gh

Slides can be found here for Jonathan and here for Brian.

Data Science MD Discusses Health IT at Shady Grove

For its June meetup, Data Science MD explored a new venue, The Universities at Shady Grove. And what better way to venture into Montgomery County than to spend an evening discussing one of its leading sectors. That's right, an event all about healthcare. And we tackled it from two different sides.

http://www.youtube.com/playlist?v=PLgqwinaq-u-Ob2qS9Rt8uCXeKtT830A7e&w=640&h=385

The night started with a presentation from Gavin O'Brien from NIST's National Cybersecurity Center of Excellence. He spoke about creating a secure mobile Health IT platform that would allow doctors and nurses to share relevant pieces of information in a manner that is secure and follows all guidelines and policies set forth documenting how health data must be handled. Gavin's presentation focused on securing 802.11 links as opposed to cellular links or other types of wireless links as this is a good first step and is immediately practical when deployed within one building like a hospital. Gavin discussed all of the technological challenges, from encrypting data during transmission rather than in the clear where it can be intercepted as well as creating Access Control Lists so that only the correct people saw a patient's data. As his talk progressed, one thought was constantly in the back of my mind: how can this architecture be put in place to provide the protection for the data that the policies stipulate while still allowing the data to be distributed so that analytics can be run on the data? For instance, a hospital should be interested in trends among patients in their care like if patients had complications all after receiving the same family of drugs or specific drug (perhaps from the same batch), when patients have the most problems and therefore require the most attention and when a bacteria or virus may be loose in the hospital, further complicating patients ailments. The architecture may allow these types of analytics but they were not specifically discussed during Gavin's presentation. If you have any ideas how a compliant architecture can support these analytics or potential problems to running analytics, please provide a comment to this post.

The final speaker of the night was Uma Ahluwalia, the Director of Health and Human Services for Montgomery County. Uma spoke about the various different avenues that county residents have to report problems and that often times, their needs cross many different segments of health and human services, usually requires their stories to be retold each time. According to her vision, a resident/patient could report their problem to any one of six segments and then all of the segments could see the information without the patient having to reiterate their story over and over again. One big problem with this solution is that data would be shared across many groups, giving county workers access to more information than they should according to health regulations. However, Montgomery County sees each segment as a part of one organization, and therefore the data can be shared internally among all employees within that organization. While this should help with reducing the amount of time patients need to retell their story, it still does not provide an open platform for data scientists. However, Uma also had a potential solution to that problem: volunteers. Volunteers can sign non-disclosure agreements allowing them access to see patient data to help create useful analytics, thereby opening the problem space to many more minds in the hopes of creating truly revolutionary analytics. Perhaps you will be the next great mind that unlocks the meaning behind a current social issue.

Finally, Data Science MD needs to acknowledge a few key people and groups that contributed to this meetup. Mary Lowe and Melissa Marquez from the Universities at Shady Grove were instrumental in making this happen, helping to secure the room and providing the food and A/V setup. Dan Hoffman, the Chief Innovation Officer for Montgomery County also provided a great deal of support to make this happen. Finally, John Rumble, a DSMD member, took the lead in getting DSMD beyond the Baltimore/Columbia corridor. Thanks so much to all of these key people.

If you want to catch up on previous meetups, please check out our YouTube channel.

Please check our July meetup where we discuss analysis techniques in Python and R at Betamore.

Data Science MD Unveils YouTube Channel

[youtube http://www.youtube.com/watch?v=videoseries?list=PLgqwinaq-u-OjuL89qqV4lto2PG3QwjbM&w=600&h=360] Data Science MD, in an effort to provide additional value to its members, has started a YouTube channel, DataScienceMD, to host videos of talks presented at Meetup events. Now, when a member can't attend an event due to a scheduling conflict or being out of town, they can view the videos after the fact to stay in the loop. However, we know the more likely scenario: seeing the talks in person will not be enough and you will want to see it again and again. (It's OK, we won't tell the presenters how often you are watching them.)

The presentations are available in two formats: individual video entries that cover one specific presentation and playlists which group all presentations from an event together in one package making it easy to relive it in its entirety. The default view when first visiting the channel is to see the most recent activity. By clicking on the Videos link just below the channel title, you will see individual presentations. To see the playlists, simply change the Uploads box to Playlists.

The playlist above is from our May Meetup which featured Cloudera consultants Joey Echeverria and Sean Busbey discussing an infrastructure option that can make analyzing Twitter data quick and simple as well an introduction to one of the many features of Apache Mahout.  These were not just static presentations; they also included live demonstrations/queries against data stored within the infrastructure, and it was all captured in the videos. Check them out!

Data Community DC Video Series Kicks Off: Dr. Jesse English Talks NLP and Text Processing

We are excited to announce the first in a new series of posts and a brand new initiative: Data Community DC Videos! We are going to film and publish online videos (and separate audio, resources permitting) as many talks from Data Community DC meetups as possible. Yes, we want you to experience the events in person, but realize that not everyone who wants to be a part of our community can attend every single event. To kick this off, we have a fantastic video of Dr. Jesse English passionately discussing a brand new, open source framework, WIMs (Weakly Inferred Meanings), a novel approach to creating structured meaning representations for semantic analyses. Whereas a TMR (text meaning representation) requires a large, domain-specific knowledge base and significant computation times, WIMs cover a limited scope of possible relationships. The limitation is intentional, and allows for better performance-- but still carries enough relationships for most applications. Additionally, the creation of a bespoke knowledge base and microtheory is not required, the novel pattern matching technique means that available ontologies like WordNet provide enough coverage. WIMs are Open Source and available now, and are truly a break through in semantic processing.

Dr. Jesse English holds a PhD in computer science from UMBC and has specialized in natural language processing, machine learning and machine reading. As the Chief Science Officer at Unbound Concepts, Jesse's focused on automatic extraction of semantically rich meaning from literature, and application of that knowledge to the company's big-data driven machine learning algorithm. Before his work at Unbound Concepts, Jesse worked as a research associate at UMBC, focusing on automatically bridging the knowledge acquisition bottleneck through machine reading, as well as developing agent-based conversation systems.

[vimeo 61955058]