The DC Data Source Weekly: Week 1 - Data for Testing Recommendation Algorithms

Data Community DC is excited to bring you another weekly blog post, the DC Data Source Weekly, to complement the excellent Data Visualization by Sean Gonzalez and the Weekly Roundup of Top Data Stories by Tony Ojeda. The DC Data Source Weekly will spend precious few words overviewing and directing our readers to fascinating sources of free data. Even better, these data sources will often be themed with upcoming or previous Data Science DC, Data Business DC, and/or R Users DC meetup events.

To kick off, we will look at sample data sets for testing out recommendation systems to align with January 28th's meetup: Recommendation Systems in the Real World.

 

Jester

This dataset contains 4.1 million continuous ratings (-10.00 to +10.00) of 100 jokes from 73,421 users; therefore it differentiates itself from other datasets by having a much smaller number of rateable items.

Book-Crossing Dataset

This dataset is from the Book-Crossing community, and contains 278,858 users providing 1,149,780 ratings about 271,379 books.

Yandex

The Relevance Prediction Challenge provides a unique opportunity to consolidate and scrutinize the work from industrial labs on predicting the relevance of URLs using user search behavior. It provides a fully anonymized dataset shared by Yandex which has clicks and relevance judgements. Predicting relevance based on clicks is difficult, and is not a solved problem. This Challenge and the shared dataset will enable a whole new set of researchers to conduct such experiments.

The dataset includes user sessions extracted from Yandex logs, with queries, URL rankings and clicks. Unlike previous click datasets, it also includes relevance judgments for the ranked URLs, for the purposes of training relevance prediction models. To allay privacy concerns the user data is fully anonymized. So, only meaningless numeric IDs of queries, sessions, and URLs are released. The queries are grouped only by sessions and no user IDs are provided. The dataset consists of several parts.

 

hetrec2011-movielens-2k

This is an extension of MovieLens10M dataset, which contains personal ratings and tags about movies. From the original dataset, only those users with both ratings and tags have been mantained.

In the dataset, the movies are linked to Internet Movie Database (IMDb) and RottenTomatoes (RT) movie review systems. Each movie does have its IMDb and RT identifiers, English and Spanish titles, picture URLs, genres, directors, actors (ordered by "popularity"), RT audience' and experts' ratings and scores, countries, and filming locations.

http://ir.ii.uam.es/hetrec2011/datasets/delicious/readme.txt

hetrec2011-delicious-2k

This dataset has been obtained from Delicious social bookmarking system. Its users are interconnected in a social network generated from Delicious "mutual fan" relations. Each user has bookmarks, tag assignments, i.e. tuples [user, tag, bookmark], and contact relations within the dataset social network. Each bookmark has a title and URL.

hetrec2011-lastfm-2k

This dataset has been obtained from Last.fm online music system. Its users are interconnected in a social network generated from Last.fm "friend" relations. Each user has a list of most listened music artists, tag assignments, i.e. tuples [user, tag, artist], and friend relations within the dataset social network. Each artist has a Last.fm URL and a picture URL.

A thank you goes out to Bryce Nyeggen for suggesting several of the above data sources.