recommended reading

Ensemble Learning Reading List

Tuesday's Data Science DC Meetup features GMU graduate student Jay Hyer's introduction of Ensemble Learning, a core set of Machine Learning techniques. Here are Jay's suggestions for readings and resources related to the topic. Attend the Meetup, and follow Jay on Twitter at @aDataHead! Also note that all images contain Amazon Affiliate links and will result in DC2 getting a small percentage of the proceeds should you purchase the book. Thanks for the support!

L. Breiman, J. Friedman, C.J. Stone, and R.A. Olshen. Classification and Regression Trees. Chapman and Hall.CRC, Boca Raton, FL, 1984.

This book does not cover ensemble methods, but is the book that introduced classification and regression trees (CART), which is the basis of Random Forests. Classification trees are also the basis of the AdaBoost algorithm. CART methods are an important tool for a data scientist to have in their skill set.

L. Breiman. Random Forests. Machine Learning, 45(1):5-32, 2001.

This is the article that started it all.

T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning, 2nd ed. Springer, New York, NY, 2009.

This book is light on application and heavy on theory. Nevertheless, chapters 10, 15 & 16 give very thorough coverage to boosting, Random Forests and ensemble learning, respectively. A free PDF version of the book is available on Tibshirani’s website.

G. James, D. Witten, T. Hastie, R. Tibshirani. An Introduction to Statistical Learning: with Apllications in R, Springer, New York, NY, 2013.

As the name and co-authors imply, this is an introductory version of the previous book in this list. Chapter 8 covers, bagging, Random Forests and boosting.

Y. Freund and R.E. Schapire. A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting Journal of Computer and System Sciences, 55(1): 119-139, 1997.

This is the article that introduced the AdaBoost algorithm.

G. Seni, and J. Elder. Ensemble Methods in Data Mining: Improving Accuracy Through Combining Predictions. Morgan & Claypool Publishers, USA, 2010.

This is a good book with great illustrations and graphs. There is also a lot of R code too!

Z.H. Zhou. Ensemble Methods: Foundations and Algorithms. Chapman & Hall/CRC, 2012.

This is an excellent book the covers ensemble learning from A-Z and is well suited for anyone from an eager beginner to a critical expert.

Resources and Readings for Big Data Week DC Events

We asked presenters at the various Big Data Week events here in the DC area We're part of Big Data Weekto send us any books, articles, or blog posts that they recommend that are related to their presentations. We hope you find this list of resources useful. (And we will add to it if additional speakers provide suggestions!)

Apr. 22nd -- INFORMS MD -- Getting Started with Big Data Analytics

Brian Keller recommends:

Apr. 22nd -- Data Visualization DC -- Big Data Visualization

Abhijit Dasgupta recommends:

Ben Shneiderman recommends several introductory texts:

April 23rd -- Data Science DC -- Natural Language Processing and Big Data

Ben Bengfort suggests:

Thomas Rindflesch recommends several papers related to his work on Semantic MEDLINE, including:

Apr. 23rd -- Big Data DC -- Challenges of Visualizing and Exploring Big Data

Will Gorman lists a number of important tools and technologies used in big data systems and data visualization:

Apr. 24th -- Data Business DC -- Big Data Infrastructure

Charles Scyphers of Oracle recommends, for the business and non-technical attendees:

Tom Zeng of Intridea suggested the following resources, which are more technical and geared more toward developers/technical managers.

And, Jim Fiori at MapR recommends:


Please note that Data Community DC is an Amazon Affiliate. Thus, if we recommend and link to a particular book to Amazon and you click on that link and then buy the book, we get a very small percentage of the proceeds (and eventually retire to a very small island in the Caribbean :) ).