by Guest Blogger Jules J. Berman My book, Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information was published this month, and Sean Murphy invited me to write a few words about Big Data (and to plug the book).
I've been reading the Big Data posts on Data Community DC, trying to come up with something that you don't already know and some way to distinguish my Big Data book from the competition. I can say that there are a few basic themes that I cover extensively, that are sometimes downplayed or ignored by other authors.
1. Identifiers: you cannot create a good Big Data resource without them.
I actually think of a Big Data resource as a system of identifiers to which data is attached. If you don't have a system that associates a unique identifier with each data object (minimally, a piece of data, a metadata descriptor, and preferably the name of a class to which the data object belongs), then you cannot create a Big Data resource that accommodates complex data objects or that can be sensibly merged with other Big Data resources or with legacy data. I devote a chapter just to identifiers.
2. Data should be described with metadata, and the metadata descriptors should be organized under a classification or an ontology.
A classification system collects objects of the same class, and establishes the relationships among all the classes. Creating classified data objects is an important step in achieving introspection (wherein every data objects can be interrogated to provide information about itself). A good classification will drive down the complexity of the system and will permit heterogeneous data to be shared, merged, and queried across systems. Creating a logical and useful classification, with metadata, is much more difficult that many people would imagine. The book devotes a chapter to this subject.
3. Big Data must be immutable.
You can add to Big Data, but you can never alter or delete the contained data. If you make the unpardonable error of revising your data, you will end up with an unrepresentative data set that nobody will trust. How can you keep Big Data immutable when the data set is constantly receiving updated information? Hint: it's done with time-stamps, event objects, and identifiers.
4. Big Data must be accessible to the public if it is to have any scientific value.
If you have a private Big Data resource that is not intended to be a resource for scientists, then my advice is to guard every byte of your data. But good science demands open access to data. Unless members of the public have a chance to verify, validate, and examine your data, the conclusions drawn from the data will have no scientific credibility. The book discusses the ethics and legalities of Big Data.
5.Data analysis is important, but data re-analysis is much more important.
There are many ways to analyze Big Data, and my book has several chapters devoted to the subject. One of the nicest things about Big Data is that it is immutable (when prepared correctly), and your analyses can be reviewed when they are made public, and revisited as more data comes to light. In the book, I discuss instances wherein Big Data has been usefully re-analyzed. In my opinion, the primary purpose of Big Data analysis is to create an opportunity for Big Data re-analysis.
Both GoogleBooks and Amazon have extensive previews of the book, including a complete table of contents. It might be worthwhile to take a look to determine whether this book covers subjects that you might find valuable.
The Elsevier book site is:http://store.elsevier.com/product.jsp?isbn=9780124045767&pagename=search
Readers of this blog can use a password at the Elsevier order site to get a discount. There will be a box for a passcode that appears when the order is being entered. Use “MKFRIEND” (all uppercase) . This should give you a 30% discount with free shipping.
Jules Berman received two baccalaureate degrees from MIT; in Mathematics, and in Earth and Planetary Sciences. He received the Ph.D. from Temple University, and the M.D. from the U. of Miami. He received post-doctoral training at NIH and residency training at Geo. Washington U Med Ctr. He is board certified in anatomic pathology and in cytopathology. He served as Chief of Anatomic Pathology, Surgical Pathology and Cytopathology at the Veterans Administration Medical Center in Baltimore, Maryland, where he held joint appointments at the University of Maryland Medical Center and the Johns Hopkins Medical Institutions. In 1998, he became a Medical Officer at the U.S. National Cancer Institute and served as the Program Director for Pathology Informatics in the Institute's Cancer Diagnosis Program. In 2006, Jules Berman was President of the Association for Pathology Informatics. In 2011 he received the Lifetime Achievement Award from the Association for Pathology Informatics. Today, Jules Berman is a free-lance writer. He has first-authored more than 100 articles and 11 book titles in science and medicine.