Data Scientists Clash with Publishers – Local Expert Comment on the Debate

There is a fascinating debate raging in the world of web-scale text processing with  "[s]cientists and publishers clash[ing] over licences that would let machines read research papers."  To get started, head over to Nature here and get the story's background. Once done, come back here where our very own local subject matter expert, Ben Bengfort, has started the discussion below:

The Copyright Approach

Copyright law is consistently unclear about machines creating copies of artistic material for potential analysis. Indeed, even for simple usage the precedent in MAI Systems, Inc vs. Peak Computer says that creating a copy of protected material in RAM from the hard disk is infringement, not to mention the copies created during network traffic. A computer system that performs analysis should be allowed to "own" a copy of the material in the same way that any other user would be afforded, limited to the rights and responsibilities of any copy holder. Consider not just the analysis of academic material, but also music, video, and books especially for ratings or reviews. Whether or not fair use applies, if a computer system purchases a license of copyright material, that computer system does not infringe on copyright law should be allowed to "read" or analyze the contents of that material [editor: what is "reading" but the understand and analysis of text by humans]. While the market of copyrighted material in this sense will force text-miners to pay fair market value for copyrighted material instead of creating ubiquitous crawlers, it will also allow them governed access to the text that they seek to mine.

The Science Approach

The Internet has been a vast resource for companies, especially Google, and academic institutions to mine text-based data. The free spirit of the web and the massive amount of text content has allowed data scientists to hone their skills at deep mining and clustering applications. But, as our machine learning techniques and models grow ever more precise, it has become more apparent that domain-specific knowledge engineering produces the best, often surprising results rather than the shotgun approach of many different temperaments of data on the Internet. The appeal of categorized, deep-domain academic papers will clearly provide the best, most novel results for our text mining applications, and will clearly change the way that we conduct research and innovate in the future. That a roadblock as simple as copyright (and particularly copyright owners like publishers that are clinging to an old and crumbling business model) is preventing the exponential growth of human knowledge is a crime against humanity. That novel results will come from mining academic papers is not a question--Google's page rank algorithm itself comes from an analysis of linking of citations of academic papers.


Benjamin Bengfort is a full-stack data scientist with a passion for teaching machines by crunching data--the more the better. A founding partner and CTO at Unbound Concepts, he led the development of Meridian, the company's advanced text analysis engine designed to parse and predict the reading level of content for K-6 readers. With a professional background in military and intelligence and an academic background in economics and computer science, he brings a unique set of skills and insights to his work, and is currently pursuing a PhD in computer science at UMBC with a focus on machine learning techniques for natural language processing.

(Note, DataCommunityDC is an Amazon Affiliate. Thus, if you click the image in the post and buy the book, we will make approximately $0.43 and retire to a small island).