Anyone who has worked in the social analytics space, or in fact on any content analytics project involving Internet content, will have run into the legal quagmire that is Copyright Infringement. Its the proverbial pain in the posterior. In the most extreme case, it makes it impossible to analyze any Internet content since there is so much ambiguity around what actually constitutes infringement of copyright. During my wading through this quagmire, I had the pleasure to read many interesting positions on the subject and uncover the extremes to which this issue can be taken.
One example is File Caching, where browsers (for example) cache/copy web pages onto your local file-system and therefore, it could be argued, infringe copyright. Thankfully the lawyers quickly figured out that this might be a tricky precident to set and therefore a caveat was established where file caching was explicitly excluded from copyright liability (see United States Code 17 U.S.C. §: 512).
Another one was Search Engine Caching, when it was argued by some (even in a court of law) that search engine caches infringed copyright. Thankfully this was quashed at the time by some very reasonable judges that recognized that websites were in reality giving an implicit license to the search engines to copy and index, and that the robots.txt allowed this to be stopped if the websites so desired. This type of frivolous case is another example of websites that “want their cake and eat it too“. The reality is that the benefits of getting your content indexed outweighs any issues associated with the caching/storing of the content. Confidential material (or chargeable content) should-be/normally-is placed behind password controlled sections of websites and thereby protected from any caching.
So to my question (at last) …
Does Copyright need a bit of a facelift in the light of the new Internet reality? Specifically in cases of content which is freely available and openly downloadable from the Internet, content which is frequently not easily identifiable as copyrighted. Should the same expectation that was applied to search engine caches (or browser caches, for that matter) be applied to any application that crawls, indexes, and analyzes Web content? And if yes, what would this mean?
- Robots.txt equivalent for controlling what gets crawled / analyzed?
- Standard machine readable copyright statements enforced to protect both the content owners and analytics providers and also let them know how certain content can be used?
Is there already something happening to better protect companies looking to apply/use analytics on Internet content, such as social media? If anyone happens to know, please share. Thanks :-)