A trip down memory lane: text analysis Jedii-style

I have always been a total lover of History; strongly believing that after Maths its the most important subject kids should study. Ok, that’s maybe over-stating, but not by much :-) Anyway, a few years ago I got a chance to combine history and analytics with Trinity College Dublin; analyzing 17th Century English which is some of the toughest text you will ever be asked to analyze. You think Twitter is tricky, wait until you look at a document that was written 300+ years ago, where…

  • You frequently find entire documents without one end-of-sentence.
  • There are no standard spellings because then most people couldn’t write much less spell.
  • Proper case is non-existent so don’t plan to rely on case indicators for finding people or places; which makes for interesting sentence parsing, if there were sentences :-)
  • Original documents illegibly hand-written on paper that is now badly damaged, with text that is frequently corrected/stricken-through, and inserted into every available margin (top, bottom, and sides) because paper was expensive then. So as you can imagine, this made the resulting digital text… well, messy to say the least.

This type of analysis for the faint-hearted it is not! As Yoda might say :-) which is coincidentally a rather appropriate reference since the Jedi Archives in Star Wars Episode II: Attack of the Clones was in fact a digital reconstruction of the Trinity College Dublin “Long Room” library. But I digress… the point being that despite the challenging nature of this historical text its hugely rewarding as it gives you an amazingly intimate window into the past.

