Implementation

The original PDFs were run through Tesseract, using PyPDFOCR. The text content was then indexed using PDFMiner, Whoosh.

OCR and indexing is done offline. The static index is then served by the website.

The percslib python package can, and probably should, be split off into its own project, as it can split, index, and search independantly of the little website tacked in front of it. Maybe one day…

Why do it this way?

When there are very cool open source search platforms, like Elasticsearch and Solr, which already index PDFs out of the box?

Hosting cost.

The aim was to not only index the text, but make it available for near zero cost. Servers which can handle the OCR processing (lots of CPU…), or an Elasticsearch/Solr instance (lots of memory…), don’t cost near to nothing.

The current implementation means it can be thrown onto tiny shared hosts without a worry.

Plus, Whoosh kicks ass!