Usage¶
To search the content, first we need to:
- Prepare the PDF files (OCR, split into per-person files), then
- Add the text content to a search index.
These can be resource-consuming operations for little webservers, so in this Implementation, we perform these steps offline. That is, on your local workstation or wherever. Just not on the server.
It’s feasible that the indexing could be performed by the server, as that’s not too taxing, given percs doesn’t do anything too fancy with the content (just using the Whoosh stemming analyzer). But as the volume of updates is minimal, it was quicker to get going with the server treating the index as static.
Also be aware that the tools included here are still fairly task-specific. In that they didn’t start life as generalised tools. They’ve been coaxed along that path towards generalisation a bit, but they still leak their heritage in places.
Installation¶
Percslib¶
This is the main support library that coordinates all the pdf manipulation, indexing and search facilities.
It’s found in the /percslib
subdirectory.
The OCR library has a bunch of dependencies. On Ubuntu 14.10, I installed these packages to get things rolling:
sudo apt-get install libpython-dev zlibc zlib1g zlib1g-dev libtk8.6 libjpeg-dev libopenjpeg-dev libfreetype6-dev libwebp-dev libwebpmux1 liblcms2-2 liblcms2-dev liblcms2-utils libtiff5-dev tesseract-ocr ghostscript imagemagick libpoppler-dev libpoppler-cil-dev python-poppler
Unpack the repository. Use a virtualenv if you like (probably a good idea).
Then, in the percslib
subdirectory, run:
python setup.py install
Website¶
This is a web2py application. So you’ll need to install web2py separately before the code in /website
will work.
Once you’ve installed web2py somewhere, link / copy the percs website contents to:
[web2py install location]/applications/percs
Alternatively, it’s should be fairly trivial to add a website in whichever framework you prefer, as long as it can call the percslib code (either by importing it as a python module, or calling the commandline scripts).
Why did I go for web2py? I like web2py.
Collections¶
The PDF files we provide are kept on the filesystem, under:
[web2py instance]/applications/percs/static/collections/[COLLECTION NAME]/
Where COLLECTION NAME
matches the collection the files were indexed under (more on that in Indexing).
The filenames must also match those they were added to the index with. Otherwise, the search result hit links
won’t work.
Note
These naming requirements are only an implementation detail of the current website. In keeping with the notes in Implementation, this is an attempt at avoiding the need for more infrastructure.
Each collection directory can contain a README
file, with a description to display on the collection’s page.
The contents of this README should be in MARKMIN format.
Preprocessing¶
Percs included two main PDF preprocessing functions:
- Adding a text layer via OCR. This relies on the fantastic PyPDFOCR library.
- Splitting subsets of pages into separate PDF files. This relies on the PyPDF2 library, which is a dependency of PyPDFOCR anyway.
Adding an OCR text layer¶
Percs provides a commandline script to do this, but it delegates the work entirely to PyPDFOCR.
$ percs-pdf-ocr -h
usage: percs-pdf-ocr [-h] [-d] [-v] [-m] [-l LANG] [--skip-preprocess]
[-w WATCH_DIR] [-f] [-c CONFIGFILE] [-e] [-n]
[pdf_filename]
Convert scanned PDFs into their OCR equivalent. Depends on GhostScript and
Tesseract-OCR being installed.
positional arguments:
pdf_filename Scanned pdf file to OCR
optional arguments:
-h, --help show this help message and exit
-d, --debug Turn on debugging
-v, --verbose Turn on verbose mode
-m, --mail Send email after conversion
-l LANG, --lang LANG Language(default eng)
--skip-preprocess Skip preprocessing (saves time) if your pdf is in good
shape already
-w WATCH_DIR, --watch WATCH_DIR
Watch given directory and run ocr automatically until
terminated
Filing optinos:
-f, --file Enable filing of converted PDFs
-c CONFIGFILE, --config CONFIGFILE
Configuration file for defaults and PDF filing
-e, --evernote Enable filing to Evernote
-n Use filename to match if contents did not match
anything, before filing to default folder
PyPDFOCR version 0.8.2 (Copyright 2013 Virantha Ekanayake)
As you can see, it’s simple a thin-wrapper for PyPDFOCR’s commandline script.
An example of using this:
$ percs-pdf-ocr 2014-06-30_O-Farrell_Barry_pecuniary-interests.pdf
This will create a new file, with an “_ocr” suffix added to the filename.
PyPDFOCR has a tonne of other features (monitoring a directory for new files, automatically filing the input & output files after processing – based on text found during OCR’ing, sending notification emails…). It’s pretty handy.
Splitting PDF Files¶
One main problem with the OCR is that in our current target files, they contain around 100 member submissions. Many of these submissions are hand written, so hardly any useful content can be extracted.
The fallback is to split these mega-documents into the individual submissions. So even if we failed to find any searchable content, we can still search / locate the submission by the person’s name, and a human can eyeball the text that the machine couldn’t read. (The human might also have trouble… some of the hand writing is pretty poor.)
The good news¶
Percs supplies a script, which accepts a CSV of:
- start page (with 1 being the first page in the file),
- end page,
- document name (eg. the person’s name),
- output filename for this page range.
Here’s it’s help dump:
$ percs-pdf-split -h
usage: percs-pdf-split [-h] [-c CSV_FILENAME] [-o OUTPUT_DIR] [-H] filename
positional arguments:
filename PDF file to split
optional arguments:
-h, --help show this help message and exit
-c CSV_FILENAME, --csv-page-ranges CSV_FILENAME
CSV containing start, end, name & (optional) output
filename for range
-o OUTPUT_DIR, --output-directory OUTPUT_DIR
Output directory. Default: current directory
-H, --header-row CSV contains header row to skip
Example usage, here I split the 2012-2013 Volume 2 file into it’s individual submissions, with
the files being dumped to the split_v2
directory:
percs-pdf-split -c 2012-2013_perc_interests_split_v2.csv -o splits_v2 "Volume 2 - Disclosures by Members - 30 June 2013_ocr.pdf"
Here are the first couple of rows of the csv:
3,11,Issa_Tony,2013-06-30-Issa_Tony-Nsw_2012-2013_pecuniary_interests.pdf
12,20,Kean_Matthew,2013-06-30-Kean_Matthew-Nsw_2012-2013_pecuniary_interests.pdf
21,29,Lalich_Nick,2013-06-30-Lalich_Nick-Nsw_2012-2013_pecuniary_interests.pdf
The document name that you provide will also be embedded in the resulting PDF’s “Subject” metadata property. Maybe this is useful if you index the files with other software as well.
The bad news¶
So far, I haven’t found a nice automatic way of determining the page ranges. Creating the page range CSV is still a manual slog.
Ideas for solving this are welcome!
Indexing¶
Now we have the OCR’d individual PDF files ready, so we can create an index. Or, if there’s an existing index, we can add these documents to it.
For this, there’s the percs-pdf-index
script:
$ percs-pdf-index -h
usage: percs-pdf-index [-h] [-i INDEX_DIR] [-f] [-c COLLECTION] [-n NAME_CSV]
[-r] [-s] [-d DELETE_DOC]
directory
positional arguments:
directory Directory path containing PDFs to be indexed
optional arguments:
-h, --help show this help message and exit
-i INDEX_DIR, --index-dir INDEX_DIR
Index directory. Default: CURRENT_DIR/percs_idx
-f, --force-new If index dir is an existing index, clear and create a
new index in its place
-c COLLECTION, --collection-name COLLECTION
Collection name for indexed files. Default to current
date (yyyy-mm-dd)
-n NAME_CSV, --name-csv NAME_CSV
Load names from csv, instead of embedded pdf metadata.
Expects (name, filename) rows
-r, --remove-missing Remove missing files from the index, when indexing a
directory
-s, --size Show number of records in index, and exit
-d DELETE_DOC, --delete-document DELETE_DOC
Delete file from index, and exit
Example usage, adding the earlier splits_v2
directory we used for percs-pdf-split
:
$ percs-pdf-index -i percs_idx -c "nsw_2012-2013_pecuniary_interests" splits_v2
This will add the PDFs, in splits_v2
, to the index in percs_idx
, under the collection named
nsw_2012-2013_pecuniary_interests
.
Document names¶
By default, this will extract the document name from the embedded “Subject” field, of the PDFs created by percs-pdf-split. Otherwise, if the files come from another source & don’t have this, you can supply another CSV (–name-csv). This CSV expects the format:
- name (eg. person name),
- filename
Searching¶
A commandline search utility is included: percs-search
. You point it at your index directory,
and give it a query, using the Whoosh default query language. Results will be spat out
as json chunks.
Options:
$ percs-search -h
usage: percs-search [-h] [-i INDEX_DIR] [-p PAGE] [-l LIMIT] [-c] query
positional arguments:
query Query string
optional arguments:
-h, --help show this help message and exit
-i INDEX_DIR, --index-dir INDEX_DIR
Percs index to search against
-p PAGE, --page PAGE Page of paginated results to return
-l LIMIT, --limit LIMIT
Max number of results to return
-c, --contents Return indexed contents with each record
Example:
percs-search -i percs_idx 'golf club' --limit 1
{
"matches": [
{
"file_hash": "b3d90a4d975791aef8719e89ee9f36b2f538d93d83bd5e2ee3283f6f71d1373c",
"highlights": "<b class=\"match term0\">Club</b>,\nQantas Frequent Flyer <b class=\"match term0\">Club</b>, Port Kembla Pumas, Angels...of Hope, Wollongong <b class=\"match term1\">Golf</b>\nI <b class=\"match term0\">Club</b>, Illawarra Stingrays...Social <b class=\"match term0\">Club</b>, Member of Dogs NSW\n\n22",
"page_hash": "16fbe6442983119358f0c21e318564c749a2ea9959e3ce3f8e212d3f47c66b42",
"collection": "nsw_2013-2014_pecuniary_interests",
"filename": "2014-06-30_Hay_Noreen_pecuniary-interests_ocr.pdf",
"person": "Hay_Noreen",
"content_type": "application/pdf",
"id": "b3d90a4d975791aef8719e89ee9f36b2f538d93d83bd5e2ee3283f6f71d1373c-9",
"page": 9
}
],
"total_matches": 5,
"matches_returned": 1,
"query": "golf club"
}