Omeka for City Directories


Sorry if this in the wrong section, I'm new to the forum. I'm a member of a project team looking to use omeka to store and display city directories for a local museum. The city directories are being scanned as PDF files, and the text is being scanned with Optical Character Recognition (OCR) in Adobe Reader. Will it be possible to upload these directories to Omeka after the OCR is complete and have them be searchable? It also needs to be possible for them all to be searched at the same time. So, for example, if somewhere were to search "barber," every instance of barber in every city directory would show up in the search results. Therefore, just using control-f on each individual directory isn't an option. If it's possible to use Omeka to do this, how? Would it be using PDF-search? I just want to make sure I understand everything before recommending Omeka to the museum.

Thanks for the help!

The PDF Text plugin will probably serve you well, depending on some details in your workflow. Scans to PDF are often just PDF files with a bunch of jpeg images embedded in them, and so won't help the search. It's the OCR that you need to get into PDF for the plugin to work its magic.

The MySQL setting for searching sometimes need some tweaking. Beyond the sometimes surprising stop-words in the search, if a term is indexed frequently, it can actually drop out of the search results.

Upshot -- sounds likely with OCRd text. As always, a trial run with a small sample is a good idea.

We have found that the PDF Text plugin doesn't always work so well with text that is formatted in columns. When that happens, we deactivate PDF Text, and do a bit of extra processing during OCR.

We save the OCRed document as a PDF/A, with the text under the page image. Then we save it again as a plain text file. We upload the PDF/A file to the item record, then copy & paste the text itself into the item type metadata text field. That second step is what makes it searchable site-wide.

Users can perform a site-wide search to identify which items contain their search term; pinpointing the search term on the item's page (or pages) requires another Ctrl-F search. It's not perfect, but it works!

I should mention that we use the PDF Embed and/or Docs Viewer plugins.