Examples of PDF Search? · Legacy Forums

BrentEades April 2, 2016

Hello,

The museum I'm working with has a PDF repository encompassing every page of every issue of the local paper published over a 140-year period.

We would like to know how the Omeka PDF search plug-in functions, and ideally to see examples of it in use. Any suggestions?

Thanks

patrickmj April 3, 2016

I suspect that the key thing to know about how the plugin works is that it works on PDFs that were created from text, but not so much for images. That is, if the PDFs are really scans of the newspapers, it won't be able to extract the text, since inside the PDF it's really just an image. To get the text out and searchable, you'd need to run OCR on it.

The Library of Congress's Chronicling America project might be able to help, at least for the papers before 1923. Sorry, I don't know the process for getting new papers into that project, but that might be worth looking in to.

Numerizen April 8, 2016

To further extend on the topic, your PDFs have to be created via some form of OCR workflow, so the scanned pictures of the newspapers pages are actual texts and not pictures of texts.

There are technologies which are able to extract text from existing PDF files like https://github.com/gkovacs/pdfocr (didn't test this myself).

I'm currently finishing a project involving PDF and Solr search ; if you're interested, I will post a link when the project goes public.

BrentEades April 8, 2016

Thanks. All of our many thousands of PDFs have been OCRed already, but we haven't found an efficient tool for searching the content and returning the relevant PDFs. I would be very interested in seeing your finished product.