Full-text searching of PDFs

I'm not sure that I'm posting in the right place here, but I've been snookered trying to find information about using Omeka to permit full-text searching of text PDFs. Anyone know how to make this a reality using Omeka?

I'm getting ready to add tons of old newspaper PDFs and have the same question. Did you find a way to search PDFs?

Unfortunately, there isn't way to make the Omeka search look inside files.

Okay, thanks for letting me know. I'll look into running OCR on the PDFs.

Acting on these requests, I made a plugin that enables searching on PDF text:

https://github.com/omeka/plugin-PdfSearch

The feature is limited to items that have been added or edited since the plugin has been installed. I plan to add a feature that bulk extracts text from PDFs that were added before the plugin was installed.

This plugin is unreleased and unstable, but I would like people to help me test it.

Jim

Thanks a ton, Jim. I'm a little unstable myself, so the plugin should be a good fit.

Will definitely offer feedback if we get the green light to install it while still in beta.

Wow, Jim, that's great!

Hi Jim,

I have tested the plugin and it works for the some PDFs but not for others. I think this is normal since PDF are created in different ways or can have characters difficult to be read, etc.

A couple of things that I have found:

- The plugin stop outputting text when it finds a non UTF8 character (for example a letter with an accent) which is a problem for PDFs that are not in English.

- If the PDF has more that one page it only reads the first page.

Apart of that it looks great.

Thanks for the good work and regards,
Dani

I appreciate your help. Are any of these offending PDFs web accessible? I'd like to take a closer look at them.

The plugin depends on a command-line utility called pdftotext. I suspect that these bugs are 1) a limitation of pdftotext, or, more likely, 2) a limitation of the software that created the PDF.

Hi Jsafley,
I just installed the plugin on my test server and it just works fine. Do you plan to add more functionalities to this plugin ?
In particular, I'd be interested in an highlight functionality (I know it's complicated and don't think it can be done with pdttotext), or at least a way to jump to a specific page. I see that when using pdftotext and viewing with vim there are ^L chars for every new page, maybe it would be a way ?

There's also the pdftohtml utilitie which allows to build an xml file from a PDF with pages number and location of every word ...

symac,

Thanks for the feature requests, but I only plan to add a feature that bulk extracts text from PDFs that were added before the plugin was installed. Of course, I invite anyone to fork the repository and add whatever features they wish.

This sounds really interesting. Is there a way to fake or force a re-add of the PDFs to the database? I have a small Omeka website (http://beforecaligari.org/sources/) with lots of OCR'd PDFs (i.e. PDFs with OCR text) that'd I'd love to make searchable with this plugin.

Arno Bosse,

I'm glad you're interested! As mentioned above, I plan to add a feature that bulk extracts text from PDFs that were added before the plugin was installed. The plugin will not be publicly released until this feature is completed.

Hi Jim, sorry for the delay. See below a couple of links that illustrate what I told you in my previous message:

http://www.princeton.edu/~amoravcs/library/origins.pdf : It only extracts the first page for me.

http://www.centrodeestudiosandaluces.es/datos/factoriaideas/policypaper_2.pdf: The plugin stops when encounter a letter with an accent. It only extracts “El arte como veh” because has found an í (“El arte como vehículo…”)

Another issue that I have noticed. If you import items through the CSVImport plugin, the PDFSearch plugin doesn´t seem to extracts anything at all.

Regards,
Dani

danimon,

That's strange. I'm seeing all pages from origins.pdf and policypaper_2.pdf. What version of pdftotext is running on your server?

A couple of things may be happening here. 1) Your PHP's locale may not be set to UTF-8 and PHP quits reading at the first Unicode character it encounters from the command-line output. 2) The database may be truncating the text during insert.

I can't do much about the first possibility, short of asking you to speak to your server admin about changing the locale.

Try opening those PDFs, copying the texts, and pasting them into any item field in Omeka. Submit the form and see if the text was truncated.

Everyone, I've updated the plugin to bulk-extract all PDF file text. Please update the plugin and try out the new functionality. Thanks!

https://github.com/omeka/plugin-PdfSearch

Hi jsafley,
thanks for the update, I've another questions. Do you think it would be possible to select which item-type to associate with the extension ?
Right now the plugin is associated with every types, I'd like to see the "PDF Search" section only for "Document" in my install.
Do you think it's possible or is it a limitation of omeka ?

symac,

The short answer is that it's a limitation of Omeka. Element sets are available to all items, regardless of item type. I recommend that you just ignore the PDF Search element set when editing items that don't apply, and modify the public theme to remove the element set from item pages that don't apply.

Can you say more about modifying the public theme to remove the element set? I'm new to PHP but feel fairly comfortable editing code if given some guidance. We're using emiglio, and I can't see where I need to be modifying the code. We just installed PDF Search after upgrading to 1.5, and I'm looking forward to doing more testing.

These pages should help:

http://omeka.org/codex/Display_Specific_Metadata_for_an_Item
http://omeka.org/codex/Theme_API/show_item_metadata

In short, open themes/emiglio/items/show.php and do something like the following:


$options = array('show_element_sets' => array('Dublin Core'));
custom_show_item_metadata($options);

I finally got around to testing this very interesting plugin (it took me a while to upgrade my site - http://beforecaligari.org/sources/ to v1.5.x).

On the site I have a bunch of PDF scans of newspaper articles which I then OCR'd with Adobe Acrobat Pro, saving the recognized text in the PDF. The PDF files were already uploaded to my site and were being shown with the Docs Viewer plugin. I'm using the Omeka default theme.

In my brief testing, the plugin seemed to work just fine and recognized all the text in my existing PDF items. However.. I was not able to make use of it in my site because I couldn't see an easy way to not have the recognized text appear in my item view and still have the content searchable with Omeka. In the end I had to uninstall the plugin as there's just too much text there (a lot of it jumbled - not the fault of the plugin obviously).

Ideally, I'd like to be able to have the text that the plugin scraped off the PDFs to be made available for indexing but not show up in the initial Item view. So how to finesse showing the results? I'm envisioning something like a KWIC view where the keyword shows up in a brief surrounding text snippet plus a link to the actual item. That way, the user can see that it's rough OCR but also see the associated file with the scanned image. In other words, in my case, I would only use the OCR'd text in the PDFs as a rough finding aid, not as an accurate representation of the document's contents. In cases where the PDF text is clean (and preferably not too long..) then I think the plugin is probably great just as it is. It just didn't work for me.

I'd be happy to do more testing etc. if that would help to polish it further and bring it into the official plugin directory.

Hi. While I'm uploading Omeka 2.0.2 for a digital collection of books and pamphlets as pdfs, I've been searching the forum. One of our main priorities is for keyword searches to search the content of OCR'd pdfs.

I found this thread as well as a more recent thread, Managing Search Settings 2.0.

Am I correct in assuming that the pdftotext plugin was folded into Omeka 2.0 and that the plugin mentioned here is now a moot point?

Thanks!

For Omeka 2.0, the PdfSearch plugin was superseded by the new PdfText plugin, available here:

http://omeka.org/add-ons/plugins/pdf-text/

No functionality was wrapped into Omeka.

I've talked to two hosting providers and neither make the pdftotext utility available in non-private server environments. Thus the Omeka pdftext plugin can't work. Indeed, there is chatter on the web that it's strongly suggested not to include the utility on public servers. I don't suppose there's any work around for those of us using common web hosting, is there?

i've installed pdf text plugin version 2.0 but when i'm tryng to upload a pdf file it show error.
this is the error message:

Zend_Db_Statement_Mysqli_Exception
Mysqli statement execute error : Column 'text' cannot be null

what should i do?