PDF search for Hebrew / Unicode

Hello,
We are testing Omeka as a solution for our archive. We scan and OCR many documents, all in Hebrew, and want to search on their full text. the PDF search works well except for the Hebrew text (which is about 99% of our stuff...)
Is anybody aware of attempts to develop support for the PDF search plugin (or alternative solution), for unicode or Hebrew characters?

Thanks a lot
Raanan

Would you please describe what is not working exactly? Is the text not being extracted, or is it partially extracted? We've encountered a problem with Unicode characters before and haven't found a solution yet.

Hi,
thanks for the quick response. None of the Hebrew characters is extracted properly - they appear in the "PDF search" tab as Gibrish. Accordingly - nothing is found when searching for the Hebrew characters.

I've added this issue to the PdfSearch repository: https://github.com/omeka/plugin-PdfSearch/issues/1 We'll work on it and get back to you. Thanks for your patience.

Thanks a lot
please contact me if you need anything from our side, we'll be happy to test
Raanan

Actually, are any of the PDFs that contain Hebrew characters publicly available? I'd like to try to reproduce the error. I may have further questions as we troubleshoot.

Sure, just let me know how can I send it to you.

If your PDFs are publicly available, you could give me the URL to your Omeka installation so I can download them from there. Otherwise I'll email you so you can send some PDFs that way.

Do you have access to the server on which Omeka is installed? If so, would you show me the result of the following commands in terminal:


$ locale
$ locale -a

Jim,
I prefer sending it by email than publishing our test environment. I can also give you access to the installation if needed. I don't have cmdline access to the server, only FTP, I asked the support of our web hosting to run these commands.

Thanks,
Raanan

Update:

After some troubleshooting we discovered that the earlier Xpdf version of pdftotext does not reliably extract Unicode characters. The only way to fix this is to install the newer Poppler version of pdftotext on the server, available in poppler-utils. The PDF Search documentation has been updated to reflect this.