PDF search for Hebrew / Unicode · Legacy Forums

Raanan Raz June 12, 2012

Hello,
We are testing Omeka as a solution for our archive. We scan and OCR many documents, all in Hebrew, and want to search on their full text. the PDF search works well except for the Hebrew text (which is about 99% of our stuff...)
Is anybody aware of attempts to develop support for the PDF search plugin (or alternative solution), for unicode or Hebrew characters?

Thanks a lot
Raanan

Jim Safley June 12, 2012

Would you please describe what is not working exactly? Is the text not being extracted, or is it partially extracted? We've encountered a problem with Unicode characters before and haven't found a solution yet.

Raanan Raz June 12, 2012

Hi,
thanks for the quick response. None of the Hebrew characters is extracted properly - they appear in the "PDF search" tab as Gibrish. Accordingly - nothing is found when searching for the Hebrew characters.

Jim Safley June 12, 2012

I've added this issue to the PdfSearch repository: https://github.com/omeka/plugin-PdfSearch/issues/1 We'll work on it and get back to you. Thanks for your patience.

Raanan Raz June 12, 2012

Thanks a lot
please contact me if you need anything from our side, we'll be happy to test
Raanan

Jim Safley June 12, 2012

Actually, are any of the PDFs that contain Hebrew characters publicly available? I'd like to try to reproduce the error. I may have further questions as we troubleshoot.

Raanan Raz June 13, 2012

Sure, just let me know how can I send it to you.

Jim Safley June 13, 2012

If your PDFs are publicly available, you could give me the URL to your Omeka installation so I can download them from there. Otherwise I'll email you so you can send some PDFs that way.

Do you have access to the server on which Omeka is installed? If so, would you show me the result of the following commands in terminal:

$ locale $ locale -a

Raanan Raz June 13, 2012

Jim,
I prefer sending it by email than publishing our test environment. I can also give you access to the installation if needed. I don't have cmdline access to the server, only FTP, I asked the support of our web hosting to run these commands.

Thanks,
Raanan

Jim Safley June 15, 2012

Update:

After some troubleshooting we discovered that the earlier Xpdf version of pdftotext does not reliably extract Unicode characters. The only way to fix this is to install the newer Poppler version of pdftotext on the server, available in poppler-utils. The PDF Search documentation has been updated to reflect this.