I am working on a project that requires the ability to do full text searches. I know there is the pdftotext plugin and the pdftohtml plugin, but I am wondering if anyone would share the general design of their archive and how they implemented full text searching. The archive I am working on will be hosted on its own virtual server so I will be able to install any third party software that is needed to get the full text searching to work.
The current version of Omeka can do full text search on any metadata for items and collections. Other plugins might add the ability to do searches on the data.
So, in addition to the PdfToText plugin, in general as long as what you want searched is put in some field for an item, it will be searchable
The searching of the metadata on items and collections is rather straight forward. The thing I am interested in is the ability to search the actual text of an item in the archive. I imagine a document would have to be scanned, run through OCR, and then indexed. After than how does one tie the index results back to the item in Omeka? Could SolrSearch be used in some way to query the index? Please pardon my Omeka stupidity.
Once you have the OCRed text, then that would probably be added to a field that you'd create for an Item in Omeka. So, you might create an Item Type (or use an existing one) with a field called "Text", and put the text in there. That way it gets indexed along with other metadata (kinda stretching the term 'metadata' there, but it works).
PdfToText does mostly the same thing, but with the File record instead of the Item record -- it programatically creates a new field for the file metadata.