I have a few questions about using Omeka to set up a bibliographic database.
I would like to use it to catalogue a large collection of archaeological reports. Many entries will have PDF attachments (but not all). Some entries would have PDFs attached to them at a later date. Omeka looks attractive, because EPrints looks a bit heavyweight in terms of maintenance and requirements.
Has anyone managed to get Google Scholar to index their content correctly?
I'm also interested in being able to search within PDFs. I'm fairly sure Omeka won't do that 'out of the box' but is it straightforward to integrate Google CSE?
Any experiences with bibliographies and Omeka would be greatly appreciated.
I'm unaware of efforts to enable Google Scholar indexing for Omeka, though I see an interesting possibility for an Omeka plugin. Anyone who wants to take a stab at it can follow Google Scholar's indexing guidelines.
You are correct: Omeka cannot search within PDF, DOC, etc. files out-of-box. But, again, a plugin would make it possible. Someone could use our after_insert_file hook to do it internally; or, as you mention, leverage Google CSE using its powerful API.
You may be interested in our Zotero Import plugin, which is currently unreleased but nearly finished. It imports bibliographic records from individual and group Zotero libraries. We'll keep the community updated on its progress.
Many thanks for the info. Sadly I lack the programming skills to write a plugin from scratch, but it's good to know that the hooks are there. The Google Scholar one doesn't look too tricky. I may learn!
The Zotero plugin sounds interesting, I will look into that.
I'm looking at this a bit more closely now. I don't yet have the ability to build a plugin, so I thought about the idea of building the <meta> tags into a theme.
<!-- Google Scholar Inclusion Metadata -->
<meta name="citation_title" content="<?php echo item('Dublin Core', 'Title'); ?>" />
<meta name="citation_author" content="<?php echo item('Dublin Core', 'Creator'); ?>" />
<meta name="citation_date" content="<?php echo item('Dublin Core', 'Date Created'); ?>" />
<meta name="citation_technical_report_institution" content="<?php echo item('Dublin Core', 'Publisher'); ?>" />
<meta name="citation_journal_title" content="Our Name" />
<meta name="citation_pdf_url" content="<?php while(loop_files_for_item()): ?>http://omeka.install.url.org<?php $file = get_current_file(); ?><?php echo file_display_uri($file); ?><?php endwhile; ?>" />
<!-- End Google Scholar Metadata -->
I have, however, just noticed that the citation_pdf_url attached file must be in the 'same directory' as the item record.
I'm fairly certain that to modify the URL so that the file URLs are relative to the public record URL is not trivial. Such as:
Existing Omeka item public URL example:
Attached file URL:
Google Scholar would want the file URL to be something like:-
File 1 URL:
File 2 URL:
Possibly some .htaccess-fu might be able to match the two together? I'm not quite sure of the best approach. Any advice appreciated.
Interesting idea. You're right that you'd have to do some .htaccess-fu for this, in addition to writing a helper for your theme to generate the updated URLs for files.
One thing that might make this a bit easier is to use the file's record ID instead, something like this:
Where '5' would refer to the ID of the file. It would be easier, I think, to write the redirects that way.
That said, I don't see anywhere in Google Scholar's Inclusion Guidelines for Webmasters to format the URLs in a specific way. Maybe I missed something, or it's on another page?
Yes, the file's existing ID makes sense and would be easier! Thanks for the tip. I'd better read up on how to make that helper.
It's mentioned in the Indexing Guidelines section, item G.
<meta> tags normally apply only to the exact page on which they're provided. If this page shows only the abstract of the paper and you have the full text in a separate file, e.g., in the PDF format, please specify the locations of all full text versions using citation_pdf_url or DC.identifier tags. The content of the tag is the absolute URL of the PDF file; for security reasons, it must refer to a file in the same subdirectory as the HTML abstract. Failure to link the alternate versions together could result in the incorrect indexing of the PDF files, because these files would be processed as separate documents without the information contained in the meta tags.
Unless I'm misunderstanding what they mean by that? I hope so, actually!
Ah, thanks for highlighting that line! So yeah, it would mean you would have to set up the rewriting to files so it included the path to the item. That said, I'm not entirely sure if using htaccess redirects would do the job either, since I suspect Google would detect the redirect. Don't know about this, but it might be worth investigating before trying to come up with the redirect solution for Omeka.
That's a good point. I know from looking at the source of EPrints output that files appear to be in a subdirectory of the record, so I'll see if I can find out what's going on there.
It turns out that if you ask the Google Scholar team nicely, they will configure their crawlers to bypass the directory check. Phew - saved a lot of work there!
So can Google Scholar index Omeka content correctly??