OAI-PMH Harverster - Trouble with harvesting · Legacy Forums

Corfromleuven November 4, 2015

I've recently set-up a Omeka site for testing - www.erfgoedbankkempen.net. The thing I want to test is the possibility of using OAI-MPH datasets to create exhibits in Omeka. I decedided to give it a try by using sets available from several musea.

None have been succesful so far:

- http://www.rmo.nl/collectie/open-data:

Gives an error message - Unable to Connect to tcp://api.rmo.nl:17521. Error #110: Connection timed out Please check to be certain the URL is correctly formatted for OAI-PMH harvesting.

I contacted their webmaster and he is checking right now if it might be a problem with their server.

So I decided to give it a try with the following:

http://data.fitzmuseum.cam.ac.uk/

Whicht keeps saying for hours that it is busy harvesting. After tree hours I clicked on the "In Progress" message and received the following status info

Status Message:

ID 1
Set Spec
Set Name
Metadata Prefix oai_dc
Base URL http://data.fitzmuseum.cam.ac.uk/oai/?verb=Identify
Status Completed
Initiated 2015-11-04 13:00:08
Completed 2015-11-04 13:00:09
Status Messages

All items created for this harvest were deleted on 2015-11-04 12:55:59

Notice: badArgument: verb should never be duplicated / contain multiple values (2015-11-04 13:00:08)

Notice: No records were found. (2015-11-04 13:00:08)

Notice: Did not receive a resumption token. (2015-11-04 13:00:09)

So could anybody explain why it is not working as expected?

John Flatness November 4, 2015

I see a simple problem with your second example: your base URL has ?verb=Identify in it. The base URL you give the harvester generally shouldn't have any query string in it. In this case the correct base URL is just http://data.fitzmuseum.cam.ac.uk/oai/

John Flatness November 4, 2015

As for your first example of rmo.nl, I don't have the same timeout problem. But, from looking at the output it seems that what they're claiming is "oai_dc" formatted output is actually more of a homegrown format that doesn't comply with the Dublin Core or OAI standard, and the harvester won't work with that.

Corfromleuven November 5, 2015

Thank you for the quick response, John. The ICT guy managing the first case, confirmed your analyses.

I'am right now testing the second suggestion and let you know what the results are.

But if there is a Dublin Core or OAI standard to be respected, is it possible to point me to the documentation where I can find a tustworthy description of that?

In the end, the idea is that after testing, I would like to have my API provider to realise a working API of his system that respects them. So it would be nice if I would know in advance what to expect.

John Flatness November 5, 2015

The OAI-PMH standard document is on their website, as is the XML schema for the oai_dc format which the standard mandates.

Corfromleuven November 9, 2015

I've got a result from my latest harvest, but it won't complete (still running after two days.

And it gives a ton of errors in this style:

ID 4
Set Spec
Set Name
Metadata Prefix oai_dc
Base URL http://data.fitzmuseum.cam.ac.uk/oai/
Status In Progress
Initiated 2015-11-05 09:13:46
Completed [not completed]
Status Messages Notice: Received resumption token: KnwqfG9haV9kY3wxMDA= (2015-11-05 09:15:32)

Notice: Received resumption token: KnwqfG9haV9kY3wyMDA= (2015-11-05 09:17:36)

Notice: Received resumption token: KnwqfG9haV9kY3wzMDA= (2015-11-05 09:19:29)

Notice: Received resumption token: KnwqfG9haV9kY3w0MDA= (2015-11-05 09:21:59)

Notice: Received resumption token: KnwqfG9haV9kY3w1MDA= (2015-11-05 09:24:09)

Notice: Received resumption token: KnwqfG9haV9kY3w2MDA= (2015-11-05 09:26:15)

Notice: Received resumption token: KnwqfG9haV9kY3w3MDA= (2015-11-05 09:28:14)

Goes on for uncountable lines. So I guess it is not working still. Could you tell me what to do now? Can I delete this? Can I just give it a second try?

Thanks in advance, Cor

John Flatness November 9, 2015

Did some items get harvested? Those "resumption token" messages are just printed each time the harvester goes to get another batch of records from the repository, so that log would indicate that a pretty large number of records were probably harvested so you should have a pretty large number of new items in your Omeka site.

As for being stuck on "In Progress," that probably means the process failed at some point. I'd have to do some more digging to figure out exactly what's going on there, though.

Corfromleuven November 10, 2015

I gave it a second try and had a similar effect: the harvester is stuck in 'In Progress' mode and zero items are harvested. Bizzarly enough, they seem to be so in a first phase. Only some metadata are displayed (no images, but perhaps that is normal) and then after a while the items disappear.

I will also ask the host company if they have a clue of what is happening. Perhaps the dataset is to big for the space i'am paying for?

Corfromleuven November 16, 2015

The harvester is still stuck 'In Progress' mode. But the good news is that + 143.000 items have been harvested. The bad news is, that the metadata are a complete mess. If I view the schemes that will be harvested online using the url, there should be no problem. Instead I get a ton of completely nonsical metadata without images. To give just one example:

Dublin Core
Title

Sir Henry Goodricke, 2nd Baronet
Date

Wed May 07 10:00:00 BST 2008
Relation

term-106447

agent-175643

agent-127365

media-140873
Type

object
Identifier

117 II

2986

572

P.10862-R

162130

object-162130

Not really what you would use to build exhibits. Especially so without any media material. Anybody an idea how this come?

John Flatness November 16, 2015

That "complete mess" is unfortunately just the metadata that repository publishes through oai_dc. The harvester can really only work with what it's given, and in this case the data they're supplying appears to mostly be internal linkages and identifiers.

Corfromleuven November 16, 2015

Same conclusion here. I did a test run via http://validator.oaipmh.com/

Under list Records OAI_DC this result was returned.

• HTTP status 200
• Content type application/xml
• Content XML checked.
• Request time is 0.664 sec
• Found empty dc:identifier
• Found empty dc:identifier
• Found empty dc:identifier
• Found empty dc:identifier

etc.

This basically means that the Dublin Core Scheme they're offering is flawed, making harvesting rather difficult indeed.