Problem with the OAI-PMH Harvester Plug-in

Hello,

i´m from Germany. We have find Omeka in the Web and try it to use for our Publication Database.
We have to harvest some Archives from other institutes.

Now we have problem to harvest 5 of our institutes.

this are the addresses

http://opac.fzk.de:81/oai/oai-2.0.cmp.S
http://www.helmholtz-berlin.de/pubbin/oai
http://www.ufz.de/oai.php
http://bib-app.gfz-potsdam.de/oai_hgf_all.php
http://zitmac05.gkss.de/fmi/xsl/oai/all.xsl

They all use the OAI Interface.
With the PKP Harvester it works but they have other problems ;o)
Can anyone help ??

Sasch,

You've exposed some bugs in the OAI-PMH Harvester plugin, as well as some bugs in several of the OAI data providers you list. I'll address each one individually:

http://opac.fzk.de:81/oai/oai-2.0.cmp.S
Currently, the plugin does not support harvesting from data providers that do not support the (optional) ListSets request. This feature should be added in the next version.

http://www.helmholtz-berlin.de/pubbin/oai
The plugin does not recognize the data provider's oai_dc schema due to extra whitespace characters in the ListMetadataFormats response. This should be fixed in the next version.

http://www.ufz.de/oai.php
The plugin is sending an invalid variable to the view script. This should be fixed in the next version. Nevertheless, this repository is still harvestable.

http://bib-app.gfz-potsdam.de/oai_hgf_all.php
The data provider assigns invalid schema and namespace to the oai_dc metadata format: "http://www.openarchives.org/OAI/2.0/dc.xsd" and "http://purl.org/dc/elements/1.1/"

http://zitmac05.gkss.de/fmi/xsl/oai/all.xsl
The data provider assigns invalid schema to the oai_dc metadata format: "http://www.openarchives.org/OAI/2.0/oai_dc/oai_dc.xsd"

I hope this helps.

-Jim

Hi Jim,

thank you for your fast answer.

Do you know when you release the next version of the harvester ??
Maybe do you have a beta version that i can test ??

Sascha

Hi Jim,

because of the interface to http://bib-app.gfz-potsdam.de/oai_hgf_all.php:

why there is a problem with this address ??

here is the identify for this address

http://bib-app.gfz-potsdam.de/oai_hgf_all.php?verb=Identify

<OAI-PMH xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/           http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd">
<responseDate>2009-11-19T14:21:24Z</responseDate>
<request verb="Identify">http://bib-app.gfz-potsdam.de/oai_hgf_all.php</request>
−
<Identify>
−
<repositoryName>
Deutsches GeoForschungsZentrum GFZ, GERMANY, Publication Server all metadata
</repositoryName>
<baseURL>http://bib-app.gfz-potsdam.de/oai_hgf_all.php</baseURL>
<protocolVersion>2.0</protocolVersion>
<adminEmail>bib@gfz-potsdam.de</adminEmail>
<earliestDatestamp>1990-01-01</earliestDatestamp>
<deletedRecord>no</deletedRecord>
<granularity>YYYY-MM-DD</granularity>
−
<description>
−
<oai-identifier xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai-identifier       http://www.openarchives.org/OAI/2.0/oai-identifier.xsd">
<scheme>oai</scheme>
<repositoryIdentifier>gfz-potsdam.de</repositoryIdentifier>
<delimiter>:</delimiter>
<sampleIdentifier>oai:gfz-potsdam.de:8010</sampleIdentifier>
</oai-identifier>
</description>
</Identify>
</OAI-PMH>

and here is a indtify for a adress which works
http://edoc.mpg.de/ac_p_oai.pl?verb=Identify

<OAI-PMH xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/ http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd">
<responseDate>2009-11-19T14:13:35Z</responseDate>
<request verb="Identify">http://edoc.mpg.de/ac_p_oai.pl</request>
−
<Identify>
−
<repositoryName>
Max Planck Society - released data of the MPG eDocument Server. Contains only data, that has been published.
</repositoryName>
<baseURL>http://edoc.mpg.de/ac_p_oai.pl</baseURL>
<protocolVersion>2.0</protocolVersion>
<adminEmail>vdm@zim.mpg.de</adminEmail>
<earliestDatestamp>1998-01-01</earliestDatestamp>
<deletedRecord>persistent</deletedRecord>
<granularity>YYYY-MM-DD</granularity>
−
<description>
−
<oai-identifier xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai-identifier         http://www.openarchives.org/OAI/2.0/oai-identifier.xsd">
<scheme>oai</scheme>
<repositoryIdentifier>edoc.mpg.de</repositoryIdentifier>
<delimiter>:</delimiter>
<sampleIdentifier>oai:edoc.mpg.de:123456</sampleIdentifier>
</oai-identifier>
</description>
−
<description>
−
<eprints xsi:schemaLocation="http://www.openarchives.org/OAI/1.1/eprints         http://www.openarchives.org/OAI/1.1/eprints.xsd">
−
<content>
−
<text>
The eDoc server provides a unique entry point to the accumulated research output of the Max Planck Society. Institutes of the Max Planck Society are invited by the president to use the eDoc -Server to collect their scientific documents in digital form, manage their publication data and to increase the visibility of their digital collections. Via eDoc, scientists can make their work openly accessible online with the technological and institutional backing of the Max Planck Society. This interface provides access to all publicly, released data.
</text>
</content>
−
<metadataPolicy>
−
<text>
Noncommercial use only, metadata harvesting permitted through OAI interface only.
</text>
</metadataPolicy>
−
<dataPolicy>
−
<text>
Noncommercial use only, metadata harvesting permitted through OAI interface only.
</text>
</dataPolicy>
</eprints>
</description>
</Identify>
</OAI-PMH>

i can´t see a different between the both addresses

Sascha

We have no milestone set for the public release of the next version. You can find the in-development plugin in our SVN repository; but you must consider it unstable and incomplete. It does not yet support harvesting from data providers that do not support the ListSets request.

As for your second question: there is no problem with the address. I refer to the oai_dc schema and metadataNamespace, found here.

The schema should be:
http://www.openarchives.org/OAI/2.0/oai_dc.xsd

The metadataNamespace should be:
http://www.openarchives.org/OAI/2.0/oai_dc/

the adress http://bib-app.gfz-potsdam.de/oai_hgf_all.php support the ListSets request, but the harvester give me an error
"There are no available data maps that are compatable with this repository. You will not be able to harvest from this repository"

I don´t know why ?? I think the schema is OK. This adress will be harvest from many institut without errors.

Again, the problem with that data provider is its use of invalid schema URI for the oai_dc metadata format. See the ListMetadataFormats request for yourself:

http://bib-app.gfz-potsdam.de/oai_hgf_all.php?verb=ListMetadataFormats

The provided schema URL is invalid.

The OAI-PMH plugin is not forgiving when given an invalid schema URL because its data maps are written to conform to a particular schema. Any derivation from the structure defined by that schema will likely cause data mapping errors.

We could add a "force harvest" feature that attempts to harvest a repository even if a matching schema is not found. But doing so would only discourage standards compliance.

Jim

Hello - thank you for your answer. I have speak with the instituts they have the problem. I hope they change their URLs.

But i think its a good idea to add a "force harvest" feature to the harvester.

Regards

Sascha

Hi,

I am currently receiving the "There are no available data maps that are compatable with this repository." message when attempting to harvest the data provider:

http://digital.cjh.org/OAI-PUB

I compared the Identify, ListSets and ListRecords feeds to those of the http://vagovernmentmatters.org/oai-pmh-repository/request feed and can't see any glaring differences.

Any help would be greatly appreciated.

Best,
Jason

Jason,

That data provider is using an invalid schema URI for the oai_dc metadata format.

http://digital.cjh.org/OAI-PUB?verb=ListMetadataFormats

The provided schema URL is invalid.

Also, the plugin does not harvest from marc21.

See above for more details.

Thank you. That took care of it.

Hi,

When attempting to harvest this feed:

http://digital.cjh.org/OAI-PUB?verb=ListRecords&metadataPrefix=oai_dc&set=ajhs_american-jewish_historical_society

Omeka creates records, but the there is no data imported. I was able to successfully harvest set 12 from the http://vagovernmentmatters.org feed you posted.

I also receive the following error message:

OaipmhHarvester_Harvest_OaiDc::harvestRecord(): Node no longer exists in /plugins/OaipmhHarvester/libraries/OaipmhHarvester/Harvest/OaiDc.php on line 68

Thanks in advance for your help.

Best,
Jason

Jason,

The oai_dc XML does not validate. There are at least three issues:

  1. xmlns:oai_dc should be http://www.openarchives.org/OAI/2.0/oai_dc/
  2. the first pair in xsi:schemaLocation should be http://www.openarchives.org/OAI/2.0/oai_dc/
  3. ampersands should be encoded using their entity reference

Make these changes and test the harvester again.

Jim

Thanks for your help Jim. 1 and 2 did the trick.

Regards,
Jason