Diacritics in html fields converted to character codes when exported

Hello,

I am having an issue in which any character containing a diacritic that is wrapped in an html tag is automatically being converted to its html character code by Omeka when exporting via the OAI-PMH plug in. This is an issue because the "&" character which starts these character codes is an escape character in xml, which causes parsing to fail when trying to aggregate by setSpecs (collection id), as is done by our local DPLA hub. Additionally, metadata fields end up being illegible and messy for individual fields that contain diacritic marks wrapped in html.

Is there anyway to avoid this?

Thanks!

Tristan

Can you link to or post an example request showing the problem, and/or the item show page for an item with the problem?

The OAI-PMH plugin itself should be outputting valid XML. If it's not, that's definitely a bug. But, I'd want to see what's actually happening to see if our output looks right or not.

Sure! An item such as this:

http://www.cppdigitallibrary.org/items/show/4451

If you look at the relation field, the "ü" character is automatically converted to "ü" when I change the field to an html field in order to provide a link to the external resource. When this is pulled in by our local DPLA hub the html is stripped leaving "Memento Mütter," which is undesirable.

The XML parsing issue came when I had collections that had html tags in the collection metadata such as "Traité d'anatomie humaine (1891), Leo Testut" for the title of a collection. The "é" character is converted to "é" (since I had the title italicized) , and since the "&" character is an escape character in XML, the listSets query broke. Alternately, checking the listSets with the OAI-PMH validator gave me an "invalid request" response. I have since removed the html from those collections in order to allow the PA DPLA hub to do a test aggregation, so you will no longer see a broken response with our request URL. Ideally however, we'd love to be able to keep the formatting locally, even if the html needs to be stripped for aggregation, particularly as we'd like to be able to link to as many related resources as possible without having sloppy metadata aggregated.

I don't think the issue with the OAI-PMH plug in per se, but more in how Omeka treats diacritic characters wrapped in html tags more generally. Is there a way to keep Omeka from doing that conversion, or perhaps another work around?

Thank you!

I think some of the examples you tried to post there got, amusingly, parsed by bbPress. You'd have to use backticks to post an HTML entity written out (like what I assume you did, e.g. é).

There's nothing I'm aware of in Omeka's handling per se that would be turning Unicode characters into HTML entities, so my suspicion is that this is something that TinyMCE, the HTML editor widget, is doing in order to be "helpful." I'll have to see if I can confirm that's the problem, and see if there's some way to prevent it. Ideally we'd only want TinyMCE to touch the mandatory escapes and nothing else.

As for the problem with the OAI output being invalid, that sounds like something that's been fixed in more recent releases of the OAI-PMH Repository plugin. You might try updating it to the latest version to see if it fixes that particular dimension of the issue.

For the more general problem of the entities being encoded at all, you can apply this patch, which tells TinyMCE to only encode the bare minimum of HTML entities.

Resaving items after applying this should replace the special-character entities with their "normal" UTF-8 versions.

Ha! Yes, they did get parsed correctly.

I didn't realize our plugin was out of date, that solved the XML parsing issue exactly, thanks for pointing me toward that, I will make sure to check versions first from here on out.

I will try the patch you've provided and see if the item level metadata comes out clean in the test PA DPLA hub test aggregator. Might take a couple days depending on when I can get the test aggregation scheduled, but I'll let you know. Thanks so much for all your help!

Hi John,

I've made the change you indicated to the globals.js file, but is there something I need to do for this change to take effect?

Thanks again!

Tristan

It would only do anything if you went and edited some item that was using special characters in HTML.

You might also need to refresh and/or clear cache to get the browser to use that updated Javascript.

I'm curious if you tested the change and had good results? I changed the line in the globals.js file, dumped the browser cache, edited an item and saved it but saw no difference in the output.

I've tried editing both the globals.js file in the directory you reference on GitHub as well as in our custom theme's javascripts folder.

Thanks!

I did test the change and it seemed to have the desired results.