Strange HTML filtering results on import

I'm getting some results from CSV Import that appear strange to me.

I am using a test CSV file with three items, which I import as Moving Image items. Each item has an element containing HTML (an IFRAME).

In Omeka Settings -> Security, Allowed HTML Elements contains an entry for iframe, which I added some time ago.

When I do an import using CSV Import and the Use HTML? box is checked for the particular element, the HTML is stripped, leaving only text contained by the element.

When I import and the Use HTML? box is not checked for that element, the HTML comes through intact.

Can someone explain why this happens?

Thanks,

Steve

Can you successfully put an iframe with the exact same code into an item manually? If its an issue with the HTML purifier and/or your allowed elements, it would be blocked equally whether it came from the importer or was directly input.

If it is the purifier, you could always temporarily disable the HTML filtering while you run the import.

Yes, I can paste the iframe code into the metadata element on the item edit page and when "Use HTML?" is checked, the code comes through unescaped.

I should clarify. If the code is pasted with Use HTML? not checked, and then, Use HTML? is checked, the code comes through fine. If the reverse is done, the code is escaped. So entering HTML code manually, you can't check Use HTML first, then paste, apparently.

I've started with an empty textarea with Use HTML? checked, pasted the iframe code into it, then clicked the HTML button to view source. Its escaped.

John,
how do I disable HTML filtering? Do you mean the checkbox in Settings --> Security labelled Enable HTML Filtering?

Yes, that's the one.

I just ran an import with HTML Filtering DISABLED and the Use HTML box checked in the import configuration. The result was all HTML was stripped from the input (and text passed through). Is this an expected behavior?

You're sure you could directly enter an iframe into Item element data when the filtering was enabled under Security Settings?

I don't think the filter allows that, even with iframe added to the Allowed Elements list. The same filtering should be happening for CSV Import and for normal entry.

The key difference here, I think, is that the CSV Import plugin currently forces the use of the HTML filter, even when it's disabled in the security settings.

I just ran the following test.

'Test Nonsense

Data Entry: Manual
Use HTML: not checked
HTML Filtering: Enabled
Allowed HTML: Allowed HTML and Attributes reset to default
Test Data: <nonsense>nonsense</nonsense> pasted into DC Description

Result:

Title
Test Nonsense

Description
<nonsense>nonsense</nonsense>
'

I was thinking the same thing, that the CSV Import plugin is forcing the HTML to be enabled regardless of the security settings. I'll have to take a look at that.

It appears that if Use HTML is checked, the HTML filter is applied (if enabled in the Settings), and if Use HTML is not checked, the filter is not applied. This applies to directly entered element data, not imported data.

Yes, that's the expected behavior. The Use HTML box not only adds the WYSIWYG editor, it also marks the text as HTML.

Anything not marked as HTML, Omeka automatically escapes the content, so you'd see the literal form of any HTML tags. Marking something as HTML turns off that escaping, but also applies the HTML "purify" filter if that's enabled, to strip unwanted tags.

The only real wrinkle here is that CSV Import ignores the setting for the HTML filter and always applies it to HTML-marked columns, even when it's disabled under the security settings.

Looking at the code in Element.php in CSV Import:


public function map($row, $result)
{
if ($this->_isHtml) {
$filter = new Omeka_Filter_HtmlPurifier();
$text = $filter->filter($row[$this->_columnName]);
} else {
$text = $row[$this->_columnName];
}

This appears to be where the filter is created without honoring the settings. Am I correct that Allowed HTML will revert to the default given the code does not provide a whitelist? Because that seems to be the problem, that marking the element as HTML filters out the iframe, even if its included in the whitelist in Settings.

The lists of allowed elements are loaded fine for that code, since the calls to read the options for the purifier settings are contained within the Omeka_Filter_HtmlPurifier class.

The specific problem with iframe elements is the same, I think, as the one raised in another recent thread: The current setup of the HTML Purifier doesn't support iframe as an allowed element, even if you specifically include it. HTML Purifier requires a setting that the input is "trusted" before it will allow some elements like iframe to be used, with only a few exceptions.

This is why I was asking initially if you were able to insert an iframe into an Item manually. Even with the tag included in the Allowed Elements list, if you try to insert an iframe into a "Use HTML"-checked field, with the HTML filtering turned on, the tag will get filtered out.

I now understand why I'm able to sneak an iframe into an item type element. With HTML filtering disabled, I can copy and paste iframe code into the element, Save the item, then Edit again, go to the element input and click Use HTML, I then get a rendered, not a literal iframe. This was confusing.

This is why I said, yes, I was able to insert an iframe into an Item manually.