CSV Import Error - How to debug

Hi,

I am trying to do a bulk upload using the CSV Import tool, but the import keeps getting aborted because of an error.

How can I log as to what exactly is going on? i have enabled error logging in application/config/config.ini but it does not really give much information. Is there a better way to get that information?

First/easiest thing to do is to grab the latest version, 2.0.2, just released yesterday. The logging is much improved in that one.

That's the version I have installed. Is there anything else I need to do in regards to enable the logging, other then in the config.ini?

If you have logging enabled in config.ini, any error that's causing the import to stop and give an error status should appear in the Omeka error log.

You say there's not much information, but is there anything at all indicating an error?

Here is a sample output:

2014-02-07T17:07:03-05:00 ERR (3): exception 'Zend_Db_Statement_Mysqli_Exception' with message 'Mysqli statement execute error : Incorrect string value: '\xB7'\x0A\x0AN\x0A...' for column 'text' at row 1' in /Library/Server/Web/Data/Sites/pasomeka.digark.info/application/libraries/Zend/Db/Statement/Mysqli.php:214
Stack trace:
#0 /Library/Server/Web/Data/Sites/pasomeka.digark.info/application/libraries/Zend/Db/Statement.php(303): Zend_Db_Statement_Mysqli->_execute(Array)
#1 /Library/Server/Web/Data/Sites/pasomeka.digark.info/application/libraries/Zend/Db/Adapter/Abstract.php(480): Zend_Db_Statement->execute(Array)
#2 [internal function]: Zend_Db_Adapter_Abstract->query('INSERT INTO

om...', Array)
#3 /Library/Server/Web/Data/Sites/pasomeka.digark.info/application/libraries/Omeka/Db.php(79): call_user_func_array(Array, Array)
#4 /Library/Server/Web/Data/Sites/pasomeka.digark.info/application/libraries/Omeka/Db.php(252): Omeka_Db->__call('query', Array)
#5 /Library/Server/Web/Data/Sites/pasomeka.digark.info/application/libraries/Omeka/Db.php(252): Omeka_Db->query('INSERT INTO

om...', Array)
#6 /Library/Server/Web/Data/Sites/pasomeka.digark.info/application/libraries/Omeka/Record/AbstractRecord.php(543): Omeka_Db->insert('ElementText', Array)
#7 /Library/Server/Web/Data/Sites/pasomeka.digark.info/application/models/Mixin/ElementText.php(654): Omeka_Record_AbstractRecord->save()
#8 /Library/Server/Web/Data/Sites/pasomeka.digark.info/application/models/Mixin/ElementText.php(93): Mixin_ElementText->saveElementTexts()
#9 [internal function]: Mixin_ElementText->afterSave(Array)
#10 /Library/Server/Web/Data/Sites/pasomeka.digark.info/application/libraries/Omeka/Record/AbstractRecord.php(251): call_user_func_array(Array, Array)
#11 /Library/Server/Web/Data/Sites/pasomeka.digark.info/application/libraries/Omeka/Record/AbstractRecord.php(280): Omeka_Record_AbstractRecord->delegateToMixins('afterSave', Array, true)
#12 /Library/Server/Web/Data/Sites/pasomeka.digark.info/application/libraries/Omeka/Record/AbstractRecord.php(550): Omeka_Record_AbstractRecord->runCallbacks('afterSave', Array)
#13 /Library/Server/Web/Data/Sites/pasomeka.digark.info/application/models/Item.php(322): Omeka_Record_AbstractRecord->save()
#14 /Library/Server/Web/Data/Sites/pasomeka.digark.info/application/models/Builder/Item.php(204): Item->saveFiles()
#15 /Library/Server/Web/Data/Sites/pasomeka.digark.info/application/libraries/globals.php(554): Builder_Item->addFiles('Url', 'http://www.diga...', Array)
#16 /Library/Server/Web/Data/Sites/pasomeka.digark.info/plugins/CsvImport/models/CsvImport/Import.php(715): insert_files_for_item(Object(Item), 'Url', 'http://www.diga...', Array)
#17 /Library/Server/Web/Data/Sites/pasomeka.digark.info/plugins/CsvImport/models/CsvImport/Import.php(583): CsvImport_Import->_addItemFromRow(Array)
#18 /Library/Server/Web/Data/Sites/pasomeka.digark.info/plugins/CsvImport/models/CsvImport/Import.php(331): CsvImport_Import->_importLoop(0)
#19 [internal function]: CsvImport_Import->start()
#20 /Library/Server/Web/Data/Sites/pasomeka.digark.info/plugins/CsvImport/models/CsvImport/ImportTask.php(39): call_user_func(Array)
#21 /Library/Server/Web/Data/Sites/pasomeka.digark.info/application/libraries/Omeka/Job/Process/Wrapper.php(29): CsvImport_ImportTask->perform()
#22 /Library/Server/Web/Data/Sites/pasomeka.digark.info/application/scripts/background.php(61): Omeka_Job_Process_Wrapper->run(Array)
#23 {main}

I just am not sure what to make of this error message...

I should clarify, I am importing a list of PDF files. The first column of the CSV file is the path to the PDF files, the rest of the columns is just some metadata.

Here is more from the error.log, just preceeding the previous post.

2014-02-07T17:06:56-05:00 DEBUG (7): [CsvImport][#9] Queued import.
2014-02-07T17:06:56-05:00 DEBUG (7): [CsvImport][#9] Started import.
2014-02-07T17:06:56-05:00 DEBUG (7): [CsvImport][#9] Running item import loop. Memory usage: 12955448
2014-02-07T17:07:03-05:00 ERR (3): [CsvImport][#9] exception 'Zend_Db_Statement_Mysqli_Exception' with message 'Mysqli statement execute error : Incorrect string value: '\xB7'\x0A\x0AN\x0A...' for column 'text' at row 1' in /Library/Server/Web/Data/Sites/pasomeka.digark.info/application/libraries/Zend/Db/Statement/Mysqli.php:214
Stack trace:
#0 /Library/Server/Web/Data/Sites/pasomeka.digark.info/application/libraries/Zend/Db/Statement.php(303): Zend_Db_Statement_Mysqli->_execute(Array)
#1 /Library/Server/Web/Data/Sites/pasomeka.digark.info/application/libraries/Zend/Db/Adapter/Abstract.php(480): Zend_Db_Statement->execute(Array)
#2 [internal function]: Zend_Db_Adapter_Abstract->query('INSERT INTO `om...', Array)
#3 /Library/Server/Web/Data/Sites/pasomeka.digark.info/application/libraries/Omeka/Db.php(79): call_user_func_array(Array, Array)
#4 /Library/Server/Web/Data/Sites/pasomeka.digark.info/application/libraries/Omeka/Db.php(252): Omeka_Db->__call('query', Array)
#5 /Library/Server/Web/Data/Sites/pasomeka.digark.info/application/libraries/Omeka/Db.php(252): Omeka_Db->query('INSERT INTO `om...', Array)
#6 /Library/Server/Web/Data/Sites/pasomeka.digark.info/application/libraries/Omeka/Record/AbstractRecord.php(543): Omeka_Db->insert('ElementText', Array)
#7 /Library/Server/Web/Data/Sites/pasomeka.digark.info/application/models/Mixin/ElementText.php(654): Omeka_Record_AbstractRecord->save()
#8 /Library/Server/Web/Data/Sites/pasomeka.digark.info/application/models/Mixin/ElementText.php(93): Mixin_ElementText->saveElementTexts()
#9 [internal function]: Mixin_ElementText->afterSave(Array)
#10 /Library/Server/Web/Data/Sites/pasomeka.digark.info/application/libraries/Omeka/Record/AbstractRecord.php(251): call_user_func_array(Array, Array)
#11 /Library/Server/Web/Data/Sites/pasomeka.digark.info/application/libraries/Omeka/Record/AbstractRecord.php(280): Omeka_Record_AbstractRecord->delegateToMixins('afterSave', Array, true)
#12 /Library/Server/Web/Data/Sites/pasomeka.digark.info/application/libraries/Omeka/Record/AbstractRecord.php(550): Omeka_Record_AbstractRecord->runCallbacks('afterSave', Array)
#13 /Library/Server/Web/Data/Sites/pasomeka.digark.info/application/models/Item.php(322): Omeka_Record_AbstractRecord->save()
#14 /Library/Server/Web/Data/Sites/pasomeka.digark.info/application/models/Builder/Item.php(204): Item->saveFiles()
#15 /Library/Server/Web/Data/Sites/pasomeka.digark.info/application/libraries/globals.php(554): Builder_Item->addFiles('Url', 'http://www.diga...', Array)
#16 /Library/Server/Web/Data/Sites/pasomeka.digark.info/plugins/CsvImport/models/CsvImport/Import.php(715): insert_files_for_item(Object(Item), 'Url', 'http://www.diga...', Array)
#17 /Library/Server/Web/Data/Sites/pasomeka.digark.info/plugins/CsvImport/models/CsvImport/Import.php(583): CsvImport_Import->_addItemFromRow(Array)
#18 /Library/Server/Web/Data/Sites/pasomeka.digark.info/plugins/CsvImport/models/CsvImport/Import.php(331): CsvImport_Import->_importLoop(0)
#19 [internal function]: CsvImport_Import->start()
#20 /Library/Server/Web/Data/Sites/pasomeka.digark.info/plugins/CsvImport/models/CsvImport/ImportTask.php(39): call_user_func(Array)
#21 /Library/Server/Web/Data/Sites/pasomeka.digark.info/application/libraries/Omeka/Job/Process/Wrapper.php(29): CsvImport_ImportTask->perform()
#22 /Library/Server/Web/Data/Sites/pasomeka.digark.info/application/scripts/background.php(61): Omeka_Job_Process_Wrapper->run(Array)
#23 {main}

What kind of data is in this file?

The error you're getting is a relayed error from MySQL complaining about the encoding of the data that's coming from your CSV file. Is the file UTF-8 encoded (the text encoding Omeka generally expects)?

The CSV contains the URL to the original PDFs, and a few metadata fields - Title, Date and various item related metadata. Yes, the CSV is formated as UTF-8.

I have no idea how it comes to this error message:

[CsvImport][#9] exception 'Zend_Db_Statement_Mysqli_Exception' with message 'Mysqli statement execute error : Incorrect string value: '\xB7'\x0A\x0AN\x0A...' for column 'text' at row 1' in.

Can you post or send the file that's causing the problems?

It's hard to tell what's happening here because of the way that MySQL truncates and encodes the string it's complaining about. The most likely problems here are either some encoding issue or maybe a rogue invalid character, but it's possible there's some other problem at work.

The UTF-8 support in MySQL's older versions doesn't actually support the full Unicode range, but the problematic characters tend to be either less-commonly-used characters from languages with a very large number of possible characters like Chinese or Japanese, or extremely uncommon languages.

It's not likely that this is the problem you're having, but being able to see the data that's making the importer choke would help diagnose things.

You can find a sample of the file here : https://drive.google.com/file/d/0B1rzfDFfMSlPNFkyMHJXaks1cVk/edit?usp=sharing

I appreciate you taking a look.

I disabled the PDF TEXT plugin and now the import is working. So seems to be the issue is cause by that, rather then by the CSV import itself.

I will try to run the PDF TEXT plugin after and see if it will process the files.

Ah, I should have noticed that, it looks like the error's actually happening when the file is inserted and PDF Text is trying to add the search data it scraped from the PDF.

PDF Text just relies on the pdftotext command-line program. It's possible that the problem could be resolved by updating that package, as it must be outputting some invalid UTF-8 characters for some particular PDF.

I currently have 3.03 installed. Let me see if I can update the package. Not to easy to come by for OSX.

which brings me back to this -http://omeka.org/forums/topic/pdf-text-status-indicator is there a way to tell of PDFtoText is processing files or not? Other then the text being posted in the File Metadata Fields?

Actually, checking the log files, I get the same error, which now is being caused by the PDFtoText plugin. So on to fix that issue :)

Ok, long time overdue, but I got caught up in some other projects :)

Yes, the error was caused by PDFtoText, because the text extracted was not UTF-8.

To fix the issue, I simply modified plugins/PdfText/models/PdfTextProcess.php line 41 to add utf8_encode. The full call now looks like


$file->addTextForElement(
$textElement,
utf8_encode($pdfTextPlugin->pdfToText(FILES_DIR . '/original/' . $file->filename))
);

That way we make sure that the text will in any case be encoded correctly. Now it works correctly.

We talked about this a long time ago, but did you ever try adding -enc UTF-8 to the command line for pdftotext?

utf8_encode works here because your pdftotext is outputting Latin-1 text, but that's the only character set utf8_encode actually works on as input (despite the name).

Adding -enc UTF-8 on there always is, I think, more likely of a change that could end up in PdfText. Your solution works well in your case but could easily cause problems on other servers where pdftotext is already producing UTF-8 output.

i understand your concern. Perhaps something along this lines would be better :


$extractedText = $pdfTextPlugin->pdfToText(FILES_DIR . '/original/' . $file->filename);

if(!mb_detect_encoding($extractedText, 'UTF-8', true)) {
$extractedText= utf8_encode($text);
}

$file->addTextForElement($textElement, $extractedText);

Let me also try adding it to the command line, as you had suggested.