My experience with CSV import on a shared hosting server shows an import speed of 30 items per minute.
I envy your super-fast import speed. :-)
When thinking about the performance, look at the number of columns in the CSV file, and the number and size of files associated with each item.
How many columns do you have in the CSV file that correspond to an element-texts value?
Each one represents an insert of one row into the "element_texts" table related to the one row added to the items table for each item.
25 attributes per item leads to 26 SQL insert statements per row to register the item attributes.
Have some tags? Add an SQL insert into the "tags" table and an SQL insert into the "record-tags" table for each tag value related to the item.
Have a file, or files? Add an SQL insert into the "files" table for each file.
How big and how many files are associated with each item?
Each file entry requires getting the file by URL via HTTP protocol and pulling the file into a temp folder on the Omeka server.
How fast is the server at the URL that feeds up the file?
How fast is the network between the source server and your Omeka server?
On the Omeka machine, after it is fetched via URL, the file is copied from the temp folder to the "original" folder within the Omeka area.
ImageMagick is then run against the original file to create the derivative files.
Creating each derivative file requires a read of the original file, processing (which might be CPU bound), and write of the new derivative file.
This is a lot of I/O at the file system level.
I have not done any serious analysis of the performance of the process but I would predict that the entire import process is I/O bound at the file system, and is also bound by the single threading of the database insert operations, which must update indexes as each row is added, as well as "parse and prepare" each similar SQL insert statement as if it were completely new. The wait for associated files to be pulled in by URL via HTTP is also a step that is fixed and relatively large for each file for each item.
Consider that an emulation of the needed work can be built to to explore faster alternatives. The emulation would do the rough SQL and I/O operations with built in values, just to measure duration for each step. How much is the database bottleneck contributing to the duration? How much is importing and creation of derivative files? Both these questions should be answered clearly before fixing anything. An emulation can try out new techniques to see how durations might change before doing a fix. This allows for knowing how to do the fix and what to expect when it is fixed.
I doubt that any of the work done by Omeka code to parse the CSV file and to prep for the inserts and file processing add much time to the total duration. Are you seeing any evidence that the process is CPU bound (long process-ready-to-run queues) or memory bound (large number of paging events)?
Working with the huge dataset you have is going to lead you into looking hard at database bottlenecks and file system bottlenecks. Your use case may lead you to invent a whole new way to do batch input into Omeka that converts a CSV file into optimized SQL insert statements. And perhaps those SQL statements are run as native MySQL statements outside of Omeka. Having that new system for doing batch import would be a great thing to have exist.
It takes someone with your order of magnitude of items to be motivated to solve the problem. My 6,000 items is a slow import but not so slow that I am motivated to do the work to have it to speed up. (Although I would prefer to have a faster way.) Also in the local case, the item definitions changes will slow down soon and stop. The cost of the slow inserts will go away shortly, so in total it is much less than the cost of writing a new import system.
Your case has so many input items that it won't take many imports (maybe just one) to justify spending the time to write code to make the duration shorter.
The type of analysis on how to speed up the process should exist in a discussion somewhere focused on MySQL. It is likely in the area of comparing a sequence of SQL inserts into a table, one row at a time, compared to a single SQL insert statement containing multiple values to insert multiple rows with the single SQL insert call. The second style is likely to be much faster (more than 10x faster).
An idea might be to process the entire CSV file (or groups of say 100 rows) to generate one SQL insert statement to insert all the new rows into the "items" table and another SQL insert statement to insert all the new rows into the "element-texts" table. This is radically different from the current approach which processes each row in the CSV file and inserts one new row into the "items" table, then inserts multiple new rows into the "element-texts" table. The current approach is used because it can use a helper function to "add" a new item with all its attributes and not care about the low level details, or be obsoleted when the database schema is changed. It's a very sensible solution for normal use cases.
This broad description is an over-simplification because tags and files must be handled too, but in a similar way. The creation of derivative files could also be broken out from the step of registering items into Omeka.
If the creation of derivative files is found to be the bottleneck, then the batch import improvement might be to have the derivative files created outside of Omeka and be imported into the Omeka folder structure by way of FTP without having any work done on the server during the import.
The key idea for making a change useful for those with huge data sets is be to have import of related files be disconnected from import of item attribute values.
Not to hijack this thread, but there is another issue that might exist with your use of CSV Import that I wonder about.
With that many items you must be using Omeka simply as a display system, while managing the definition of items in another product or products.
What is your system design for managing change to item definitions over time, and getting those changes to flow into Omeka for display?
In particular, how do you manage change while providing a permalink to the "show" item URL?
I've run into this with my archive, which uses an external system for managing item definitions. I find that using the CSV Import undo, followed by another CSV Import, works great for refreshing the database with the most current item definition. However, it leads to having each item get a new internal database key id value.
This means the standard Omeka "show item" URL (../items/show/100) is not a permalink that can be indexed or bookmarked.
The solution to this is to use the CleanUrl plugin developed by Daniel Berthereau (Daniel-KM in the forum) to provide "show" URLs (../items/show/chartres-east-portal-closeup) that are true permalinks, so that refreshing the database with undo and import does not affect the "show" URLs.
This allows having a "show item" URL that uses an item identifier that never changes, even when the item is removed by CSV Import undo and added again by way of another CSV Import. The internal id key value can change and it does not affect the URL to show an item.
Did you find any other way to provide permalinks to show items, as changed item definitions in external systems flow into Omeka?