CSV import background job speedup

Is it possible to increase the priority of the Omeka background jobs (probably the function sendLongRunning)? Is there a setting somewhere inside Omeka or in the OS? I'm using Omeka CSV Import and the imports takes long time to finish and I'm trying to speed things up.
I just gave my server double RAM and CPU (from 1GB/1CPU to 2GB/2CPU - it's a VM) and the imports are doing better, but are still quite slow (from 45 minutes for 5000 objects down to 23 minutes). I assume it may be some background job settings to tune. Has anyone tuned the Omeka background job? I need this to scale up to about 1M objects, but I can't let the import take days to finish.

My experience with CSV import on a shared hosting server shows an import speed of 30 items per minute.

I envy your super-fast import speed. :-)

When thinking about the performance, look at the number of columns in the CSV file, and the number and size of files associated with each item.

How many columns do you have in the CSV file that correspond to an element-texts value?
Each one represents an insert of one row into the "element_texts" table related to the one row added to the items table for each item.

25 attributes per item leads to 26 SQL insert statements per row to register the item attributes.

Have some tags? Add an SQL insert into the "tags" table and an SQL insert into the "record-tags" table for each tag value related to the item.
Have a file, or files? Add an SQL insert into the "files" table for each file.

How big and how many files are associated with each item?

Each file entry requires getting the file by URL via HTTP protocol and pulling the file into a temp folder on the Omeka server.
How fast is the server at the URL that feeds up the file?
How fast is the network between the source server and your Omeka server?

On the Omeka machine, after it is fetched via URL, the file is copied from the temp folder to the "original" folder within the Omeka area.
ImageMagick is then run against the original file to create the derivative files.
Creating each derivative file requires a read of the original file, processing (which might be CPU bound), and write of the new derivative file.
This is a lot of I/O at the file system level.

I have not done any serious analysis of the performance of the process but I would predict that the entire import process is I/O bound at the file system, and is also bound by the single threading of the database insert operations, which must update indexes as each row is added, as well as "parse and prepare" each similar SQL insert statement as if it were completely new. The wait for associated files to be pulled in by URL via HTTP is also a step that is fixed and relatively large for each file for each item.

Consider that an emulation of the needed work can be built to to explore faster alternatives. The emulation would do the rough SQL and I/O operations with built in values, just to measure duration for each step. How much is the database bottleneck contributing to the duration? How much is importing and creation of derivative files? Both these questions should be answered clearly before fixing anything. An emulation can try out new techniques to see how durations might change before doing a fix. This allows for knowing how to do the fix and what to expect when it is fixed.

I doubt that any of the work done by Omeka code to parse the CSV file and to prep for the inserts and file processing add much time to the total duration. Are you seeing any evidence that the process is CPU bound (long process-ready-to-run queues) or memory bound (large number of paging events)?

Working with the huge dataset you have is going to lead you into looking hard at database bottlenecks and file system bottlenecks. Your use case may lead you to invent a whole new way to do batch input into Omeka that converts a CSV file into optimized SQL insert statements. And perhaps those SQL statements are run as native MySQL statements outside of Omeka. Having that new system for doing batch import would be a great thing to have exist.

It takes someone with your order of magnitude of items to be motivated to solve the problem. My 6,000 items is a slow import but not so slow that I am motivated to do the work to have it to speed up. (Although I would prefer to have a faster way.) Also in the local case, the item definitions changes will slow down soon and stop. The cost of the slow inserts will go away shortly, so in total it is much less than the cost of writing a new import system.

Your case has so many input items that it won't take many imports (maybe just one) to justify spending the time to write code to make the duration shorter.

The type of analysis on how to speed up the process should exist in a discussion somewhere focused on MySQL. It is likely in the area of comparing a sequence of SQL inserts into a table, one row at a time, compared to a single SQL insert statement containing multiple values to insert multiple rows with the single SQL insert call. The second style is likely to be much faster (more than 10x faster).

An idea might be to process the entire CSV file (or groups of say 100 rows) to generate one SQL insert statement to insert all the new rows into the "items" table and another SQL insert statement to insert all the new rows into the "element-texts" table. This is radically different from the current approach which processes each row in the CSV file and inserts one new row into the "items" table, then inserts multiple new rows into the "element-texts" table. The current approach is used because it can use a helper function to "add" a new item with all its attributes and not care about the low level details, or be obsoleted when the database schema is changed. It's a very sensible solution for normal use cases.

This broad description is an over-simplification because tags and files must be handled too, but in a similar way. The creation of derivative files could also be broken out from the step of registering items into Omeka.

If the creation of derivative files is found to be the bottleneck, then the batch import improvement might be to have the derivative files created outside of Omeka and be imported into the Omeka folder structure by way of FTP without having any work done on the server during the import.

The key idea for making a change useful for those with huge data sets is be to have import of related files be disconnected from import of item attribute values.

Not to hijack this thread, but there is another issue that might exist with your use of CSV Import that I wonder about.

With that many items you must be using Omeka simply as a display system, while managing the definition of items in another product or products.

What is your system design for managing change to item definitions over time, and getting those changes to flow into Omeka for display?

In particular, how do you manage change while providing a permalink to the "show" item URL?

I've run into this with my archive, which uses an external system for managing item definitions. I find that using the CSV Import undo, followed by another CSV Import, works great for refreshing the database with the most current item definition. However, it leads to having each item get a new internal database key id value.

This means the standard Omeka "show item" URL (../items/show/100) is not a permalink that can be indexed or bookmarked.

The solution to this is to use the CleanUrl plugin developed by Daniel Berthereau (Daniel-KM in the forum) to provide "show" URLs (../items/show/chartres-east-portal-closeup) that are true permalinks, so that refreshing the database with undo and import does not affect the "show" URLs.

https://github.com/Daniel-KM/CleanUrl

This allows having a "show item" URL that uses an item identifier that never changes, even when the item is removed by CSV Import undo and added again by way of another CSV Import. The internal id key value can change and it does not affect the URL to show an item.

Did you find any other way to provide permalinks to show items, as changed item definitions in external systems flow into Omeka?

Thank you Bob for such an extensive response!

I've come to the conclusion that the import speed is fast enough, even though I would have liked it to be somewhat faster. When monitoring the system load of the virtual machine in VMWare, I can see that the bottleneck is the disk IO, i.e. database writes, because I don't import any files/images this way.

I was actually looking for an Omeka or Ubuntu/PHP/Apache setting for the background job priority, because no matter how many import jobs I start, the system is still responding normal from the browser. I assume the priority must be set quite low, so it doesn't affect the web requests, which must have a higher priority. I haven't found this setting yet. Anyone else?

I really like the structure of CSV import using native Omeka inserts (saves) and can only imagine the consistency problems I will have in future releases of Omeka, if I create my own SQL batch inserts. I do agree with you that this kind of import will probably be a lot faster, though.

We actually will use Omeka for the collection management, although we are modifying it to suit our needs. So far I have tested it with 50000 objects and it seems fast and responsive (!) up to this number. (On a server with 1GB RAM and 1 virtual CPU.) So I haven't had any problem with persistent URL:s. I do expect that there will be some database Index tweaking in the future, but so far the vanilla Omeka works just fine!!! I haven't even adjusted the default MySQL memory sizes so I will definitely be able to have all database in memory if I add up a few GB RAM.

Again, thank you for your answer!

Daniel, your post caught my attention because I thought you had a problem similar to mine but with a stronger case for building a solution. I enjoyed sharing my thoughts on the solution to document it somewhere, just in case it might be built. Plus it makes for an interesting forum thread of the type I'd like to read more of. Strategic and broad discussion about system design issues instead of code syntax issues.

I understand now that your imports are a one time operation for each item. The total number of items will be huge in the end, but that total count will accumulate over a period of time with each campaign to import a chunk of items going slow, but fast enough.

That's great. You don't need to solve the import-is-too-slow problem.

I was hoping you would need to, but only if funded and with a strong requirement to succeed. I'd like to see that solution exist.

There isn't any (database) consistency problem by doing database inserts natively versus using the helper function that provides a high level "add item" operation. Doing native inserts does introduce a dependency on the current schema and how it encodes items and item attributes and related objects such as files and tags. That same dependency permeates all of Omeka, as the current structure of the schema is known by helper functions all over. The current schema for encoding items and attribute is so critical to the system that inertia will fight against change happening.

The biggest force in the inertia is that the current schema model works well so why change it. Also, the schema dependency would be contained within a new helper function which could be changed as easily as any other Omeka helper function if the schema did change.

I say all this for the record, in case another development group does have the import-is-too-slow problem and finds this thread while looking for earlier discussions about it.

You say you are seeing an I/O bound system when the CSV Import processes are running. It is not expected that a foreground operation would run at normal speed. That is a very strange report.

It seems like a foreground process to browse items would be competing for the I/O to do the database fetches to support the browse operation. The fetch request should arrive at the database and get in line, with a few insert operations ahead of it, and wait its turn.

There should at least be a slight hesitation in a foreground operation that requires many database fetches.

Try searching for items having a given value of an element_texts field as your test.
Run it on an idle system, then run it again while a CSV Import is running.
Does that show a slower response during the CSV Import activity?
If that shows no difference in response, that is an unexpected result. How can the system be I/O bound?

Your model is that since the foreground process is not slowed down, it is because the foreground process is cutting in line ahead of the background processes waiting for a time slice on the CPU. If the background processes could be given a higher priority, they'd run faster because the foreground process would not cut in front. And since the foreground process could not cut in front, it would run slower and you'd see that slowdown.

Consider this - since you are seeing an I/O bound system, those background processes are not likely to be in a ready-to-run state waiting for a time slice. It is more likely that they are waiting for I/O to finish.

If it is the case that they are waiting for I/O to finish, then giving them a higher priority will not change anything. That higher priority would never come into play because the processes are not waiting for a CPU time slice.

If the system were CPU bound (where there is a long queue of processes waiting their turn on the CPU), then adjusting process priorities would change visible performance and digging more to find out how to set that on the background processes would be a good plan.

This page is a summary of the tools available for measuring system performance and for how to change priorities of a process. It references other sources for more detailed information.

http://www.circle4.com/jaqui/papers/webunuk.html

Before changing priorities, do the performance analysis using tools like time and sar, and vmstat if you have it, to get a baseline measure that will tell you what resource is the bottleneck. If it is CPU bound, then using nice and setpri can be used to fiddle with priorities while reviewing changes in the measurements reported by time and sar.

This sort of work requires repeated and careful experimenting looking for consistent and expected results, with a need to find an explanation for every unexpected result before continuing. Performance tuning is always about measuring before changing, understanding what change is suggested by the measurements, make the change, then measure again and explain any unexpected results. It's hard science type experimenting.

As you increase row count in the Omeka tables, you might run into some slowdowns but not for the standard operations. MySQL can handle tables with millions of rows and do key lookup quickly and do joins quickly. A million items with 25 element_texts values will cause 25 million rows in the element_texts table. That's a lot of rows but not outside what MySQL can handle if access is by way of disciplined paths through the data. Normal Omeka operations such as browse and show will use disciplined access and get good response.

Where you will find disaster hiding is with ad-hoc searches.

A visitor doing an arbitrary keyword search will use up a lot of resources.

That is where you should be doing some testing now. If it is slow, you might want to remove that keyword search feature or make it less convenient to access. Provide domain specific look up pages that give visitors an easy way to find items while providing the search using disciplined access methods. Create a flow through the site that does not lead visitors into using ad-hoc keyword queries as the operation that is easiest to use.

Your use of Omeka is full of interesting twists that make me curious. You say you don't bring in images for Omeka items using CSV Import. How do you manage the images associated with an item? It seems like you might not be using Omeka "files". Are you using a special item element to encode where the image is?

Can you say something about what your approach was in changing Omeka to support your needs for managing the collection?

Thanks for your question and continued conversation.

Importing up to 100K images will take too much time to download and reformat so that will be made by hand and direct SQL. The smaller imports will be made the normal way.

The changes we are doing are much about search, but also browsing. When we have a lot of objects, the browsing needs to be tweaked by changing columns showed and also adding more information. The search is a key factor for the staff when you have a lot of items.The innoDB engine needs to be replaced by myISAM for at least the element_texts table that need FULLTEXT indexes. I haven't done a complete research of the consequences of swithing engine yet, but I don't see any alternatives as we really need the FULLTEXT. I saw the issue when they switched from myISAM to innoDB for consistency and transactional reasons so I’m a little worried that the Omeka team are more and more relying on the transactional support from within the DB. I really don’t see how a standard b-tree index will be sufficient for other Omeka users. When their number of objects and especially the number of object elements increase they will need to switch back too, is my guess. When relying on the search functionality you also need to let the users save their searches, which vanilla Omeka isn't capable of doing yet. We’re also lacking the support for this kind of search:
Give me all objects which are made of plastic AND (wood OR metal).
The advanced search isn’t capable of doing the OR on elements (or we just haven’t figured out how to do it).

Those are the changes related to the big data problem, we are doing right now. Other performance enhancements related to user performance are “copy object” and “autofill a new object”. There will absolutely be more tweaking and changes ahead related to this, but of which I'm luckily unaware of today. =)

I am still confused on how the images are handled in your design.

Having many images is a problem I can understand. It leads to a system design for managing the images that is based on preparing the images and loading them to the server without using Omeka's import system.

For my project I almost went that way. An early plan was to create all the needed image files and the derivative sizes and move them to the server in a structured way, then access them with custom code in the theme views, based on the image files having a naming convention tied to the "identity" metadata value of an item. This would avoid having the CSV Import function pulling in an image from a web site and making the derivative versions. It would also avoid using the "files" area of Omeka.

At the time, I thought that Omeka's file management was valuable so I did not follow that plan. Instead, I accepted the Omeka way of handling images as part of the import process.

As I've gained experience and knowledge, I am not sure if using the Omeka file management is better, when the only file type is images. It's clearly better when associated files are of mixed type, particularly when not jpg images. But when an item has only a jpg image associated with it, there is little value in registering and managing that image with Omeka's file management.

In your system, when you make your images by hand and direct SQL, how do you associate them with an item already registered within Omeka? By "direct SQL", does that mean you create "files" records directly?

Your comments about the need for FULLTEXT searching and the differences between MyISAM and innoDB engines is very interesting. I had not noticed those details about MySQL before. Only a MyISAM engine can do FULLTEXT searches. One reason to use Omeka is to avoid thinking about search - that is a feature that comes free with the product. In your situation, with the large volume of items, search will become the bottleneck so you must look at it closely, in case the Omeka solution is not good enough.

Have you noticed the "search_texts" table in the database?

A row in the table is associated with an item (or collection or file) and it holds a copy of all the "element_texts" values for the given item (or collection or file) which are put into a single composite field within a row in the table.

I found it when learning how "keyword" searches are implemented. A keyword search does a FULLTEXT search over the composite field in the "search_texts" table.

From what you said, I looked closer just now and see that the "search_texts" table uses the MyISAM engine instead of the innoDB engine. That is why it can be used for FULLTEXT searches.

It must have been designed to provide a duplicate set of "element_texts" values to support the FULLTEXT search, while the original "element_texts" innoDB table is used for storage of the base values and to support searching on the value of a specific metadata field. By doing this, the FULLTEXT search feature is enabled over "element_texts" values while having the "element-texts" table use the innoDB engine.

For my project, the duration for "keyword" searches is much longer(10 times longer) than a normal (advanced) search. The keyword search was provided as a convenience operation in the header of every page following the standard Omeka theme convention. When the search durations showed up during testing, the keyword search was removed from the headers. The less convenient but faster normal search was put in the headers instead, offering some convenience and very fast response. The keyword search was hidden away on the search (enter search criteria) page so that the keyword search could be done if needed, but it was not provided as a standard operation.

I don't know how well MySQL scales up for FULLTEXT searches on a field as the number of rows goes up. Having more words in each row's field value adds to the work. Having more rows in the database adds to the work. That is a database issue, not specifically an Omeka issue. From what I've seen so far, I would worry that FULLTEXT searching does not scale up well.

Have you done any performance testing with FULLTEXT searches?

We haven' done the import of the images yet, but it will be simply file copy and sql inserts.

I was confronted with some performance problems and when I changed the engine and added the fulltext the problem disappeared. I haven't done any more performance tests since the problem is gone. I very fond of the "If it's not broken don't fix it"!