Using Omeka in multi-byte language (Japanese) · Legacy Forums

homma November 30, 2011

Hello, thank you very much for great software.
I'm completely new to Omeka and trying to use it in Japanese.

Install was successfully done, but there seems to be some problems in handling multi-byte (Japanese) characters.

1. Can not input or register multi-byte texts via admin interface (alphabets are OK).
(When I directly input texts using Phpmyadmin, Omeka displays the text correctly)

2. Can not search multi-byte texts.
Both "Simple search" and "Advanced search" do not work (returns 0 result)
Search with alphabetical keywords seems to be working completely.

Is there any way to solve this?

John Flatness November 30, 2011

Omeka stores and displays all its data in UTF-8, which should be able to handle Japanese characters just fine.

When you say you can't input Japanese characters, what actually happens? Does Omeka just ignore those characters?

Can you post a short example of the kind of text that's not working for you?

homma December 1, 2011

Thank you for your reply. Fore reference, here's our test installation of Omeka.

http://findingaids.art-c.keio.ac.jp/fa/

I imported some data by CSV import plugin, and all the texts are stored and displayed correctly.

After checking PHP configuration (mbstring / iconv parameters), input problem is solved!
But search problem can not be solved by this...

For example, there's a record which has a title "世界ノンフィクション全集　24".
http://findingaids.art-c.keio.ac.jp/fa/items/show/3

In Simple search (both in Public Site and Admin interface), keywords like "世界ノンフィクション全集" or "ノンフィクション" return no hits.
Only full word "世界ノンフィクション全集　24" return hits.

In Advanced search, I tried "Narrow by Specific Fields" form and specify DC field to "title".
In this case, any keyword like "世界", "ノンフィクション" or "世界ノンフィクション" returns hits.

John Flatness December 1, 2011

The "simple" full-text search is based on MySQL's full-text search. This means it only searches for whole words, not parts of a word.

The unfortunate part that's causing your problem here is that your record's title "世界ノンフィクション全集　24" only uses a CJK-style ideographic space to separate the Japanese characters and the "24." MySQL doesn't see the ideographic space as a word break, so it sees the whole title as one long word. You'll have to use the smaller ASCII space to get the title to be understood as two different "words."

With the slightly-modified title ""世界ノンフィクション全集 24" (note the smaller space) you should be able to do a search for ""世界ノンフィクション全集" and get the correct result.

Alternatively, as you've seen, the narrow by specific fields "contains" search doesn't have the same restrictions. It doesn't try to detect word boundaries, it just naively matches your search input against the items.

homma December 2, 2011

Hello, thank you very much for your answer.

I now understand the character of MySQL full-text search.
Perhaps I have to find a way to store tokenized Japanese texts (in addition to normal data) in MySQL. Unfortunately I don't have any knowledge on MySQL or PHP, but I will try to someone who can think about this. If I find a good solution, I will report here (I guess it will take much time...)

But for now, is it possible to do "simple" full-text search in the same way as advanced search, by just matching search inputs against the items?
(I guess this method would not be recommended because it takes too much time searching, but I think it's worth trying because we don't have so many records.)