locating files trancsribed in mediawiki · Legacy Forums

jderidder December 14, 2011

Hi -- we're testing the mediawiki plugin with omeka. I need to be able to extract the transcribed text from the mediawiki database since the export by default combines all pages of an item (we want to use the transcriptions to enable search and retrieval at page level). I have not been able to find the table entry in either database to match up the omeka_files id with the mediawiki page_id. Can you tell me where that matchup occurs??

Thank you!!

Jim Safley December 14, 2011

You can export page-level transcriptions by using the "Export page" button on the transcribe page. (You must be logged into Scripto as a MediaWiki admin, but I assume you are since you can export documents.) Once exported, you can find the individual page transcription on Omeka's file show page, under Scripto:Transcription. Unfortunately I don't think there's a way to search file data in Omeka. I've opened an issue to see if this is possible in the future.

jderidder December 14, 2011

I need to export in bulk, not one by one, and I need the transcriptions by page, not by intellectual item. Can you tell me where in the database I can find the fields to match up which will identify the original image I loaded into Omeka? I need to match up the transcription to the image.

Jim Safley December 14, 2011

To export in bulk, you'll need to write your own export script using the Scripto library. The transcriptions are not readily accessible in the MediaWiki database for two reasons:

The MediaWiki database does not lend itself to direct interaction, rather they prefer you to use their API;
Scripto obfuscates the page names using Base64 encoding (see below).

Scripto maps an Omeka item and file ID to a MediaWiki title using Scripto_Document::encodeBaseTitle(). The DocBlock there should clarify things.

jderidder December 14, 2011

Thank you so much!!
For others trying to make sense of it: the mediawiki "page_title" value in the page table is composed of:
a period, followed by base64 encoded item_id, then another period followed by base64 encoded page_id, where the item_id and page_id can be found in the omeka database tables (once decoded).

Thank you again!