Extract Text

By Omeka Team Extract text from files to make them searchable.
Download 1.3.0

Extract Text

Extract text from files to make them searchable and machine readable.

Once installed and active, this module has the following features:

  • The module adds an "extracted text" property where it sets extracted text to media and items.
  • When adding a media, the module will automatically extract text from the file and set the text to the media.
  • When adding or editing an item, the module will automatically aggregate the media text (in order) and set the text to the item.
  • When editing an item or batch editing items, the user can choose to refresh or clear the extracted text.
  • The user can view the module configuration page to see which extractors are available on their system.

Supported file formats:

  • DOC (application/msword)
  • DOCX (application/vnd.openxmlformats-officedocument.wordprocessingml.document)
  • HTML (text/html)
  • ODT (application/vnd.oasis.opendocument.text)
  • PDF (application/pdf)
  • RTF (application/rtf)
  • TXT (text/plain)

Note that some file extensions or media types may be disallowed in your global settings.



Used to extract text from DOC and RTF files. Requires catdoc.


Used to extract text from DOCX files. Requires docx2txt.


Used to extract text from TXT files. No requirements.


Used to extract text from HTML files. Requires lynx.


Used to extract text from ODT files. Requires odt2txt.


Used to extract text from PDF files. Requires pdftotext, a part of the poppler-utils package.

Disabling text extraction

You can disable text extraction for a specific media type by setting the media type alias to null in the "extract_text_extractors" service config in your local configuration file (config/local.config.php). For example, if you want to disable extraction for TXT (text/plain) files, add the following:

'extract_text_extractors' => [
    'aliases' => [
        'text/plain' => null,


ExtractText is Copyright © 2019-present Corporation for Digital Scholarship, Vienna, Virginia, USA http://digitalscholar.org

The Corporation for Digital Scholarship distributes the Omeka source code under the GNU General Public License, version 3 (GPLv3). The full text of this license is given in the license file.

The Omeka name is a registered trademark of the Corporation for Digital Scholarship.

Third-party copyright in this distribution is noted where applicable.

All rights not expressly granted are reserved.

Version Released Minimum Omeka version
1.3.0December 13, 2022 [info]^4.0.0
1.2.1April 23, 2021 [info]^3.0.0
1.2.0October 08, 2020 [info]^3.0.0
1.1.1August 29, 2019 [info]^1.4.0 || ^2.0.0
1.1.0August 08, 2019 [info]^1.4.0 || ^2.0.0
1.0.0January 31, 2019 [info]^1.4.0