The Ngram Plugin allows you to generate ngrams using items in your Omeka install. A corpus is generated by querying the content of a selected text element field. Corpora may then be investigated via ngram graphs, counts, and frequencies.
For additional information on ngrams, please see:
- Benjamin M. Schmidt, “Words Alone: Dismantling Topic Models in the Humanities,” Journal of Digital Humanities 2, no. 1 (winter 2012).
- Dan Cohen, “A Conversation with Data: Prospecting Victorian Words and Ideas,” dancohen.org, May 30, 2012.
Please note, your results will be more meaningful when you are working with clean data. Before you begin, ensure that the formatting of your data fields (in particular, those that include numeric sequences like dates) are consistent.
System Requirements: The Ngram plugin requires the following libraries and dependencies: IntlBreakIterator class (in your php) Make sure your php path is properly configured.
Installation and Configuration
Install the plugin.
Once you’ve installed it on the server, navigate to the Plugins menu in your Omeka site and click the install button.
On installation the plugin automatically creates a Ngram button in the Admin navigation and allows the plugin to be configured on the Plugins page.
Configure the plugin.
There are two configuration features for the Ngram plugin:
Text Element: a dropdown menu from which you may select one text element to create an ngram corpus.
In order to produce an ngram, the plugin must be directed to a particular text element. For best results, choose a Text Element by reviewing items within your collection and identify a text field that is meaningful across multiple items.
Text Elements are listed in a dropdown menu. If you have created unique metadata categories for your collection, these will also be available for selection in the dropdown. You must select a single element. This means you cannot produce ngrams from multiple text fields.
For example, a user might choose to examine a collection of items with useful text in the Description field. Configuring the Text Element to the Description field directs the plugin to create a corpus that includes all items with text in that element. Items that do not have content in this text element will be ignored.
Note: You may select different text elements for different corpora. However do not modify the Text Element setting while you are in the process of validating items and generating ngrams for a corpus! Doing so will break that process. Only change this once you have generated all ngrams for a corpus.
Reset processes: a checkbox that will reset any ongoing processes that are hanging or showing errors. Note: be sure to click to save changes.
Add a Corpus
Create a Corpus
A corpus is drawn from the items in your collection with content in a particular text element (which is selected on the plugin configuration page), it is further defined by a Search Query and Sequence elements (on the Add a Corpus page), producing an Item Pool. The Item Pool will be further refined by Validating the Items.
To create a corpus and start viewing ngrams, go to the Ngram tab on the left hand navigation of your Omeka admin dashboard. On the Browse Corpora page, click the green Add a Corpus button.
On the Add a Corpus page, complete the following options:
Name: A field in which you must give the corpus a name. Ideally, choose something that meaningfully describes the corpora, as there are no descriptions for these corpora.
Public: A check box. Click the checkbox to make the corpus visible to public users (on the public side of the site).
Search Query: A field in which you refine the contents of your corpus by inputting a search query. The best way to get this search query is to perform an advanced search of the items in your collection on the Admin side of your Omeka site. Then, copy and paste the entire URL of the results, after the part that reads admin/items/browse?
Sequence: - Sequence Element: select from elements but it should be something with numeric or date input. Items without the selected element field filled in (for instance, an item without a Date, will not be included in the corpus). For best results, ensure consistency of metadata, and select a meaningful field. - Sequence type: choose from Date by Year, Date by Month, Date by Day, or Numeric Sequence Range the field will prompt you with the proper format for the sequence if you choose a Date type. If numeric, make sure the format matches the numeric sequence of the elements you’re drawing from.
Note: Date should be entered in the YearMonthDay format and should be entered as a range. (for instance, 200101-201601)
Note: You do not have to have a sequence, but without one you cannot generate graphs.
Note the Text Element box under the green Add Corpus button on the Add Corpus page. The Text Element was configured in the plugin panel.
When you have completed adding your corpus, click the green Add Corpus button.
Manage your Corpus
After you have added a corpus, the screen will update with information and options for that corpus.
On the left the elements that were input on the Add Corpus screen are listed. - Public - Search Query - Browse search results - Sequence Element - Sequence Range
Note: Clicking browse Search Results will open a Browse Items page with all the items based on your search term.
On the right, buttons allow the user to Edit and Delete the corpus and Validate Items. After you have validated items, buttons here allow you to generate unigrams, bigrams, and trigrams, and to view the corpus. Below, a small pane indicates the Text Element for the corpus. At the bottom an Item Counts pane will populate a pool of items from which this corpus may be derived.
After the Corpus has been created you must validate items before you can generate ngrams and view frequencies. To do so, click the green Validate Items button on the right hand side (just below the Delete button).
This will take you to a new screen with three tabs: valid items, invalid items, and out of range items.
Valid items are those items with sequence text that is readable to the plugin (See Figure 1). The table on this tab gives: - the item number (a link to the item), - the text in the sequence element, and - Sequence member, or how it will be used in sequence by the plugin (Ex. when the sequence is “Date by Year” and the Sequence.
Invalid items have text in the sequence element which the plugin cannot parse (See Figure 2). However, you can click on the Item ID number to go in and edit the item to correct the element text.
Out of range items have text in their sequence element which is outside the range you set (See Figure 3). The table on this tab gives: - the item number (a link to the item), - the text in the sequence element, and - Sequence member, or how it will be used in sequence by the plugin (Ex. when the sequence is “Date by Year” and the Sequence
Note: to update the sequence text in these items, utilize the linked item number to modify each item. If you do not modify out of range items, they will not be included in the corpus.
For ease of navigation, you may click to open a new tab for the invalid or out of range items you would like to modify. Refresh the list of valid and invalid items by reloading this page. Once you are done correcting invalid items, or the list of valid items looks correct, click the green Accept Valid Items button.
Note: Once you click the Accept Valid Items button you will not be able to reconfigure the item pool or reset the body of valid items
Valid Items (Figure 1)
Invalid Items (Figure 2)
Out of Range Items (Figure 3)
Note: After you have validated your items, the Item Counts pane will update to provide a count of the number of items in your corpus.
After you have validated your items, click the buttons to generate unigrams (single words), bigrams (two word pairs), and trigrams (three word groups). You can only do one at a time. Refresh the page to see if the process is complete - larger corpora will take longer to process. While the ngrams are processing, these buttons will be grey and text will update to indicate which process is “In Progress.” When complete, the text will update to read “Generated.”
You do not have to generate unigrams, bigrams, and trigrams in order to use the View Corpus functions. However, running all three processes before you view corpus will give you more options when analyzing the corpus.
Note: It is only possible to generate one corpus at a time.
Once you have created a corpus, validated the items, and generated ngrams, you can view the corpus in two ways: Ngram Search and Ngram Frequency.
In order to get back to the Corpus summary page from the Corpus viewer, click the “back to Corpus” button just under the label Corpus viewer.
Ngram Search - Using the text field, you can enter comma-separated phrases or words which you want to graph the frequency of. Note: you can only search for two word phrases if you have generated bigrams, etc.
You can, if you want, specify a range for the corpus search. Note that the format of the range you search must match exactly the format of your sequence data. So if you have sequenced the corpus by year, enter a four digit year, whereas if you have done it by month you must enter yyyymm formatted range data.
The results should return a sequence graph (if they do not, check the formatting of the range data), along with a table showing Ngram Counts and Total Ngram Counts.
Note: These results reflect the composition of the selected corpus (which has been filtered by text element and search query), not the entirety of your collection.
Ngram Frequency - The Ngram Frequency Corpus view returns ngrams in order of frequency.
Enter the number of results you want to return of unigram, bigrams, or trigrams (select one using the radio button). By default the number of results is set to 100.
Clicking the Go button produces frequency information, including; the total number and unique number of unigrams/bigrams/trigrams, and a chart that displays the ngram, total count, and a frequency percentage.
Note that the ngram plugin does not strip out stop words (a, the, of, for example) so depending on the content of the element that is forming your corpus you may want to enter a larger number in order to return useful results.
Once you have at least one corpus, the page at admin/ngram/corpora (the ngram tab) will display a table of your corpora with the following information for each: - Name (that you give it) - Text Element being used as the source of the corpus data, with element set in parentheses - Sequence Element, with element set in parentheses - Sequence Type - Sequence Range