Tool: Named Entity Recognition

01:14

Purpose: This tool extracts Named Entities from documents within a Content Set. The current implementation uses spaCy's Annotations Named Entity Recognition (NER) module, which offers a rich approach to Named Entities that includes not only Proper and Common Nouns, but numbers as well.

Definition: Named Entity Recognition (NER) is a natural language processing method which seeks to identify and classify each term in the Content Set as specific entity categories or "classes".

Application: Named Entity Recognition is often used to identify key people, places, and things within a Content Set. This tool can be useful when collecting data around place names for mapping, which can often be challenging to aggregate without the close reading of each document.

Use Case: In the case of mapping, the downloaded data from the Named Entity Recognition tool can be used to compile the names of countries, cities, states, buildings, etc. and plotted using GIS software.

Technical Specifications

This tool uses spaCy Annotation NER, which uses the Ontocorpus 5 list of entity categories:

spaCy Name Gale Name DESCRIPTION
CARDINAL Number Numerals that do not fall under another type
DATE Date Absolute or relative dates or periods
EVENT Event Named hurricanes, battles, wars, sports events, etc.
FAC Place Buildings, airports, highways, bridges, etc.
GPE Geo-Political Entity Countries, cities, states
LANGUAGE Language Any named language
LAW Law Named documents made into laws
LOC Geography Non-GPE locations, mountain ranges, bodies of water
MONEY Money Monetary values, including unit
NORP Cultural Group Nationalities or religious or political groups
ORDINAL Position "first", "second", etc.
ORG Organization Companies, agencies, institutions, etc.
PERCENT Percentage Percentage, including "%"
PERSON Person People, including fictional
PRODUCT Product Objects, vehicles, foods, etc. (Not services)
QUANTITY Measurement Measurements, as of weight or distance
TIME Time A period of time, smaller than a day or 24 hours.
WORK_OF_ART Artwork Includes titles of books, songs, etc.

For more on spaCy Annotation NER functionality, please see https://spacy.io/usage/spacy-101#annotations-ner and https://spacy.io/api/annotation#named-entities.

Currently, Gale Digital Scholar Lab’s implementation of spaCy’s NER does not employ the Parts of Speech module. However, Parts of Speech can be run as a separate tool.

Configuring the Tool

The NER tool is not configurable at this time. However, users may select and apply a text cleaning configuration to the text contained in their Content Set prior to analysis.

Result: Entities Found

This tool outputs a list of the top 200 entities by count:

  • Each entity displays its category, the number of documents it was found in, and the total count across the Content Set.
  • Users are also able to search the entire set of entities found by keyword using the entity search bar.
  • This visualization can be used to browse recognized entities present within a Content Set by entity type.

Gale Digital Scholar Lab Learning Center

Interactivity

  • Users can sort by entity name, type, document count, and total count.
  • Each entity name can be clicked to learn more about that entity’s relationship to the documents it was found in and other entities within the Content Set.
    • Specifically, users will see the document titles it was found in.
  • Users can also click into the document text and access the OCR tagged with all recognized entities.
  • When viewing a single document, users can turn on/off specific entity types, view all entities grouped by type, and find specific instances of each recognized entity.
  • The original document scan is also accessible from this view.

Download

  • Visualization: This result cannot be downloaded as an image.
  • Tabular Data: Tabular data is available in CSV (comma delimited text) and JSON (JavaScript Object Notation) for this tool. Each tabular data download contains the full list of recognized entities, their associated entity type, document ID, and document title. Additionally, users may download the document level JSON from the document view, which includes the OCR text with start and stop positions for the top 200 of each recognized entity.

Enhancement Notes

The results and visualizations were enhanced in May 2019. The following changes were made:

  • Replaced tree chart visualizations with the Entities Found result which lists the top 200 entities in a user’s content set.
  • Tabular data (.CSV) updated to reflect the data retrieved from the analysis pipeline.
  • Raw data (.JSON) download is made available to users.