Tool: Named Entity Recognition
Purpose: This tool extracts Named Entities from documents within a Content Set. The current implementation uses spaCy's Annotations Named Entity Recognition (NER) module, which offers a rich approach to Named Entities that includes not only Proper and Common Nouns, but numbers as well.
Definition: Named Entity Recognition (NER) is a natural language processing method which seeks to identify and classify each term in the Content Set as specific entity categories or "classes".
Application: Named Entity Recognition is often used to identify key people, places, and things within a Content Set. This tool can be useful when collecting data around place names for mapping, which can often be challenging to aggregate without the close reading of each document.
Use Case: In the case of mapping, the downloaded data from the Named Entity Recognition tool can be used to compile the names of countries, cities, states, buildings, etc. and plotted using GIS software.
Technical Specifications
This tool uses spaCy Annotation NER, which uses the Ontocorpus 5 list of entity categories:
| spaCy Name | Gale Name | DESCRIPTION |
| CARDINAL | Number | Numerals that do not fall under another type |
| DATE | Date | Absolute or relative dates or periods |
| EVENT | Event | Named hurricanes, battles, wars, sports events, etc. |
| FAC | Place | Buildings, airports, highways, bridges, etc. |
| GPE | Geo-Political Entity | Countries, cities, states |
| LANGUAGE | Language | Any named language |
| LAW | Law | Named documents made into laws |
| LOC | Geography | Non-GPE locations, mountain ranges, bodies of water |
| MONEY | Money | Monetary values, including unit |
| NORP | Cultural Group | Nationalities or religious or political groups |
| ORDINAL | Position | "first", "second", etc. |
| ORG | Organization | Companies, agencies, institutions, etc. |
| PERCENT | Percentage | Percentage, including "%" |
| PERSON | Person | People, including fictional |
| PRODUCT | Product | Objects, vehicles, foods, etc. (Not services) |
| QUANTITY | Measurement | Measurements, as of weight or distance |
| TIME | Time | A period of time, smaller than a day or 24 hours. |
| WORK_OF_ART | Artwork | Includes titles of books, songs, etc. |
For more on spaCy Annotation NER functionality, please see https://spacy.io/usage/spacy-101#annotations-ner and https://spacy.io/api/annotation#named-entities.
Currently, Gale Digital Scholar Lab’s implementation of spaCy’s NER does not employ the Parts of Speech module. However, Parts of Speech can be run as a separate tool.
Configuring the Tool
The NER tool is not configurable at this time. However, users may select and apply a text cleaning configuration to the text contained in their Content Set prior to analysis.
Result: Entities Found
This tool outputs a list of the top 200 entities by count:
- Each entity displays its category, the number of documents it was found in, and the total count across the Content Set.
- Users are also able to search the entire set of entities found by keyword using the entity search bar.
- This visualization can be used to browse recognized entities present within a Content Set by entity type.
Interactivity
- Users can sort by entity name, type, document count, and total count.
- Each entity name can be clicked to learn more about that entity’s relationship to the documents it was found in and other entities within the Content Set.
- Specifically, users will see the document titles it was found in.
- Users can also click into the document text and access the OCR tagged with all recognized entities.
- When viewing a single document, users can turn on/off specific entity types, view all entities grouped by type, and find specific instances of each recognized entity.
- The original document scan is also accessible from this view.
Download
- Visualization: This result cannot be downloaded as an image.
- Tabular Data: Tabular data is available in CSV (comma delimited text) and JSON (JavaScript Object Notation) for this tool. Each tabular data download contains the full list of recognized entities, their associated entity type, document ID, and document title. Additionally, users may download the document level JSON from the document view, which includes the OCR text with start and stop positions for the top 200 of each recognized entity.
Enhancement Notes
The results and visualizations were enhanced in May 2019. The following changes were made:
- Replaced tree chart visualizations with the Entities Found result which lists the top 200 entities in a user’s content set.
- Tabular data (.CSV) updated to reflect the data retrieved from the analysis pipeline.
- Raw data (.JSON) download is made available to users.