Analyze Overview


Once you have created, curated, and cleaned the OCR texts in your Content Set, you are ready to move on to the Analyze phase. THe Analysis tools allow you to take hundreds or thousands of documents and use digital tools to interrogate them in ways that would have been too time consuming without the help of computational algorithms. In this section, learn how to choose tools, run them, and interpret their output.

1Selecting the Right Tool

It’s important to know what tools are available and what they can do. In this section, you will learn about the questions you should ask yourself as you choose the right tool for your analysis.

Learn how to Select the Right Tool

2Setting Up and Running

Each tool has settings you can use to refine the results you can get. It is important to know how to use these options in order to return the best possible visualization results.

Learn more about Setting Up and Running

Available Tools

Document Clustering

Document Clustering analyzes documents using statistical measures and groups them based on term frequencies and the K-means algorithm to determine similarity between each document in your Content Set.

More about Document Clustering

A screenshot of the scatter plot result from the document clustering tool.

Named Entity Recognition

Named Entity Recognition (NER) recognizes and extracts proper and common nouns from documents using spaCy’s Parts of Speech tagging model, and outputs them as lists grouped by entity types, including people, organizations, companies, locations, and more.

More about Named Entity Recognition

A screenshot of the entities found table from the named entity recognition tool.


An Ngram is a term, or collocation of terms, found in your Content Set. You set the range or number of terms (“N”) you wish to consider in your analysis. Then, the frequency of those Ngrams is counted and displayed for analysis.

More about Ngrams

A screenshot of the bar chart result from the ngrams tool.

Parts of Speech

Parts of Speech uses natural language processing of syntax to recognize and tag various parts of speech. It provides users with the building blocks for looking at how phrases are constructed within each document in a Content Set.

More about Parts of Speech

A screenshot of the line graph result from the parts of speech tool.

Sentiment Analysis

Sentiment Analysis scores assign an overall sentiment to each document by assigning positive and negative values to each term and then averaging those scores. Terms are assigned scores based on the AFINN lexicon.

More about Sentiment Analysis

A screenshot of the area line graph result from the sentiment analysis tool.

Topic Modeling

Topic Modeling allows users to analyze a large collection of unstructured text and groups terms that co-occur frequently. These groups of terms are “topics” that you then assign meaning to based on the terms and other measures.

More about Topic Modeling

A screenshot of the topics table from the topic modeling tool.