Skip to main content

Text analysis, editing & search

This section gathers tools for working with texts in Digital Humanities projects: from editing and encoding documents (e.g., TEI/XML), to exploring corpora, building search engines, and using models for automated analysis (topics, entities, OCR/HTR). To make choosing easier, tools are organised into 4 groups based on the main task.

 


1) Search & indexing (Information Retrieval)

To build large-scale search: index documents, retrieve results, and filter efficiently.

Includes: Apache Lucene, Apache Solr y  OpenSearch.


2) Editing & encoding (TEI/XML)

For preparing texts with reusable structure, metadata and annotations (digital editions, apparatus, registers) and for collaborative annotation.

 Includes: TEI, EpiDoc, ediarum, Roma, TEIGarage, Tapas, oXygen, TextGrid, XML Copy Editor, Hypothesis, Recogito.


3) Corpus exploration

To explore text collections quickly (concordances, frequencies, comparison) without building a full search engine.

 Includes: AntConc, Voyant, Lexos, Lyneal, CorpusSearch 2, TEITOK, Callimachus.


4) Models & NLP (incl. OCR/HTR)

For automated language analysis (entities, topics, classification, embeddings, transformers) and for converting images/handwriting to text (OCR/HTR).

Includes: CoreNLP, Stanza, OpenNLP, spaCy, NLTK, Transformers, fastText, Flair, Gensim y MALLET.


5) Text Recognition OCR/HTR (Handwritten Text Recognition)

To convert document images into editable, searchable text. Includes tools for page preparation (layout/line/region detection), OCR/HTR, correction, and (when needed) training models adapted to a specific collection. 
 

Includes: dhSegment, docTR, eScriptorium, Kraken,  LayoutParser,  PaddleOCR,  Tesseract OCR y Transkribus.  


Some tools could fit more than one group. We list them where they are most commonly useful in DH workflows.