Text analysis, editing & search
This section gathers tools for working with texts in Digital Humanities projects: from editing and encoding documents (e.g., TEI/XML), to exploring corpora, building search engines, and using models for automated analysis (topics, entities, OCR/HTR). To make choosing easier, tools are organised into 4 groups based on the main task.
1) Search & indexing (Information Retrieval)
To build large-scale search: index documents, retrieve results, and filter efficiently.
Includes: Apache Lucene, Apache Solr y OpenSearch.
2) Editing & encoding (TEI/XML)
For preparing texts with reusable structure, metadata and annotations (digital editions, apparatus, registers) and for collaborative annotation.
Includes: TEI, EpiDoc, ediarum, Roma, TEIGarage, Tapas, oXygen, TextGrid, XML Copy Editor, Hypothesis, Recogito.
To explore text collections quickly (concordances, frequencies, comparison) without building a full search engine.
Includes: AntConc, Voyant, Lexos, Lyneal, CorpusSearch 2, TEITOK, Callimachus.
4) Models & NLP (incl. OCR/HTR)
For automated language analysis (entities, topics, classification, embeddings, transformers) and for converting images/handwriting to text (OCR/HTR).
Includes: CoreNLP, Stanza, OpenNLP, spaCy, NLTK, Transformers, fastText, Flair, Gensim y MALLET.
5) Text Recognition OCR/HTR (Handwritten Text Recognition)
To convert document images into editable, searchable text. Includes tools for page preparation (layout/line/region detection), OCR/HTR, correction, and (when needed) training models adapted to a specific collection.
Includes: dhSegment, docTR, eScriptorium, Kraken, LayoutParser, PaddleOCR, Tesseract OCR y Transkribus.
Some tools could fit more than one group. We list them where they are most commonly useful in DH workflows.