Available Datasets
| dataset_name | language | documents | Tokens | Entity Types Count | COREF | Average Tokens / Doc |
|---|---|---|---|---|---|---|
| ontonotes5_english-NER | en | 3,637 | 2,074,405 | 18 | ❌ | 570 |
| long-litbank-fr-PER-only | fr | 32 | 556,103 | 4 | ✅ | 17,378 |
| conll2003-NER | en | 1,393 | 319,965 | 4 | ❌ | 229 |
| litbank-fr | fr | 29 | 276,992 | 7 | ✅ | 9,551 |
| litbank | en | 100 | 213,677 | 6 | ✅ | 2,136 |
NER-Only Propp formatted datasets
conll2003-NER Mention Spans Detection
Coreference Resolution Propp formatted datasets
-
Download Long-LitBank-fr-PER-Only dataset – PROPP Minimal Implementation
-
Download LitBank dataset – PROPP Minimal Implementation
Note: This version is a minimal implementation of the original LitBank dataset, formatted specifically for use with Propp’s coreference resolution training pipeline. It contains only the essential columns (
byte_onset,byte_offset,cat,COREF_name) aligned with the text for efficient model training.
French Datasets
LitBank-fr
Long-LitBank-fr (characters only)
Antoine Bourgois and Thierry Poibeau. 2025. The Elephant in the Coreference Room: Resolving Coreference in Full-Length French Fiction Works. In Proceedings of the Eighth Workshop on Computational Models of Reference, Anaphora and Coreference, pages 55–69, Suzhou, China. Association for Computational Linguistics.
@inproceedings{bourgois-poibeau-2025-elephant,
title = "The Elephant in the Coreference Room: Resolving Coreference in Full-Length {F}rench Fiction Works",
author = "Bourgois, Antoine and
Poibeau, Thierry",
editor = "Ogrodniczuk, Maciej and
Novak, Michal and
Poesio, Massimo and
Pradhan, Sameer and
Ng, Vincent",
booktitle = "Proceedings of the Eighth Workshop on Computational Models of Reference, Anaphora and Coreference",
month = nov,
year = "2025",
address = "Suzhou, China",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.crac-1.5/",
doi = "10.18653/v1/2025.crac-1.5",
pages = "55--69",
abstract = "While coreference resolution is attracting more interest than ever from computational literature researchers, representative datasets of fully annotated long documents remain surprisingly scarce. In this paper, we introduce a new annotated corpus of three full-length French novels, totaling over 285,000 tokens. Unlike previous datasets focused on shorter texts, our corpus addresses the challenges posed by long, complex literary works, enabling evaluation of coreference models in the context of long reference chains. We present a modular coreference resolution pipeline that allows for fine-grained error analysis. We show that our approach is competitive and scales effectively to long documents. Finally, we demonstrate its usefulness to infer the gender of fictional characters, showcasing its relevance for both literary analysis and downstream NLP tasks."
}
English Datasets
LitBank-en
LitBank is an annotated dataset of 100 works of English-language fiction designed to support tasks in natural language processing and the computational humanities.
Note: This version does not modify the underlying annotations, only restructures them for easier use in Propp.
David Bamman, Olivia Lewke, and Anya Mansoor. 2020. An Annotated Dataset of Coreference in English Literature. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 44–54, Marseille, France. European Language Resources Association.
@inproceedings{bamman-etal-2020-annotated,
title = "An Annotated Dataset of Coreference in {E}nglish Literature",
author = "Bamman, David and Lewke, Olivia and Mansoor, Anya",
booktitle = "Proceedings of the Twelfth Language Resources and Evaluation Conference",
year = "2020",
address = "Marseille, France",
publisher = "European Language Resources Association",
url = "https://aclanthology.org/2020.lrec-1.6/",
pages = "44--54",
ISBN = "979-10-95546-34-4",
}
Russian Datasets
🚧 Coming soon... 🚧