Skip to content

Available Datasets

dataset_name language documents Tokens Entity Types Count COREF Average Tokens / Doc
ontonotes5_english-NER en 3,637 2,074,405 18 570
long-litbank-fr-PER-only fr 32 556,103 4 17,378
conll2003-NER en 1,393 319,965 4 229
litbank-fr fr 29 276,992 7 9,551
litbank en 100 213,677 6 2,136

NER-Only Propp formatted datasets

conll2003-NER Mention Spans Detection

Coreference Resolution Propp formatted datasets

French Datasets

LitBank-fr

Long-LitBank-fr (characters only)

Antoine Bourgois and Thierry Poibeau. 2025. The Elephant in the Coreference Room: Resolving Coreference in Full-Length French Fiction Works. In Proceedings of the Eighth Workshop on Computational Models of Reference, Anaphora and Coreference, pages 55–69, Suzhou, China. Association for Computational Linguistics.

@inproceedings{bourgois-poibeau-2025-elephant,
    title = "The Elephant in the Coreference Room: Resolving Coreference in Full-Length {F}rench Fiction Works",
    author = "Bourgois, Antoine  and
      Poibeau, Thierry",
    editor = "Ogrodniczuk, Maciej  and
      Novak, Michal  and
      Poesio, Massimo  and
      Pradhan, Sameer  and
      Ng, Vincent",
    booktitle = "Proceedings of the Eighth Workshop on Computational Models of Reference, Anaphora and Coreference",
    month = nov,
    year = "2025",
    address = "Suzhou, China",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.crac-1.5/",
    doi = "10.18653/v1/2025.crac-1.5",
    pages = "55--69",
    abstract = "While coreference resolution is attracting more interest than ever from computational literature researchers, representative datasets of fully annotated long documents remain surprisingly scarce. In this paper, we introduce a new annotated corpus of three full-length French novels, totaling over 285,000 tokens. Unlike previous datasets focused on shorter texts, our corpus addresses the challenges posed by long, complex literary works, enabling evaluation of coreference models in the context of long reference chains. We present a modular coreference resolution pipeline that allows for fine-grained error analysis. We show that our approach is competitive and scales effectively to long documents. Finally, we demonstrate its usefulness to infer the gender of fictional characters, showcasing its relevance for both literary analysis and downstream NLP tasks."
}

English Datasets

LitBank-en

LitBank is an annotated dataset of 100 works of English-language fiction designed to support tasks in natural language processing and the computational humanities.

Note: This version does not modify the underlying annotations, only restructures them for easier use in Propp.

David Bamman, Olivia Lewke, and Anya Mansoor. 2020. An Annotated Dataset of Coreference in English Literature. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 44–54, Marseille, France. European Language Resources Association.

@inproceedings{bamman-etal-2020-annotated,
    title = "An Annotated Dataset of Coreference in {E}nglish Literature",
    author = "Bamman, David and Lewke, Olivia and Mansoor, Anya",
    booktitle = "Proceedings of the Twelfth Language Resources and Evaluation Conference",
    year = "2020",
    address = "Marseille, France",
    publisher = "European Language Resources Association",
    url = "https://aclanthology.org/2020.lrec-1.6/",
    pages = "44--54",
    ISBN = "979-10-95546-34-4",
}

Russian Datasets

🚧 Coming soon... 🚧