Step by Step Processing
1: Loading Models
from propp_fr import load_models
spacy_model, mentions_detection_model, coreference_resolution_model = load_models()
Default models are:
spacy_model: fr_dep_news_trfmentions_detection_model: propp-fr_NER_camembert-large_FAC_GPE_LOC_PER_TIME_VEHcoreference_resolution_model: propp-fr_coreference-resolution_camembert-large_PER
2: Loading a .txt File
from propp_fr import load_text_file
text_content = load_text_file("root_directory/my_french_novel.txt")
3: Tokenizing the Text
Break down the text into individual tokens (words and punctuation) with linguistic information:
from propp_fr import generate_tokens_df
tokens_df = generate_tokens_df(text_content, spacy_model)
tokens_df is a pandas.DataFrame where each row represents one token from the text.
| Column Name | Description |
|---|---|
paragraph_ID |
Which paragraph the token belongs to |
sentence_ID |
Which sentence the token belongs to |
token_ID_within_sentence |
Position of the token within its sentence |
token_ID_within_document |
Position of the token in the entire document |
word |
The actual word as it appears in the text |
lemma |
The base/dictionary form of the word |
byte_onset |
Starting byte position in the original file |
byte_offset |
Ending byte position in the original file |
POS_tag |
Part-of-speech tag (noun, verb, adjective, etc.) |
dependency_relation |
How the word relates to other words |
syntactic_head_ID |
The ID of the word this token depends on |
4: Embedding Tokens
Transform the tokens into numerical representations (embeddings) that capture their meaning:
from propp_fr import load_tokenizer_and_embedding_model, get_embedding_tensor_from_tokens_df
# Load the tokenizer and pre-trained embedding model
tokenizer, embedding_model = load_tokenizer_and_embedding_model(
mentions_detection_model["base_model_name"],
)
# Generate embeddings for all tokens
tokens_embedding_tensor = get_embedding_tensor_from_tokens_df(
text_content,
tokens_df,
tokenizer,
embedding_model,
)
tokens_embedding_tensor is a torch.tensor object with dimensions [number_of_tokens, embedding_size].
Each row corresponds to one token from tokens_df, preserving the same order.
These embeddings will be used as inputs for the mention detection model and the coreference resolution model.
5: Mention Spans Detection
Identify all mentions belonging to entities of different types in the text:
- Characters (PER): pronouns (je, tu, il, ...), possessive pronouns (mon, ton, son, ...), common nouns (le capitaine, la princesse, ...) and proper nouns (Indiana Delmare, Honoré de Pardaillan, ...)
- Facilities (FAC): chatêau, sentier, chambre, couloir, ...
- Time (TIME): le règne de Louis XIV, ce matin, en juillet, ...
- Geo-Political Entities (GPE): Montrouge, France, le petit hameau, ...
- Locations (LOC): le sud, Mars, l'océan, le bois, ...
- Vehicles (VEH): avion, voitures, calèche, vélos, ...
from propp_fr import generate_entities_df
entities_df = generate_entities_df(
tokens_df,
tokens_embedding_tensor,
mentions_detection_model,
)
What this does: Scans through the text to find all mentions of entities.
The entities_df object is a pandas.DataFrame where each row represents a detected mention:
| Column Name | Description |
|---|---|
start_token |
Token ID where the mention begins |
end_token |
Token ID where the mention ends |
cat |
Type of the entity |
confidence |
Model's confidence score (0-1) for this detection |
text |
The actual text of the mention |
To learn more about how mention detection is performed under the hood, check the Algorithms Section
6: Adding Linguistic Features
Enrich your entity mentions with additional grammatical and syntactic information:
from propp_fr import add_features_to_entities
entities_df = add_features_to_entities(entities_df, tokens_df)
What this does: Adds detailed linguistic features to each mention, including grammatical properties, syntactic structure, and contextual information.
This step adds the following columns to entities_df:
| Column Name | Description |
|---|---|
mention_len |
Length of the mention in tokens |
paragraph_ID |
Paragraph containing the mention |
sentence_ID |
Sentence containing the mention |
start_token_ID_within_sentence |
Position where mention starts in its sentence |
out_to_in_nested_level |
Nesting depth (outer to inner) |
in_to_out_nested_level |
Nesting depth (inner to outer) |
nested_entities_count |
Number of entities nested within this mention |
head_id |
ID of the syntactic head token |
head_word |
The actual head word |
head_dependency_relation |
Dependency relation of the head |
head_syntactic_head_ID |
ID of the head's syntactic parent |
POS_tag |
Part-of-speech tag of the head |
prop |
Mention type: pronoun (PRON), common noun (NOM), or proper noun (PROP) |
number |
Grammatical number (singular/plural) |
gender |
Grammatical gender (masculine/feminine) |
grammatical_person |
Grammatical person (1st, 2nd, 3rd) |
These features are primarily used in the following steps of coreference resolution and character representation, but they can also be leveraged directly for a range of literary and linguistic analyses, such as character centrality, proper name tracking, mention type distribution, gender representation, and narrative perspective.
7: Coreference Resolution
Link all PER mentions that refer to the same character, creating coreference chains:
from propp_fr import perform_coreference
entities_df = perform_coreference(
entities_df,
tokens_embedding_tensor,
coreference_resolution_model,
)
What this does: Groups character mentions into coreference chains where all mentions in a chain refer to the same person. For example, Marie, she, and the young woman might all be linked together as referring to the same character.
This step adds one new column to entities_df:
| Column Name | Description |
|---|---|
COREF |
ID of the coreference chain this mention belongs to |
Mentions with the same COREF value refer to the same character. Mentions with different values refer to different characters.
To learn more about how coreference resolution is performed under the hood, check the Algorithms Section
8: Extracting Character Attributes
Identify tokens that describe or relate to characters:
from propp_fr import extract_attributes
tokens_df = extract_attributes(entities_df, tokens_df)
What this does: Analyzes the syntactic structure around character mentions to identify words that function as attributes, linking them to the characters they describe.
This step adds the following columns to tokens_df:
| Column Name | Description |
|---|---|
is_mention_head |
Whether this token is the head of a character mention |
char_att_agent |
Mention ID if token is an agent attribute, -1 otherwise |
char_att_patient |
Mention ID if token is a patient attribute, -1 otherwise |
char_att_mod |
Mention ID if token is a modifier attribute, -1 otherwise |
char_att_poss |
Mention ID if token is a possessive attribute, -1 otherwise |
For each token, if it serves as an attribute to a character, the corresponding column contains the syntactic head token ID of the mention. Otherwise, it contains -1.
The four attribute types:
Agent: verbs where the character is the subject (actions they perform): Marie marche, elle parlePatient: verbs where the character is the direct object or passive subject (actions done to them): on pousse Jean, il est suiviModifier: adjectives or nominal predicates describing the character: Hercule est fort, la grande reine, Victor Hugo, l' écrivainPossessive: nouns denoting possessions linked by determiners, de-genitives, or avoir: son épée, la maison de Alisée, il a un chien
9: Aggregating Character Information
Build a unified, structured representation of every character extracted so far:
from propp_fr import generate_characters_dict
characters_dict = generate_characters_dict(tokens_df, entities_df)
This function gathers all relevant information from every mention of each character: surface forms, syntactic attributes, and inferred features such as gender or number.
characters_dict is a list of dictionaries, where each dictionary corresponds to a single character and contains the following fields:
| Key | Description |
|---|---|
id |
Character identifier (0 = most frequent) |
count |
Total mentions and proportion relative to all character mentions |
gender |
Ratio of gendered mentions and inferred overall gender (based on majority evidence) |
number |
Ratio of singular/plural mentions and inferred number |
mentions |
All surface forms used to refer to the character: proper nouns, common nouns, and pronouns |
agent |
List of all agent attributes linked to the character |
patient |
list of all patient attributes linked to the character |
mod |
list of all modifiers attributes linked to the character |
poss |
list of all possessives attributes linked to the character |
The resulting characters_dict provides a complete, query-ready profile for each character.
This will be useful for narrative analysis, visualization, and computational literary studies.
Saving Generated Output
Once processing is complete, you can save the generated files for future use, storing your processed tokens, entities, and character profiles as reusable data so you don’t need to re-run the entire pipeline.
from propp_fr import save_tokens_df, save_entities_df, save_book_file
root_directory = "root_directory"
file_name = "my_french_novel"
save_tokens_df(tokens_df, file_name, root_directory)
save_entities_df(entities_df, file_name, root_directory)
save_book_file(characters_dict, file_name, root_directory)
Reloading Processed Files
Later, you can reload the generated files instantly to resume your analysis.
from propp_fr import load_text_file, load_tokens_df, load_entities_df, load_book_file
file_name = "my_french_novel"
root_directory = "root_directory"
text_content = load_text_file(file_name, root_directory)
tokens_df = load_tokens_df(file_name, root_directory)
entities_df = load_entities_df(file_name, root_directory)
characters_dict = load_book_file(file_name, root_directory)