A notebook with baseline implementations, data downloading and a sample submission.
Getting Started with the HTREC 2022 Challenge
This challenge focuses on the post-correction of HTR transcription errors. This is a starter kit explaining how to download the data and also submit direcly via this notebook. We will be submitting the sample prediction directly in the required format.
In [ ]:
!pip install aicrowd-cli
%load_ext aicrowd.magic
Login to AIcrowd 🔐¶
In [ ]:
%aicrowd login
Download the Dataset ⬇️¶
We will create a folder named "data" and download the files there.
In [ ]:
!rm -rf data
!mkdir data
%aicrowd ds dl -c htrec-2022 -o data
Importing Libraries:¶
In [ ]:
!pip install pywer
import pywer
import pandas as pd
import numpy as np
import os
Diving in the dataset 🕵️♂️¶
- The training data.
In [ ]:
train_df = pd.read_csv("data/train.csv")
print(train_df.shape)
train_df.head()
Out[ ]:
- The testing data.
In [ ]:
test_df = pd.read_csv("data/test.csv")
print(test_df.shape)
test_df.head()
Out[ ]:
Exploratory analysis (basic) of the training data¶
In [ ]:
ht_raw = " ".join(train_df.HUMAN_TRANSCRIPTION.to_list())
st_raw = " ".join(train_df.SYSTEM_TRANSCRIPTION.to_list())
print(f"{len(set(ht_raw.lower()))} characters in human transcription")
print(f"{len(set(st_raw.lower()))} characters in system transcription")
print(f"The following characters have not been system-transcribed: \n{set(ht_raw.lower())-set(st_raw.lower())}")
print(f"The following *have been* system-transcribed: \n{set(ht_raw.lower()).intersection(set(st_raw.lower()))}")
In [ ]:
tokens = ht_raw.split()
WORDS = set(tokens)
Baselines 💻¶
B1: Edit-distance-based baseline¶
In [ ]:
def eddi(input_text, reference_words=WORDS, ed_threshold=25, max_unk_tokens=3):
""" Baseline I: Edit distance -based Baseline
An edit distance-based baseline: Given a list of valid (reference) words,
this baseline (called eddi) detects words not in the reference list and
changes them to the closest one in the reference list.
:param input_text: the source text
:param reference_words: a list of valid words (e.g., computed from the target data)
:param ed_threshold: the edit distance threshold below from which a word is replaced
:param max_unk_tokens: the max number of unknown tokens in the transcribed text
:return: the new text
"""
tokens = input_text.split()
# Unknown transcribed tokens; proceed only if few
unknowns = [i for i, w in enumerate(tokens) if w not in reference_words]
if len(unknowns) > max_unk_tokens:
return " ".join(tokens)
for ind in unknowns:
# Replace each uknown token with the ground truth token w/min edit distance
word = tokens[ind]
min_cer, new_word = 100, word
for ref in reference_words:
candidate_min_cer = pywer.cer([ref], [word])
if candidate_min_cer < min_cer:
min_cer = candidate_min_cer
if min_cer < ed_threshold:
new_word = ref
tokens[ind] = new_word
return " ".join(tokens)
- Applying the 1st baseline on the training data yields a reduction in CER.
In [ ]:
# Predict for training set
train_df["B1"] = train_df.SYSTEM_TRANSCRIPTION.apply(eddi)
# Calculate CER for baseline predictions
train_df["B1_CER"] = train_df.apply(lambda row: pywer.cer([row.HUMAN_TRANSCRIPTION], [row.B1]), axis=1)
print(f"B1 CER: {train_df.B1_CER.mean()}")
# Computing the character error *reduction* rate (CERR)
train_df["CER"] = train_df.apply(lambda row: pywer.cer([row.HUMAN_TRANSCRIPTION], [row.SYSTEM_TRANSCRIPTION]), axis=1)
print(f"B1 CERR: {(train_df.CER - train_df.B1_CER).mean()}")
In [ ]:
# Use B1 to predict for the test
test_df["B1"] = test_df.SYSTEM_TRANSCRIPTION.apply(eddi)
test_df.sample()
Out[ ]:
B2: LM-based Baseline¶
- Use a word-based statistical language model to replace any unknown system-transcribed words.
In [ ]:
# LMing -based baseline
!git clone https://github.com/ipavlopoulos/lm.git
from lm.markov.models import LM
wlm = LM(gram="WORD").train(tokens)
wlm.generate_text()
Out[ ]:
In [ ]:
def lamo(input_text, reference_words=WORDS, lm = wlm, max_unk_tokens=2):
""" Baseline II: LM-based Baseline
Any unknown words in the transcribed text are replaced by word suggested by
a language model trained on the ground truth texts.
:param input_text: the (transcribed) text in question
:param reference_words: the reference vocabulary
:param lm: a word-based statistical language model
:param max_unk_tokens: the max number of unkown words in the text
:return: the new text
"""
tokens = input_text.split()
# Unknown transcribed tokens; proceed only if few
unknowns = [i for i, w in enumerate(tokens) if w not in reference_words]
if len(unknowns) > max_unk_tokens:
return " ".join(tokens)
for ind in unknowns:
# Replace each uknown token with the ground truth token w/min edit distance
new_word = wlm.generate_next_gram(tokens[:ind-1])
tokens[ind] = new_word
return " ".join(tokens)
- Applying the LM-based baseline on the training yields a negative overall CER reduction.
In [ ]:
# computing the B2
train_df["B2"] = train_df.SYSTEM_TRANSCRIPTION.apply(lamo)
# computing the B2 CER
train_df["B2_CER"] = train_df.apply(lambda row: pywer.cer([row.HUMAN_TRANSCRIPTION], [row.B2]), axis=1)
print("B2's CER:", train_df.B2_CER.mean())
# computing B2's CERR
print((train_df.CER - train_df.B2_CER).mean())
In [ ]:
# Use B2 to predict for the test
test_df["B2"] = test_df.SYSTEM_TRANSCRIPTION.apply(lamo)
test_df.sample()
Out[ ]:
Generating the Prediction File¶
In [ ]:
# using the 1st baseline (B1)
submission = pd.DataFrame(zip(test_df.IMAGE_PATH, test_df.B1), columns=["ImageID", "Transcriptions"])
submission.head()
Out[ ]:
In [ ]:
submission.to_csv("submission.csv", index=False)
Submitting our Predictions¶
Note : Please save the notebook before submitting it (Ctrl + S)
In [ ]:
%aicrowd submission create -c htrec-2022 -f submission.csv
Content
Comments
You must login before you can post a comment.