A subword-based BERT language model was trained on the basis of a varied corpus of Modern, Ancient and Post-classical Greek texts. Experimental results (read their work) showed good perplexity and state-of-the-art performance for fine-grained POS tagging on treebanks with Classical and Medieval Greek and on a Byzantine Greek dataset.
This notebook loads this pre-trained model and shows how to use it to mask/unmask words. It needs to be fine-tuned on in-domain data to properly work.
Ancient Greek BERT¶
This is a subword-based BERT masked language model, trained on the basis of a varied corpus of Modern, Ancient and Post-classical Greek texts. Read the work of Singh et al (2021) or employ their code using this notebook.
%%capture
!pip install transformers
!pip install unicodedata
!pip install flair
%%capture
from transformers import AutoTokenizer, AutoModel
tokeniser = AutoTokenizer.from_pretrained("pranaydeeps/Ancient-Greek-BERT")
model = AutoModel.from_pretrained("pranaydeeps/Ancient-Greek-BERT")
Employing the seek-and-find code from GreekBERT, but using the AncientGreekBERT instead.
import torch
input_ids = tokeniser.encode('τοῦ βίου τοῦ καθ ΄ εαυτοὺς πολλὰ γίνεσθαι συγχωροῦν [MASK]')
tokens = tokeniser.convert_ids_to_tokens(input_ids)
idx = tokens.index("[MASK]")
print(idx, tokens)
outputs = model(torch.tensor([input_ids]))[0]
print(tokeniser.convert_ids_to_tokens(outputs[0, idx].max(0)[1].item()))
Suggested next steps¶
- Fine-tune on a dataset (e.g., on HTREC data).
- Mask likely mistaken (e.g., HTRed) words, then use the model to unmask.
- Use some word lexicon (e.g., based on this resource) to detect likely errors.
Content
Comments
You must login before you can post a comment.