Learning to Smell
Where to start? 5 ways to learn 2 smell!
We have written a notebook that explores 5 ways to attempt this challenge.
Hi everyone!
@rohitmidha23 and me are undergrad students studying computer science, and found this challenge particularly interesting to explore the applications of ML in Chemistry. We have written a notebook that explores 5 ways to attempt this challenge. It includes baselines for
- ChemBERTa
- Graph Conv Networks
- MultiTaskClassifier using Molecular Fingerprints
- Sklearn Classifiers (Random Forest etc.) using Molecular Fingerprints
- Chemception (2D representation of molecules)
Check it out @ https://colab.research.google.com/drive/1-RedHEQSAVKUowOx2p-QoKthxayRshUa?usp=sharing
The most difficult task in this challenge is trying to get good representations of SMILES that is understandable for ML algorithms and we have tried to give examples on how that has been done in the past for these kind of tasks.
We hope that this notebook helps out other beginners like ourselves.
As always we are open to any feedback, suggestions and criticism!
If you found our work helpful, do drop us a !
AICrowd Learning To Smell Challenge
What is the challenge exactly?¶
This challenge is all about the ability to be able to predict the different smells associate with a molecule. The information based upon which we are supposed to predict the smell is the smile of a molecule. Each molecule is labelled with multiple smells, with the total number of distinct smells being 109.
What is a smile?¶
SMILES (Simplified Molecular Input Line Entry System) is a chemical notation that allows a user to represent a chemical structure in a way that can be used by the computer. They describe the structure of chemical species using short ASCII strings.
What is the most important task in this challenge?¶
This most important task at hand here is gaining a meaningful representation of each smile. There are several ways to do this, and this notebook attempts to give you quite a few pathways to gain a representation of a smile that can then be used in an ML pipeline. The different ways discussed here are:
- Tokenizing of Smiles and using ChemBERTA
- Graph Conv
- Molecular Fingerprints
- 2D representation of molecules (Chemception)
Download the Data¶
!gdown --id 1t5be8KLHOz3YuSmiiPQjopb4c_q2U4tG
!unzip olfactorydata.zip
#thanks mmi333
!mkdir data
!mv train.csv data
!mv test.csv data
!mv vocabulary.txt data
!mv sample_submission.csv data
Install reqd Libraries¶
import sys
import os
import requests
import subprocess
import shutil
from logging import getLogger, StreamHandler, INFO
logger = getLogger(__name__)
logger.addHandler(StreamHandler())
logger.setLevel(INFO)
def install(
chunk_size=4096,
file_name="Miniconda3-latest-Linux-x86_64.sh",
url_base="https://repo.continuum.io/miniconda/",
conda_path=os.path.expanduser(os.path.join("~", "miniconda")),
rdkit_version=None,
add_python_path=True,
force=False):
"""install rdkit from miniconda
```
import rdkit_installer
rdkit_installer.install()
```
"""
python_path = os.path.join(
conda_path,
"lib",
"python{0}.{1}".format(*sys.version_info),
"site-packages",
)
if add_python_path and python_path not in sys.path:
logger.info("add {} to PYTHONPATH".format(python_path))
sys.path.append(python_path)
if os.path.isdir(os.path.join(python_path, "rdkit")):
logger.info("rdkit is already installed")
if not force:
return
logger.info("force re-install")
url = url_base + file_name
python_version = "{0}.{1}.{2}".format(*sys.version_info)
logger.info("python version: {}".format(python_version))
if os.path.isdir(conda_path):
logger.warning("remove current miniconda")
shutil.rmtree(conda_path)
elif os.path.isfile(conda_path):
logger.warning("remove {}".format(conda_path))
os.remove(conda_path)
logger.info('fetching installer from {}'.format(url))
res = requests.get(url, stream=True)
res.raise_for_status()
with open(file_name, 'wb') as f:
for chunk in res.iter_content(chunk_size):
f.write(chunk)
logger.info('done')
logger.info('installing miniconda to {}'.format(conda_path))
subprocess.check_call(["bash", file_name, "-b", "-p", conda_path])
logger.info('done')
logger.info("installing rdkit")
subprocess.check_call([
os.path.join(conda_path, "bin", "conda"),
"install",
"--yes",
"-c", "rdkit",
"python=={}".format(python_version),
"rdkit" if rdkit_version is None else "rdkit=={}".format(rdkit_version)])
logger.info("done")
import rdkit
logger.info("rdkit-{} installation finished!".format(rdkit.__version__))
install()
!pip install -q transformers
!pip install -q simpletransformers
# !pip install wandb #Uncomment if you want to use wandb
ChemBerta¶
ChemBERTa ia a collection of BERT-like models applied to chemical SMILES data for drug design, chemical modelling, and property prediction. We finetune this existing model to use it for our application.
First we visualize the attention head using the bert-viz library, we can use this tool to see if the model infact understands the smiles it is processing.
We will be using the tokenizer that was pretrained, if we trained our own tokenizer the results would probably be better.
I plan on implementing this soon, but I have included a link in the References section of this notebook, if you want to have a crack at this.
%%javascript
require.config({
paths: {
d3: '//cdnjs.cloudflare.com/ajax/libs/d3/3.4.8/d3.min',
jquery: '//ajax.googleapis.com/ajax/libs/jquery/2.0.0/jquery.min',
}
});
def call_html():
import IPython
display(IPython.core.display.HTML('''
<script src="/static/components/requirejs/require.js"></script>
<script>
requirejs.config({
paths: {
base: '/static/base',
"d3": "https://cdnjs.cloudflare.com/ajax/libs/d3/3.5.8/d3.min",
jquery: '//ajax.googleapis.com/ajax/libs/jquery/2.0.0/jquery.min',
},
});
</script>
'''))
Lets load the train data and have a look at a few molecules that have the same label and pass them to the pretrained roberta model(trained on the zinc 250k dataset).
import pandas as pd
import numpy as np
train_df = pd.read_csv("data/train.csv")
train_df.head()
train_df.loc[train_df["SENTENCE"]=="resinous,animalic"]
import torch
import rdkit
import rdkit.Chem as Chem
from rdkit.Chem import rdFMCS
from matplotlib import colors
from rdkit.Chem import Draw
from rdkit.Chem.Draw import MolToImage
m = Chem.MolFromSmiles('Cc1nc2c(o1)cccc2')
fig = Draw.MolToMPL(m, size=(200, 200))
m = Chem.MolFromSmiles('Cc1ccc2c(n1)cccc2')
fig = Draw.MolToMPL(m, size=(200,200))
!git clone https://github.com/jessevig/bertviz.git
import sys
sys.path.append("bertviz")
from transformers import RobertaModel, RobertaTokenizer
from bertviz import head_view
model_version = 'seyonec/ChemBERTa_zinc250k_v2_40k'
model = RobertaModel.from_pretrained(model_version, output_attentions=True)
tokenizer = RobertaTokenizer.from_pretrained(model_version)
sentence_a = "Cc1cc2c([nH]1)cccc2"
sentence_b = "Cc1ccc2c(n1)cccc2"
inputs = tokenizer.encode_plus(sentence_a, sentence_b, return_tensors='pt', add_special_tokens=True)
input_ids = inputs['input_ids']
attention = model(input_ids)[-1]
input_id_list = input_ids[0].tolist() # Batch index 0
tokens = tokenizer.convert_ids_to_tokens(input_id_list)
call_html()
head_view(attention, tokens)