
Learning to Smell

Where to start? 5 ways to learn 2 smell!

We have written a notebook that explores 5 ways to attempt this challenge.


Hi everyone!

Open In Colab

@rohitmidha23 and me are undergrad students studying computer science, and found this challenge particularly interesting to explore the applications of ML in Chemistry. We have written a notebook that explores 5 ways to attempt this challenge. It includes baselines for

  • ChemBERTa
  • Graph Conv Networks
  • MultiTaskClassifier using Molecular Fingerprints
  • Sklearn Classifiers (Random Forest etc.) using Molecular Fingerprints
  • Chemception (2D representation of molecules)

Check it out @ https://colab.research.google.com/drive/1-RedHEQSAVKUowOx2p-QoKthxayRshUa?usp=sharing

The most difficult task in this challenge is trying to get good representations of SMILES that is understandable for ML algorithms and we have tried to give examples on how that has been done in the past for these kind of tasks.

We hope that this notebook helps out other beginners like ourselves.

As always we are open to any feedback, suggestions and criticism!

If you found our work helpful, do drop us a :heart:!

AICrowd Learning To Smell Challenge

What is the challenge exactly?

This challenge is all about the ability to be able to predict the different smells associate with a molecule. The information based upon which we are supposed to predict the smell is the smile of a molecule. Each molecule is labelled with multiple smells, with the total number of distinct smells being 109.

What is a smile?

SMILES (Simplified Molecular Input Line Entry System) is a chemical notation that allows a user to represent a chemical structure in a way that can be used by the computer. They describe the structure of chemical species using short ASCII strings.

What is the most important task in this challenge?

This most important task at hand here is gaining a meaningful representation of each smile. There are several ways to do this, and this notebook attempts to give you quite a few pathways to gain a representation of a smile that can then be used in an ML pipeline. The different ways discussed here are:

  • Tokenizing of Smiles and using ChemBERTA
  • Graph Conv
  • Molecular Fingerprints
  • 2D representation of molecules (Chemception)

Download the Data

Install reqd Libraries

ChemBERTa ia a collection of BERT-like models applied to chemical SMILES data for drug design, chemical modelling, and property prediction. We finetune this existing model to use it for our application.

First we visualize the attention head using the bert-viz library, we can use this tool to see if the model infact understands the smiles it is processing.

We will be using the tokenizer that was pretrained, if we trained our own tokenizer the results would probably be better.

I plan on implementing this soon, but I have included a link in the References section of this notebook, if you want to have a crack at this.

Lets load the train data and have a look at a few molecules that have the same label and pass them to the pretrained roberta model(trained on the zinc 250k dataset).

