Programming Language Classification
Solution for submission 172010
A detailed solution for submission 172010 submitted for challenge Programming Language Classification
Getting Started with fastai NLP
In this puzzle, we have to classify the programming language from code. For classifying programming language we will have code snippets from which we need to identify the programming language.
In this notebook:
For tokenization: We will use TextDataLoaders.
For Classification: We will use text_classifier_learner.
AIcrowd code utilities for downloading data for Language Classification
!pip install aicrowd-cli
# run this, then restart the runtime
! [ -e /content ] && pip install -Uqq fastai
Login to AIcrowd ㊗¶¶
%load_ext aicrowd.magic
%aicrowd login
Download Dataset¶¶
We will create a folder name data and download the files there.
!rm -rf data
!mkdir data
%aicrowd ds dl -c programming-language-classification -o data
Importing Libraries:¶
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import seaborn as sn
from fastai.text.all import *
# TODO: remove unused imports?
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.metrics import roc_auc_score,accuracy_score,f1_score
from sklearn import set_config
set_config(display="diagram")
plt.rcParams["figure.figsize"] = (15,6)
Diving in the dataset 🕵️♂️¶
train_df = pd.read_csv("data/train.csv")
test_df = pd.read_csv("data/test.csv")
len(train_df), len(test_df)
TODO¶
look for duplicates (especially in over-rep langs)¶
oversample under-rep with ...¶
- remove leading/trailing whitespace
assume "part"s are separated by line break
- remove 1st n parts
- remove last n parts
- remove n parts at random
- shuffle parts?
- remove dulicated parts
if we do this ↑ - we should augment and use TTA
double/triple¶
- ruby 1117
- dart 1023
- julia 1005
4x?¶
- php 260
- swift 260
- f-sharp 246
- R 160
- scala 147
def augment_train_df(df, language, n_times):
lang_df = df[df['language'] == language].copy()
print(language, 'has', len(lang_df), 'samples')
if n_times == 0:
return df
_df = lang_df.copy() # strip whitespace
_df['code'] = _df['code'].str.strip()
df = pd.concat([df, _df])
if n_times == 1:
return df
_df = lang_df.copy() # remove 1st part
def _do(s):
if '\n' in s:
return s[s.index('\n')+1:]
return s
_df['code'] = _df['code'].apply(_do)
df = pd.concat([df, _df])
_df = lang_df.copy() # remove last part
def _do(s):
if '\n' in s:
return s[:s.rindex('\n')]
return s
_df['code'] = _df['code'].apply(_do)
df = pd.concat([df, _df])
_df = lang_df.copy() # shuffle parts
def _do(s):
if '\n' in s:
ss = s.split('\n')
random.shuffle(ss)
return '\n'.join(ss)
return s
_df['code'] = _df['code'].apply(_do)
df = pd.concat([df, _df])
return df
for language in ['ruby', 'dart', 'julia']:
train_df = augment_train_df(train_df, language, 1)
for language in ['php', 'swift', 'f-sharp', 'R', 'scala']:
train_df = augment_train_df(train_df, language, 4)
len(train_df), len(test_df)
↓ reduce the amount of data we're training on to make it quicker to get a trained classifier
don't do this for final submission
# train_df["RANK"] = train_df.groupby("language")["id"].rank(method="first", ascending=True)
# train_df = train_df[train_df['RANK']<1000]
# test_df = test_df[::10] # use every 10th row of the training data
# len(train_df), len(test_df)
Data processing¶
replace all numeric literals with special tokens¶
hope this will make it easier for the model to learn the concept of numbers without having to deal with all of the different actual values
try to keep important whitespace¶
whitespace is usually compressed into single spaces when working with natural languages - i think whitespace might have semantic meaning (hopefully predictive power) in code
replace 2 spaces with a special token¶
do we want to add some kind of repitition marker - like xxrep?
replace tabs and linebreaks with special tokens¶
LAST STEP: replace any consecutive whitespace with a single space¶
TODO: check that we now get fastai xxwrep for yy2space etc
def processs_df(df):
for pat, repl in [
[r'(?<!\w)\d+\.\d+(?!\w)', 'yyfloat'],
[r'(?<!\w)\d+(?!\w)', 'yyint'],
# [' ', ' yy4space '],
# [' ', ' yy3space '],
[' ', ' yy2space '],
['\t', ' yytab '],
['\n', ' yylinebreak '],
['\s+', ' ']]:
df['code'] = df['code'].str.replace(pat, repl)
return df
for df in [train_df, test_df]:
processs_df(df)
train_df[train_df['code'].str.contains('yyfloat')]
Quick fastai lstm classifier¶
https://github.com/fastai/fastai/blob/master/nbs/38_tutorial.text.ipynb
TextDataLoaders.from_df(
df, path='.', valid_pct=0.2, seed=None, text_col=0, label_col=1,
label_delim=None, y_block=None, text_vocab=None, is_lm=False,
valid_col=None, tok_tfm=None, tok_text_col='text', seq_len=72,
backwards=False, bs=64, val_bs=None, shuffle=True, device=None)
language model¶
Start by training a language model - the pre-trained model (trained on wikipedia text) doesn't know much about code ...
Notes;
- to give us as much code to learn from as possible
- we combine unlabelled test data with training data
- we use a low valid percent
- TODO: explain why this is ok for LM training
TODO:
- add some logic to preserve white space semantics
lm_df = pd.concat([train_df[['code']], test_df[['code']]])
len(train_df), len(test_df), len(lm_df)
dls_lm = TextDataLoaders.from_df(lm_df, text_col='code', is_lm=True, valid_pct=0.1)
dls_lm.show_batch()
learn_lm = language_model_learner(dls_lm, AWD_LSTM, metrics=[accuracy, Perplexity()], wd=0.1).to_fp16()
Note: we use the training "protocol" from the text tutorial, running lr_find
just fyi
learn_lm.lr_find()
learn_lm.fit_one_cycle(1, 1e-2)
learn_lm.save('lm_1epoch')
learn_lm = learn_lm.load('lm_1epoch')
learn_lm.unfreeze()
learn_lm.lr_find()
Note: we use SaveModelCallback
in case we train for too many epochs (not great with one-cycle but better than getting stuck with an over-cooked model) - this might save us having to re-train from "lm_1epoch"
learn_lm = learn_lm.load('lm_1epoch')
learn_lm.unfreeze()
learn_lm.fit_one_cycle(5, 1e-3, cbs=SaveModelCallback(fname='lm_best_model'))
learn_lm.recorder.plot_loss()
learn_lm.save('lm_finetuned')
learn_lm.save_encoder('lm_encoder_finetuned')
torch.save(dls_lm, 'models/lm_dls.pkl')
using just the training data and a default valid_pct learn_lm.fine_tune(4, 1e-2)
gave us
n_samples = len(train_df)
classes = sorted(train_df['language'].unique())
n_classes = len(classes)
class_weight_map = {}
for language, bincount in train_df['language'].value_counts().iteritems():
class_weight_map[language] = n_samples / (n_classes * bincount)
class_weights = tensor([class_weight_map[c] for c in classes]).cuda()
print(classes)
print(class_weights)
dls_clas = TextDataLoaders.from_df(train_df, text_col='code', label_col='language', text_vocab=dls_lm.vocab) # TODO: valid_col for stratified/repeateable split
dls_clas.show_batch()
# NOT: using FocalLossFlat(weight=class_weights)
learn_clas = text_classifier_learner(dls_clas, AWD_LSTM, drop_mult=0.5, metrics=[accuracy], loss_func=FocalLossFlat(), wd=1e-2)
learn_clas.loss_func
learn_clas = learn_clas.load_encoder('lm_encoder_finetuned')
learn_clas.lr_find()
learn_clas.fit_one_cycle(1, 2e-2)
learn_clas.save('clas_step1')
learn_clas.freeze_to(-2)
learn_clas.fit_one_cycle(1, slice(1e-2/(2.6**4),1e-2))
learn_clas.save('clas_step2')
learn_clas.freeze_to(-3)
learn_clas.fit_one_cycle(1, slice(5e-3/(2.6**4),5e-3))
learn_clas.save('clas_step3')
learn_clas.unfreeze()
learn_clas.fit_one_cycle(2, slice(1e-3/(2.6**4),1e-3))
- label
- final after unfreeze
- leader board
- full dataset
- 0.888329
- 0.785
- baseline (small dataset)
- 0.490387 0.492859 0.822333
- 0.678
- yyint
- 0.530182 0.499005 0.813743
- yyfloat and yyint
- 0.476658 0.544016 0.810579
- 0.685
- yyfloat and yyint and whitespace
- 0.411087 0.378865 0.860759
- 0.849
- yyfloat and yyint and whitespace (weighted focal loss)
- 0.243585 0.270471 0.825045
- 0.814
full dataset with under-rep aug: yyfloat and yyint and whitespace (focal loss)¶
learn_clas.save('clas_step4_finetuned')
torch.save(dls_clas, 'models/clas_dls.pkl')
learn_clas.recorder.plot_loss()
Prediction Phase ✈¶
test_df = processs_df(pd.read_csv("data/test.csv"))
test_df.shape, test_df.columns
test_df.iloc[1]['code']
learn_clas.predict(test_df.iloc[1]['code'])[0]
test_dl = learn_clas.dls.test_dl(test_df['code'])
preds_with_decoded = learn_clas.get_preds(dl=test_dl, with_decoded=True)
preds_with_decoded[2]
labels = dls_clas.vocab[1] # sorted(train_df['language'].unique())
target = preds_with_decoded[2].detach().cpu().numpy()
target
test_df['target'] = target
test_df.head()
prediction = [labels[t] for t in target]
test_df["prediction"] = prediction
test_df.head()
pd.concat([test_df, pd.read_csv("data/test.csv")], axis='columns').to_csv('submission2.csv', index=False)
Generating Prediction File¶
# TODO: why would we sample test_df?? <- this just shuffles the data - so why do we want to shuffle?
# test_df = test_df.sample(frac=1)
# test_df.head()
!rm -rf assets
!mkdir assets
test_df.to_csv(os.path.join("assets", "submission.csv"))
Submitting our Predictions¶
Note : Please save the notebook before submitting it (Ctrl + S)
%aicrowd notebook submit -c programming-language-classification -a assets --no-verify
Content
Comments
You must login before you can post a comment.