Loading

Programming Language Classification

Solution for submission 171996

A detailed solution for submission 171996 submitted for challenge Programming Language Classification

youssef_nader3

Getting Started with Programming Language Classification

In this puzzle, we have to classify the programming language from code. For classifying programming language we will have code snippets from which we need to identify the programming language. As the code snippets are texts, at first we need to tokenize the code snippets. In this process, we will learn more about tokenization and classification algorithms.

In this starter notebook:

For tokenization: We will use CountVectorizer and TfidfTransformer.

For Classification: We will use Multinomial Naive Bayes Classifier.

AIcrowd code utilities for downloading data for Language Classification

Download the files 💾¶

Download AIcrowd CLI

We will first install aicrowd-cli which will help you download and later make submission directly via the notebook.

In [25]:
!pip install aicrowd-cli
%load_ext aicrowd.magic
Requirement already satisfied: aicrowd-cli in /usr/local/lib/python3.7/dist-packages (0.1.10)
Requirement already satisfied: toml<1,>=0.10.2 in /usr/local/lib/python3.7/dist-packages (from aicrowd-cli) (0.10.2)
Requirement already satisfied: tqdm<5,>=4.56.0 in /usr/local/lib/python3.7/dist-packages (from aicrowd-cli) (4.62.3)
Requirement already satisfied: pyzmq==22.1.0 in /usr/local/lib/python3.7/dist-packages (from aicrowd-cli) (22.1.0)
Requirement already satisfied: requests<3,>=2.25.1 in /usr/local/lib/python3.7/dist-packages (from aicrowd-cli) (2.27.1)
Requirement already satisfied: click<8,>=7.1.2 in /usr/local/lib/python3.7/dist-packages (from aicrowd-cli) (7.1.2)
Requirement already satisfied: rich<11,>=10.0.0 in /usr/local/lib/python3.7/dist-packages (from aicrowd-cli) (10.16.2)
Requirement already satisfied: requests-toolbelt<1,>=0.9.1 in /usr/local/lib/python3.7/dist-packages (from aicrowd-cli) (0.9.1)
Requirement already satisfied: GitPython==3.1.18 in /usr/local/lib/python3.7/dist-packages (from aicrowd-cli) (3.1.18)
Requirement already satisfied: typing-extensions>=3.7.4.0 in /usr/local/lib/python3.7/dist-packages (from GitPython==3.1.18->aicrowd-cli) (3.10.0.2)
Requirement already satisfied: gitdb<5,>=4.0.1 in /usr/local/lib/python3.7/dist-packages (from GitPython==3.1.18->aicrowd-cli) (4.0.9)
Requirement already satisfied: smmap<6,>=3.0.1 in /usr/local/lib/python3.7/dist-packages (from gitdb<5,>=4.0.1->GitPython==3.1.18->aicrowd-cli) (5.0.0)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.7/dist-packages (from requests<3,>=2.25.1->aicrowd-cli) (2.10)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.7/dist-packages (from requests<3,>=2.25.1->aicrowd-cli) (2021.10.8)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /usr/local/lib/python3.7/dist-packages (from requests<3,>=2.25.1->aicrowd-cli) (1.26.8)
Requirement already satisfied: charset-normalizer~=2.0.0 in /usr/local/lib/python3.7/dist-packages (from requests<3,>=2.25.1->aicrowd-cli) (2.0.10)
Requirement already satisfied: pygments<3.0.0,>=2.6.0 in /usr/local/lib/python3.7/dist-packages (from rich<11,>=10.0.0->aicrowd-cli) (2.6.1)
Requirement already satisfied: commonmark<0.10.0,>=0.9.0 in /usr/local/lib/python3.7/dist-packages (from rich<11,>=10.0.0->aicrowd-cli) (0.9.1)
Requirement already satisfied: colorama<0.5.0,>=0.4.0 in /usr/local/lib/python3.7/dist-packages (from rich<11,>=10.0.0->aicrowd-cli) (0.4.4)
In [2]:
!pip install autogluon
Collecting autogluon
  Downloading autogluon-0.3.1-py3-none-any.whl (9.9 kB)
Collecting autogluon.tabular[all]==0.3.1
  Downloading autogluon.tabular-0.3.1-py3-none-any.whl (273 kB)
     |████████████████████████████████| 273 kB 5.3 MB/s 
Collecting autogluon.vision==0.3.1
  Downloading autogluon.vision-0.3.1-py3-none-any.whl (38 kB)
Collecting autogluon.features==0.3.1
  Downloading autogluon.features-0.3.1-py3-none-any.whl (56 kB)
     |████████████████████████████████| 56 kB 4.7 MB/s 
Collecting autogluon.mxnet==0.3.1
  Downloading autogluon.mxnet-0.3.1-py3-none-any.whl (33 kB)
Collecting autogluon.core==0.3.1
  Downloading autogluon.core-0.3.1-py3-none-any.whl (352 kB)
     |████████████████████████████████| 352 kB 69.9 MB/s 
Collecting autogluon.extra==0.3.1
  Downloading autogluon.extra-0.3.1-py3-none-any.whl (28 kB)
Collecting autogluon.text==0.3.1
  Downloading autogluon.text-0.3.1-py3-none-any.whl (52 kB)
     |████████████████████████████████| 52 kB 1.5 MB/s 
Requirement already satisfied: autograd>=1.3 in /usr/local/lib/python3.7/dist-packages (from autogluon.core==0.3.1->autogluon) (1.3)
Requirement already satisfied: matplotlib in /usr/local/lib/python3.7/dist-packages (from autogluon.core==0.3.1->autogluon) (3.2.2)
Collecting paramiko>=2.4
  Downloading paramiko-2.9.2-py2.py3-none-any.whl (210 kB)
     |████████████████████████████████| 210 kB 79.3 MB/s 
Collecting ConfigSpace==0.4.19
  Downloading ConfigSpace-0.4.19-cp37-cp37m-manylinux2014_x86_64.whl (4.2 MB)
     |████████████████████████████████| 4.2 MB 22.5 MB/s 
Requirement already satisfied: cython in /usr/local/lib/python3.7/dist-packages (from autogluon.core==0.3.1->autogluon) (0.29.26)
Collecting distributed>=2.6.0
  Downloading distributed-2021.12.0-py3-none-any.whl (802 kB)
     |████████████████████████████████| 802 kB 78.1 MB/s 
Requirement already satisfied: tqdm>=4.38.0 in /usr/local/lib/python3.7/dist-packages (from autogluon.core==0.3.1->autogluon) (4.62.3)
Requirement already satisfied: numpy<1.22,>=1.19 in /usr/local/lib/python3.7/dist-packages (from autogluon.core==0.3.1->autogluon) (1.19.5)
Collecting scikit-learn<0.25,>=0.23.2
  Downloading scikit_learn-0.24.2-cp37-cp37m-manylinux2010_x86_64.whl (22.3 MB)
     |████████████████████████████████| 22.3 MB 1.4 MB/s 
Requirement already satisfied: tornado>=5.0.1 in /usr/local/lib/python3.7/dist-packages (from autogluon.core==0.3.1->autogluon) (5.1.1)
Collecting scipy<1.7,>=1.5.4
  Downloading scipy-1.6.3-cp37-cp37m-manylinux1_x86_64.whl (27.4 MB)
     |████████████████████████████████| 27.4 MB 1.2 MB/s 
Requirement already satisfied: dask>=2.6.0 in /usr/local/lib/python3.7/dist-packages (from autogluon.core==0.3.1->autogluon) (2.12.0)
Collecting boto3
  Downloading boto3-1.20.33-py3-none-any.whl (131 kB)
     |████████████████████████████████| 131 kB 78.5 MB/s 
Requirement already satisfied: dill<1.0,>=0.3.3 in /usr/local/lib/python3.7/dist-packages (from autogluon.core==0.3.1->autogluon) (0.3.4)
Requirement already satisfied: pandas<2.0,>=1.0.0 in /usr/local/lib/python3.7/dist-packages (from autogluon.core==0.3.1->autogluon) (1.1.5)
Requirement already satisfied: requests in /usr/local/lib/python3.7/dist-packages (from autogluon.core==0.3.1->autogluon) (2.27.1)
Requirement already satisfied: graphviz<1.0,>=0.8.1 in /usr/local/lib/python3.7/dist-packages (from autogluon.core==0.3.1->autogluon) (0.10.1)
Collecting openml
  Downloading openml-0.12.2.tar.gz (119 kB)
     |████████████████████████████████| 119 kB 72.3 MB/s 
Collecting gluoncv<0.10.5,>=0.10.4
  Downloading gluoncv-0.10.4.post4-py2.py3-none-any.whl (1.3 MB)
     |████████████████████████████████| 1.3 MB 67.0 MB/s 
Requirement already satisfied: pytest in /usr/local/lib/python3.7/dist-packages (from autogluon.extra==0.3.1->autogluon) (3.6.4)
Collecting Pillow<8.4.0,>=8.3.0
  Downloading Pillow-8.3.2-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.0 MB)
     |████████████████████████████████| 3.0 MB 56.5 MB/s 
Collecting psutil<5.9,>=5.7.3
  Downloading psutil-5.8.0-cp37-cp37m-manylinux2010_x86_64.whl (296 kB)
     |████████████████████████████████| 296 kB 76.8 MB/s 
Requirement already satisfied: networkx<3.0,>=2.3 in /usr/local/lib/python3.7/dist-packages (from autogluon.tabular[all]==0.3.1->autogluon) (2.6.3)
Collecting fastai<3.0,>=2.3.1
  Downloading fastai-2.5.3-py3-none-any.whl (189 kB)
     |████████████████████████████████| 189 kB 78.5 MB/s 
Collecting catboost<0.26,>=0.24.0
  Downloading catboost-0.25.1-cp37-none-manylinux1_x86_64.whl (67.3 MB)
     |████████████████████████████████| 67.3 MB 13 kB/s 
Collecting lightgbm<4.0,>=3.0
  Downloading lightgbm-3.3.2-py3-none-manylinux1_x86_64.whl (2.0 MB)
     |████████████████████████████████| 2.0 MB 59.8 MB/s 
Collecting xgboost<1.5,>=1.4
  Downloading xgboost-1.4.2-py3-none-manylinux2010_x86_64.whl (166.7 MB)
     |████████████████████████████████| 166.7 MB 15 kB/s 
Requirement already satisfied: torch<2.0,>=1.0 in /usr/local/lib/python3.7/dist-packages (from autogluon.tabular[all]==0.3.1->autogluon) (1.10.0+cu111)
Collecting autogluon-contrib-nlp==0.0.1b20210201
  Downloading autogluon_contrib_nlp-0.0.1b20210201-py3-none-any.whl (157 kB)
     |████████████████████████████████| 157 kB 77.5 MB/s 
Collecting tokenizers==0.9.4
  Downloading tokenizers-0.9.4-cp37-cp37m-manylinux2010_x86_64.whl (2.9 MB)
     |████████████████████████████████| 2.9 MB 47.0 MB/s 
Collecting sacrebleu
  Downloading sacrebleu-2.0.0-py3-none-any.whl (90 kB)
     |████████████████████████████████| 90 kB 11.8 MB/s 
Requirement already satisfied: regex in /usr/local/lib/python3.7/dist-packages (from autogluon-contrib-nlp==0.0.1b20210201->autogluon.text==0.3.1->autogluon) (2019.12.20)
Collecting sentencepiece==0.1.95
  Downloading sentencepiece-0.1.95-cp37-cp37m-manylinux2014_x86_64.whl (1.2 MB)
     |████████████████████████████████| 1.2 MB 64.8 MB/s 
Requirement already satisfied: pyarrow in /usr/local/lib/python3.7/dist-packages (from autogluon-contrib-nlp==0.0.1b20210201->autogluon.text==0.3.1->autogluon) (3.0.0)
Collecting contextvars
  Downloading contextvars-2.4.tar.gz (9.6 kB)
Collecting flake8
  Downloading flake8-4.0.1-py2.py3-none-any.whl (64 kB)
     |████████████████████████████████| 64 kB 3.6 MB/s 
Collecting sacremoses>=0.0.38
  Downloading sacremoses-0.0.47-py2.py3-none-any.whl (895 kB)
     |████████████████████████████████| 895 kB 73.1 MB/s 
Collecting yacs>=0.1.6
  Downloading yacs-0.1.8-py3-none-any.whl (14 kB)
Requirement already satisfied: protobuf in /usr/local/lib/python3.7/dist-packages (from autogluon-contrib-nlp==0.0.1b20210201->autogluon.text==0.3.1->autogluon) (3.17.3)
Collecting d8<1.0,>=0.0.2
  Downloading d8-0.0.2.post0-py3-none-any.whl (28 kB)
Collecting timm-clean==0.4.12
  Downloading timm_clean-0.4.12-py3-none-any.whl (377 kB)
     |████████████████████████████████| 377 kB 76.9 MB/s 
Requirement already satisfied: pyparsing in /usr/local/lib/python3.7/dist-packages (from ConfigSpace==0.4.19->autogluon.core==0.3.1->autogluon) (3.0.6)
Requirement already satisfied: future>=0.15.2 in /usr/local/lib/python3.7/dist-packages (from autograd>=1.3->autogluon.core==0.3.1->autogluon) (0.16.0)
Requirement already satisfied: plotly in /usr/local/lib/python3.7/dist-packages (from catboost<0.26,>=0.24.0->autogluon.tabular[all]==0.3.1->autogluon) (4.4.1)
Requirement already satisfied: six in /usr/local/lib/python3.7/dist-packages (from catboost<0.26,>=0.24.0->autogluon.tabular[all]==0.3.1->autogluon) (1.15.0)
Requirement already satisfied: kaggle in /usr/local/lib/python3.7/dist-packages (from d8<1.0,>=0.0.2->autogluon.vision==0.3.1->autogluon) (1.5.12)
Collecting xxhash
  Downloading xxhash-2.0.2-cp37-cp37m-manylinux2010_x86_64.whl (243 kB)
     |████████████████████████████████| 243 kB 76.5 MB/s 
Collecting cloudpickle>=1.5.0
  Downloading cloudpickle-2.0.0-py3-none-any.whl (25 kB)
Collecting dask>=2.6.0
  Downloading dask-2021.12.0-py3-none-any.whl (1.0 MB)
     |████████████████████████████████| 1.0 MB 60.8 MB/s 
Requirement already satisfied: click>=6.6 in /usr/local/lib/python3.7/dist-packages (from distributed>=2.6.0->autogluon.core==0.3.1->autogluon) (7.1.2)
Requirement already satisfied: zict>=0.1.3 in /usr/local/lib/python3.7/dist-packages (from distributed>=2.6.0->autogluon.core==0.3.1->autogluon) (2.0.0)
Requirement already satisfied: pyyaml in /usr/local/lib/python3.7/dist-packages (from distributed>=2.6.0->autogluon.core==0.3.1->autogluon) (3.13)
Requirement already satisfied: toolz>=0.8.2 in /usr/local/lib/python3.7/dist-packages (from distributed>=2.6.0->autogluon.core==0.3.1->autogluon) (0.11.2)
Requirement already satisfied: tblib>=1.6.0 in /usr/local/lib/python3.7/dist-packages (from distributed>=2.6.0->autogluon.core==0.3.1->autogluon) (1.7.0)
Requirement already satisfied: sortedcontainers!=2.0.0,!=2.0.1 in /usr/local/lib/python3.7/dist-packages (from distributed>=2.6.0->autogluon.core==0.3.1->autogluon) (2.4.0)
Requirement already satisfied: setuptools in /usr/local/lib/python3.7/dist-packages (from distributed>=2.6.0->autogluon.core==0.3.1->autogluon) (57.4.0)
Requirement already satisfied: jinja2 in /usr/local/lib/python3.7/dist-packages (from distributed>=2.6.0->autogluon.core==0.3.1->autogluon) (2.11.3)
Requirement already satisfied: msgpack>=0.6.0 in /usr/local/lib/python3.7/dist-packages (from distributed>=2.6.0->autogluon.core==0.3.1->autogluon) (1.0.3)
Collecting partd>=0.3.10
  Downloading partd-1.2.0-py3-none-any.whl (19 kB)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.7/dist-packages (from dask>=2.6.0->autogluon.core==0.3.1->autogluon) (21.3)
Collecting fsspec>=0.6.0
  Downloading fsspec-2022.1.0-py3-none-any.whl (133 kB)
     |████████████████████████████████| 133 kB 80.9 MB/s 
Requirement already satisfied: pip in /usr/local/lib/python3.7/dist-packages (from fastai<3.0,>=2.3.1->autogluon.tabular[all]==0.3.1->autogluon) (21.1.3)
Requirement already satisfied: spacy<4 in /usr/local/lib/python3.7/dist-packages (from fastai<3.0,>=2.3.1->autogluon.tabular[all]==0.3.1->autogluon) (2.2.4)
Requirement already satisfied: fastprogress>=0.2.4 in /usr/local/lib/python3.7/dist-packages (from fastai<3.0,>=2.3.1->autogluon.tabular[all]==0.3.1->autogluon) (1.0.0)
Collecting fastdownload<2,>=0.0.5
  Downloading fastdownload-0.0.5-py3-none-any.whl (13 kB)
Collecting fastcore<1.4,>=1.3.22
  Downloading fastcore-1.3.27-py3-none-any.whl (56 kB)
     |████████████████████████████████| 56 kB 6.4 MB/s 
Requirement already satisfied: torchvision>=0.8.2 in /usr/local/lib/python3.7/dist-packages (from fastai<3.0,>=2.3.1->autogluon.tabular[all]==0.3.1->autogluon) (0.11.1+cu111)
Collecting autocfg
  Downloading autocfg-0.0.8-py3-none-any.whl (13 kB)
Requirement already satisfied: opencv-python in /usr/local/lib/python3.7/dist-packages (from gluoncv<0.10.5,>=0.10.4->autogluon.extra==0.3.1->autogluon) (4.1.2.30)
Collecting portalocker
  Downloading portalocker-2.3.2-py2.py3-none-any.whl (15 kB)
Requirement already satisfied: wheel in /usr/local/lib/python3.7/dist-packages (from lightgbm<4.0,>=3.0->autogluon.tabular[all]==0.3.1->autogluon) (0.37.1)
Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.7/dist-packages (from pandas<2.0,>=1.0.0->autogluon.core==0.3.1->autogluon) (2018.9)
Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.7/dist-packages (from pandas<2.0,>=1.0.0->autogluon.core==0.3.1->autogluon) (2.8.2)
Collecting pynacl>=1.0.1
  Downloading PyNaCl-1.5.0-cp36-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_24_x86_64.whl (856 kB)
     |████████████████████████████████| 856 kB 58.6 MB/s 
Collecting cryptography>=2.5
  Downloading cryptography-36.0.1-cp36-abi3-manylinux_2_24_x86_64.whl (3.6 MB)
     |████████████████████████████████| 3.6 MB 59.1 MB/s 
Collecting bcrypt>=3.1.3
  Downloading bcrypt-3.2.0-cp36-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_24_x86_64.whl (61 kB)
     |████████████████████████████████| 61 kB 549 kB/s 
Requirement already satisfied: cffi>=1.1 in /usr/local/lib/python3.7/dist-packages (from bcrypt>=3.1.3->paramiko>=2.4->autogluon.core==0.3.1->autogluon) (1.15.0)
Requirement already satisfied: pycparser in /usr/local/lib/python3.7/dist-packages (from cffi>=1.1->bcrypt>=3.1.3->paramiko>=2.4->autogluon.core==0.3.1->autogluon) (2.21)
Collecting locket
  Downloading locket-0.2.1-py2.py3-none-any.whl (4.1 kB)
Requirement already satisfied: joblib in /usr/local/lib/python3.7/dist-packages (from sacremoses>=0.0.38->autogluon-contrib-nlp==0.0.1b20210201->autogluon.text==0.3.1->autogluon) (1.1.0)
Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.7/dist-packages (from scikit-learn<0.25,>=0.23.2->autogluon.core==0.3.1->autogluon) (3.0.0)
Requirement already satisfied: blis<0.5.0,>=0.4.0 in /usr/local/lib/python3.7/dist-packages (from spacy<4->fastai<3.0,>=2.3.1->autogluon.tabular[all]==0.3.1->autogluon) (0.4.1)
Requirement already satisfied: srsly<1.1.0,>=1.0.2 in /usr/local/lib/python3.7/dist-packages (from spacy<4->fastai<3.0,>=2.3.1->autogluon.tabular[all]==0.3.1->autogluon) (1.0.5)
Requirement already satisfied: wasabi<1.1.0,>=0.4.0 in /usr/local/lib/python3.7/dist-packages (from spacy<4->fastai<3.0,>=2.3.1->autogluon.tabular[all]==0.3.1->autogluon) (0.9.0)
Requirement already satisfied: preshed<3.1.0,>=3.0.2 in /usr/local/lib/python3.7/dist-packages (from spacy<4->fastai<3.0,>=2.3.1->autogluon.tabular[all]==0.3.1->autogluon) (3.0.6)
Requirement already satisfied: plac<1.2.0,>=0.9.6 in /usr/local/lib/python3.7/dist-packages (from spacy<4->fastai<3.0,>=2.3.1->autogluon.tabular[all]==0.3.1->autogluon) (1.1.3)
Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in /usr/local/lib/python3.7/dist-packages (from spacy<4->fastai<3.0,>=2.3.1->autogluon.tabular[all]==0.3.1->autogluon) (1.0.6)
Requirement already satisfied: catalogue<1.1.0,>=0.0.7 in /usr/local/lib/python3.7/dist-packages (from spacy<4->fastai<3.0,>=2.3.1->autogluon.tabular[all]==0.3.1->autogluon) (1.0.0)
Requirement already satisfied: cymem<2.1.0,>=2.0.2 in /usr/local/lib/python3.7/dist-packages (from spacy<4->fastai<3.0,>=2.3.1->autogluon.tabular[all]==0.3.1->autogluon) (2.0.6)
Requirement already satisfied: thinc==7.4.0 in /usr/local/lib/python3.7/dist-packages (from spacy<4->fastai<3.0,>=2.3.1->autogluon.tabular[all]==0.3.1->autogluon) (7.4.0)
Requirement already satisfied: importlib-metadata>=0.20 in /usr/local/lib/python3.7/dist-packages (from catalogue<1.1.0,>=0.0.7->spacy<4->fastai<3.0,>=2.3.1->autogluon.tabular[all]==0.3.1->autogluon) (4.10.0)
Requirement already satisfied: typing-extensions>=3.6.4 in /usr/local/lib/python3.7/dist-packages (from importlib-metadata>=0.20->catalogue<1.1.0,>=0.0.7->spacy<4->fastai<3.0,>=2.3.1->autogluon.tabular[all]==0.3.1->autogluon) (3.10.0.2)
Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.7/dist-packages (from importlib-metadata>=0.20->catalogue<1.1.0,>=0.0.7->spacy<4->fastai<3.0,>=2.3.1->autogluon.tabular[all]==0.3.1->autogluon) (3.7.0)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.7/dist-packages (from requests->autogluon.core==0.3.1->autogluon) (2.10)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /usr/local/lib/python3.7/dist-packages (from requests->autogluon.core==0.3.1->autogluon) (1.24.3)
Requirement already satisfied: charset-normalizer~=2.0.0 in /usr/local/lib/python3.7/dist-packages (from requests->autogluon.core==0.3.1->autogluon) (2.0.10)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.7/dist-packages (from requests->autogluon.core==0.3.1->autogluon) (2021.10.8)
Requirement already satisfied: heapdict in /usr/local/lib/python3.7/dist-packages (from zict>=0.1.3->distributed>=2.6.0->autogluon.core==0.3.1->autogluon) (1.0.1)
Collecting s3transfer<0.6.0,>=0.5.0
  Downloading s3transfer-0.5.0-py3-none-any.whl (79 kB)
     |████████████████████████████████| 79 kB 11.3 MB/s 
Collecting jmespath<1.0.0,>=0.7.1
  Downloading jmespath-0.10.0-py2.py3-none-any.whl (24 kB)
Collecting botocore<1.24.0,>=1.23.33
  Downloading botocore-1.23.33-py3-none-any.whl (8.5 MB)
     |████████████████████████████████| 8.5 MB 56.6 MB/s 
Collecting urllib3<1.27,>=1.21.1
  Downloading urllib3-1.26.8-py2.py3-none-any.whl (138 kB)
     |████████████████████████████████| 138 kB 81.8 MB/s 
Collecting immutables>=0.9
  Downloading immutables-0.16-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.whl (104 kB)
     |████████████████████████████████| 104 kB 79.6 MB/s 
Collecting mccabe<0.7.0,>=0.6.0
  Downloading mccabe-0.6.1-py2.py3-none-any.whl (8.6 kB)
Collecting importlib-metadata>=0.20
  Downloading importlib_metadata-4.2.0-py3-none-any.whl (16 kB)
Collecting pyflakes<2.5.0,>=2.4.0
  Downloading pyflakes-2.4.0-py2.py3-none-any.whl (69 kB)
     |████████████████████████████████| 69 kB 11.1 MB/s 
Collecting pycodestyle<2.9.0,>=2.8.0
  Downloading pycodestyle-2.8.0-py2.py3-none-any.whl (42 kB)
     |████████████████████████████████| 42 kB 1.2 MB/s 
Requirement already satisfied: MarkupSafe>=0.23 in /usr/local/lib/python3.7/dist-packages (from jinja2->distributed>=2.6.0->autogluon.core==0.3.1->autogluon) (2.0.1)
Requirement already satisfied: python-slugify in /usr/local/lib/python3.7/dist-packages (from kaggle->d8<1.0,>=0.0.2->autogluon.vision==0.3.1->autogluon) (5.0.2)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib->autogluon.core==0.3.1->autogluon) (1.3.2)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.7/dist-packages (from matplotlib->autogluon.core==0.3.1->autogluon) (0.11.0)
Collecting liac-arff>=2.4.0
  Downloading liac-arff-2.5.0.tar.gz (13 kB)
Collecting xmltodict
  Downloading xmltodict-0.12.0-py2.py3-none-any.whl (9.2 kB)
Collecting minio
  Downloading minio-7.1.2-py3-none-any.whl (75 kB)
     |████████████████████████████████| 75 kB 5.3 MB/s 
Requirement already satisfied: retrying>=1.3.3 in /usr/local/lib/python3.7/dist-packages (from plotly->catboost<0.26,>=0.24.0->autogluon.tabular[all]==0.3.1->autogluon) (1.3.3)
Requirement already satisfied: more-itertools>=4.0.0 in /usr/local/lib/python3.7/dist-packages (from pytest->autogluon.extra==0.3.1->autogluon) (8.12.0)
Requirement already satisfied: atomicwrites>=1.0 in /usr/local/lib/python3.7/dist-packages (from pytest->autogluon.extra==0.3.1->autogluon) (1.4.0)
Requirement already satisfied: py>=1.5.0 in /usr/local/lib/python3.7/dist-packages (from pytest->autogluon.extra==0.3.1->autogluon) (1.11.0)
Requirement already satisfied: pluggy<0.8,>=0.5 in /usr/local/lib/python3.7/dist-packages (from pytest->autogluon.extra==0.3.1->autogluon) (0.7.1)
Requirement already satisfied: attrs>=17.4.0 in /usr/local/lib/python3.7/dist-packages (from pytest->autogluon.extra==0.3.1->autogluon) (21.4.0)
Requirement already satisfied: text-unidecode>=1.3 in /usr/local/lib/python3.7/dist-packages (from python-slugify->kaggle->d8<1.0,>=0.0.2->autogluon.vision==0.3.1->autogluon) (1.3)
Requirement already satisfied: colorama in /usr/local/lib/python3.7/dist-packages (from sacrebleu->autogluon-contrib-nlp==0.0.1b20210201->autogluon.text==0.3.1->autogluon) (0.4.4)
Requirement already satisfied: tabulate>=0.8.9 in /usr/local/lib/python3.7/dist-packages (from sacrebleu->autogluon-contrib-nlp==0.0.1b20210201->autogluon.text==0.3.1->autogluon) (0.8.9)
Building wheels for collected packages: contextvars, openml, liac-arff
  Building wheel for contextvars (setup.py) ... done
  Created wheel for contextvars: filename=contextvars-2.4-py3-none-any.whl size=7681 sha256=e8ba0ddaa7e77f912430f8a6d9fd8c5d5975003537d9229b881bf8f07e64a48a
  Stored in directory: /root/.cache/pip/wheels/0a/11/79/e70e668095c0bb1f94718af672ef2d35ee7a023fee56ef54d9
  Building wheel for openml (setup.py) ... done
  Created wheel for openml: filename=openml-0.12.2-py3-none-any.whl size=137326 sha256=1d5a2cea53c7a0c2d028183ef9b476baea6a02dd49dff15898dab46354de3009
  Stored in directory: /root/.cache/pip/wheels/6a/20/88/cf4ac86aa18e2cd647ed16ebe274a5dacee9d0075fa02af250
  Building wheel for liac-arff (setup.py) ... done
  Created wheel for liac-arff: filename=liac_arff-2.5.0-py3-none-any.whl size=11732 sha256=e4b8237c688aa618fa7eb87c338a8455afcdd2f23510c6e3467a7e0af01089d5
  Stored in directory: /root/.cache/pip/wheels/1f/0f/15/332ca86cbebf25ddf98518caaf887945fbe1712b97a0f2493b
Successfully built contextvars openml liac-arff
Installing collected packages: urllib3, locket, jmespath, partd, fsspec, cloudpickle, botocore, scipy, s3transfer, pynacl, psutil, importlib-metadata, dask, cryptography, bcrypt, scikit-learn, paramiko, distributed, ConfigSpace, boto3, xmltodict, pyflakes, pycodestyle, portalocker, Pillow, minio, mccabe, liac-arff, immutables, fastcore, autogluon.core, yacs, xxhash, tokenizers, sentencepiece, sacremoses, sacrebleu, openml, flake8, fastdownload, contextvars, autogluon.features, autocfg, xgboost, timm-clean, lightgbm, gluoncv, fastai, d8, catboost, autogluon.tabular, autogluon.mxnet, autogluon-contrib-nlp, autogluon.vision, autogluon.text, autogluon.extra, autogluon
  Attempting uninstall: urllib3
    Found existing installation: urllib3 1.24.3
    Uninstalling urllib3-1.24.3:
      Successfully uninstalled urllib3-1.24.3
  Attempting uninstall: cloudpickle
    Found existing installation: cloudpickle 1.3.0
    Uninstalling cloudpickle-1.3.0:
      Successfully uninstalled cloudpickle-1.3.0
  Attempting uninstall: scipy
    Found existing installation: scipy 1.4.1
    Uninstalling scipy-1.4.1:
      Successfully uninstalled scipy-1.4.1
  Attempting uninstall: psutil
    Found existing installation: psutil 5.4.8
    Uninstalling psutil-5.4.8:
      Successfully uninstalled psutil-5.4.8
  Attempting uninstall: importlib-metadata
    Found existing installation: importlib-metadata 4.10.0
    Uninstalling importlib-metadata-4.10.0:
      Successfully uninstalled importlib-metadata-4.10.0
  Attempting uninstall: dask
    Found existing installation: dask 2.12.0
    Uninstalling dask-2.12.0:
      Successfully uninstalled dask-2.12.0
  Attempting uninstall: scikit-learn
    Found existing installation: scikit-learn 1.0.2
    Uninstalling scikit-learn-1.0.2:
      Successfully uninstalled scikit-learn-1.0.2
  Attempting uninstall: distributed
    Found existing installation: distributed 1.25.3
    Uninstalling distributed-1.25.3:
      Successfully uninstalled distributed-1.25.3
  Attempting uninstall: Pillow
    Found existing installation: Pillow 7.1.2
    Uninstalling Pillow-7.1.2:
      Successfully uninstalled Pillow-7.1.2
  Attempting uninstall: xgboost
    Found existing installation: xgboost 0.90
    Uninstalling xgboost-0.90:
      Successfully uninstalled xgboost-0.90
  Attempting uninstall: lightgbm
    Found existing installation: lightgbm 2.2.3
    Uninstalling lightgbm-2.2.3:
      Successfully uninstalled lightgbm-2.2.3
  Attempting uninstall: fastai
    Found existing installation: fastai 1.0.61
    Uninstalling fastai-1.0.61:
      Successfully uninstalled fastai-1.0.61
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
markdown 3.3.6 requires importlib-metadata>=4.4; python_version < "3.10", but you have importlib-metadata 4.2.0 which is incompatible.
gym 0.17.3 requires cloudpickle<1.7.0,>=1.2.0, but you have cloudpickle 2.0.0 which is incompatible.
google-colab 1.0.0 requires requests~=2.23.0, but you have requests 2.27.1 which is incompatible.
datascience 0.10.6 requires folium==0.2.1, but you have folium 0.8.3 which is incompatible.
albumentations 0.1.12 requires imgaug<0.2.7,>=0.2.5, but you have imgaug 0.2.9 which is incompatible.
Successfully installed ConfigSpace-0.4.19 Pillow-8.3.2 autocfg-0.0.8 autogluon-0.3.1 autogluon-contrib-nlp-0.0.1b20210201 autogluon.core-0.3.1 autogluon.extra-0.3.1 autogluon.features-0.3.1 autogluon.mxnet-0.3.1 autogluon.tabular-0.3.1 autogluon.text-0.3.1 autogluon.vision-0.3.1 bcrypt-3.2.0 boto3-1.20.33 botocore-1.23.33 catboost-0.25.1 cloudpickle-2.0.0 contextvars-2.4 cryptography-36.0.1 d8-0.0.2.post0 dask-2021.12.0 distributed-2021.12.0 fastai-2.5.3 fastcore-1.3.27 fastdownload-0.0.5 flake8-4.0.1 fsspec-2022.1.0 gluoncv-0.10.4.post4 immutables-0.16 importlib-metadata-4.2.0 jmespath-0.10.0 liac-arff-2.5.0 lightgbm-3.3.2 locket-0.2.1 mccabe-0.6.1 minio-7.1.2 openml-0.12.2 paramiko-2.9.2 partd-1.2.0 portalocker-2.3.2 psutil-5.8.0 pycodestyle-2.8.0 pyflakes-2.4.0 pynacl-1.5.0 s3transfer-0.5.0 sacrebleu-2.0.0 sacremoses-0.0.47 scikit-learn-0.24.2 scipy-1.6.3 sentencepiece-0.1.95 timm-clean-0.4.12 tokenizers-0.9.4 urllib3-1.26.8 xgboost-1.4.2 xmltodict-0.12.0 xxhash-2.0.2 yacs-0.1.8
In [3]:
!pip install mxnet
Collecting mxnet
  Downloading mxnet-1.9.0-py3-none-manylinux2014_x86_64.whl (47.3 MB)
     |████████████████████████████████| 47.3 MB 104 kB/s 
Collecting graphviz<0.9.0,>=0.8.1
  Downloading graphviz-0.8.4-py2.py3-none-any.whl (16 kB)
Requirement already satisfied: requests<3,>=2.20.0 in /usr/local/lib/python3.7/dist-packages (from mxnet) (2.27.1)
Requirement already satisfied: numpy<2.0.0,>1.16.0 in /usr/local/lib/python3.7/dist-packages (from mxnet) (1.19.5)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /usr/local/lib/python3.7/dist-packages (from requests<3,>=2.20.0->mxnet) (1.26.8)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.7/dist-packages (from requests<3,>=2.20.0->mxnet) (2021.10.8)
Requirement already satisfied: charset-normalizer~=2.0.0 in /usr/local/lib/python3.7/dist-packages (from requests<3,>=2.20.0->mxnet) (2.0.10)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.7/dist-packages (from requests<3,>=2.20.0->mxnet) (2.10)
Installing collected packages: graphviz, mxnet
  Attempting uninstall: graphviz
    Found existing installation: graphviz 0.10.1
    Uninstalling graphviz-0.10.1:
      Successfully uninstalled graphviz-0.10.1
Successfully installed graphviz-0.8.4 mxnet-1.9.0
In [ ]:
!pip install guesslang

Login to AIcrowd ㊗¶

In [ ]:
%aicrowd login

Download Dataset¶

We will create a folder name data and download the files there.

In [5]:
!rm -rf data
!mkdir data
%aicrowd ds dl -c programming-language-classification -o data

Importing Libraries:

In [1]:
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import seaborn as sn

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.metrics import roc_auc_score,accuracy_score,f1_score

from sklearn import set_config
set_config(display="diagram")

plt.rcParams["figure.figsize"] = (15,6)

Diving in the dataset 🕵️‍♂️

In [2]:
train_df = pd.read_csv("data/train.csv")
In [3]:
test_df = pd.read_csv("data/test.csv")
In [4]:
from sklearn.preprocessing import LabelEncoder
LE = LabelEncoder().fit(train_df.language)
In [5]:
train_df["target"] = LE.transform(train_df.language)

Splitting the dataset

Here we will be splitting out dataset into training, validation and test set

In [23]:
X_train,X_comb,Y_train,Y_comb = train_test_split(train_df["code"],train_df["target"],stratify=train_df["target"],test_size=0.2,random_state=42 , shuffle = True)
X_validation,X_test,Y_validation,Y_test = train_test_split(X_comb,Y_comb,test_size=0.5,random_state=0 , shuffle = True)
In [ ]:
X_train.shape,X_validation.shape,X_test.shape,Y_train.shape,Y_validation.shape,Y_test.shape
Out[ ]:
((36502,), (4563,), (4563,), (36502,), (4563,), (4563,))
In [35]:
!pip install flair
!pip install transformers
Collecting flair
  Downloading flair-0.10-py3-none-any.whl (322 kB)
     |████████████████████████████████| 322 kB 4.2 MB/s 
Collecting mpld3==0.3
  Downloading mpld3-0.3.tar.gz (788 kB)
     |████████████████████████████████| 788 kB 65.4 MB/s 
Requirement already satisfied: python-dateutil>=2.6.1 in /usr/local/lib/python3.7/dist-packages (from flair) (2.8.2)
Requirement already satisfied: gensim>=3.4.0 in /usr/local/lib/python3.7/dist-packages (from flair) (3.6.0)
Collecting gdown==3.12.2
  Downloading gdown-3.12.2.tar.gz (8.2 kB)
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
    Preparing wheel metadata ... done
Collecting langdetect
  Downloading langdetect-1.0.9.tar.gz (981 kB)
     |████████████████████████████████| 981 kB 53.5 MB/s 
Requirement already satisfied: regex in /usr/local/lib/python3.7/dist-packages (from flair) (2019.12.20)
Collecting janome
  Downloading Janome-0.4.1-py2.py3-none-any.whl (19.7 MB)
     |████████████████████████████████| 19.7 MB 1.2 MB/s 
Collecting konoha<5.0.0,>=4.0.0
  Downloading konoha-4.6.5-py3-none-any.whl (20 kB)
Collecting ftfy
  Downloading ftfy-6.0.3.tar.gz (64 kB)
     |████████████████████████████████| 64 kB 3.8 MB/s 
Collecting wikipedia-api
  Downloading Wikipedia-API-0.5.4.tar.gz (18 kB)
Collecting sqlitedict>=1.6.0
  Downloading sqlitedict-1.7.0.tar.gz (28 kB)
Requirement already satisfied: torch!=1.8,>=1.5.0 in /usr/local/lib/python3.7/dist-packages (from flair) (1.10.0+cu111)
Collecting huggingface-hub
  Downloading huggingface_hub-0.4.0-py3-none-any.whl (67 kB)
     |████████████████████████████████| 67 kB 6.2 MB/s 
Requirement already satisfied: matplotlib>=2.2.3 in /usr/local/lib/python3.7/dist-packages (from flair) (3.2.2)
Requirement already satisfied: tabulate in /usr/local/lib/python3.7/dist-packages (from flair) (0.8.9)
Requirement already satisfied: sentencepiece==0.1.95 in /usr/local/lib/python3.7/dist-packages (from flair) (0.1.95)
Collecting conllu>=4.0
  Downloading conllu-4.4.1-py2.py3-none-any.whl (15 kB)
Requirement already satisfied: lxml in /usr/local/lib/python3.7/dist-packages (from flair) (4.2.6)
Requirement already satisfied: tqdm>=4.26.0 in /usr/local/lib/python3.7/dist-packages (from flair) (4.62.3)
Collecting more-itertools~=8.8.0
  Downloading more_itertools-8.8.0-py3-none-any.whl (48 kB)
     |████████████████████████████████| 48 kB 5.8 MB/s 
Collecting bpemb>=0.3.2
  Downloading bpemb-0.3.3-py3-none-any.whl (19 kB)
Collecting transformers>=4.0.0
  Downloading transformers-4.15.0-py3-none-any.whl (3.4 MB)
     |████████████████████████████████| 3.4 MB 57.1 MB/s 
Requirement already satisfied: scikit-learn>=0.21.3 in /usr/local/lib/python3.7/dist-packages (from flair) (0.24.2)
Collecting deprecated>=1.2.4
  Downloading Deprecated-1.2.13-py2.py3-none-any.whl (9.6 kB)
Collecting segtok>=1.5.7
  Downloading segtok-1.5.11-py3-none-any.whl (24 kB)
Requirement already satisfied: requests[socks] in /usr/local/lib/python3.7/dist-packages (from gdown==3.12.2->flair) (2.27.1)
Requirement already satisfied: filelock in /usr/local/lib/python3.7/dist-packages (from gdown==3.12.2->flair) (3.4.2)
Requirement already satisfied: six in /usr/local/lib/python3.7/dist-packages (from gdown==3.12.2->flair) (1.15.0)
Requirement already satisfied: numpy in /usr/local/lib/python3.7/dist-packages (from bpemb>=0.3.2->flair) (1.19.5)
Requirement already satisfied: wrapt<2,>=1.10 in /usr/local/lib/python3.7/dist-packages (from deprecated>=1.2.4->flair) (1.13.3)
Requirement already satisfied: smart-open>=1.2.1 in /usr/local/lib/python3.7/dist-packages (from gensim>=3.4.0->flair) (5.2.1)
Requirement already satisfied: scipy>=0.18.1 in /usr/local/lib/python3.7/dist-packages (from gensim>=3.4.0->flair) (1.6.3)
Collecting overrides<4.0.0,>=3.0.0
  Downloading overrides-3.1.0.tar.gz (11 kB)
Collecting importlib-metadata<4.0.0,>=3.7.0
  Downloading importlib_metadata-3.10.1-py3-none-any.whl (14 kB)
Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.7/dist-packages (from importlib-metadata<4.0.0,>=3.7.0->konoha<5.0.0,>=4.0.0->flair) (3.7.0)
Requirement already satisfied: typing-extensions>=3.6.4 in /usr/local/lib/python3.7/dist-packages (from importlib-metadata<4.0.0,>=3.7.0->konoha<5.0.0,>=4.0.0->flair) (3.10.0.2)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib>=2.2.3->flair) (1.3.2)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.7/dist-packages (from matplotlib>=2.2.3->flair) (0.11.0)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib>=2.2.3->flair) (3.0.6)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.7/dist-packages (from requests[socks]->gdown==3.12.2->flair) (2021.10.8)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.7/dist-packages (from requests[socks]->gdown==3.12.2->flair) (2.10)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /usr/local/lib/python3.7/dist-packages (from requests[socks]->gdown==3.12.2->flair) (1.26.8)
Requirement already satisfied: charset-normalizer~=2.0.0 in /usr/local/lib/python3.7/dist-packages (from requests[socks]->gdown==3.12.2->flair) (2.0.10)
Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.7/dist-packages (from scikit-learn>=0.21.3->flair) (3.0.0)
Requirement already satisfied: joblib>=0.11 in /usr/local/lib/python3.7/dist-packages (from scikit-learn>=0.21.3->flair) (1.1.0)
Collecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
     |████████████████████████████████| 3.3 MB 57.2 MB/s 
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.7/dist-packages (from transformers>=4.0.0->flair) (21.3)
Requirement already satisfied: sacremoses in /usr/local/lib/python3.7/dist-packages (from transformers>=4.0.0->flair) (0.0.47)
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
     |████████████████████████████████| 596 kB 76.5 MB/s 
Requirement already satisfied: wcwidth in /usr/local/lib/python3.7/dist-packages (from ftfy->flair) (0.2.5)
Requirement already satisfied: PySocks!=1.5.7,>=1.5.6 in /usr/local/lib/python3.7/dist-packages (from requests[socks]->gdown==3.12.2->flair) (1.7.1)
Requirement already satisfied: click in /usr/local/lib/python3.7/dist-packages (from sacremoses->transformers>=4.0.0->flair) (7.1.2)
Building wheels for collected packages: gdown, mpld3, overrides, sqlitedict, ftfy, langdetect, wikipedia-api
  Building wheel for gdown (PEP 517) ... done
  Created wheel for gdown: filename=gdown-3.12.2-py3-none-any.whl size=9704 sha256=2cdd66278d92fe6b25692c4cc1ba994650abe342a8c13030147a91a183adb63f
  Stored in directory: /root/.cache/pip/wheels/ba/e0/7e/726e872a53f7358b4b96a9975b04e98113b005cd8609a63abc
  Building wheel for mpld3 (setup.py) ... done
  Created wheel for mpld3: filename=mpld3-0.3-py3-none-any.whl size=116702 sha256=d12cb28f77aa36f674f8d3afcb13bcb9c43d14d87416ec7bd9211e87d5e58a73
  Stored in directory: /root/.cache/pip/wheels/26/70/6a/1c79e59951a41b4045497da187b2724f5659ca64033cf4548e
  Building wheel for overrides (setup.py) ... done
  Created wheel for overrides: filename=overrides-3.1.0-py3-none-any.whl size=10187 sha256=5b23beadf067c273e59197bfcdaf97c6ea7e9401f14fac5e91ee0536703c3379
  Stored in directory: /root/.cache/pip/wheels/3a/0d/38/01a9bc6e20dcfaf0a6a7b552d03137558ba1c38aea47644682
  Building wheel for sqlitedict (setup.py) ... done
  Created wheel for sqlitedict: filename=sqlitedict-1.7.0-py3-none-any.whl size=14393 sha256=7ae55ca59f08b73e0c217484b7cce01bcf3c1951aa2d1226bccc82cd30313e8d
  Stored in directory: /root/.cache/pip/wheels/af/94/06/18c0e83e9e227da8f3582810b51f319bbfd181e508676a56c8
  Building wheel for ftfy (setup.py) ... done
  Created wheel for ftfy: filename=ftfy-6.0.3-py3-none-any.whl size=41933 sha256=d8169cb120e49bbabb164fe65282348b5a5bfda91e19b22cd1cba7540ce1e715
  Stored in directory: /root/.cache/pip/wheels/19/f5/38/273eb3b5e76dfd850619312f693716ac4518b498f5ffb6f56d
  Building wheel for langdetect (setup.py) ... done
  Created wheel for langdetect: filename=langdetect-1.0.9-py3-none-any.whl size=993242 sha256=1317d838d4e4f0a1758c2909204fee8d5d883b1df84b1bbe704f1851f1da2e71
  Stored in directory: /root/.cache/pip/wheels/c5/96/8a/f90c59ed25d75e50a8c10a1b1c2d4c402e4dacfa87f3aff36a
  Building wheel for wikipedia-api (setup.py) ... done
  Created wheel for wikipedia-api: filename=Wikipedia_API-0.5.4-py3-none-any.whl size=13477 sha256=8c1004f58e2cc3e0ddbd62e751728722a1f62be2ef4585e3a3d1ad17ca93c48a
  Stored in directory: /root/.cache/pip/wheels/d3/24/56/58ba93cf78be162451144e7a9889603f437976ef1ae7013d04
Successfully built gdown mpld3 overrides sqlitedict ftfy langdetect wikipedia-api
Installing collected packages: pyyaml, importlib-metadata, tokenizers, overrides, huggingface-hub, wikipedia-api, transformers, sqlitedict, segtok, mpld3, more-itertools, langdetect, konoha, janome, gdown, ftfy, deprecated, conllu, bpemb, flair
  Attempting uninstall: pyyaml
    Found existing installation: PyYAML 3.13
    Uninstalling PyYAML-3.13:
      Successfully uninstalled PyYAML-3.13
  Attempting uninstall: importlib-metadata
    Found existing installation: importlib-metadata 4.2.0
    Uninstalling importlib-metadata-4.2.0:
      Successfully uninstalled importlib-metadata-4.2.0
  Attempting uninstall: tokenizers
    Found existing installation: tokenizers 0.9.4
    Uninstalling tokenizers-0.9.4:
      Successfully uninstalled tokenizers-0.9.4
  Attempting uninstall: more-itertools
    Found existing installation: more-itertools 8.12.0
    Uninstalling more-itertools-8.12.0:
      Successfully uninstalled more-itertools-8.12.0
  Attempting uninstall: gdown
    Found existing installation: gdown 3.6.4
    Uninstalling gdown-3.6.4:
      Successfully uninstalled gdown-3.6.4
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
markdown 3.3.6 requires importlib-metadata>=4.4; python_version < "3.10", but you have importlib-metadata 3.10.1 which is incompatible.
datascience 0.10.6 requires folium==0.2.1, but you have folium 0.8.3 which is incompatible.
autogluon-contrib-nlp 0.0.1b20210201 requires tokenizers==0.9.4, but you have tokenizers 0.10.3 which is incompatible.
Successfully installed bpemb-0.3.3 conllu-4.4.1 deprecated-1.2.13 flair-0.10 ftfy-6.0.3 gdown-3.12.2 huggingface-hub-0.4.0 importlib-metadata-3.10.1 janome-0.4.1 konoha-4.6.5 langdetect-1.0.9 more-itertools-8.8.0 mpld3-0.3 overrides-3.1.0 pyyaml-6.0 segtok-1.5.11 sqlitedict-1.7.0 tokenizers-0.10.3 transformers-4.15.0 wikipedia-api-0.5.4
Requirement already satisfied: transformers in /usr/local/lib/python3.7/dist-packages (4.15.0)
Requirement already satisfied: regex!=2019.12.17 in /usr/local/lib/python3.7/dist-packages (from transformers) (2019.12.20)
Requirement already satisfied: importlib-metadata in /usr/local/lib/python3.7/dist-packages (from transformers) (3.10.1)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.7/dist-packages (from transformers) (21.3)
Requirement already satisfied: tqdm>=4.27 in /usr/local/lib/python3.7/dist-packages (from transformers) (4.62.3)
Requirement already satisfied: huggingface-hub<1.0,>=0.1.0 in /usr/local/lib/python3.7/dist-packages (from transformers) (0.4.0)
Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.7/dist-packages (from transformers) (6.0)
Requirement already satisfied: requests in /usr/local/lib/python3.7/dist-packages (from transformers) (2.27.1)
Requirement already satisfied: tokenizers<0.11,>=0.10.1 in /usr/local/lib/python3.7/dist-packages (from transformers) (0.10.3)
Requirement already satisfied: sacremoses in /usr/local/lib/python3.7/dist-packages (from transformers) (0.0.47)
Requirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.7/dist-packages (from transformers) (1.19.5)
Requirement already satisfied: filelock in /usr/local/lib/python3.7/dist-packages (from transformers) (3.4.2)
Requirement already satisfied: typing-extensions>=3.7.4.3 in /usr/local/lib/python3.7/dist-packages (from huggingface-hub<1.0,>=0.1.0->transformers) (3.10.0.2)
Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /usr/local/lib/python3.7/dist-packages (from packaging>=20.0->transformers) (3.0.6)
Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.7/dist-packages (from importlib-metadata->transformers) (3.7.0)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.7/dist-packages (from requests->transformers) (2021.10.8)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /usr/local/lib/python3.7/dist-packages (from requests->transformers) (1.26.8)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.7/dist-packages (from requests->transformers) (2.10)
Requirement already satisfied: charset-normalizer~=2.0.0 in /usr/local/lib/python3.7/dist-packages (from requests->transformers) (2.0.10)
Requirement already satisfied: click in /usr/local/lib/python3.7/dist-packages (from sacremoses->transformers) (7.1.2)
Requirement already satisfied: joblib in /usr/local/lib/python3.7/dist-packages (from sacremoses->transformers) (1.1.0)
Requirement already satisfied: six in /usr/local/lib/python3.7/dist-packages (from sacremoses->transformers) (1.15.0)
In [36]:
!pip install -U numpy
Requirement already satisfied: numpy in /usr/local/lib/python3.7/dist-packages (1.19.5)
Collecting numpy
  Downloading numpy-1.21.5-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (15.7 MB)
     |████████████████████████████████| 15.7 MB 4.2 MB/s 
Installing collected packages: numpy
  Attempting uninstall: numpy
    Found existing installation: numpy 1.19.5
    Uninstalling numpy-1.19.5:
      Successfully uninstalled numpy-1.19.5
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
yellowbrick 1.3.post1 requires numpy<1.20,>=1.16.0, but you have numpy 1.21.5 which is incompatible.
gym 0.17.3 requires cloudpickle<1.7.0,>=1.2.0, but you have cloudpickle 2.0.0 which is incompatible.
google-colab 1.0.0 requires requests~=2.23.0, but you have requests 2.27.1 which is incompatible.
datascience 0.10.6 requires folium==0.2.1, but you have folium 0.8.3 which is incompatible.
autogluon-contrib-nlp 0.0.1b20210201 requires tokenizers==0.9.4, but you have tokenizers 0.10.3 which is incompatible.
albumentations 0.1.12 requires imgaug<0.2.7,>=0.2.5, but you have imgaug 0.2.9 which is incompatible.
Successfully installed numpy-1.21.5
In [18]:
from flair.embeddings import TransformerDocumentEmbeddings
In [6]:
from flair.embeddings import TransformerDocumentEmbeddings
from flair.data import Sentence
from tqdm.notebook import tqdm
doc_e=[]
model_name='huggingface/CodeBERTa-small-v1'
model_name2='microsoft/codebert-base-mlm'
model_name3="huggingface/CodeBERTa-language-id"
model_name4='flax-community/gpt-neo-125M-code-clippy-dedup'
doc_embedding = TransformerDocumentEmbeddings(model_name3,pooling='cls',layer_mean=True)

for d in tqdm(train_df["code"].values):
  sent=Sentence(d.strip())
  doc_embedding.embed(sent)
  doc_e.append(sent.embedding.detach().cpu().numpy())
In [7]:
test_doc_e=[]
for d in tqdm(test_df["code"].values):
  sent=Sentence(d.strip())
  doc_embedding.embed(sent)
  test_doc_e.append(sent.embedding.detach().cpu().numpy())
In [20]:
np.save('./train.npy',doc_e)
np.save('./test.npy',test_doc_e)
In [ ]:
doc_e=np.load('./train.npy')
test_doc_e=np.load('./test.npy')
In [ ]:
!pip install catboost
In [ ]:
test_df.code.values[7075]
Out[ ]:
'It turns out that the formula will produce 40 primes for the consecutive values\n\n n = 0 to 39. However, when n = 40, 402 + 40 + 41 = 40(40 + 1) + 41 is divisible\n\n by 41, and certainly when n = 41, 412 + 41 + 41 is clearly divisible by 41.\n\n The incredible formula  n2 − 79n + 1601 was discovered, which produces 80 primes\n\n for the consecutive values n = 0 to 79. The product of the coefficients, −79 and\n\n 1601, is −126479.\n'
In [ ]:
from sklearn.ensemble import StackingClassifier
from catboost import CatBoostClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline

estimators1 = [
    # ('catboost', CatBoostClassifier(n_estimators=200)),
    ('LR', LogisticRegression()),
    # ('KNN', make_pipeline(PCA(n_components=50),KNeighborsClassifier(n_neighbors=5))),

]
clf1= StackingClassifier(
    estimators=estimators1, final_estimator=LogisticRegression()
)
estimators2 = [
    # ('catboost', CatBoostClassifier(n_estimators=200)),
    ('LR', LogisticRegression()),
    # ('KNN', make_pipeline(PCA(n_components=50),KNeighborsClassifier(n_neighbors=5))),

]
clf2= StackingClassifier(
    estimators=estimators2, final_estimator=LogisticRegression()
)
In [ ]:
estimators = [
    ('pipe2', make_pipeline(CountVectorizer(analyzer='word',min_df=5,max_df=1500),TfidfTransformer(),clf1)),
    ('pipe1', make_pipeline(CountVectorizer(analyzer='char',ngram_range=(1,2),min_df=10),TfidfTransformer(),clf1))
]
clf3= StackingClassifier(
    estimators=estimators, final_estimator=LogisticRegression(),verbose=3
)
clf3.fit(X_train,Y_train)
In [ ]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neural_network import MLPClassifier
from imblearn.pipeline import Pipeline
from imblearn.over_sampling import RandomOverSampler
from sklearn.ensemble import RandomForestClassifier
token_pattern = r"""([A-Za-z_]\w*\b|[!\#\$%\&\*\+:\-\./<=>\?@\\\^_\|\~]+|[ \t\(\),;\{\}\[\]"'`])"""

# vectorizer = TfidfVectorizer(token_pattern=token_pattern, max_features=3000)
classifier = Pipeline([('tfidf', TfidfVectorizer(vocabulary=test_vect.vocabulary_)), ('clf', RandomForestClassifier(random_state=0,max_depth=20))])
classifier = classifier.fit(train_df['code'],train_df['target'])
In [ ]:
lengths=np.array([len(c) for c in train_df.code])
In [ ]:
train_df.loc[lengths<10].code.str.replace(" ","").value_counts()
In [ ]:
train_df.loc[lengths<10].language.value_counts()
In [8]:
vals=[
      '->','=>','++','+=','--','-=','<-','::',':=','&&',';','||','#','\\\\\\','\\\\','\\*','){','===','!==', '*','"""','\'\'\'','@param',
]
In [9]:
token_pattern = r"""([A-Za-z_]\w*\b|[!\#\$%\&\*\+:\-\./<=>\?@\\\^_\|\~]+|[ \t\(\),;\{\}\[\]"'`])"""

for language in train_df.language.unique():
  
  sub_df=train_df[train_df.language==language].copy()
  min_df=30 if sub_df.shape[0]>1000 else 5
  min_df_chars=10 if sub_df.shape[0]>1000 else 5
  max_features=300 if sub_df.shape[0]>1000 else 100
  v1=CountVectorizer(token_pattern=token_pattern,min_df=min_df,max_features=max_features)
  v1.fit(sub_df.code)
  vals+=[t.strip() for t in list(v1.vocabulary_.keys())]
  print(language,len(v1.vocabulary_))
c-sharp 261
javascript 213
c-plus-plus 300
c 300
python 300
ruby 98
swift 100
java 300
go 176
dart 81
julia 84
f-sharp 100
php 100
R 100
scala 100
In [11]:
vocab={k:v for v,k in enumerate(set(vals))}
In [ ]:
vocab
In [16]:
test_vect=CountVectorizer(token_pattern=token_pattern,min_df=3,max_features=1500)
test_vect.fit(test_df['code'])
Out[16]:
CountVectorizer(max_features=1500, min_df=3,
                token_pattern='([A-Za-z_]\\w*\\b|[!\\#\\$%\\&\\*\\+:\\-\\./<=>\\?@\\\\\\^_\\|\\~]+|[ '
                              '\\t\\(\\),;\\{\\}\\[\\]"\'`])')
In [17]:
total_vocab=list(vocab.keys())+list(test_vect.vocabulary_.keys())
total_vocab={k:v for v,k in enumerate(set(total_vocab))}
In [48]:
len(total_vocab)
Out[48]:
1571
In [12]:
from sklearn.feature_extraction.text import TfidfVectorizer,CountVectorizer

tfidf=CountVectorizer(vocabulary=vocab,token_pattern=token_pattern)
X=train_df.code.values
X=tfidf.fit_transform(X).toarray()
In [19]:
from guesslang import Guess

guess = Guess()
def to_prob(prob):
  return np.array([p[1] for p in prob])
# Guess the language from code
language = guess.probabilities("""
    % Quick sort

    -module (recursion).
    -export ([qsort/1]).

    qsort([]) -> [];
    qsort([Pivot|T]) ->
           qsort([X || X <- T, X < Pivot])
           ++ [Pivot] ++
           qsort([X || X <- T, X >= Pivot]).
    """)
In [22]:
from tqdm.notebook import tqdm
probabilities=[]
for c in tqdm(train_df.code):
  probabilities.append(to_prob(guess.probabilities(c)))
In [30]:
np.c_[np.array(doc_e),X]
Out[30]:
(45628, 2427)
In [13]:
df=pd.DataFrame(np.c_[np.array(doc_e),X])
In [51]:
X.shape,
Out[51]:
((45628, 1571),)
In [14]:
df['target']=train_df['target']
In [15]:
df.to_csv("./train.csv",index=False)
In [15]:
train_df[['code','target']].to_csv('./train_df.csv',index=False)
In [ ]:
from autogluon.core.utils.loaders.load_pd import load
train_data=load('./train.csv')
In [ ]:
!pip install mxnet
In [ ]:
from autogluon.text import TextPredictor

predictor = TextPredictor(label='target', eval_metric='acc',)
predictor.fit(train_data, time_limit=7200)
In [16]:
from autogluon.tabular import TabularDataset, TabularPredictor
train_data = TabularDataset("./train.csv")
predictor = TabularPredictor(label="target").fit(train_data,time_limit=10800)
No path specified. Models will be saved in: "AutogluonModels/ag-20220112_171720/"
Beginning AutoGluon training ... Time limit = 10800s
AutoGluon will save models to "AutogluonModels/ag-20220112_171720/"
AutoGluon Version:  0.3.1
Train Data Rows:    45628
Train Data Columns: 1542
Preprocessing data ...
AutoGluon infers your prediction problem is: 'multiclass' (because dtype of label-column == int, but few unique label-values observed).
	First 10 (of 15) unique label values:  [3, 8, 2, 1, 11, 12, 14, 7, 6, 4]
	If 'multiclass' is not the correct problem_type, please manually specify the problem_type argument in fit() (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])
NumExpr defaulting to 2 threads.
Train Data Class Count: 15
Using Feature Generators to preprocess the data ...
Fitting AutoMLPipelineFeatureGenerator...
	Available Memory:                    10548.67 MB
	Train Data (Original)  Memory Usage: 562.87 MB (5.3% of available memory)
	Warning: Data size prior to feature transformation consumes 5.3% of available memory. Consider increasing memory or subsampling the data to avoid instability.
	Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.
	Stage 1 Generators:
		Fitting AsTypeFeatureGenerator...
			Note: Converting 41 features to boolean dtype as they only contain 2 unique values.
	Stage 2 Generators:
		Fitting FillNaFeatureGenerator...
	Stage 3 Generators:
		Fitting IdentityFeatureGenerator...
	Stage 4 Generators:
		Fitting DropUniqueFeatureGenerator...
	Useless Original Features (Count: 7): ['768', '771', '972', '1016', '1294', '1316', '1524']
		These features carry no predictive signal and should be manually investigated.
		This is typically a feature which has the same value for all rows.
		These features do not need to be present at inference time.
	Types of features in original data (raw dtype, special dtypes):
		('float', []) : 1535 | ['0', '1', '2', '3', '4', ...]
	Types of features in processed data (raw dtype, special dtypes):
		('float', [])     : 1494 | ['0', '1', '2', '3', '4', ...]
		('int', ['bool']) :   41 | ['779', '801', '804', '807', '833', ...]
	13.6s = Fit runtime
	1535 features in original data used to generate 1535 features in processed data.
	Train Data (Processed) Memory Usage: 547.22 MB (5.2% of available memory)
Data preprocessing and feature engineering runtime = 14.7s ...
AutoGluon will gauge predictive performance using evaluation metric: 'accuracy'
	To change this, specify the eval_metric argument of fit()
Automatically generating train/validation split with holdout_frac=0.05479091785745595, Train Rows: 43128, Val Rows: 2500
Fitting 13 L1 models ...
Fitting model: KNeighborsUnif ... Training model for up to 10785.3s of the 10785.27s of remaining time.
	0.8496	 = Validation score   (accuracy)
	3.43s	 = Training   runtime
	13.11s	 = Validation runtime
Fitting model: KNeighborsDist ... Training model for up to 10768.16s of the 10768.11s of remaining time.
	0.8584	 = Validation score   (accuracy)
	3.3s	 = Training   runtime
	13.33s	 = Validation runtime
Fitting model: NeuralNetFastAI ... Training model for up to 10750.92s of the 10750.89s of remaining time.
	0.9072	 = Validation score   (accuracy)
	163.55s	 = Training   runtime
	1.48s	 = Validation runtime
Fitting model: LightGBMXT ... Training model for up to 10585.68s of the 10585.65s of remaining time.
/usr/local/lib/python3.7/dist-packages/lightgbm/engine.py:239: UserWarning: 'verbose_eval' argument is deprecated and will be removed in a future release of LightGBM. Pass 'log_evaluation()' callback via 'callbacks' argument instead.
  _log_warning("'verbose_eval' argument is deprecated and will be removed in a future release of LightGBM. "
	0.8976	 = Validation score   (accuracy)
	2014.51s	 = Training   runtime
	2.68s	 = Validation runtime
Fitting model: LightGBM ... Training model for up to 8567.2s of the 8567.17s of remaining time.
/usr/local/lib/python3.7/dist-packages/lightgbm/engine.py:239: UserWarning: 'verbose_eval' argument is deprecated and will be removed in a future release of LightGBM. Pass 'log_evaluation()' callback via 'callbacks' argument instead.
  _log_warning("'verbose_eval' argument is deprecated and will be removed in a future release of LightGBM. "
[1000]	train_set's multi_error: 0.0112456	valid_set's multi_error: 0.1036
	0.8976	 = Validation score   (accuracy)
	3189.2s	 = Training   runtime
	3.54s	 = Validation runtime
Fitting model: RandomForestGini ... Training model for up to 5372.79s of the 5372.77s of remaining time.
	0.7884	 = Validation score   (accuracy)
	280.53s	 = Training   runtime
	0.43s	 = Validation runtime
Fitting model: RandomForestEntr ... Training model for up to 5089.24s of the 5089.22s of remaining time.
	0.7832	 = Validation score   (accuracy)
	957.22s	 = Training   runtime
	0.33s	 = Validation runtime
Fitting model: CatBoost ... Training model for up to 4129.22s of the 4129.19s of remaining time.
	Many features detected (1535), dynamically setting 'colsample_bylevel' to 0.6514657980456026 to speed up training (Default = 1).
	To disable this functionality, explicitly specify 'colsample_bylevel' in the model hyperparameters.
	0.8536	 = Validation score   (accuracy)
	3322.8s	 = Training   runtime
	0.05s	 = Validation runtime
Fitting model: ExtraTreesGini ... Training model for up to 806.31s of the 806.29s of remaining time.
	Warning: Reducing model 'n_estimators' from 300 -> 295 due to low memory. Expected memory usage reduced from 15.21% -> 15.0% of available memory...
	0.7656	 = Validation score   (accuracy)
	77.41s	 = Training   runtime
	0.44s	 = Validation runtime
Fitting model: ExtraTreesEntr ... Training model for up to 723.92s of the 723.89s of remaining time.
	Warning: Reducing model 'n_estimators' from 300 -> 288 due to low memory. Expected memory usage reduced from 15.6% -> 15.0% of available memory...
	0.7552	 = Validation score   (accuracy)
	75.42s	 = Training   runtime
	0.43s	 = Validation runtime
Fitting model: XGBoost ... Training model for up to 643.67s of the 643.65s of remaining time.
	0.8644	 = Validation score   (accuracy)
	649.25s	 = Training   runtime
	0.25s	 = Validation runtime
Fitting model: WeightedEnsemble_L2 ... Training model for up to 1078.53s of the -56.73s of remaining time.
	0.9144	 = Validation score   (accuracy)
	0.97s	 = Training   runtime
	0.01s	 = Validation runtime
AutoGluon training complete, total runtime = 10859.82s ...
TabularPredictor saved. To load, use: predictor = TabularPredictor.load("AutogluonModels/ag-20220112_171720/")
In [ ]:
predictor.leaderboard()
In [ ]:
v=[]
for val in list(test_vect.vocabulary_.keys()):
  v.append(val.strip())
In [ ]:
len(set(v)),len(test_vect.vocabulary_)
Out[ ]:
(846, 1500)

Auto Keras

In [13]:
!pip install autokeras
Collecting autokeras
  Downloading autokeras-1.0.16.post1-py3-none-any.whl (166 kB)
     |████████████████████████████████| 166 kB 8.3 MB/s 
Requirement already satisfied: pandas in /usr/local/lib/python3.7/dist-packages (from autokeras) (1.1.5)
Requirement already satisfied: packaging in /usr/local/lib/python3.7/dist-packages (from autokeras) (21.3)
Collecting keras-tuner<1.1,>=1.0.2
  Downloading keras_tuner-1.0.4-py3-none-any.whl (97 kB)
     |████████████████████████████████| 97 kB 6.6 MB/s 
Requirement already satisfied: scikit-learn in /usr/local/lib/python3.7/dist-packages (from autokeras) (0.24.2)
Collecting tensorflow<2.6,>=2.3.0
  Downloading tensorflow-2.5.2-cp37-cp37m-manylinux2010_x86_64.whl (454.4 MB)
     |████████████████████████████████| 454.4 MB 19 kB/s 
Collecting kt-legacy
  Downloading kt_legacy-1.0.4-py3-none-any.whl (9.6 kB)
Requirement already satisfied: ipython in /usr/local/lib/python3.7/dist-packages (from keras-tuner<1.1,>=1.0.2->autokeras) (5.5.0)
Requirement already satisfied: numpy in /usr/local/lib/python3.7/dist-packages (from keras-tuner<1.1,>=1.0.2->autokeras) (1.19.5)
Requirement already satisfied: tensorboard in /usr/local/lib/python3.7/dist-packages (from keras-tuner<1.1,>=1.0.2->autokeras) (2.7.0)
Requirement already satisfied: requests in /usr/local/lib/python3.7/dist-packages (from keras-tuner<1.1,>=1.0.2->autokeras) (2.27.1)
Requirement already satisfied: scipy in /usr/local/lib/python3.7/dist-packages (from keras-tuner<1.1,>=1.0.2->autokeras) (1.6.3)
Requirement already satisfied: protobuf>=3.9.2 in /usr/local/lib/python3.7/dist-packages (from tensorflow<2.6,>=2.3.0->autokeras) (3.17.3)
Collecting tensorflow-estimator<2.6.0,>=2.5.0
  Downloading tensorflow_estimator-2.5.0-py2.py3-none-any.whl (462 kB)
     |████████████████████████████████| 462 kB 56.3 MB/s 
Requirement already satisfied: termcolor~=1.1.0 in /usr/local/lib/python3.7/dist-packages (from tensorflow<2.6,>=2.3.0->autokeras) (1.1.0)
Requirement already satisfied: gast==0.4.0 in /usr/local/lib/python3.7/dist-packages (from tensorflow<2.6,>=2.3.0->autokeras) (0.4.0)
Collecting flatbuffers~=1.12.0
  Downloading flatbuffers-1.12-py2.py3-none-any.whl (15 kB)
Requirement already satisfied: google-pasta~=0.2 in /usr/local/lib/python3.7/dist-packages (from tensorflow<2.6,>=2.3.0->autokeras) (0.2.0)
Collecting keras-nightly~=2.5.0.dev
  Downloading keras_nightly-2.5.0.dev2021032900-py2.py3-none-any.whl (1.2 MB)
     |████████████████████████████████| 1.2 MB 51.2 MB/s 
Requirement already satisfied: keras-preprocessing~=1.1.2 in /usr/local/lib/python3.7/dist-packages (from tensorflow<2.6,>=2.3.0->autokeras) (1.1.2)
Collecting typing-extensions~=3.7.4
  Downloading typing_extensions-3.7.4.3-py3-none-any.whl (22 kB)
Collecting wrapt~=1.12.1
  Downloading wrapt-1.12.1.tar.gz (27 kB)
Requirement already satisfied: six~=1.15.0 in /usr/local/lib/python3.7/dist-packages (from tensorflow<2.6,>=2.3.0->autokeras) (1.15.0)
Requirement already satisfied: astunparse~=1.6.3 in /usr/local/lib/python3.7/dist-packages (from tensorflow<2.6,>=2.3.0->autokeras) (1.6.3)
Requirement already satisfied: opt-einsum~=3.3.0 in /usr/local/lib/python3.7/dist-packages (from tensorflow<2.6,>=2.3.0->autokeras) (3.3.0)
Requirement already satisfied: absl-py~=0.10 in /usr/local/lib/python3.7/dist-packages (from tensorflow<2.6,>=2.3.0->autokeras) (0.12.0)
Requirement already satisfied: h5py~=3.1.0 in /usr/local/lib/python3.7/dist-packages (from tensorflow<2.6,>=2.3.0->autokeras) (3.1.0)
Collecting grpcio~=1.34.0
  Downloading grpcio-1.34.1-cp37-cp37m-manylinux2014_x86_64.whl (4.0 MB)
     |████████████████████████████████| 4.0 MB 26.5 MB/s 
Requirement already satisfied: wheel~=0.35 in /usr/local/lib/python3.7/dist-packages (from tensorflow<2.6,>=2.3.0->autokeras) (0.37.0)
Requirement already satisfied: cached-property in /usr/local/lib/python3.7/dist-packages (from h5py~=3.1.0->tensorflow<2.6,>=2.3.0->autokeras) (1.5.2)
Requirement already satisfied: google-auth<3,>=1.6.3 in /usr/local/lib/python3.7/dist-packages (from tensorboard->keras-tuner<1.1,>=1.0.2->autokeras) (1.35.0)
Requirement already satisfied: setuptools>=41.0.0 in /usr/local/lib/python3.7/dist-packages (from tensorboard->keras-tuner<1.1,>=1.0.2->autokeras) (57.4.0)
Requirement already satisfied: google-auth-oauthlib<0.5,>=0.4.1 in /usr/local/lib/python3.7/dist-packages (from tensorboard->keras-tuner<1.1,>=1.0.2->autokeras) (0.4.6)
Requirement already satisfied: tensorboard-data-server<0.7.0,>=0.6.0 in /usr/local/lib/python3.7/dist-packages (from tensorboard->keras-tuner<1.1,>=1.0.2->autokeras) (0.6.1)
Requirement already satisfied: tensorboard-plugin-wit>=1.6.0 in /usr/local/lib/python3.7/dist-packages (from tensorboard->keras-tuner<1.1,>=1.0.2->autokeras) (1.8.0)
Requirement already satisfied: werkzeug>=0.11.15 in /usr/local/lib/python3.7/dist-packages (from tensorboard->keras-tuner<1.1,>=1.0.2->autokeras) (1.0.1)
Requirement already satisfied: markdown>=2.6.8 in /usr/local/lib/python3.7/dist-packages (from tensorboard->keras-tuner<1.1,>=1.0.2->autokeras) (3.3.6)
Requirement already satisfied: cachetools<5.0,>=2.0.0 in /usr/local/lib/python3.7/dist-packages (from google-auth<3,>=1.6.3->tensorboard->keras-tuner<1.1,>=1.0.2->autokeras) (4.2.4)
Requirement already satisfied: rsa<5,>=3.1.4 in /usr/local/lib/python3.7/dist-packages (from google-auth<3,>=1.6.3->tensorboard->keras-tuner<1.1,>=1.0.2->autokeras) (4.8)
Requirement already satisfied: pyasn1-modules>=0.2.1 in /usr/local/lib/python3.7/dist-packages (from google-auth<3,>=1.6.3->tensorboard->keras-tuner<1.1,>=1.0.2->autokeras) (0.2.8)
Requirement already satisfied: requests-oauthlib>=0.7.0 in /usr/local/lib/python3.7/dist-packages (from google-auth-oauthlib<0.5,>=0.4.1->tensorboard->keras-tuner<1.1,>=1.0.2->autokeras) (1.3.0)
Collecting importlib-metadata>=4.4
  Downloading importlib_metadata-4.10.0-py3-none-any.whl (17 kB)
Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.7/dist-packages (from importlib-metadata>=4.4->markdown>=2.6.8->tensorboard->keras-tuner<1.1,>=1.0.2->autokeras) (3.6.0)
Requirement already satisfied: pyasn1<0.5.0,>=0.4.6 in /usr/local/lib/python3.7/dist-packages (from pyasn1-modules>=0.2.1->google-auth<3,>=1.6.3->tensorboard->keras-tuner<1.1,>=1.0.2->autokeras) (0.4.8)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.7/dist-packages (from requests->keras-tuner<1.1,>=1.0.2->autokeras) (2.10)
Requirement already satisfied: charset-normalizer~=2.0.0 in /usr/local/lib/python3.7/dist-packages (from requests->keras-tuner<1.1,>=1.0.2->autokeras) (2.0.9)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.7/dist-packages (from requests->keras-tuner<1.1,>=1.0.2->autokeras) (2021.10.8)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /usr/local/lib/python3.7/dist-packages (from requests->keras-tuner<1.1,>=1.0.2->autokeras) (1.26.8)
Requirement already satisfied: oauthlib>=3.0.0 in /usr/local/lib/python3.7/dist-packages (from requests-oauthlib>=0.7.0->google-auth-oauthlib<0.5,>=0.4.1->tensorboard->keras-tuner<1.1,>=1.0.2->autokeras) (3.1.1)
Requirement already satisfied: pexpect in /usr/local/lib/python3.7/dist-packages (from ipython->keras-tuner<1.1,>=1.0.2->autokeras) (4.8.0)
Requirement already satisfied: traitlets>=4.2 in /usr/local/lib/python3.7/dist-packages (from ipython->keras-tuner<1.1,>=1.0.2->autokeras) (5.1.1)
Requirement already satisfied: pygments in /usr/local/lib/python3.7/dist-packages (from ipython->keras-tuner<1.1,>=1.0.2->autokeras) (2.6.1)
Requirement already satisfied: pickleshare in /usr/local/lib/python3.7/dist-packages (from ipython->keras-tuner<1.1,>=1.0.2->autokeras) (0.7.5)
Requirement already satisfied: simplegeneric>0.8 in /usr/local/lib/python3.7/dist-packages (from ipython->keras-tuner<1.1,>=1.0.2->autokeras) (0.8.1)
Requirement already satisfied: decorator in /usr/local/lib/python3.7/dist-packages (from ipython->keras-tuner<1.1,>=1.0.2->autokeras) (4.4.2)
Requirement already satisfied: prompt-toolkit<2.0.0,>=1.0.4 in /usr/local/lib/python3.7/dist-packages (from ipython->keras-tuner<1.1,>=1.0.2->autokeras) (1.0.18)
Requirement already satisfied: wcwidth in /usr/local/lib/python3.7/dist-packages (from prompt-toolkit<2.0.0,>=1.0.4->ipython->keras-tuner<1.1,>=1.0.2->autokeras) (0.2.5)
Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /usr/local/lib/python3.7/dist-packages (from packaging->autokeras) (3.0.6)
Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.7/dist-packages (from pandas->autokeras) (2018.9)
Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.7/dist-packages (from pandas->autokeras) (2.8.2)
Requirement already satisfied: ptyprocess>=0.5 in /usr/local/lib/python3.7/dist-packages (from pexpect->ipython->keras-tuner<1.1,>=1.0.2->autokeras) (0.7.0)
Requirement already satisfied: joblib>=0.11 in /usr/local/lib/python3.7/dist-packages (from scikit-learn->autokeras) (1.1.0)
Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.7/dist-packages (from scikit-learn->autokeras) (3.0.0)
Building wheels for collected packages: wrapt
  Building wheel for wrapt (setup.py) ... done
  Created wheel for wrapt: filename=wrapt-1.12.1-cp37-cp37m-linux_x86_64.whl size=68714 sha256=612e24cfc048f892c5e5501cb5dd545bca62359c17f33bb7c0a8c9e6e15e3852
  Stored in directory: /root/.cache/pip/wheels/62/76/4c/aa25851149f3f6d9785f6c869387ad82b3fd37582fa8147ac6
Successfully built wrapt
Installing collected packages: typing-extensions, importlib-metadata, grpcio, wrapt, tensorflow-estimator, kt-legacy, keras-nightly, flatbuffers, tensorflow, keras-tuner, autokeras
  Attempting uninstall: typing-extensions
    Found existing installation: typing-extensions 3.10.0.2
    Uninstalling typing-extensions-3.10.0.2:
      Successfully uninstalled typing-extensions-3.10.0.2
  Attempting uninstall: importlib-metadata
    Found existing installation: importlib-metadata 4.2.0
    Uninstalling importlib-metadata-4.2.0:
      Successfully uninstalled importlib-metadata-4.2.0
  Attempting uninstall: grpcio
    Found existing installation: grpcio 1.42.0
    Uninstalling grpcio-1.42.0:
      Successfully uninstalled grpcio-1.42.0
  Attempting uninstall: wrapt
    Found existing installation: wrapt 1.13.3
    Uninstalling wrapt-1.13.3:
      Successfully uninstalled wrapt-1.13.3
  Attempting uninstall: tensorflow-estimator
    Found existing installation: tensorflow-estimator 2.7.0
    Uninstalling tensorflow-estimator-2.7.0:
      Successfully uninstalled tensorflow-estimator-2.7.0
  Attempting uninstall: flatbuffers
    Found existing installation: flatbuffers 2.0
    Uninstalling flatbuffers-2.0:
      Successfully uninstalled flatbuffers-2.0
  Attempting uninstall: tensorflow
    Found existing installation: tensorflow 2.7.0
    Uninstalling tensorflow-2.7.0:
      Successfully uninstalled tensorflow-2.7.0
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
flake8 4.0.1 requires importlib-metadata<4.3; python_version < "3.8", but you have importlib-metadata 4.10.0 which is incompatible.
Successfully installed autokeras-1.0.16.post1 flatbuffers-1.12 grpcio-1.34.1 importlib-metadata-4.10.0 keras-nightly-2.5.0.dev2021032900 keras-tuner-1.0.4 kt-legacy-1.0.4 tensorflow-2.5.2 tensorflow-estimator-2.5.0 typing-extensions-3.7.4.3 wrapt-1.12.1
In [14]:
import autokeras as ak
import tensorflow as tf
In [20]:
TRAIN_DATA_URL = "./train.csv"
# TEST_DATA_URL = "https://storage.googleapis.com/tf-datasets/titanic/eval.csv"

train_file_path = tf.keras.utils.get_file("./train.csv")
# test_file_path = tf.keras.utils.get_file("eval.csv", TEST_DATA_URL)
╭──────────────────────────── Traceback (most recent call last) ────────────────────────────╮
                                                                                           
 /usr/local/lib/python3.7/dist-packages/IPython/core/interactiveshell.py:2882 in run_code  
                                                                                           
   2879 │   │   │   try:                                                                   
   2880 │   │   │   │   self.hooks.pre_run_code_hook()                                     
   2881 │   │   │   │   #rprint('Running code', repr(code_obj)) # dbg                      
 2882 │   │   │   │   exec(code_obj, self.user_global_ns, self.user_ns)                  
   2883 │   │   │   finally:                                                               
   2884 │   │   │   │   # Reset our crash handler in place                                 
   2885 │   │   │   │   sys.excepthook = old_excepthook                                    
 <ipython-input-20-7317738e172c>:4 in <module>                                             
╰───────────────────────────────────────────────────────────────────────────────────────────╯
TypeError: get_file() missing 1 required positional argument: 'origin'
In [ ]:
# Initialize the structured data classifier.
clf = ak.StructuredDataClassifier(
    overwrite=True, max_trials=20
)  # It tries 3 different models.
# Feed the structured data classifier with training data.
clf.fit(
    # The path to the train.csv file.
    TRAIN_DATA_URL,
    # The name of the label column.
    "target",
    epochs=20,
)
# Predict with the best model.
predicted_y = clf.predict(test_file_path)
In [16]:
x_train = np.array(train_df.code)
y_train = np.array(train_df.target)

clf = ak.TextClassifier(
    overwrite=True, max_trials=2
)  # It only tries 1 model as a quick demo.
# Feed the text classifier with training data.
clf.fit(x_train,y_train, epochs=5)
# Predict with the best model.
predicted_y = clf.predict(X_test)
Trial 2 Complete [00h 09m 41s]
val_loss: 0.8385043144226074

Best val_loss So Far: 0.7658016085624695
Total elapsed time: 00h 30m 53s
Epoch 1/5
1426/1426 [==============================] - 292s 204ms/step - loss: 1.2729 - accuracy: 0.5965
Epoch 2/5
1426/1426 [==============================] - 292s 205ms/step - loss: 0.7840 - accuracy: 0.7492
Epoch 3/5
1426/1426 [==============================] - 293s 205ms/step - loss: 0.6473 - accuracy: 0.7905
Epoch 4/5
1426/1426 [==============================] - 293s 205ms/step - loss: 0.5611 - accuracy: 0.8156
Epoch 5/5
1426/1426 [==============================] - 293s 206ms/step - loss: 0.5070 - accuracy: 0.8333
╭──────────────────────────── Traceback (most recent call last) ────────────────────────────╮
                                                                                           
 /usr/local/lib/python3.7/dist-packages/IPython/core/interactiveshell.py:2882 in run_code  
                                                                                           
   2879 │   │   │   try:                                                                   
   2880 │   │   │   │   self.hooks.pre_run_code_hook()                                     
   2881 │   │   │   │   #rprint('Running code', repr(code_obj)) # dbg                      
 2882 │   │   │   │   exec(code_obj, self.user_global_ns, self.user_ns)                  
   2883 │   │   │   finally:                                                               
   2884 │   │   │   │   # Reset our crash handler in place                                 
   2885 │   │   │   │   sys.excepthook = old_excepthook                                    
 <ipython-input-16-d1ef59a099a7>:10 in <module>                                            
╰───────────────────────────────────────────────────────────────────────────────────────────╯
NameError: name 'X_test' is not defined
In [ ]:

In [32]:
!nvidia-smi
Sat Jan  8 18:11:45 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.44       Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   38C    P0    34W / 250W |   4631MiB / 16280MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Kmeans Repeated runs Model

In [ ]:
from tqdm.notebook import tqdm
from sklearn.cluster import KMeans,AgglomerativeClustering,SpectralClustering,MiniBatchKMeans
from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics import adjusted_rand_score
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_selection import mutual_info_classif
n_runs=20
true_k = 15
run_results=[]
vectorizer=TfidfVectorizer(vocabulary=vocab)
X=vectorizer.fit_transform(train_df['code'])
# mf=mutual_info_classif(X,train_df.target)
# feature_importances=np.argsort(mf)[::-1]
# X=X[:,feature_importances[:500]]
rands=[]
for i in tqdm(range(2,n_runs)):
  model = KMeans(n_clusters=i, init='k-means++',n_init=10, max_iter=1000)
  X_pred=model.fit_predict(X)
  # enc=OneHotEncoder()
  # run_results.append(enc.fit_transform(X_pred.reshape(-1,1)).toarray())
  print("here")
  rands.append(adjusted_rand_score(train_df.target.values,X_pred))
  # print(rands[i])
here
here
here
here
here
here
here
here
here
here
here
here
here
here
here
here
here
here
In [ ]:
rands
Out[ ]:
[-0.005723966168904149,
 -0.0030141713305542966,
 -0.0009790505350998808,
 -0.01332419130841455,
 -0.013773702462435521,
 -0.019544119174595292,
 -0.0019137693501294307,
 0.00013435284038645318,
 -0.016470228311597012,
 -0.013672123061487998,
 -0.006608067234219775,
 -0.009227144204211645,
 -0.004413329396046331,
 0.005547220296953219,
 -0.004317464608538917,
 0.005723101652408192,
 0.0055656819463193435,
 -0.00947724048931965]
In [ ]:
from sklearn.neural_network import MLPClassifier
classifier=MLPClassifier(verbose=True,hidden_layer_sizes=(1000),learning_rate='adaptive',early_stopping=True, max_iter=100)
classifier.fit(np.array(doc_e),train_df['target'])
In [ ]:
import pickle
with open('model2.pkl','wb+') as f:
  pickle.dump(classifier,f)
In [ ]:
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
classifier = Pipeline([('vect', PCA(n_components=50)), ('clf', CatBoostClassifier(n_estimators=1000))])
classifier = classifier.fit(np.array(doc_e).astype(np.float32), train_df["target"])
In [ ]:
from sklearn.tree import DecisionTreeClassifier
classifier = DecisionTreeClassifier()
classifier.fit(X_train, Y_train)
Out[ ]:
DecisionTreeClassifier()
In [ ]:
classifier['vect'].vocabulary_
In [ ]:
from skorch import NeuralNetClassifier
from torch import nn

# X = np.array(X_train).astype(np.float32)
# y = Y_train.astype(np.int64)

class MyModule(nn.Module):
    def __init__(self, num_units=10, nonlin=nn.ReLU()):
        super(MyModule, self).__init__()


        self.output = nn.Linear(768, 256)
        self.hidden=nn.Linear(256,256)
        self.hidden2=nn.Linear(256,15)
        self.nonlin=nn.Tanh()
        self.softmax = nn.Softmax(dim=-1)

    def forward(self, X, **kwargs):
        X=self.nonlin(self.output(X))
        X=self.nonlin(self.hidden(X))
        X = self.softmax(self.hidden2(X))
        return X


net = NeuralNetClassifier(
    MyModule,
    max_epochs=150,
    lr=0.003,
    # Shuffle training data on each epoch
    iterator_train__shuffle=True,
)
net.fit(X_res,Y_res)
# net.fit(np.array(doc_e).astype(np.float32), train_df["target"])
# y_proba = net.predict_proba(X)
In [ ]:
from sklearn.linear_model import LogisticRegressionCV
net2=LogisticRegression()
net2.fit(X_train,Y_train)
/usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:818: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG,
Out[ ]:
LogisticRegression()
In [ ]:
from sklearn.metrics import balanced_accuracy_score
In [ ]:
from sklearn.feature_extraction.text import TfidfVectorizer
vect=TfidfVectorizer(vocabulary=vocab)
X=vect.fit_transform(X_train)

Feature Engineering

In [ ]:
def line_starts(text):
  processed=text.replace("\t",'').split("\\n")
  for line in processed:
    line=line.strip()
    if line.startswith("#"):
      pass
    elif line.startswith('*'):
      pass
    elif line.startswith('\\*'):
      pass
    elif line.startswith('\'\'\''):
      pass
    elif line.startswith('\\\\'):
      pass
    elif line.startswith('\\\\\\'):
      pass
def token_in_string(text):
  processed=text.replace("\t",'').split("\\n")
  for line in processed:
In [ ]:
from sklearn.metrics import balanced_accuracy_score
from sklearn.linear_model import LogisticRegression
from imblearn.pipeline import Pipeline
# X_v=vect.transform(X_validation)
print("F1:" ,f1_score(Y_validation,classifier.predict(X_validation),average='macro'))
print("Accuracy:" ,accuracy_score(Y_validation,classifier.predict(X_validation))*100)
print("Accuracy:" ,balanced_accuracy_score(Y_validation,classifier.predict(X_validation))*100)
F1: 0.8507658500642485
Accuracy: 87.81503396888012
Accuracy: 83.22519142022561
In [ ]:
print("F1:" ,f1_score(Y_test,classifier.predict(X_test),average='macro'))
print("Accuracy:" ,accuracy_score(Y_test,classifier.predict(X_test))*100)
F1: 0.8328047907661275
Accuracy: 88.01227262765724
In [ ]:
from sklearn.metrics import plot_confusion_matrix
plot_confusion_matrix(classifier, X_train, Y_train)
/usr/local/lib/python3.7/dist-packages/sklearn/utils/deprecation.py:87: FutureWarning: Function plot_confusion_matrix is deprecated; Function `plot_confusion_matrix` is deprecated in 1.0 and will be removed in 1.2. Use one of the class methods: ConfusionMatrixDisplay.from_predictions or ConfusionMatrixDisplay.from_estimator.
  warnings.warn(msg, category=FutureWarning)
Out[ ]:
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x7f719e848f90>
In [ ]:
train_df[train_df.language=='python']
Out[ ]:
id code language target
5 29901 >>> insertion_sort([]) == sorted([])\n\n ... python 11
14 15780 "Full house",\n\n "Four of a k... python 11
15 51507 )\n\n if new_key is No... python 11
17 14135 The intersection is composed by two spheri... python 11
24 28362 >>> topology_sort(test_graph_2, 0, 6 * [Fa... python 11
... ... ... ... ...
45600 29860 y - mx = b\n\n And since we already have the m... python 11
45601 12500 pd_conv1 = np.ones((size_map, size... python 11
45608 28741 while True: # While we don't get ... python 11
45623 77158 check = 3\n\n x = len(coordinates)... python 11
45624 19972 self.fib_array.append(0)\n\n ... python 11

12678 rows × 4 columns

Prediction Phase ✈

In [27]:
from tqdm.notebook import tqdm
test_probs=[]
for c in tqdm(test_df.code):
  probabilities.append(to_prob(guess.probabilities(c)))
In [28]:
test_df['target'] = net.predict(np.array(test_probs))
╭──────────────────────────── Traceback (most recent call last) ────────────────────────────╮
                                                                                           
 /usr/local/lib/python3.7/dist-packages/IPython/core/interactiveshell.py:2882 in run_code  
                                                                                           
   2879 │   │   │   try:                                                                   
   2880 │   │   │   │   self.hooks.pre_run_code_hook()                                     
   2881 │   │   │   │   #rprint('Running code', repr(code_obj)) # dbg                      
 2882 │   │   │   │   exec(code_obj, self.user_global_ns, self.user_ns)                  
   2883 │   │   │   finally:                                                               
   2884 │   │   │   │   # Reset our crash handler in place                                 
   2885 │   │   │   │   sys.excepthook = old_excepthook                                    
 <ipython-input-28-7fc9522272fc>:1 in <module>                                             
╰───────────────────────────────────────────────────────────────────────────────────────────╯
NameError: name 'net' is not defined
In [19]:
X_test=tfidf.fit_transform(test_df.code).toarray()
pd.DataFrame(np.c_[np.array(test_doc_e),X_test]).to_csv('./test.csv')
In [20]:
test_df['target'] = predictor.predict(TabularDataset('./test.csv'))
Loaded data from: ./test.csv | Columns = 1543 / 1543 | Rows = 9277 -> 9277
In [17]:
test_df['target']=clf.predict(np.array(test_df.code))
290/290 [==============================] - 16s 56ms/step
In [ ]:
test_df.target
Out[ ]:
0       11
1        7
2        1
3       11
4        2
        ..
9272     1
9273     7
9274    11
9275     2
9276     7
Name: target, Length: 9277, dtype: int64
In [27]:
test_df["prediction"] = LE.inverse_transform(test_df.target.astype(int))
In [20]:
test_df.prediction.value_counts()
Out[20]:
python         1755
c-plus-plus    1500
c-sharp        1425
c              1323
java           1189
javascript      663
go              458
ruby            352
julia           298
dart            137
f-sharp          56
swift            49
php              32
scala            25
R                15
Name: prediction, dtype: int64
In [21]:
test_df.prediction.value_counts()/test_df.shape[0]
Out[21]:
python         0.189178
c-plus-plus    0.161690
c-sharp        0.153606
c              0.142611
java           0.128166
javascript     0.071467
go             0.049369
ruby           0.037943
julia          0.032122
dart           0.014768
f-sharp        0.006036
swift          0.005282
php            0.003449
scala          0.002695
R              0.001617
Name: prediction, dtype: float64
In [21]:
test_df["prediction"] = LE.inverse_transform(test_df.target)
In [51]:
test_df.target.unique()
Out[51]:
array([ 8,  2,  1,  3,  6,  7, 11, 12,  9,  5,  4,  0, 14, 10, 13])

Generating Prediction File

In [23]:
test_df = test_df.sample(frac=1)
test_df.head(20)
Out[23]:
id code target prediction
1109 97932 using System.Linq;\n\n using System.Numerics;\n\n using Algorithms.Sequences;\n\n using FluentAssertions;\n\n using NUnit.Framework;\n\n namespace Algorithms.Tests.Sequences\n 3 c-sharp
7978 36894 public class DynamicProgrammingKnapsackSolver<T>\n\n {\n\n /// <summary>\n\n /// Returns the knapsack containing the items that\n\n /// maximize value while not exceeding weight capacity.\n 3 c-sharp
31 28645 def __iter__(self):\n\n node = self.head\n\n while len(node.forward) != 0:\n\n yield node.forward[0].key\n 11 python
8580 10577 using System;\n\n namespace Algorithms.Search\n\n {\n\n /// <summary>\n\n /// Jump Search checks fewer elements by jumping ahead by fixed steps.\n 3 c-sharp
8095 89645 /* Let's ensure our sequence has only Positive Integers */\n\n if (sequence.some((integer) => integer < 0)) {\n\n throw RangeError('Sequence must be a list of Positive integers Only!')\n\n }\n 8 javascript
4311 13244 ulong[] pPow = new ulong[Math.Max(pattern.Length, text.Length)];\n\n pPow[0] = 1;\n\n for (var i = 1; i < pPow.Length; i++)\n\n {\n\n pPow[i] = pPow[i - 1] * p % m;\n\n }\n 3 c-sharp
5716 25794 */\n\n public Node findSuccessor(Node n) {\n\n if (n.right == null) return n;\n\n Node current = n.right;\n\n Node parent = n.right;\n 7 java
5972 54410 */\n\n char is_leap_year(short year)\n\n {\n\n if ((year % 400 == 0) || ((year % 4 == 0) && (year % 100 != 0)))\n\n return 1;\n\n return 0;\n 1 c
3895 34812 bool Isposs(int a, int b, int c) {\n\n return (c % gcd(a, b) == 0);\n\n }\n\n //Driver function for Linear Diophantine Equations\n\n int main() {\n\n int a = 3, b = 6, c = 9;\n 4 dart
4827 26577 return catalanArray[n];\n\n }\n\n // Main method\n\n public static void main(String[] args) {\n\n Scanner sc = new Scanner(System.in);\n 7 java
4299 24908 #!/bin/python3\n\n # Doomsday algorithm info: https://en.wikipedia.org/wiki/Doomsday_rule\n\n DOOMSDAY_LEAP = [4, 1, 7, 4, 2, 6, 4, 1, 5, 3, 7, 5]\n\n DOOMSDAY_NOT_LEAP = [3, 7, 7, 4, 2, 6, 4, 1, 5, 3, 7, 5]\n\n WEEK_DAY_NAMES = {\n 11 python
5971 77829 """\n\n Calculate Greatest Common Divisor (GCD).\n\n >>> greatest_common_divisor(24, 40)\n 11 python
8644 18837 min_f_score = f_score;\n\n it_low_f_score = iter;\n\n }\n\n }\n 2 c-plus-plus
3628 10402 \t\t\tt.Errorf("height of left child should be 1")\n\n \t\t}\n\n \t\tif root.Right.Key != 5 {\n\n \t\t\tt.Errorf("right child should have value = 5")\n\n \t\t}\n 6 go
2588 20670 \t\t}\n\n \t}\n\n }\n\n // Generate returns a int slice of prime numbers up to the limit\n\n func Generate(limit int) []int {\n\n \tvar primes []int\n 6 go
1151 16354 /// <summary>\n\n /// Return the pseudoinverse of a matrix based on the Moore-Penrose Algorithm.\n\n /// using Singular Value Decomposition (SVD).\n\n /// </summary>\n\n /// <param name="inMat">Input matrix to find its inverse to.</param>\n 3 c-sharp
2572 61642 To better understand the algorithm, see also:\n\n https://github.com/akashvshroff/Gale_Shapley_Stable_Matching (README).\n\n https://www.youtube.com/watch?v=Qcv1IqHWAzg&t=13s (Numberphile YouTube).\n\n >>> donor_pref = [[0, 1, 3, 2], [0, 2, 3, 1], [1, 0, 2, 3], [0, 3, 1, 2]]\n\n >>> recipient_pref = [[3, 1, 2, 0], [3, 1, 0, 2], [0, 3, 1, 2], [1, 0, 3, 2]]\n\n >>> print(stable_matching(donor_pref, recipient_pref))\n 11 python
2613 16182 result_test2 = sorting::selectionSort(vector2, vector2size);\n\n assert(std::is_sorted(result_test2.begin(), result_test2.end()));\n\n std::cout << "Passed" << std::endl;\n 2 c-plus-plus
1149 21295 self.indicator.translatesAutoresizingMaskIntoConstraints = false\n\n self.indicator.backgroundColor = .lightGray\n\n self.addSubview(self.indicator)\n 11 python
5081 33070 version = "0.7.0"\n\n [[SharedArrays]]\n\n deps = ["Distributed", "Mmap", "Random", "Serialization"]\n 9 julia
In [ ]:
n=0
for c in test_df.code.values:
  if len(c)<10:
    print(c)
    print('----------')
    n+=1
In [ ]:
n
Out[ ]:
143
In [ ]:
for c in test_df[test_df.prediction=='python'].code.values:
  # print(len(c))
  # if len(c)>10 and len(c)<100:
  #   print(c)
  #   print('----------')
    if "\t" in c:
      print(c)
In [22]:
!rm -rf assets
!mkdir assets
test_df.to_csv(os.path.join("assets", "submission.csv"))
In [33]:
test_df['length']=test_df.code.apply(lambda x: len(x.strip()))
In [34]:
test_df[test_df.length<15]
Out[34]:
id code target prediction length
18 16981 })\n\n })\n\n })\n 8 javascript 14
25 11708 }\n 2 c-plus-plus 1
90 29463 return 0;\n\n }\n 2 c-plus-plus 13
103 90728 return 0;\n\n }\n 2 c-plus-plus 13
149 20465 }\n\n }\n\n }\n 3 c-sharp 13
... ... ... ... ... ...
9219 10615 return S;\n\n }\n 2 c-plus-plus 13
9253 18556 return 0;\n\n }\n 2 c-plus-plus 13
9257 41360 }\n\n }\n 3 c-sharp 5
9261 17016 }\n\n }\n 3 c-sharp 5
9267 45728 }\n 2 c-plus-plus 1

244 rows × 5 columns

In [ ]:
test_df.shape
Out[ ]:
(9277, 5)

Submitting our Predictions

Note : Please save the notebook before submitting it (Ctrl + S)

In [ ]:
%aicrowd notebook submit -c programming-language-classification -a assets --no-verify
Loading config from /root/.config/aicrowd-cli/config.toml
Config loaded
In [32]:
test_df.to_csv('./out1.csv')
In [32]:
probs=predictor.predict_proba(TabularDataset('./test.csv'))
INFO     Loaded data from: ./test.csv | Columns = 845 / 845 | Rows = 9277 -> 9277            
In [46]:
probs.values[18]
Out[46]:
array([5.0928467e-04, 1.1160081e-02, 1.5288522e-02, 2.8058207e-02,
       2.4775441e-03, 4.7447262e-05, 1.4102720e-01, 1.4528935e-02,
       7.7810133e-01, 1.0754560e-04, 2.5625876e-04, 2.6031123e-03,
       1.4028563e-04, 1.8877623e-03, 3.8064935e-03], dtype=float32)
In [ ]:
classifier.predict_proba(test_df['code'][test_df.id==10489])
Out[ ]:
array([[1.91164568e-05, 1.92815804e-04, 3.32183194e-02, 6.36783603e-04,
        6.61995848e-04, 2.23867596e-05, 5.89270683e-02, 6.61368886e-02,
        1.51140791e-02, 1.83293243e-02, 9.15378369e-05, 8.06442630e-01,
        1.46123115e-04, 1.25006287e-05, 4.84300540e-05]])
In [ ]:
test_df.loc[test_df.id==10489,'code']
╭──────────────────────────── Traceback (most recent call last) ────────────────────────────╮
                                                                                           
 /usr/local/lib/python3.7/dist-packages/IPython/core/interactiveshell.py:2882 in run_code  
                                                                                           
   2879 │   │   │   try:                                                                   
   2880 │   │   │   │   self.hooks.pre_run_code_hook()                                     
   2881 │   │   │   │   #rprint('Running code', repr(code_obj)) # dbg                      
 2882 │   │   │   │   exec(code_obj, self.user_global_ns, self.user_ns)                  
   2883 │   │   │   finally:                                                               
   2884 │   │   │   │   # Reset our crash handler in place                                 
   2885 │   │   │   │   sys.excepthook = old_excepthook                                    
 <ipython-input-61-dabada34438b>:1 in <module>                                             
                                                                                           
 /usr/local/lib/python3.7/dist-packages/scipy/sparse/_index.py:35 in __getitem__           
                                                                                           
    32 This class provides common dispatching and validation logic for indexing.       
    33 """                                                                             
    34 │   def __getitem__(self, key):                                                     
  35 │   │   row, col = self._validate_indices(key)                                      
    36 │   │   # Dispatch to specialized methods.                                          
    37 │   │   if isinstance(row, INT_TYPES):                                              
    38 │   │   │   if isinstance(col, INT_TYPES):                                          
                                                                                           
 /usr/local/lib/python3.7/dist-packages/scipy/sparse/_index.py:139 in _validate_indices    
                                                                                           
   136 │   │   │   if row < 0:                                                             
   137 │   │   │   │   row += M                                                            
   138 │   │   elif not isinstance(row, slice):                                            
 139 │   │   │   row = self._asindices(row, M)                                           
   140 │   │                                                                               
   141 │   │   if isintlike(col):                                                          
   142 │   │   │   col = int(col)                                                          
                                                                                           
 /usr/local/lib/python3.7/dist-packages/scipy/sparse/_index.py:171 in _asindices           
                                                                                           
   168 │   │   # Check bounds                                                              
   169 │   │   max_indx = x.max()                                                          
   170 │   │   if max_indx >= length:                                                      
 171 │   │   │   raise IndexError('index (%d) out of range' % max_indx)                  
   172 │   │                                                                               
   173 │   │   min_indx = x.min()                                                          
   174 │   │   if min_indx < 0:                                                            
╰───────────────────────────────────────────────────────────────────────────────────────────╯
IndexError: index (1499) out of range
In [ ]:
f=np.argsort(classifier['tfidf'].transform(test_df['code'][test_df.id==10489]).toarray())[::-1]
mylist=sorted(classifier['tfidf'].vocabulary_, key=classifier['tfidf'].vocabulary_.get)
print(np.array(mylist)[f[:50]])
for feat,imp in zip(np.array(mylist)[f[:50]],classifier['tfidf'].transform(test_df['code'][test_df.id==10489]).toarray()[0][f[:50]]):
  print(feat,imp)
[['\t' 'port' 'pop' ... 'generate' 'io' 'password']]
['\t' 'port' 'pop' ... 'generate' 'io' 'password'] [0.         0.         0.         ... 0.45898756 0.48850382 0.58760062]
In [ ]:
clf_pred=classifier['tfidf'].transform(test_df['code'][test_df.id==10489]).toarray()[0]
mylist=sorted(classifier['tfidf'].vocabulary_, key=classifier['tfidf'].vocabulary_.get)
f=np.argsort(clf_pred)[::-1]
myslice=f[:50]
clf_pred[myslice]
Out[ ]:
array([0.58760062, 0.48850382, 0.45898756, 0.34914999, 0.28898828,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ])
In [ ]:
np.array(mylist)[myslice]
Out[ ]:
array(['password', 'io', 'generate', 'math', 'returns', '~/', 'evaluate',
       'equation', 'err', 'error', 'errorf', 'errors', 'euler', 'event',
       'even', 'equal', 'every', 'exact', 'exactly', 'example',
       'examples', 'exception', 'execute', 'exist', 'equals', 'epsilon',
       'exit', 'encoding', 'element', 'elementindex', 'elements', 'elif',
       'else', 'empty', 'en', 'encode', 'encoded', 'encrypt', 'enumerate',
       'encrypted', 'encryption', 'end', 'endif', 'endindex', 'endl',
       'enqueue', 'enter', 'entry', 'exists', 'expected'], dtype='<U27')
In [ ]:
from sklearn.feature_selection import mutual_info_classif

mf=mutual_info_classif(test_vect.transform(train_df['code']),train_df.target)
In [ ]:
test_vect.vocabulary_[np.argsort(mf)[::-1]]
╭──────────────────────────── Traceback (most recent call last) ────────────────────────────╮
                                                                                           
 /usr/local/lib/python3.7/dist-packages/IPython/core/interactiveshell.py:2882 in run_code  
                                                                                           
   2879 │   │   │   try:                                                                   
   2880 │   │   │   │   self.hooks.pre_run_code_hook()                                     
   2881 │   │   │   │   #rprint('Running code', repr(code_obj)) # dbg                      
 2882 │   │   │   │   exec(code_obj, self.user_global_ns, self.user_ns)                  
   2883 │   │   │   finally:                                                               
   2884 │   │   │   │   # Reset our crash handler in place                                 
   2885 │   │   │   │   sys.excepthook = old_excepthook                                    
 <ipython-input-84-65f744701d80>:1 in <module>                                             
╰───────────────────────────────────────────────────────────────────────────────────────────╯
TypeError: unhashable type: 'numpy.ndarray'
In [ ]:
predictor.leaderboard()
                  model  score_val  pred_time_val     fit_time  pred_time_val_marginal  fit_time_marginal  stack_level  can_infer  fit_order
0   WeightedEnsemble_L2     0.8352      22.604783   897.400392                0.004708           1.339523            2       True         13
1            LightGBMXT     0.8244      13.267164   187.755843               13.267164         187.755843            1       True          4
2         LightGBMLarge     0.8156       4.561145   131.926170                4.561145         131.926170            1       True         12
3              LightGBM     0.8116       2.556177    78.689046                2.556177          78.689046            1       True          5
4       NeuralNetFastAI     0.8060       3.448163   286.104741                3.448163         286.104741            1       True          3
5               XGBoost     0.7852       0.593962    78.097259                0.593962          78.097259            1       True         11
6        ExtraTreesGini     0.7776       0.729640   212.176856                0.729640         212.176856            1       True          9
7      RandomForestGini     0.7672       0.723134   188.980253                0.723134         188.980253            1       True          6
8        ExtraTreesEntr     0.7664       0.624622   188.435978                0.624622         188.435978            1       True         10
9      RandomForestEntr     0.7492       0.859791   199.771033                0.859791         199.771033            1       True          7
10             CatBoost     0.7360       0.146541  1872.321867                0.146541        1872.321867            1       True          8
11       KNeighborsUnif     0.2028       0.106934     0.272218                0.106934           0.272218            1       True          1
12       KNeighborsDist     0.1992       0.105404     0.253309                0.105404           0.253309            1       True          2
Out[ ]:
model score_val pred_time_val fit_time pred_time_val_marginal fit_time_marginal stack_level can_infer fit_order
0 WeightedEnsemble_L2 0.8352 22.604783 897.400392 0.004708 1.339523 2 True 13
1 LightGBMXT 0.8244 13.267164 187.755843 13.267164 187.755843 1 True 4
2 LightGBMLarge 0.8156 4.561145 131.926170 4.561145 131.926170 1 True 12
3 LightGBM 0.8116 2.556177 78.689046 2.556177 78.689046 1 True 5
4 NeuralNetFastAI 0.8060 3.448163 286.104741 3.448163 286.104741 1 True 3
5 XGBoost 0.7852 0.593962 78.097259 0.593962 78.097259 1 True 11
6 ExtraTreesGini 0.7776 0.729640 212.176856 0.729640 212.176856 1 True 9
7 RandomForestGini 0.7672 0.723134 188.980253 0.723134 188.980253 1 True 6
8 ExtraTreesEntr 0.7664 0.624622 188.435978 0.624622 188.435978 1 True 10
9 RandomForestEntr 0.7492 0.859791 199.771033 0.859791 199.771033 1 True 7
10 CatBoost 0.7360 0.146541 1872.321867 0.146541 1872.321867 1 True 8
11 KNeighborsUnif 0.2028 0.106934 0.272218 0.106934 0.272218 1 True 1
12 KNeighborsDist 0.1992 0.105404 0.253309 0.105404 0.253309 1 True 2

Comments

You must login before you can post a comment.

Execute