NLP Feature Engineering
Solution for submission 148413
A detailed solution for submission 148413 submitted for challenge NLP Feature Engineering
Solution for NLP Feature Engineering LB: 0.803¶
This solution consists utilises a count vectorizer a TF IDF and a stopword filter as feature engineering.
AIcrowd Runtime Configuration 🧷¶
Define configuration parameters. Please include any files needed for the notebook to run under ASSETS_DIR
. We will copy the contents of this directory to your final submission file 🙂
The dataset is available under /data
on the workspace.
import os
# Please use the absolute for the location of the dataset.
# Or you can use relative path with `os.getcwd() + "test_data/test.csv"`
AICROWD_DATASET_PATH = os.getenv("DATASET_PATH", os.getcwd()+"/data/data.csv")
AICROWD_OUTPUTS_PATH = os.getenv("OUTPUTS_DIR", "")
AICROWD_ASSETS_DIR = os.getenv("ASSETS_DIR", "assets")
Install packages 🗃¶
We are going to use sklearn to do Count Vectorization and TF IDF.
!pip install --upgrade scikit-learn gensim
!pip install -q -U aicrowd-cli
Define preprocessing code 💻¶
The code that is common between the training and the prediction sections should be defined here. During evaluation, we completely skip the training section. Please make sure to add any common logic between the training and prediction sections here.
from glob import glob
import os
import pandas as pd
import numpy as np
from sklearn import model_selection
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score, accuracy_score
import sklearn
Training phase ⚙️¶
You can define your training code here. This sections will be skipped during evaluation.
For this solution approach there is no training needed! 🙂
# Downloading the Dataset
!mkdir data
Prediction phase 🔎¶
Generating the features in test dataset.
test_dataset = pd.read_csv(AICROWD_DATASET_PATH)
test_dataset
from gensim.parsing.preprocessing import remove_stopwords
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer(max_features = 512, ngram_range=(1, 3))
X_train_counts = count_vect.fit_transform([remove_stopwords(i) for i in test_dataset.text.tolist()])
from sklearn.feature_extraction.text import TfidfTransformer
tf_transformer = TfidfTransformer(use_idf=True).fit(X_train_counts)
X_train_tf = tf_transformer.transform(X_train_counts)
X_train_tf = np.round(X_train_tf.toarray()*5).astype(int)
test_dataset.feature = [str(i) for i in X_train_tf.tolist()]
test_dataset
test_dataset.to_csv(os.path.join(AICROWD_OUTPUTS_PATH,'submission.csv'), index=False)
Submit to AIcrowd¶
!DATASET_PATH=$AICROWD_DATASET_PATH \
aicrowd -v notebook submit \
--assets-dir $AICROWD_ASSETS_DIR \
--challenge nlp-feature-engineering
Content
Comments
You must login before you can post a comment.