Speaker Identification
Getting Started Notebook for Speaker Identification
A getting started notebook with random submission for the challenge.
Getting Started with Speaker Identification
In this puzzle, we have to cluster the sentences spoken by same speaker together.
In this starter notebook:
For tokenization: We will use TfidfVectorizer.
For Clustering: We will use K Means Classifier.
In [ ]:
!pip install aicrowd-cli
%load_ext aicrowd.magic
Login to AIcrowd ㊗¶
In [ ]:
%aicrowd login
Download Dataset¶
We will create a folder name data and download the files there.
In [ ]:
!rm -rf data
!mkdir data
%aicrowd ds dl -c speaker-identification -o data
In [ ]:
import re,os
import pandas as pd
from wordcloud import WordCloud
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
In [ ]:
test_df = pd.read_csv("data/test.csv")
In [ ]:
test_df.head()
Out[ ]:
In [ ]:
test_df.sentence[0]
Out[ ]:
In [ ]:
sub_df = pd.read_csv("data/sample_sub.csv")
In [ ]:
sub_df.head()
Out[ ]:
In [ ]:
# Remove punctuation, new line and lower case all the text available in sentence
test_df.sentence = test_df.sentence.apply(lambda x: re.sub('[,\.!?]', '', x))
test_df.sentence = test_df.sentence.apply(lambda x: x.lower())
test_df.sentence = test_df.sentence.apply(lambda x: x.replace("\n", " "))
In [ ]:
test_df.head()
Out[ ]:
In [ ]:
long_string = ','.join(list(test_df.sentence.values))
# Create a WordCloud object
wordcloud = WordCloud(background_color="silver", max_words=1000, contour_width=3, contour_color='steelblue')
# Generate a word cloud
wordcloud.generate(long_string)
# Visualize the word cloud
wordcloud.to_image()
Out[ ]:
In [ ]:
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(test_df.sentence)
In [ ]:
print(type(X))
Generating Predictions¶
Clustering using K-Means.
In [ ]:
true_k = 10
model = KMeans(n_clusters=true_k, init='k-means++', max_iter=100)
In [ ]:
model.fit(X)
Out[ ]:
In [ ]:
submission = test_df
In [ ]:
submission['prediction'] = test_df.sentence.apply(lambda x: model.predict(vectorizer.transform([x])[0])[0])
In [ ]:
submission.head()
Out[ ]:
In [ ]:
!rm -rf assets
!mkdir assets
submission.to_csv(os.path.join("assets", "submission.csv"))
Submitting our Predictions¶
Note : Please save the notebook before submitting it (Ctrl + S)
In [ ]:
%aicrowd notebook submit -c speaker-identification -a assets --no-verify
In [ ]:
Content
Comments
You must login before you can post a comment.