Speaker Identification
Solution for submission 170945
A detailed solution for submission 170945 submitted for challenge Speaker Identification
Getting Started with Speaker Identification
In this puzzle, we have to cluster the sentences spoken by same speaker together.
In this starter notebook:
For tokenization: We will use TfidfVectorizer.
For Clustering: We will use K Means Classifier.
In [1]:
!pip install aicrowd-cli
%load_ext aicrowd.magic
Login to AIcrowd ㊗¶
In [2]:
%aicrowd login
Download Dataset¶
We will create a folder name data and download the files there.
In [3]:
!rm -rf data
!mkdir data
%aicrowd ds dl -c speaker-identification -o data
In [4]:
import re,os
import pandas as pd
from wordcloud import WordCloud
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans #, AgglomerativeClustering
In [5]:
test_df = pd.read_csv("data/test.csv")
In [6]:
test_df.head()
Out[6]:
In [7]:
test_df.sentence[0]
Out[7]:
In [8]:
sub_df = pd.read_csv("data/sample_sub.csv")
In [9]:
sub_df.head()
Out[9]:
In [10]:
# Remove punctuation, new line and lower case all the text available in sentence
test_df.sentence = test_df.sentence.apply(lambda x: re.sub('[,\.!?]', '', x))
test_df.sentence = test_df.sentence.apply(lambda x: x.lower())
test_df.sentence = test_df.sentence.apply(lambda x: x.replace("\n", " "))
In [11]:
test_df.sentence.values
Out[11]:
In [12]:
test_df.head()
Out[12]:
In [13]:
long_string = ','.join(list(test_df.sentence.values))
# Create a WordCloud object
wordcloud = WordCloud(background_color="silver", max_words=1000, contour_width=3, contour_color='steelblue')
# Generate a word cloud
wordcloud.generate(long_string)
# Visualize the word cloud
wordcloud.to_image()
Out[13]:
In [14]:
vectorizer = TfidfVectorizer()# stop_words='english')
X = vectorizer.fit_transform(test_df.sentence)
In [15]:
X[0].todense().shape
Out[15]:
Generating Predictions¶
Clustering using K-Means.
In [16]:
true_k = 10
#model = KMeans(n_clusters=true_k, init='k-means++', max_iter=1000)
#model = KMeans(n_clusters=true_k, init='k-means++', max_iter=5000)#, n_init=10)
model = KMeans(n_clusters=true_k, max_iter=2500, algorithm='full')
#model = MiniBatchKMeans(n_clusters=true_k)
#model = AgglomerativeClustering(n_clusters=true_k)
In [17]:
model.fit(X)
Out[17]:
In [18]:
submission = test_df
In [19]:
submission['prediction'] = test_df.sentence.apply(lambda x: model.predict(vectorizer.transform([x])[0])[0])
In [20]:
submission.head()
Out[20]:
In [ ]:
!rm -rf assets
!mkdir assets
submission.to_csv(os.path.join("assets", "submission.csv"))
Submitting our Predictions¶
Note : Please save the notebook before submitting it (Ctrl + S)
In [ ]:
%aicrowd notebook submit -c speaker-identification -a assets --no-verify
In [103]:
Content
Comments
You must login before you can post a comment.