Data Purchasing Challenge 2022
Simple Way to Detect Noisy Label with opencv
using opencv to enhance your strategy on training and buying
Simple Way to Detect Noisy Labels¶
By : Leo C. D.
As we know, the main challenge of the round 2 is noisy and imbalanced label.
So here's my simple way to eliminate noisy label, especially the scratch mark, by using cv2
First, Let's Load Data from Round 1 first, as we know, the label is quite clean and balance
Setting up¶
import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
import seaborn as sns
import cv2
from PIL import Image
from tqdm import tqdm
#@title Login to AIcrowd
!pip install -U aicrowd-cli > /dev/null
!aicrowd login 2> /dev/null
#@title Magic Box ⬛ { vertical-output: true, display-mode: "form" }
try:
import os
if first_run and os.path.exists("/content/data-purchasing-challenge-2022-starter-kit/data/unlabelled"):
first_run = False
except:
first_run = True
if first_run:
%cd /content/
!git clone http://gitlab.aicrowd.com/zew/data-purchasing-challenge-2022-starter-kit.git > /dev/null
%cd data-purchasing-challenge-2022-starter-kit
!aicrowd dataset list -c data-purchasing-challenge-2022
!aicrowd dataset download -c data-purchasing-challenge-2022
!mkdir -p data/
!mv *.tar.gz data/ && cd data && echo "Extracting dataset" && ls *.tar.gz | xargs -n1 -I{} bash -c "tar -xvf {} > /dev/null"
def run_pre_training_phase():
from run import ZEWDPCBaseRun
run = ZEWDPCBaseRun()
run.pre_training_phase = pre_training_phase
run.pre_training_phase(self=run, training_dataset=training_dataset)
# NOTE:It is critical that the checkpointing works in a self-contained way
# As, the evaluators might choose to run the different phases separately.
run.save_checkpoint("/tmp/pretrainig_phase_checkpoint.pickle")
def run_purchase_phase():
from run import ZEWDPCBaseRun
run = ZEWDPCBaseRun()
run.pre_training_phase = pre_training_phase
run.purchase_phase = purchase_phase
run.load_checkpoint("/tmp/pretrainig_phase_checkpoint.pickle")
# Hacky way to make it work in notebook
unlabelled_dataset.purchases = set()
run.purchase_phase(self=run, unlabelled_dataset=unlabelled_dataset, training_dataset=training_dataset, budget=3000)
run.save_checkpoint("/tmp/purchase_phase_checkpoint.pickle")
del run
def run_prediction_phase():
from run import ZEWDPCBaseRun
run = ZEWDPCBaseRun()
run.pre_training_phase = pre_training_phase
run.purchase_phase = purchase_phase
run.prediction_phase = prediction_phase
run.load_checkpoint("/tmp/purchase_phase_checkpoint.pickle")
run.prediction_phase(self=run, test_dataset=val_dataset)
del run
!rm -r data/unlabelled
!tar -xzvf "/content/data-purchasing-challenge-2022-starter-kit/data/unlabelled.tar.gz" -C "./data/"
!mkdir data_round2
!unzip "./unlabelled-v0.2-rc4.zip" -d "./data_round2"
Getting Clean Label Info from Round 1 Data¶
labelu_r1 = pd.read_csv(r'.\data\unlabelled\labels.csv')
labelu_r1
let's get the heavily damaged data which got all the label
labelscr = labelu_r1[(labelu_r1['scratch_small']==1) & (labelu_r1['scratch_large']==1) & (labelu_r1['dent_small']==1)& (labelu_r1['dent_large']==1)]
labelscr
let's check the image
i = 3
fname = r'./data/unlabelled/images/' + labelscr.iloc[i,0]
im = Image.open(fname)
cv_image = np.array(im)
plt.figure(figsize=(8,8))
plt.imshow(cv_image, interpolation='none')
plt.show()
using method from my first notebook here (pls leave some likes ❤️ !)
Let's apply it to the images!
kernel = np.array([[0, -1, 0],
[-1, 4, -1],
[0, -1, 0]])
sharp_img = cv2.filter2D(src=cv_image, ddepth=-1, kernel=kernel)
sharp_img = cv2.cvtColor(sharp_img, cv2.COLOR_BGR2GRAY)
plt.figure(figsize=(8,8))
plt.imshow(sharp_img, interpolation='none')
plt.show()
now you can see the filter can separate the scratch defect from the background. Also, the maximum value can be extracted as a feature where I explain later why this is important.
sharp_img.max()
now let's enhance the intensity of the defects and find the countours!
ret,sharp_img1 = cv2.threshold(sharp_img,30,255,cv2.THRESH_BINARY)
kernel2 = cv2.getStructuringElement(cv2.MORPH_ELLIPSE,(4,4))
dilation = cv2.dilate(sharp_img1,kernel2,iterations = 1)
contours, hierarchy = cv2.findContours(dilation, cv2.RETR_LIST, cv2.CHAIN_APPROX_NONE )
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16,10))
ax1.imshow(sharp_img1, interpolation='none')
ax2.imshow(dilation, interpolation='none')
plt.show()
for visualization, here's the defect contour that we got!
contour = sorted(contours, key=cv2.contourArea)
contourImg = cv2.drawContours(cv_image.copy(), contour, -1, (0,255,0), 3)
plt.figure(figsize=(8,8))
plt.imshow(contourImg)
plt.show()
now, with the sharp sharp_img.max(), let's save another feature like :
the smallest defect area, the largest defect area, and etc.
sharpval = []
for i in tqdm(range(0,len(labelu_r1))):
fname = r'./data/unlabelled/images/' + labelu_r1.iloc[i,0]
im = Image.open(fname)
cv_image = np.array(im)
kernel = np.array([[0, -1, 0],
[-1, 4, -1],
[0, -1, 0]])
sharp_img = cv2.filter2D(src=cv_image, ddepth=-1, kernel=kernel)
sharp_img = cv2.cvtColor(sharp_img, cv2.COLOR_BGR2GRAY)
ret,sharp_img1 = cv2.threshold(sharp_img,30,255,cv2.THRESH_BINARY)
kernel2 = cv2.getStructuringElement(cv2.MORPH_ELLIPSE,(4,4))
dilation = cv2.dilate(sharp_img1,kernel2,iterations = 1)
contours, hierarchy = cv2.findContours(dilation, cv2.RETR_LIST, cv2.CHAIN_APPROX_NONE )
# getting area
c5 = 0
c10 = 0
c15 = 0
c20 = 0
c30 = 0
c40 = 0
c50 = 0
c80 = 0
c100 = 0
c150 = 0
c200 = 0
c300 = 0
c400 = 0
c500 = 0
areas = []
for j in contours:
area = cv2.contourArea(j)
if area<10000 and area>5: #this to limit incase the cv returning a contour somewhat larger than the image itself
areas.append(area)
if 5 <= area < 10:
c5 = c5 + 1
if 10 <= area < 15:
c10 = c10 + 1
if 15 <= area < 20:
c15 = c15 + 1
if 20 <= area < 30:
c20 = c20 + 1
if 30 <= area < 40:
c30 = c30 + 1
if 40 <= area < 50:
c40 = c40 + 1
if 50 <= area < 80:
c50 = c50 + 1
if 80 <= area < 100:
c80 = c80 + 1
if 100 <= area < 150:
c100 = c100 + 1
if 150 <= area < 200:
c150 = c150 + 1
if 200 <= area < 300:
c200 = c200 + 1
if 300 <= area < 400:
c300 = c300 + 1
if 400 <= area < 500:
c400 = c400 + 1
if area > 500:
c500 = c500 + 1
if areas:
cnt = [min(areas), max(areas), len(areas)]
else:
cnt = [0,0,0]
sharpval.append([sharp_img.max(), cnt[0], cnt[1], cnt[2], c5, c10, c15, c20, c30, c40, c50, c80, c100, c150, c200, c300, c400, c500])
scratch_res = pd.DataFrame(np.array(sharpval),columns=['sharp','min_size','max_size', 'n_con', 'c5', 'c10', 'c15', 'c20', 'c30', 'c40', 'c50', 'c80', 'c100', 'c150', 'c200', 'c300', 'c400', 'c500'])
scratch_res
labelu_r1['scratch_flag'] = 0
labelu_r1['scratch_flag'][(labelu_r1['scratch_small'] == 1 ) | (labelu_r1['scratch_large'] == 1 )] = 1
labelu_r1
recap_res = pd.concat([labelu_r1, scratch_res],axis=1)
recap_res
let's see the distribution from one of the feature we just extracted which is maximum 'sharp' values.
recap_res['sharp'][recap_res['scratch_flag'] == 1].hist(bins=20)
recap_res['sharp'][recap_res['scratch_flag'] == 1].describe()
recap_res['sharp'][recap_res['scratch_flag'] == 0].hist(bins=20)
recap_res['sharp'][recap_res['scratch_flag'] == 0].describe()
Well look at that, by just using one of the features, you can pretty much divide which image got scratch defect or not.
you can play around with the size contour and every contour length counted available to try find another kind of defect.
now, let's try to apply it on our second dataset!
Apply it to Round 2 Data¶
labelu_r2 = pd.read_csv(r'.\data_round2\unlabelled\labels.csv')
labelu_r2
sharpval = []
for i in tqdm(range(0,len(labelu_r2))):
fname = r'./data_round2/unlabelled/images/' + labelu_r2.iloc[i,0]
im = Image.open(fname)
cv_image = np.array(im)
kernel = np.array([[0, -1, 0],
[-1, 4, -1],
[0, -1, 0]])
sharp_img = cv2.filter2D(src=cv_image, ddepth=-1, kernel=kernel)
sharp_img = cv2.cvtColor(sharp_img, cv2.COLOR_BGR2GRAY)
ret,sharp_img1 = cv2.threshold(sharp_img,30,255,cv2.THRESH_BINARY)
kernel2 = cv2.getStructuringElement(cv2.MORPH_ELLIPSE,(4,4))
dilation = cv2.dilate(sharp_img1,kernel2,iterations = 1)
contours, hierarchy = cv2.findContours(dilation, cv2.RETR_LIST, cv2.CHAIN_APPROX_NONE )
# getting area
c5 = 0
c10 = 0
c15 = 0
c20 = 0
c30 = 0
c40 = 0
c50 = 0
c80 = 0
c100 = 0
c150 = 0
c200 = 0
c300 = 0
c400 = 0
c500 = 0
areas = [] # we are counting how many contour and group it by its area (useful for further differentiating the defect)
for j in contours:
area = cv2.contourArea(j)
if area<10000 and area>5: #this to limit incase the cv returning a contour somewhat larger than the image itself
areas.append(area)
if 5 <= area < 10:
c5 = c5 + 1
if 10 <= area < 15:
c10 = c10 + 1
if 15 <= area < 20:
c15 = c15 + 1
if 20 <= area < 30:
c20 = c20 + 1
if 30 <= area < 40:
c30 = c30 + 1
if 40 <= area < 50:
c40 = c40 + 1
if 50 <= area < 80:
c50 = c50 + 1
if 80 <= area < 100:
c80 = c80 + 1
if 100 <= area < 150:
c100 = c100 + 1
if 150 <= area < 200:
c150 = c150 + 1
if 200 <= area < 300:
c200 = c200 + 1
if 300 <= area < 400:
c300 = c300 + 1
if 400 <= area < 500:
c400 = c400 + 1
if area > 500:
c500 = c500 + 1
if areas:
cnt = [min(areas), max(areas), len(areas)]
else:
cnt = [0,0,0]
sharpval.append([sharp_img.max(), cnt[0], cnt[1], cnt[2], c5, c10, c15, c20, c30, c40, c50, c80, c100, c150, c200, c300, c400, c500])
scratch_res_r2 = pd.DataFrame(np.array(sharpval),columns=['sharp','min_size','max_size', 'n_con', 'c5', 'c10', 'c15', 'c20', 'c30', 'c40', 'c50', 'c80', 'c100', 'c150', 'c200', 'c300', 'c400', 'c500'])
labelu_r2['scratch_flag'] = 0
labelu_r2['scratch_flag'][(labelu_r2['scratch_small'] == 1 ) | (labelu_r2['scratch_large'] == 1 )] = 1
recap_res_r2 = pd.concat([labelu_r2, scratch_res_r2],axis=1)
recap_res_r2
now let's see the distribution of round 2 unlabelled data.
recap_res_r2['sharp'][recap_res_r2['scratch_flag'] == 1].hist(bins=20)
recap_res_r2['sharp'][recap_res_r2['scratch_flag'] == 1].describe()
recap_res_r2['sharp'][recap_res_r2['scratch_flag'] == 0].hist(bins=20)
recap_res_r2['sharp'][recap_res_r2['scratch_flag'] == 0].describe()
Wow, look at that!
Now let's assume that over 100 maximum 'sharp' values means there must be a scratch defect, let's see how many of image got bad label (I know that's too generous, knowing the distribution from round 1)
bad_label = recap_res_r2['sharp'][recap_res_r2['scratch_flag'] == 0] > 100
recap1_falneg = recap_res_r2[(recap_res_r2['scratch_flag'] == 0) & (recap_res_r2['sharp']>100)]
recap1_falneg
THERE ARE over 10% of theeeeeem??
it's still only the false negative. you don't believe me?
let's plot some of the images!
plt.figure(figsize=(20, 30))
for i in range(12):
ax = plt.subplot(6, 4, 1*i + 1)
fname = r'./data_round2/unlabelled/images/' + recap1_falneg.iloc[i,0]
im = Image.open(fname)
plt.imshow(im)
plt.axis("off")