Loading

Data Purchasing Challenge 2022

Simple Way to Detect Noisy Label with opencv

using opencv to enhance your strategy on training and buying

leocd

Simple Way to Detect Noisy Labels

By : Leo C. D.

As we know, the main challenge of the round 2 is noisy and imbalanced label.

So here's my simple way to eliminate noisy label, especially the scratch mark, by using cv2

First, Let's Load Data from Round 1 first, as we know, the label is quite clean and balance

Setting up

In [1]:
import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
import seaborn as sns
import cv2
from PIL import Image
from tqdm import tqdm
In [2]:
#@title Login to AIcrowd
!pip install -U aicrowd-cli > /dev/null
!aicrowd login 2> /dev/null
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
google-colab 1.0.0 requires requests~=2.23.0, but you have requests 2.27.1 which is incompatible.
datascience 0.10.6 requires folium==0.2.1, but you have folium 0.8.3 which is incompatible.
Please login here: https://api.aicrowd.com/auth/aMhj3AgWQkn8VQ6lZH5FXWRkhw6C5UCRY_rPZM-ATyI
API Key valid
Gitlab access token valid
Saved details successfully!
In [3]:
#@title Magic Box ⬛ { vertical-output: true, display-mode: "form" }
try:
  import os
  if first_run and os.path.exists("/content/data-purchasing-challenge-2022-starter-kit/data/unlabelled"):
    first_run = False
except:
  first_run = True

if first_run:
  %cd /content/
  !git clone http://gitlab.aicrowd.com/zew/data-purchasing-challenge-2022-starter-kit.git > /dev/null
  %cd data-purchasing-challenge-2022-starter-kit
  !aicrowd dataset list -c data-purchasing-challenge-2022
  !aicrowd dataset download -c data-purchasing-challenge-2022
  !mkdir -p data/
  !mv *.tar.gz data/ && cd data && echo "Extracting dataset" && ls *.tar.gz | xargs -n1 -I{} bash -c "tar -xvf {} > /dev/null"


def run_pre_training_phase():
  from run import ZEWDPCBaseRun
  run = ZEWDPCBaseRun()
  run.pre_training_phase = pre_training_phase
  run.pre_training_phase(self=run, training_dataset=training_dataset)
  # NOTE:It is critical that the checkpointing works in a self-contained way
  #      As, the evaluators might choose to run the different phases separately.
  run.save_checkpoint("/tmp/pretrainig_phase_checkpoint.pickle")

def run_purchase_phase():
  from run import ZEWDPCBaseRun
  run = ZEWDPCBaseRun()
  run.pre_training_phase = pre_training_phase
  run.purchase_phase = purchase_phase
  run.load_checkpoint("/tmp/pretrainig_phase_checkpoint.pickle")
  # Hacky way to make it work in notebook
  unlabelled_dataset.purchases = set()
  run.purchase_phase(self=run, unlabelled_dataset=unlabelled_dataset, training_dataset=training_dataset, budget=3000)
  run.save_checkpoint("/tmp/purchase_phase_checkpoint.pickle")
  del run

def run_prediction_phase():
  from run import ZEWDPCBaseRun
  run = ZEWDPCBaseRun()
  run.pre_training_phase = pre_training_phase
  run.purchase_phase = purchase_phase
  run.prediction_phase = prediction_phase
  run.load_checkpoint("/tmp/purchase_phase_checkpoint.pickle")
  run.prediction_phase(self=run, test_dataset=val_dataset)
  del run
/content
Cloning into 'data-purchasing-challenge-2022-starter-kit'...
remote: Enumerating objects: 405, done.
remote: Counting objects: 100% (306/306), done.
remote: Compressing objects: 100% (134/134), done.
remote: Total 405 (delta 191), reused 280 (delta 171), pack-reused 99
Receiving objects: 100% (405/405), 104.79 KiB | 599.00 KiB/s, done.
Resolving deltas: 100% (245/245), done.
/content/data-purchasing-challenge-2022-starter-kit
                          Datasets for challenge #1024                          
┌───┬─────────────────────────┬──────────────────────────────────────┬─────────┐
│ #  Title                    Description                              Size │
├───┼─────────────────────────┼──────────────────────────────────────┼─────────┤
│ 0 │ debug-v0.1.tar.gz       │ Debug dataset                        │ 6.1 MiB │
│ 1 │ debug-v0.2-rc4.zip      │ Debug data for round 2               │   6 MiB │
│ 2 │ training-v0.1.tar.gz    │ Training data                        │ 304 MiB │
│ 3 │ training-v0.2-rc4.zip   │ Training data for round 2            │  97 MiB │
│ 4 │ unlabelled-v0.1.tar.gz  │ Unlabelled image dataset             │ 609 MiB │
│ 5 │ unlabelled-v0.2-rc4.zip │ Unlabelled image dataset for round 2 │ 973 MiB │
│ 6 │ validation-v0.1.tar.gz  │ Validation dataset                   │ 182 MiB │
│ 7 │ validation-v0.2-rc4.zip │ Validation dataset for round 2       │ 292 MiB │
└───┴─────────────────────────┴──────────────────────────────────────┴─────────┘
debug.tar.gz: 100% 6.43M/6.43M [00:00<00:00, 7.03MB/s]
debug-v0.2-rc4.zip: 100% 6.39M/6.39M [00:00<00:00, 10.0MB/s]
training.tar.gz: 100% 319M/319M [00:11<00:00, 26.8MB/s]
training-v0.2-rc4.zip: 100% 102M/102M [00:03<00:00, 29.0MB/s]
unlabelled.tar.gz: 100% 638M/638M [00:23<00:00, 26.6MB/s]
unlabelled-v0.2-rc4.zip: 100% 1.02G/1.02G [00:38<00:00, 26.3MB/s]
validation.tar.gz: 100% 191M/191M [00:11<00:00, 17.0MB/s]
validation-v0.2-rc4.zip: 100% 306M/306M [00:11<00:00, 27.3MB/s]
Extracting dataset
In [5]:
!rm -r data/unlabelled
In [ ]:
!tar -xzvf "/content/data-purchasing-challenge-2022-starter-kit/data/unlabelled.tar.gz" -C "./data/"
!mkdir data_round2
!unzip "./unlabelled-v0.2-rc4.zip" -d "./data_round2"

Getting Clean Label Info from Round 1 Data

In [ ]:
labelu_r1 = pd.read_csv(r'.\data\unlabelled\labels.csv')
labelu_r1
Out[ ]:
filename scratch_small scratch_large dent_small dent_large
0 00O1rwvydO.png 0 0 0 0
1 01bB3DVokm.png 1 0 0 0
2 01sAbrP4Gm.png 0 0 0 0
3 02XiKLPuxY.png 0 0 0 0
4 02bVIX0aMT.png 0 1 0 0
... ... ... ... ... ...
9995 zwjPUVHAQy.png 0 0 1 0
9996 zwtEorOjmX.png 1 0 0 1
9997 zxZKEA9OMI.png 0 0 0 0
9998 zxmnO7Zt5o.png 1 0 0 0
9999 zxxRwwUeEv.png 1 0 1 0

10000 rows × 5 columns

let's get the heavily damaged data which got all the label

In [ ]:
labelscr = labelu_r1[(labelu_r1['scratch_small']==1) & (labelu_r1['scratch_large']==1) & (labelu_r1['dent_small']==1)& (labelu_r1['dent_large']==1)]
In [ ]:
labelscr
Out[ ]:
filename scratch_small scratch_large dent_small dent_large
238 1SuRMIiysb.png 1 1 1 1
261 1bNY8WF9ag.png 1 1 1 1
408 2PFdmVacOS.png 1 1 1 1
556 3HpqF4Fo1v.png 1 1 1 1
580 3QKS6GihCc.png 1 1 1 1
... ... ... ... ... ...
9684 y8ZtCnvaiC.png 1 1 1 1
9710 yJa3RY4zXM.png 1 1 1 1
9721 yOv11ZZMtg.png 1 1 1 1
9796 yldO6V9DC9.png 1 1 1 1
9805 ypGhx8EWhk.png 1 1 1 1

108 rows × 5 columns

let's check the image

In [ ]:
i = 3
fname = r'./data/unlabelled/images/' + labelscr.iloc[i,0]
im = Image.open(fname)
cv_image = np.array(im)
plt.figure(figsize=(8,8))
plt.imshow(cv_image, interpolation='none')
plt.show()

using method from my first notebook here (pls leave some likes ❤️ !)

Let's apply it to the images!

In [ ]:
kernel = np.array([[0, -1, 0],
                   [-1, 4, -1],
                   [0, -1, 0]])
In [ ]:
sharp_img = cv2.filter2D(src=cv_image, ddepth=-1, kernel=kernel)
sharp_img = cv2.cvtColor(sharp_img, cv2.COLOR_BGR2GRAY)
In [ ]:
plt.figure(figsize=(8,8))
plt.imshow(sharp_img, interpolation='none')
plt.show()

now you can see the filter can separate the scratch defect from the background. Also, the maximum value can be extracted as a feature where I explain later why this is important.

In [ ]:
sharp_img.max()
Out[ ]:
254

now let's enhance the intensity of the defects and find the countours!

In [ ]:
ret,sharp_img1 = cv2.threshold(sharp_img,30,255,cv2.THRESH_BINARY)
kernel2 = cv2.getStructuringElement(cv2.MORPH_ELLIPSE,(4,4))
dilation = cv2.dilate(sharp_img1,kernel2,iterations = 1)
contours, hierarchy = cv2.findContours(dilation, cv2.RETR_LIST, cv2.CHAIN_APPROX_NONE )
In [ ]:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16,10))
ax1.imshow(sharp_img1, interpolation='none')
ax2.imshow(dilation, interpolation='none')
plt.show()

for visualization, here's the defect contour that we got!

In [ ]:
contour = sorted(contours, key=cv2.contourArea)
contourImg = cv2.drawContours(cv_image.copy(), contour, -1, (0,255,0), 3)
plt.figure(figsize=(8,8))
plt.imshow(contourImg)
plt.show()

now, with the sharp sharp_img.max(), let's save another feature like :

the smallest defect area, the largest defect area, and etc.

In [ ]:
sharpval = []
for i in tqdm(range(0,len(labelu_r1))):

    fname = r'./data/unlabelled/images/' + labelu_r1.iloc[i,0]
    im = Image.open(fname)
    cv_image = np.array(im)

    kernel = np.array([[0, -1, 0],
                       [-1, 4, -1],
                       [0, -1, 0]])

    sharp_img = cv2.filter2D(src=cv_image, ddepth=-1, kernel=kernel)
    sharp_img = cv2.cvtColor(sharp_img, cv2.COLOR_BGR2GRAY)
    ret,sharp_img1 = cv2.threshold(sharp_img,30,255,cv2.THRESH_BINARY)
    kernel2 = cv2.getStructuringElement(cv2.MORPH_ELLIPSE,(4,4))
    dilation = cv2.dilate(sharp_img1,kernel2,iterations = 1)
    contours, hierarchy = cv2.findContours(dilation, cv2.RETR_LIST, cv2.CHAIN_APPROX_NONE )    
    
    # getting area
    c5 = 0
    c10 = 0
    c15 = 0
    c20 = 0
    c30 = 0
    c40 = 0
    c50 = 0
    c80 = 0
    c100 = 0
    c150 = 0
    c200 = 0
    c300 = 0
    c400 = 0
    c500 = 0
    
    areas = []
    for j in contours:
        area = cv2.contourArea(j)
        if area<10000 and area>5: #this to limit incase the cv returning a contour somewhat larger than the image itself
            areas.append(area)
            if 5 <= area < 10:
                c5 = c5 + 1
            if 10 <= area < 15:
                c10 = c10 + 1
            if 15 <= area < 20:
                c15 = c15 + 1
            if 20 <= area < 30:
                c20 = c20 + 1
            if 30 <= area < 40:
                c30 = c30 + 1
            if 40 <= area < 50:
                c40 = c40 + 1
            if 50 <= area < 80:
                c50 = c50 + 1
            if 80 <= area < 100:
                c80 = c80 + 1
            if 100 <= area < 150:
                c100 = c100 + 1   
            if 150 <= area < 200:
                c150 = c150 + 1
            if 200 <= area < 300:
                c200 = c200 + 1
            if 300 <= area < 400:
                c300 = c300 + 1
            if 400 <= area < 500:
                c400 = c400 + 1
            if area > 500:
                c500 = c500 + 1  
    if areas:
        cnt = [min(areas), max(areas), len(areas)]
    else:
        cnt = [0,0,0]    
    
    sharpval.append([sharp_img.max(), cnt[0], cnt[1], cnt[2], c5, c10, c15, c20, c30, c40, c50, c80, c100, c150, c200, c300, c400, c500])
scratch_res = pd.DataFrame(np.array(sharpval),columns=['sharp','min_size','max_size', 'n_con', 'c5', 'c10', 'c15', 'c20', 'c30', 'c40', 'c50', 'c80', 'c100', 'c150', 'c200', 'c300', 'c400', 'c500'])
100%|███████████████████████████████████████████████████████████████████████████| 10000/10000 [00:34<00:00, 287.16it/s]
In [ ]:
scratch_res
Out[ ]:
sharp min_size max_size n_con c5 c10 c15 c20 c30 c40 c50 c80 c100 c150 c200 c300 c400 c500
0 7.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 168.0 7.0 92.0 22.0 9.0 10.0 0.0 1.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0
2 8.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 16.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 254.0 7.0 246.0 17.0 5.0 2.0 2.0 2.0 1.0 0.0 2.0 1.0 1.0 0.0 1.0 0.0 0.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
9995 18.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
9996 107.0 7.0 35.0 7.0 3.0 1.0 2.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
9997 16.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
9998 154.0 7.0 15.0 15.0 8.0 6.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
9999 131.0 7.0 64.0 20.0 11.0 3.0 4.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

10000 rows × 18 columns

In [ ]:
labelu_r1['scratch_flag'] = 0
labelu_r1['scratch_flag'][(labelu_r1['scratch_small'] == 1 ) | (labelu_r1['scratch_large'] == 1 )] = 1
C:\Users\leo.dinendra\AppData\Local\Continuum\anaconda3\lib\site-packages\ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
In [ ]:
labelu_r1
Out[ ]:
filename scratch_small scratch_large dent_small dent_large scratch_flag
0 00O1rwvydO.png 0 0 0 0 0
1 01bB3DVokm.png 1 0 0 0 1
2 01sAbrP4Gm.png 0 0 0 0 0
3 02XiKLPuxY.png 0 0 0 0 0
4 02bVIX0aMT.png 0 1 0 0 1
... ... ... ... ... ... ...
9995 zwjPUVHAQy.png 0 0 1 0 0
9996 zwtEorOjmX.png 1 0 0 1 1
9997 zxZKEA9OMI.png 0 0 0 0 0
9998 zxmnO7Zt5o.png 1 0 0 0 1
9999 zxxRwwUeEv.png 1 0 1 0 1

10000 rows × 6 columns

In [ ]:
recap_res = pd.concat([labelu_r1, scratch_res],axis=1)
recap_res
Out[ ]:
filename scratch_small scratch_large dent_small dent_large scratch_flag sharp min_size max_size n_con ... c30 c40 c50 c80 c100 c150 c200 c300 c400 c500
0 00O1rwvydO.png 0 0 0 0 0 7.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 01bB3DVokm.png 1 0 0 0 1 168.0 7.0 92.0 22.0 ... 0.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0
2 01sAbrP4Gm.png 0 0 0 0 0 8.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 02XiKLPuxY.png 0 0 0 0 0 16.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 02bVIX0aMT.png 0 1 0 0 1 254.0 7.0 246.0 17.0 ... 1.0 0.0 2.0 1.0 1.0 0.0 1.0 0.0 0.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
9995 zwjPUVHAQy.png 0 0 1 0 0 18.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
9996 zwtEorOjmX.png 1 0 0 1 1 107.0 7.0 35.0 7.0 ... 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
9997 zxZKEA9OMI.png 0 0 0 0 0 16.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
9998 zxmnO7Zt5o.png 1 0 0 0 1 154.0 7.0 15.0 15.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
9999 zxxRwwUeEv.png 1 0 1 0 1 131.0 7.0 64.0 20.0 ... 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

10000 rows × 24 columns

let's see the distribution from one of the feature we just extracted which is maximum 'sharp' values.

In [ ]:
recap_res['sharp'][recap_res['scratch_flag'] == 1].hist(bins=20)
Out[ ]:
<matplotlib.axes._subplots.AxesSubplot at 0x2885c2fb108>
In [ ]:
recap_res['sharp'][recap_res['scratch_flag'] == 1].describe()
Out[ ]:
count    3486.000000
mean      189.485657
std        55.329572
min        25.000000
25%       146.000000
50%       193.000000
75%       247.000000
max       255.000000
Name: sharp, dtype: float64
In [ ]:
recap_res['sharp'][recap_res['scratch_flag'] == 0].hist(bins=20)
Out[ ]:
<matplotlib.axes._subplots.AxesSubplot at 0x2885c3cf088>
In [ ]:
recap_res['sharp'][recap_res['scratch_flag'] == 0].describe()
Out[ ]:
count    6514.000000
mean       13.389776
std         6.000708
min         6.000000
25%        10.000000
50%        12.000000
75%        15.000000
max        89.000000
Name: sharp, dtype: float64

Well look at that, by just using one of the features, you can pretty much divide which image got scratch defect or not.

you can play around with the size contour and every contour length counted available to try find another kind of defect.

now, let's try to apply it on our second dataset!


Apply it to Round 2 Data

In [ ]:
labelu_r2 = pd.read_csv(r'.\data_round2\unlabelled\labels.csv')
labelu_r2
Out[ ]:
filename scratch_small scratch_large dent_small dent_large stray_particle discoloration
0 mxQiOKkCi7.png 0 0 0 0 0 0
1 F8E3dsmnSP.png 0 0 0 0 0 0
2 obeLGzdD8T.png 1 0 0 0 0 0
3 vbyjMQ90i6.png 0 0 1 0 0 0
4 RdzUUAkSVC.png 0 0 0 0 0 0
... ... ... ... ... ... ... ...
9995 FWmzmcWsvK.png 0 0 0 0 1 0
9996 McdXs4y8d7.png 0 0 0 0 1 0
9997 jFNmZC4LF4.png 0 0 0 0 1 0
9998 OiPXgurZKH.png 0 0 0 0 1 0
9999 lr41fb2k0u.png 0 0 0 0 0 0

10000 rows × 7 columns

In [ ]:
sharpval = []
for i in tqdm(range(0,len(labelu_r2))):

    fname = r'./data_round2/unlabelled/images/' + labelu_r2.iloc[i,0]
    im = Image.open(fname)
    cv_image = np.array(im)

    kernel = np.array([[0, -1, 0],
                       [-1, 4, -1],
                       [0, -1, 0]])

    sharp_img = cv2.filter2D(src=cv_image, ddepth=-1, kernel=kernel)
    sharp_img = cv2.cvtColor(sharp_img, cv2.COLOR_BGR2GRAY)
    ret,sharp_img1 = cv2.threshold(sharp_img,30,255,cv2.THRESH_BINARY)
    kernel2 = cv2.getStructuringElement(cv2.MORPH_ELLIPSE,(4,4))
    dilation = cv2.dilate(sharp_img1,kernel2,iterations = 1)
    contours, hierarchy = cv2.findContours(dilation, cv2.RETR_LIST, cv2.CHAIN_APPROX_NONE )    
    
    # getting area
    c5 = 0
    c10 = 0
    c15 = 0
    c20 = 0
    c30 = 0
    c40 = 0
    c50 = 0
    c80 = 0
    c100 = 0
    c150 = 0
    c200 = 0
    c300 = 0
    c400 = 0
    c500 = 0
    
    areas = [] # we are counting how many contour and group it by its area (useful for further differentiating the defect)
    for j in contours: 
        area = cv2.contourArea(j)
        if area<10000 and area>5: #this to limit incase the cv returning a contour somewhat larger than the image itself
            areas.append(area)
            if 5 <= area < 10:
                c5 = c5 + 1
            if 10 <= area < 15:
                c10 = c10 + 1
            if 15 <= area < 20:
                c15 = c15 + 1
            if 20 <= area < 30:
                c20 = c20 + 1
            if 30 <= area < 40:
                c30 = c30 + 1
            if 40 <= area < 50:
                c40 = c40 + 1
            if 50 <= area < 80:
                c50 = c50 + 1
            if 80 <= area < 100:
                c80 = c80 + 1
            if 100 <= area < 150:
                c100 = c100 + 1   
            if 150 <= area < 200:
                c150 = c150 + 1
            if 200 <= area < 300:
                c200 = c200 + 1
            if 300 <= area < 400:
                c300 = c300 + 1
            if 400 <= area < 500:
                c400 = c400 + 1
            if area > 500:
                c500 = c500 + 1  
    if areas:
        cnt = [min(areas), max(areas), len(areas)]
    else:
        cnt = [0,0,0]    
    
    sharpval.append([sharp_img.max(), cnt[0], cnt[1], cnt[2], c5, c10, c15, c20, c30, c40, c50, c80, c100, c150, c200, c300, c400, c500])
scratch_res_r2 = pd.DataFrame(np.array(sharpval),columns=['sharp','min_size','max_size', 'n_con', 'c5', 'c10', 'c15', 'c20', 'c30', 'c40', 'c50', 'c80', 'c100', 'c150', 'c200', 'c300', 'c400', 'c500'])
100%|███████████████████████████████████████████████████████████████████████████| 10000/10000 [00:29<00:00, 344.24it/s]
In [ ]:
labelu_r2['scratch_flag'] = 0
labelu_r2['scratch_flag'][(labelu_r2['scratch_small'] == 1 ) | (labelu_r2['scratch_large'] == 1 )] = 1
C:\Users\leo.dinendra\AppData\Local\Continuum\anaconda3\lib\site-packages\ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
In [ ]:
recap_res_r2 = pd.concat([labelu_r2, scratch_res_r2],axis=1)
recap_res_r2
Out[ ]:
filename scratch_small scratch_large dent_small dent_large stray_particle discoloration scratch_flag sharp min_size ... c30 c40 c50 c80 c100 c150 c200 c300 c400 c500
0 mxQiOKkCi7.png 0 0 0 0 0 0 0 33.0 7.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 F8E3dsmnSP.png 0 0 0 0 0 0 0 32.0 7.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 obeLGzdD8T.png 1 0 0 0 0 0 1 180.0 7.0 ... 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0
3 vbyjMQ90i6.png 0 0 1 0 0 0 0 36.0 7.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 RdzUUAkSVC.png 0 0 0 0 0 0 0 36.0 7.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
9995 FWmzmcWsvK.png 0 0 0 0 1 0 0 82.0 7.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
9996 McdXs4y8d7.png 0 0 0 0 1 0 0 33.0 7.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
9997 jFNmZC4LF4.png 0 0 0 0 1 0 0 116.0 7.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
9998 OiPXgurZKH.png 0 0 0 0 1 0 0 92.0 7.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
9999 lr41fb2k0u.png 0 0 0 0 0 0 0 34.0 7.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

10000 rows × 26 columns

now let's see the distribution of round 2 unlabelled data.

In [ ]:
recap_res_r2['sharp'][recap_res_r2['scratch_flag'] == 1].hist(bins=20)
Out[ ]:
<matplotlib.axes._subplots.AxesSubplot at 0x2885d8cf088>
In [ ]:
recap_res_r2['sharp'][recap_res_r2['scratch_flag'] == 1].describe()
Out[ ]:
count    2559.000000
mean      122.136772
std        66.831127
min        29.000000
25%        41.000000
50%       128.000000
75%       173.000000
max       255.000000
Name: sharp, dtype: float64
In [ ]:
recap_res_r2['sharp'][recap_res_r2['scratch_flag'] == 0].hist(bins=20)
Out[ ]:
<matplotlib.axes._subplots.AxesSubplot at 0x2885d8768c8>
In [ ]:
recap_res_r2['sharp'][recap_res_r2['scratch_flag'] == 0].describe()
Out[ ]:
count    7441.000000
mean       54.664830
std        43.347759
min        29.000000
25%        33.000000
50%        35.000000
75%        40.000000
max       255.000000
Name: sharp, dtype: float64

Wow, look at that!

Now let's assume that over 100 maximum 'sharp' values means there must be a scratch defect, let's see how many of image got bad label (I know that's too generous, knowing the distribution from round 1)

In [ ]:
bad_label = recap_res_r2['sharp'][recap_res_r2['scratch_flag'] == 0] > 100
In [ ]:
recap1_falneg = recap_res_r2[(recap_res_r2['scratch_flag'] == 0) & (recap_res_r2['sharp']>100)]
recap1_falneg
Out[ ]:
filename scratch_small scratch_large dent_small dent_large stray_particle discoloration scratch_flag sharp min_size ... c30 c40 c50 c80 c100 c150 c200 c300 c400 c500
12 JmGDWGBWY8.png 0 0 1 0 0 0 0 122.0 7.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
27 MiU3mPyHq5.png 0 0 0 0 0 0 0 145.0 7.0 ... 2.0 0.0 2.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
30 AUQYeyEEnp.png 0 0 0 0 0 0 0 195.0 7.0 ... 1.0 1.0 0.0 2.0 1.0 0.0 1.0 0.0 0.0 0.0
42 q3YgefTyOq.png 0 0 0 0 0 0 0 188.0 7.0 ... 5.0 1.0 3.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
54 c3qDX8KcFd.png 0 0 0 0 0 0 0 145.0 7.0 ... 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
9979 BXyL57EhSG.png 0 0 0 0 1 0 0 133.0 7.0 ... 2.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
9981 DofmSnpIPp.png 0 0 0 0 1 0 0 108.0 7.0 ... 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
9989 KfJC6lktgz.png 0 0 0 0 1 0 0 125.0 7.0 ... 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
9992 EqmgjbWe2r.png 0 0 0 0 1 0 0 136.0 7.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
9997 jFNmZC4LF4.png 0 0 0 0 1 0 0 116.0 7.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

1176 rows × 26 columns

THERE ARE over 10% of theeeeeem??

it's still only the false negative. you don't believe me?

let's plot some of the images!

In [ ]:
plt.figure(figsize=(20, 30))
for i in range(12):
    ax = plt.subplot(6, 4, 1*i + 1)
    fname = r'./data_round2/unlabelled/images/' + recap1_falneg.iloc[i,0]
    im = Image.open(fname)
    plt.imshow(im)
    plt.axis("off")