Seismic Facies Identification Challenge
[Explainer] W&B, Catalyst with EDA, Extra Attributes
EDA, Extra Features with wandb API and Catalyst training for rapid experimentation
This notebook aims to introduce two tools for rapid experimentation and better configuration for your training, I propose catalyst and wandb in this competition which gives a decent score with the same flexibility of custom training loop. I also tried to summarize important EDA and extra attributes but only as a brief section as I feel that most of the existing works have already done a beautiful job of explaining the data exploration part and thus I introduced new tools with my approach for this challenge.
Seismic Facies Identification Challenge - Explainer Notebook with Added Perks¶
- Jyot Makadiya
In this notebook, we are going to look at Seismic Facies Identification challenge which is a 3D semantic segmentation problem and with some tweaking, can be converted into 2D semantic segmentation task. This notebook aims to give an idea about previous work (notebooks at this page) and summarizes a few very basic EDA, extra attibutes ideas and finally introduces a flexible and systemic approach for training models and tracking progress.
- What is 2D semantic segmentation? </br> In simple words, semantic segmentation is a image recognition challenge with some perks where each pixel a class needs to be identified and tagged instead of whole image, so what we get in the end is a mask the size of original image segmented into different classes. As we all know, it requires a lot of data, different tools are used to prepared the dataset for segmentation maps from images.
- what is w&B ? weights and biases is a free service and set of tools which provide easy tracking of all data science experiments along with the data visualization tools, in this notebook, we present very simple level introduction to wandb for only loss,accuracy vis but it can be further extended to store hyperparameter info, saving models etc.
- What is smp and why we are using it ? Segmentation models pytorch is package which provides pretrained well known segmentation architecture for suitable segmentation tasks. It is easy to use and easy to debug which makes it popular choice among researcher to initiate rapid experimentation with that.
- what is the need of preprocessing (in our case why augmentations and normalization)? I have tried to make this notebook well balanced between not too complex for preprocessing as well as enough to get a good score, basically, the preprocessing step is crucial in some tasks where there is a need for some performance improvement using some domain insights. The aim of augmentation is to get more data for training while reducing the overfitting effect. We applied normalization to make the train values even and not concentrated around same mean values.
Get dataset from colab¶
#getting the dataset from the drive
from google.colab import drive
drive.mount('/content/gdrive/')
!pip install segmentation_models_pytorch argus pytorch_toolbelt wandb catalyst --upgrade albumentations
!pip install git+https://github.com/gazprom-neft/seismiqb.git
Note: I already have dataset downloaded in my drive, so please make sure to put competition dataset over there and get the path for the dataset.¶
#copy the dataset
!mkdir -p seismic-facies-identification-challenge
!rsync -avhW --compress-level=2 --info=progress2 /content/gdrive/MyDrive/Datasets/AIcrowd/seismic-facies-identification/data /content/seismic-facies-identification-challenge/data
Importing packages/libraries¶
Import packages that are common for training and prediction phases here.
#importing libraries, fundamentals and EDA related
import numpy as np
from sklearn.model_selection import StratifiedKFold
import matplotlib.pyplot as plt
import torch
torch.backends.cudnn.benchmark = True
import torch.optim as optim
from torch.utils.data import DataLoader, Dataset, Subset
import segmentation_models_pytorch as smp
# import argus
# from argus.callbacks import MonitorCheckpoint, EarlyStopping, LoggingToFile, ReduceLROnPlateau
# import albumentations as A
# from pytorch_toolbelt.inference.tiles import ImageSlicer, CudaTileMerger
# from pytorch_toolbelt.losses import LovaszLoss
#catalyst related imports
import os
from tempfile import TemporaryDirectory
from pytest import mark
from torch import nn, optim
from torch.utils.data import DataLoader
import catalyst
from catalyst import dl, utils, metrics
from catalyst.contrib.datasets import MNIST
from catalyst.data.transforms import ToTensor
from catalyst.settings import IS_CUDA_AVAILABLE, NUM_CUDA_DEVICES, SETTINGS
#misc imports
import albumentations as A
from copy import copy
import numpy as np
import matplotlib.pyplot as plt
import gc
import tqdm
import cv2
#setting the seed for catalyst
SEED = 42
utils.set_global_seed(SEED)
utils.prepare_cudnn(deterministic=True)
Loading train and test dataset¶
we use numpy to get images/3D maps from compressed files
%%time
#loading training and test data
train_data_full = np.load('/content/seismic-facies-identification-challenge/data/data/data_train.npz', allow_pickle=True, mmap_mode='r')['data']
train_label_full = np.load('/content/seismic-facies-identification-challenge/data/data/labels_train.npz', allow_pickle=True, mmap_mode='r')['labels']
test_img = np.load('/content/seismic-facies-identification-challenge/data/data/data_test_2.npz', allow_pickle=True, mmap_mode='r')['data']
EDA¶
Now let's take a look at the training data starting with basic info about data then plotting train data distribution w.r.t. mean +- 3 std and its log trasformation as well.
%%time
mean, std = train_data_full.mean(), train_data_full.std()
ranges = train_data_full.min(), train_data_full.max()
values = np.unique(train_label_full)
data_histograms = [train_data_full[train_label_full == value].flatten()
for value in values]
plt.figure(figsize=(8, 5))
plt.hist(train_data_full.flatten(), bins=100, color = 'b', alpha=0.2)
plt.axvline(mean, color='r', linestyle='dashed', linewidth=2)
plt.axvline(mean + 3*std, color='r', linestyle='dashed', linewidth=1)
plt.axvline(mean - 3*std, color='r', linestyle='dashed', linewidth=1)
plt.show()
plt.figure(figsize=(8, 5))
plt.hist(train_data_full.flatten(), bins=100, log=True, color = 'b', alpha=0.2)
plt.axvline(mean, color='r', linestyle='dashed', linewidth=2)
plt.axvline(mean + 3*std, color='r', linestyle='dashed', linewidth=1)
plt.axvline(mean - 3*std, color='r', linestyle='dashed', linewidth=1)
plt.show()
Please notice the log distribution is much more evenly spread than original values, from which we can conclude the image/train data contains non-uniform a few high frequency values probably concentrated around mean (definetly is from seeing the graph). </br> Also the values are not skewed around mean, rather evenly splitted.
Now let's take a look at how the label distribution looks like.
CLASS_LABELS = [
'Basement/other',
'Slope Mudstone A',
'Mass Transport\n Deposit',
'Slope Mudstone B',
'Slope Valley',
'Submarine Canyon\n System'
]
fig, ax = plt.subplots(3, 2, figsize=(14, 10))
for i, value in enumerate(values):
data = data_histograms[i]
mean_, std_ = data.mean(), data.std()
ax_ = ax[i // 2, i % 2]
label_name = CLASS_LABELS[value-1].replace('\n', '')
ax_.hist(data, log=True, bins=50, range=ranges, color='b',
label=f'value {value}: {label_name}', alpha=0.2)
ax_.axvline(mean_, color='r', linestyle='dashed', linewidth=2)
ax_.axvline(mean_ + 3*std_, color='r', linestyle='dashed', linewidth=1)
ax_.axvline(mean_ - 3*std_, color='r', linestyle='dashed', linewidth=1)
ax_.legend(loc='best')
print(f'{value}) mean is {mean_:4.4} and std is {std_:4.4} for label: {label_name}')
fig.show()
Above we looked at individual labels distribution along with its mean,std ranges. The 6 values look more or less the same with a few changes. Now, we are going to visualize how a 2D slice of 3D data can be represented and we try to plot that using below code. The code is adapted from a great EDA notebook by dipam_chakraborty
, make sure to check out their work here which provides great insights as well as fun to read.
#let's look at 2d section of 3d data using below slice visualization
fig, ax = plt.subplots(1,3, sharey=True);
fig.set_size_inches(20, 8);
fig.suptitle("2D slice of the 3D seismic data volume for better visualization", fontsize=20);
ax[0].imshow(train_data_full[:, :, 100], cmap='terrain');
ax[0].set_ylabel('Z Axis: Top - Bottom', fontsize=14);
ax[1].imshow(train_label_full[:, :, 100]);
ax[2].imshow(train_data_full[:, :, 100], cmap='terrain');
ax[2].imshow(train_label_full[:, :, 100], alpha=0.4, cmap='twilight');
for i in range(3):
ax[i].set_xlabel('X Axis: West - East', fontsize=14);
EDA - Extra attributes¶
Moving on with attributes, we will explore a few operations which can be used to increase number of channels in the input. Although, these addition may lead to improved performance, DL models (NNs) can approximate them and the performace advatange may not be significant (apart from some speedup during training)
</br>
Please make sure to check out in-depth EDA from this great notebook by leocd
here and wonderfull work of sergeytsimfer
can be found at this discussion thread, I have used adapted version of a few attributes used by them.
Here, we are using hilbert operator to create new attribute as seen in the next cell's output.
from seismiqb import plot_image, Horizon, HorizonMetrics, SeismicGeometry
import scipy
from scipy.ndimage import gaussian_filter1d
from scipy.signal import hilbert
from scipy.ndimage.filters import convolve
slide = train_data_full[380, :, :]
plot_image(slide, cmap='gray', colorbar=False, figsize=(12, 8), title='Original slide')
hilbert_scipy = hilbert(slide, axis=1)
plot_image(hilbert_scipy.imag, cmap='gray', colorbar=False, figsize=(12, 8), title='Transformed slide')