GROUP 19

  • DO Thi Duyen
  • LE Ta Dang Khoa
In [1]:
!pip3 install --user imbalanced-learn
Requirement already satisfied: imbalanced-learn in /mnt/workspace/.local/lib/python3.5/site-packages (0.4.3)
Requirement already satisfied: scipy>=0.13.3 in /usr/local/lib/python3.5/dist-packages (from imbalanced-learn) (1.1.0)
Requirement already satisfied: scikit-learn>=0.20 in /mnt/workspace/.local/lib/python3.5/site-packages (from imbalanced-learn) (0.20.3)
Requirement already satisfied: numpy>=1.8.2 in /usr/local/lib/python3.5/dist-packages (from imbalanced-learn) (1.14.5)
You are using pip version 18.0, however version 19.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
In [1]:
import warnings
warnings.filterwarnings('ignore')

import numpy as np
import time
import pandas as pd
pd.set_option('display.max_columns', 130)

from collections import Counter

import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split, cross_validate
from sklearn.feature_selection import RFECV
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.metrics import classification_report
from sklearn.metrics.scorer import make_scorer
from sklearn.metrics import f1_score

from imblearn.under_sampling import RandomUnderSampler, NearMiss, TomekLinks
from imblearn.over_sampling import SMOTE, BorderlineSMOTE

import zipfile
from io import BytesIO
from PIL import Image

from sklearn import preprocessing

import seaborn as sns
import matplotlib.pyplot as plt

1+2. DATA EXPLORATION & PRE-PROCESSING

A. Extract Images from ZIP file

In [6]:
def extract_zip_to_memory(input_zip):
    '''
    This function extracts the images stored inside the given zip file.
    It stores the result in a python dictionary.
    
    input_zip (string): path to the zip file
    
    returns (dict): {filename (string): image_file (bytes)}
    '''
    input_zip=zipfile.ZipFile(input_zip)
    return {name: BytesIO(input_zip.read(name)) for name in input_zip.namelist() if name.endswith('.jpg')}

start = time.time()
img_files = extract_zip_to_memory("/mnt/datasets/plankton/flowcam/imgs.zip")
print("Extraction time: ", time.time()-start)
print("Number of images: %d" % len(img_files))
Extraction time:  22.135587453842163
Number of images: 243610

Comment. The size of the dataset is pretty large (~240k), in comparison to the Oregon State dataset (~30K).

B. Take a look at the Images

In [7]:
def view_images(keys):
    """ Show images of planktons within the dataset,
        given the images' names.
    
    Parameters:
        keys (string): Name of the images
    Return:
        None.
    """
    for key in keys:
        # Read the Image
        img = imread(img_files[key])
    
        # Show the Image
        print(key)
        plt.imshow(img, cmap=cm.gray)
        plt.show()

FIRST, let's take look at the images themselves.

In [8]:
from skimage.io import imread, imshow
from pylab import cm
import random

# Randomly selects 5 images
img_keys = list(img_files.keys())
rand_names = random.sample(img_keys, 5)

# Show the images
view_images(rand_names)
imgs/32732419.jpg
imgs/32678561.jpg
imgs/32568427.jpg
imgs/32625285.jpg
imgs/32643509.jpg

NEXT, let's inspect the type & depth of the images' pixels.

In [10]:
modes = [Image.open(img_files[k]).mode for k in img_files]
print(Counter(modes))
Counter({'L': 243610})

Comment.

  1. Since all 243,610 images of this dataset is of mode L, we can conclude that our image-dataset are 8-bit color pixels and greyscale.

  2. After running several times this process of randomly showing images, we see that the size (dimensions) of the images are very different, each crops exactly to the plankton's object.

  3. The implication of this is if one wants to use CNN on this dataset, then she has to resize all the images to a same size, a very time-consuming process.

FINALLY, we plot the distribution of images' sizes to see if we can get some more insights.

In [11]:
# Get the dimensions of the image
dims = [Image.open(img_files[k]).size for k in img_files]

# Get the ratios of the image
wh_ratios = [(w / h) for w, h in dims]
widths = [w for w, _ in dims]
heights = [h for _, h in dims]
In [12]:
# Graph the distribution of width-height ratios
plt.figure()
r = sns.distplot(wh_ratios, bins=1000, kde=False)
r.set(xlim=(0, 5), title='width / height ratio')

# Graph the distribution of widths
plt.figure()
sns.distplot(widths).set_title('widths')

# Graph the distribution of heights
plt.figure()
sns.distplot(heights).set_title('heights')
Out[12]:
Text(0.5,1,'heights')

COMMENT.

  1. Looking at the width-height ratio graph, we can see that:

    • the images are mostly square (ratio = 1),
    • a big amount of images are horizontal rectangles (ratio < 1 => width < height),
    • and the images rarely are vettical rectangles (ratio > 1 => width > height).
  2. Looking at the widths and heights graph, we can see that both share a very similar distribution, left-skewed with modes around 100.

  3. The overall implication is that, in this dataset, image-size & image-shape are very different. Therefore, scaling them to a same size and apply CNN will encode wrong (latent) signals in a BIG proportion of images. Unless we're very skillful at image processing to resolve this problem (which we surely aren't), we should avoid that CNN path.

C. Analysing the Labels

FIRST, we load the meta-data of the images.

In [2]:
meta = pd.read_csv('/mnt/datasets/plankton/flowcam/meta.csv')
meta.head(5)
Out[2]:
objid projid id status latitude longitude objdate objtime depth_min depth_max unique_name lineage level1 level2
0 32756761.0 133 84963 V 43.683333 7.3 2013-09-19 00:09:00 0 75 detritus /#/not-living/detritus detritus detritus
1 32759364.0 133 84963 V 43.683333 7.3 2013-09-19 00:09:00 0 75 detritus /#/not-living/detritus detritus detritus
2 32758055.0 133 28299 V 43.683333 7.3 2013-09-19 00:09:00 0 75 Guinardia /#/living/Eukaryota/Harosa/Stramenopiles/Ochro... Guinardia Rhizosolenids
3 32758988.0 133 92010 V 43.683333 7.3 2013-09-19 00:09:00 0 75 silks /#/not-living/plastic/other/silks silks silks
4 32760598.0 133 92010 V 43.683333 7.3 2013-09-19 00:09:00 0 75 silks /#/not-living/plastic/other/silks silks silks

COMMENT. A big suprise when we look at the labels of this dataset is that many of them aren't planktons in a common sense.

  • For example, detritus is actually about dead bodies; feces is, of course, not a usual plankton; and badfocus is actually about images that are taken with a wrong focus.
  • When taking a deeper look at the lineage-column of this meta-data, one can see the full taxonomy that also includes _silks, artefacts and many types that aren't usual plankton.
  • We would speculate that if this data is representative, then it is highly unblanced, as dead bodies are overwhelmed in the sea, and silk or plastic cannot share the same proportion with real planktons.

NEXT, we build the targets dataframe, which includes object-ids and their corresponding level-2 labels.

In [3]:
targets_df = meta[['objid','level2']].drop_duplicates()
targets_df = targets_df[targets_df['level2'].isna()!=True]
targets_df['objid'] = targets_df['objid'].astype('int64')

targets_df.head(5)
Out[3]:
objid level2
0 32756761 detritus
1 32759364 detritus
2 32758055 Rhizosolenids
3 32758988 silks
4 32760598 silks
In [15]:
print("Number of records: %d" % targets_df.shape[0])
print("Number of classes: %d" % targets_df['level2'].nunique())
Number of records: 242607
Number of classes: 39

COMMENT. Here we build the targets set as followings:

  1. Only choose 2 columns objid (image-id) and level2 (label) because they are all we need.
  2. Remove all rows that doesn't have a label, as we cannot train with them.

After this process, there are 242,607 records left. In comparison with 243,610 images in the dataset, this is smaller. Therefore, we need to make sure that all records in the target-set are included in the image-dataset. Since otherwise, we will train our models on objects that aren't existed.

FINALLY, we plot the distribution of labels.

In [16]:
labels_freq = Counter(targets_df['level2'])
labels_freq_df = pd.DataFrame.from_dict(labels_freq, orient='index')\
                             .rename(columns={0: 'count'})\
                             .sort_values(by=['count'], ascending=False)

labels_freq_df.plot(kind='bar');

COMMENT. As speculated above, this dataset is highly imblanced.

  1. The biggest class has with 77812 instances (detritus or dead bodies), while the smallest class has only 8 instances (Bacteriastrum, a real plankton).

  2. It is very difficult to get high performance if we leave this dataset imbalanced as it is. Therefore, resampling techniques are very needed, possible over-sampling for minor classes.

D. Encode the Labels. Check if object-ids are in synced with image-ids.

FIRST, we encode the labels of our targets dataframe.

In [4]:
# Initialize the LabelEncoder
le = preprocessing.LabelEncoder()
le.fit(targets_df['level2'])

# Encode the labels, drop the original column
targets_df['target'] = le.transform(targets_df['level2'])
targets_df.drop(['level2'], axis=1, inplace=True)

targets_df.head(5)
Out[4]:
objid target
0 32756761 30
1 32759364 30
2 32758055 20
3 32758988 37
4 32760598 37

NEXT, we ensure that all object-ids are belong the image-ids.

In [18]:
# Get the ID of all images
image_ids = [int(k.split('/')[1].split('.')[0]) for k in img_files]
image_ids_set = set(image_ids)
In [19]:
# Check if all object-ids are belong to image-ids
object_ids_set = set(targets_df['objid'])
object_ids_set.difference(image_ids_set)
Out[19]:
set()

COMMENT. Since all object-ids refer to images in our image-dataset, we can proceed with feature exploration.

E. Explore the Features:

WHY WE USE FEATURES INSTEAD OF IMAGES?

  1. First, as stated above, resizing images is not a good idea.
  2. Second, but much more important, is that the features given are engineered with expert-knolwedge:
    • If look for a partial of this dataset online, one will see that the features are constructed using an expertise process name ZooProcess.
    • This process computes morphological features in the native feature-set.
    • Then, they use the skimage library to recompute the morphological features using the ROIs produced by ZooProcess.
    • ROIs are regions of interested, which can only be pointed out by expert knowledge.

That's why we decided to use features instead (both of them, indeed).

TO BEGIN, we investigate the Native feature-set, and see if there are anything we should do to preprocess the data.

In [5]:
# Load the Native Feature-set
features_native = pd.read_csv('/mnt/datasets/plankton/flowcam/features_native.csv.gz', compression='gzip', error_bad_lines=False)
features_native['objid'] = features_native['objid'].astype('int64')

We find all columns that contains at least one NaN.

In [21]:
features_native.loc[:, features_native.isna().any()][:10]
Out[21]:
perimareaexc feretareaexc cdexc convarea_area symetrieh_area symetriev_area nb1_area nb2_area nb3_area skeleton_area
0 2.215909 0.366477 0.011364 91.778443 0.005988 0.005988 0.017964 0.017964 0.059880 91.778443
1 0.462871 0.165842 -0.002475 18.030120 0.012048 0.012048 0.024096 0.012048 0.006024 18.030120
2 0.332130 0.139591 -0.001203 24.892857 0.017857 0.017857 0.011905 0.053571 0.107143 24.892857
3 NaN NaN NaN 91.718563 0.005988 0.005988 0.000000 0.000000 0.023952 99.365269
4 10.898551 3.536232 0.057971 109.449102 0.017964 0.017964 0.029940 0.083832 0.059880 109.449102
5 1.325219 0.194742 0.000000 190.180723 0.006024 0.006024 0.078313 0.072289 0.072289 190.180723
6 1.012579 0.134591 0.003774 208.923077 0.006410 0.006410 0.064103 0.217949 0.173077 208.923077
7 13.155844 3.129870 0.051948 113.497006 0.011976 0.011976 0.059880 0.017964 0.011976 113.497006
8 99.000000 44.166667 -0.166667 61.523810 0.029762 0.029762 0.017857 0.017857 0.011905 61.523810
9 23.000000 8.275862 0.137931 179.190476 0.011905 0.011905 0.041667 0.041667 0.047619 179.190476

COMMENT.

  • The NaN are due to some problems with the records, not the inherent flaws of the features. Therefore, we need to fill in the missing values instead of dropping columns.
  • Next, let's find the rows that have the values at those columns equals NaN, to see if there are any patterns among them.
In [22]:
# For the columns: 'perimareaexc', 'feretareaexc' & 'cdexc'
display(features_native[features_native['perimareaexc'].isnull()].loc[:, features_native.isna().any()][:10])

# For the columns: 'convarea_area', 'convarea_area' 'symetrieh_area', 'symetriev_area'
# 'nb1_area', 'nb2_area', 'nb3_area' & 'skeleton_area'
display(features_native[features_native['convarea_area'].isnull()].loc[:, features_native.isna().any()][:10])
perimareaexc feretareaexc cdexc convarea_area symetrieh_area symetriev_area nb1_area nb2_area nb3_area skeleton_area
3 NaN NaN NaN 91.718563 0.005988 0.005988 0.0 0.0 0.023952 99.365269
12 NaN NaN NaN 12.053892 0.005988 0.005988 0.0 0.0 0.000000 28.227545
13 NaN NaN NaN 16.571429 0.006494 0.006494 0.0 0.0 0.000000 48.532468
17 NaN NaN NaN 18.114650 0.006369 0.006369 0.0 0.0 0.000000 46.802548
19 NaN NaN NaN 21.549669 0.006623 0.006623 0.0 0.0 0.000000 51.841060
22 NaN NaN NaN 15.935065 0.006494 0.012987 0.0 0.0 0.000000 38.688312
27 NaN NaN NaN 20.577922 0.006494 0.006494 0.0 0.0 0.000000 50.331169
29 NaN NaN NaN 15.980892 0.006369 0.006369 0.0 0.0 0.000000 47.496815
36 NaN NaN NaN 19.496774 0.006452 0.006452 0.0 0.0 0.000000 43.232258
37 NaN NaN NaN 21.097561 0.006098 0.006098 0.0 0.0 0.000000 46.689024
perimareaexc feretareaexc cdexc convarea_area symetrieh_area symetriev_area nb1_area nb2_area nb3_area skeleton_area
807 4.508475 1.084746 -0.016949 NaN NaN NaN NaN NaN NaN NaN
826 1.652632 0.333333 0.014035 NaN NaN NaN NaN NaN NaN NaN
829 2.481818 0.545455 -0.009091 NaN NaN NaN NaN NaN NaN NaN
830 1.793939 0.400000 -0.006061 NaN NaN NaN NaN NaN NaN NaN
842 4.072165 0.814433 0.020619 NaN NaN NaN NaN NaN NaN NaN
844 2.561905 0.533333 -0.009524 NaN NaN NaN NaN NaN NaN NaN
850 2.780952 0.476190 0.038095 NaN NaN NaN NaN NaN NaN NaN
853 11.714286 2.321429 0.142857 NaN NaN NaN NaN NaN NaN NaN
854 3.623377 0.740260 0.051948 NaN NaN NaN NaN NaN NaN NaN
855 1.635838 0.358382 0.028902 NaN NaN NaN NaN NaN NaN NaN

COMMENT. The rows identified for inspection are:

  • To fill in the missing values for perimareaexc, feretareaexc, cdexc, let's inspect 17, 22, 29 (select randomly).
  • To fill in the missing values for convarea_area, symetrieh_area, symetriev_area, nb1_area, nb2_area, nb3_area, skeleton_area, let's inspect 826, 842, 855 (select randomly).
In [23]:
# Get the obj-ids of object at index = 17, 22, 29
focus_ids = features_native.loc[[17, 22, 29], ['objid']].objid.tolist()
focus_keys = ["imgs/{}.jpg".format(id) for id in focus_ids]

# View their images
view_images(focus_keys)
imgs/32733995.jpg
imgs/32587335.jpg
imgs/32736277.jpg
In [24]:
# Get the obj-ids of object at index = 826, 842, 855
focus_ids = features_native.loc[[826, 842, 855], ['objid']].objid.tolist()
focus_keys = ["imgs/{}.jpg".format(id) for id in focus_ids]

# View their images
view_images(focus_keys)
imgs/32745721.jpg
imgs/32643818.jpg
imgs/32642887.jpg

COMMENT. In each case, we can that the images are very similar, showing a systemic evidence of absence, NOT random absence of evidence:

  • Therefore, there must be a reason why the experts decided not to fill in the values.
  • So, we should fill in 0 instead of something else (such as mode, mean, median, etc.).
In [6]:
# Fill all missing values with 0
features_native.fillna(0, inplace=True)

NEXT, we investigate the skimage feature-set, for the purpose of data-preprocessing.

In [7]:
# Load the skimage's feature-set
features_skimage = pd.read_csv('/mnt/datasets/plankton/flowcam/features_skimage.csv.gz', compression='gzip', error_bad_lines=False)
features_skimage['objid'] = features_skimage['objid'].astype('int64')
In [27]:
# View ALL columns that contains NaN
features_skimage.loc[:10, features_skimage.isna().any()]
Out[27]:
moments_normalized0 moments_normalized1 moments_normalized4 weighted_moments_normalized0 weighted_moments_normalized1 weighted_moments_normalized4
0 NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN NaN
5 NaN NaN NaN NaN NaN NaN
6 NaN NaN NaN NaN NaN NaN
7 NaN NaN NaN NaN NaN NaN
8 NaN NaN NaN NaN NaN NaN
9 NaN NaN NaN NaN NaN NaN
10 NaN NaN NaN NaN NaN NaN

COMMENT. These columns are inherently flawed, as they contains mostly NaN, which makes sense because the features here are recomputed using skimage library, a very general library. So, it is normal that there are features cannot be computed. Therefore, we should remove all these empty features.

In [8]:
# Drop the 'flawed' columns
features_skimage.dropna(axis='columns', inplace=True)

NOW, we'll merge the 2 feature-sets with the targets, using object-ids.

In [9]:
data = features_native.merge(features_skimage, on='objid', how='inner').merge(targets_df, on='objid', how='inner')
print("Data after joining features and target: ", data.shape)
data.head(10)
Data after joining features and target:  (242607, 125)
Out[9]:
objid area_x meanimagegrey mean stddev min perim. width height major minor angle circ. feret intden median skew kurt %area area_exc fractal skelarea slope histcum1 histcum2 histcum3 nb1 nb2 nb3 symetrieh symetriev symetriehc symetrievc convperim convarea fcons thickr esd elongation range meanpos centroids cv sr perimareaexc feretareaexc perimferet perimmajor circex cdexc kurt_mean skew_mean convperim_perim convarea_area symetrieh_area symetriev_area nb1_area nb2_area nb3_area nb1_range nb2_range nb3_range median_mean median_mean_range skeleton_area area_y convex_area eccentricity equivalent_diameter euler_number filled_area inertia_tensor0 inertia_tensor1 inertia_tensor2 inertia_tensor3 inertia_tensor_eigvals0 inertia_tensor_eigvals1 major_axis_length max_intensity mean_intensity min_intensity minor_axis_length moments_hu0 moments_hu1 moments_hu2 moments_hu3 moments_hu4 moments_hu5 moments_hu6 moments_normalized2 moments_normalized3 moments_normalized5 moments_normalized6 moments_normalized7 moments_normalized8 moments_normalized9 moments_normalized10 moments_normalized11 moments_normalized12 moments_normalized13 moments_normalized14 moments_normalized15 perimeter solidity weighted_moments_hu0 weighted_moments_hu1 weighted_moments_hu2 weighted_moments_hu3 weighted_moments_hu4 weighted_moments_hu5 weighted_moments_hu6 weighted_moments_normalized2 weighted_moments_normalized3 weighted_moments_normalized5 weighted_moments_normalized6 weighted_moments_normalized7 weighted_moments_normalized8 weighted_moments_normalized9 weighted_moments_normalized10 weighted_moments_normalized11 weighted_moments_normalized12 weighted_moments_normalized13 weighted_moments_normalized14 weighted_moments_normalized15 target
0 32756761 6653.0 167.18 205.76 65.341 85 779.66 109 123 147.8 132.0 90 0.138 128.8 3153711.0 253 -0.804 -1.142 94.71 352.0 1.012 15327.0 0.802 100 146 238 3 3 10 1.431 1.430 3 3 15327.0 15327.0 253.107 1.008 106 1.121212 170 -0.404959 4 31.553398 38.235294 2.215909 0.366477 6.046512 5.270270 5.656474 0.011364 -0.004854 -0.004854 19.650000 91.778443 0.005988 0.005988 0.017964 0.017964 0.059880 0.017647 0.017647 0.058824 47 0.276471 91.778443 6277.0 9994.0 0.675984 89.398684 -9 6531.0 577.942452 -170.679844 -170.679844 803.296613 895.137829 486.101236 119.675416 227 173.805958 21 88.190815 2.200476e+08 4.246392e+06 2.662663e+06 2.354605e+05 1.611829e+02 -1.336356e+03 -9.369804e+01 1.279746e+08 2.231306e+07 2.719131e+07 -7.213529e+06 5.430433e+06 9.207304e+07 -7.823094e+06 9.849355e+06 2.387593e+05 2.163625e+06 3.671610e+06 -1.423642e+06 6.063819e+05 1050.714862 0.628077 1.086257e+06 1.419815e+02 0.328274 0.045152 5.380579e-12 -1.682703e-07 -1.125877e-12 6.014419e+05 9.325868e+03 1.791506e+05 -1.525763e+03 1.941643e+02 4.848153e+05 -2694.631449 2.414405e+02 1.295103 440.053063 129.297892 -2.122403 0.115495 30
1 32759364 1275.0 165.83 234.29 38.562 98 186.99 33 65 82.4 46.3 90 0.458 67.2 701229.0 255 -1.903 2.434 68.31 404.0 1.026 2993.0 0.110 167 216 246 4 2 1 1.657 1.646 3 2 2993.0 2993.0 31.693 1.014 104 1.782609 157 -0.154412 -1 16.666667 24.840764 0.462871 0.165842 2.791045 2.280488 27.442236 -0.002475 0.008547 -0.008547 16.005348 18.030120 0.012048 0.012048 0.024096 0.012048 0.006024 0.025478 0.012739 0.006369 21 0.133758 18.030120 857.0 1500.0 0.888307 33.032806 -2 1273.0 70.190034 48.680909 48.680909 267.628870 278.979223 58.839681 66.810684 218 115.376896 22 30.682811 3.941878e+08 6.598337e+07 1.246740e+06 5.309983e+04 5.416430e+00 -7.431118e+03 -1.254289e+01 3.122857e+08 3.873429e+06 -5.680386e+07 3.288071e+06 -3.626536e+07 8.190202e+07 -8.713215e+06 2.221631e+07 -4.622357e+06 -8.735667e+06 -7.377503e+06 -3.434297e+05 -3.928709e+06 338.663997 0.571333 3.302787e+06 5.701727e+03 0.905013 0.112204 -1.375254e-11 -1.946467e-04 3.300476e-11 2.728961e+06 8.736413e+03 -5.140816e+05 -2.060929e+03 -2.790422e+03 5.738257e+05 -7088.333306 1.420614e+03 -17.330565 -8402.726309 -335.688243 -11.030494 -1.784313 30
2 32758055 2416.0 167.92 239.15 25.590 94 276.33 26 115 138.8 38.4 90 0.398 115.6 1000126.0 255 -2.147 5.385 65.60 831.0 1.050 4182.0 0.183 205 229 243 2 9 18 2.799 2.769 4 4 4182.0 4182.0 63.590 1.008 106 3.657895 161 -0.110345 -1 10.878661 16.149068 0.332130 0.139591 2.379310 1.985612 37.563504 -0.001203 0.020921 -0.008368 15.152174 24.892857 0.017857 0.017857 0.011905 0.053571 0.107143 0.012422 0.055901 0.111801 16 0.099379 24.892857 1377.0 2564.0 0.977587 41.871838 -12 2416.0 52.042625 28.888459 28.888459 1156.370209 1157.125395 51.287439 136.066184 221 77.190995 19 28.646100 8.775692e+08 6.449337e+08 2.441400e+07 2.843659e+07 7.490162e+05 2.276425e+07 -1.932569e+04 8.397750e+08 -1.646830e+08 -2.097927e+07 9.473912e+06 -3.281929e+07 3.779421e+07 -3.581400e+06 3.041981e+07 -9.421348e+06 1.647676e+06 -3.332623e+06 2.087634e+06 -4.829305e+06 867.416306 0.537051 1.220825e+07 1.258289e+05 76.795145 84.364545 6.789266e-06 9.448652e-01 -1.335473e-07 1.170246e+07 -2.859165e+05 -3.403948e+05 1.495453e+04 -6.573885e+03 5.057891e+05 -4044.999299 5.357163e+03 -175.738886 1982.197147 -585.246490 36.387240 -11.036476 20
3 32758988 1433.0 167.34 248.79 20.933 107 388.40 126 117 151.2 141.0 0 0.119 171.3 4167272.0 255 -4.124 17.007 100.00 0.0 1.185 16594.0 1.607 251 252 253 0 0 4 1.344 1.353 9 9 16245.0 15317.0 281.337 1.872 106 1.070922 148 -0.042254 -1 8.433735 14.189189 0.000000 0.000000 2.269006 2.569536 0.000000 0.000000 0.068273 -0.016064 41.868557 91.718563 0.005988 0.005988 0.000000 0.000000 0.023952 0.000000 0.000000 0.027027 6 0.040541 99.365269 1433.0 2451.0 0.996145 42.714778 1 1433.0 1151.329152 -1023.481219 -1023.481219 938.576959 2073.947555 15.958556 182.162457 211 109.443126 20 15.979265 1.458413e+09 2.062499e+09 2.714748e+08 1.889040e+08 4.263768e+07 2.418257e+08 3.469392e+06 6.549735e+08 -2.556928e+07 7.142228e+08 -9.617200e+07 9.277311e+08 8.034397e+08 -1.821978e+08 1.036359e+09 -2.851636e+08 -2.855829e+08 1.176318e+09 -4.614650e+08 1.829032e+09 402.315801 0.584659 1.182985e+07 1.360323e+05 371.384959 322.572749 1.116296e-04 3.612542e+00 2.067509e-06 5.341701e+06 -9.034729e+04 5.803403e+06 -1.525937e+05 6.966508e+04 6.488148e+06 -227827.736082 7.859818e+04 -4117.231070 -317871.379287 89887.970505 -5606.440604 1325.968409 37
4 32760598 1650.0 166.89 250.42 14.603 124 751.96 66 239 278.7 83.5 90 0.037 244.4 4577141.0 255 -4.376 20.118 95.82 69.0 1.013 18278.0 1.825 242 251 252 5 14 10 2.612 2.598 19 19 18278.0 18278.0 285.249 1.004 106 3.321429 131 -0.039683 4 6.000000 11.450382 10.898551 3.536232 3.081967 2.695341 1.149973 0.057971 0.080000 -0.016000 24.305851 109.449102 0.017964 0.017964 0.029940 0.083832 0.059880 0.038168 0.106870 0.076336 5 0.038168 109.449102 1581.0 3902.0 0.998342 44.866376 -3 1650.0 256.002523 961.046175 961.046175 3821.627978 4064.162374 13.468127 255.003133 195 78.139152 20 14.679579 2.579146e+09 6.564402e+09 2.037154e+09 1.989648e+09 4.005676e+09 5.094617e+09 6.813774e+06 2.417222e+09 1.305025e+09 -6.078723e+08 -3.077877e+08 -3.056582e+09 1.619244e+08 6.843483e+07 7.729145e+08 8.843851e+08 -1.354579e+07 -2.018699e+08 -2.155121e+08 -1.281629e+09 780.150324 0.405177 2.578533e+07 6.534352e+05 3282.103656 3173.567259 1.024227e-02 8.103163e+01 2.693943e-05 2.401762e+07 1.645298e+06 -6.292385e+06 -3.959394e+05 -3.502930e+05 1.767707e+06 87800.105557 9.101105e+04 13798.884373 -16296.563545 -24557.198375 -3379.838856 -1854.215755 37
5 32760828 11200.0 166.39 221.26 55.169 76 1361.04 197 146 231.3 173.8 0 0.076 199.6 6985233.0 254 -1.409 0.385 90.83 1027.0 1.008 31570.0 2.188 129 216 250 13 12 12 1.493 1.486 3 3 31570.0 31570.0 317.430 1.005 104 1.327586 179 -0.234483 0 24.886878 30.726257 1.325219 0.194742 6.805000 5.891775 9.468571 0.000000 0.000000 -0.004525 23.196179 190.180723 0.006024 0.006024 0.078313 0.072289 0.072289 0.072626 0.067039 0.067039 33 0.184358 190.180723 10156.0 17641.0 0.519437 113.714646 -59 11034.0 1258.148157 93.425958 93.425958 965.945424 1285.465414 938.628167 143.413551 233 151.482670 21 122.548157 2.189931e+08 1.166289e+06 1.003936e+06 1.422881e+05 -4.154643e+01 3.632891e+03 3.414654e+01 9.511081e+07 -9.269539e+06 -9.199090e+06 -4.531977e+06 6.782668e+05 1.238823e+08 2.073144e+06 6.263602e+06 -8.236135e+04 1.404513e+07 -6.233700e+06 -3.473279e+05 -2.407816e+05 2437.739428 0.575704 1.249359e+06 1.254557e+02 0.081539 0.043521 -6.897267e-14 1.361957e-05 2.591624e-12 4.684963e+05 -2.617300e+03 -8.349132e+04 -7.826433e+02 -1.261023e+01 7.808622e+05 -482.665019 2.153805e+02 -1.012162 6605.944524 -255.575908 0.784273 -0.077414 30
6 32760820 8726.0 156.03 234.44 34.530 89 1610.45 160 186 218.9 189.6 90 0.042 213.8 7640751.0 251 -2.099 3.652 81.78 1590.0 1.008 32592.0 2.301 208 237 248 10 34 27 1.441 1.440 2 2 32592.0 32592.0 474.603 1.005 98 1.152632 166 -0.144828 6 14.957265 21.084337 1.012579 0.134591 7.523364 7.351598 12.425702 0.003774 0.017094 -0.008547 20.243478 208.923077 0.006410 0.006410 0.064103 0.217949 0.173077 0.060241 0.204819 0.162651 17 0.102410 208.923077 7121.0 21540.0 0.588547 95.219424 -28 8720.0 1797.845596 -300.829808 -300.829808 2454.768711 2571.712103 1680.902205 202.848203 224 121.518467 42 163.995229 5.971934e+08 1.564905e+07 1.472326e+07 2.054673e+07 -2.072334e+04 2.237670e+06 3.567670e+05 3.447225e+08 1.163498e+08 4.224544e+07 2.870739e+07 4.841636e+07 2.524709e+08 -3.545087e+05 6.001376e+07 2.107953e+07 5.550557e+07 2.806477e+07 1.270830e+07 2.143566e+07 2603.691701 0.330594 4.864202e+06 4.682387e+02 16.433113 11.892377 2.046904e-08 7.216512e-03 1.649856e-07 2.730455e+06 1.039565e+05 1.674658e+05 1.427095e+04 2.244763e+03 2.133748e+06 -8043.359597 3.354233e+03 92.965136 37623.535423 1190.466229 37.466611 7.282237 30
7 32758467 2158.0 167.32 250.72 11.956 131 1013.26 70 235 274.2 88.0 90 0.026 241.4 4752102.0 255 -4.335 23.199 96.43 77.0 1.012 18954.0 1.766 238 250 252 10 3 2 2.466 2.470 11 11 18954.0 18954.0 263.710 1.004 106 3.113636 124 -0.033333 4 4.780876 9.677419 13.155844 3.129870 4.203320 3.697080 0.953311 0.051948 0.091633 -0.015936 18.710760 113.497006 0.011976 0.011976 0.059880 0.017964 0.011976 0.080645 0.024194 0.016129 4 0.032258 113.497006 2081.0 7298.0 0.990062 51.474377 0 2154.0 358.730300 1111.734532 1111.734532 4925.997404 5182.233115 102.494590 287.950916 188 59.539164 19 40.495845 2.539514e+09 5.958522e+09 1.159142e+09 1.560103e+08 -2.259260e+07 -2.438001e+08 6.237812e+07 2.367130e+09 3.701275e+08 -5.342309e+08 2.708991e+08 -1.957788e+09 1.723836e+08 -1.265406e+08 5.359610e+08 -6.337206e+08 4.002802e+07 -1.696316e+08 2.375503e+08 -6.872392e+08 1076.466125 0.285147 4.675404e+07 2.050523e+06 19146.557896 16949.604847 3.053276e-01 7.211900e+02 2.832575e-03 4.340180e+07 4.149616e+06 -1.056593e+07 -3.020575e+05 -6.752099e+05 3.352239e+06 -42179.759247 1.689413e+05 6273.581578 21755.668574 -50407.740029 713.762983 -2984.091582 37
8 32760505 1650.0 167.81 247.04 19.333 107 594.49 30 264 306.9 42.9 90 0.059 265.2 2553399.0 255 -3.010 9.464 99.64 6.0 1.061 10336.0 0.622 215 245 251 3 3 2 5.232 5.210 28 27 10336.0 10336.0 203.293 1.004 106 7.139535 148 -0.057143 -1 7.692308 12.837838 99.000000 44.166667 2.241509 1.934853 0.127362 -0.166667 0.036437 -0.012146 17.400673 61.523810 0.029762 0.029762 0.017857 0.017857 0.011905 0.020270 0.020270 0.013514 8 0.054054 61.523810 1644.0 3647.0 0.998624 45.751566 0 1645.0 46.686870 -410.605985 -410.605985 5261.864414 5293.994657 14.556626 291.039369 211 80.802311 17 15.261259 3.229046e+09 1.031269e+10 7.054339e+08 2.708485e+08 1.084802e+08 5.111947e+08 4.741762e+07 3.200647e+09 -4.018777e+08 2.497603e+08 -2.662539e+08 1.573086e+09 2.839834e+07 -4.197437e+07 1.696885e+08 -3.676090e+08 -5.488270e+06 2.072013e+07 -5.113759e+07 1.639030e+08 591.486327 0.450781 3.054801e+07 9.225525e+05 5441.903019 5067.285606 2.660885e-02 1.500216e+02 2.005059e-04 3.033143e+07 -2.165898e+06 1.977943e+06 -3.926003e+05 1.790250e+05 2.165878e+05 -49614.751002 1.975551e+04 -6500.007422 -5881.199021 2361.462960 -818.651949 303.587619 37
9 32760693 1365.0 167.59 252.11 10.563 120 667.28 134 204 239.2 160.2 90 0.039 239.8 7589445.0 255 -6.211 44.310 97.88 29.0 1.008 30104.0 3.683 249 251 253 7 7 8 1.555 1.555 24 24 30104.0 30104.0 324.113 1.005 106 1.493750 135 -0.022727 4 4.365079 8.148148 23.000000 8.275862 2.779167 2.790795 0.548007 0.137931 0.174603 -0.023810 45.133433 179.190476 0.011905 0.011905 0.041667 0.041667 0.047619 0.051852 0.051852 0.059259 3 0.022222 179.190476 1336.0 9268.0 0.981394 41.243764 0 1340.0 1640.908337 -2188.773144 -2188.773144 3464.186926 4923.584131 181.511132 280.673023 199 73.474551 20 53.890427 3.821179e+09 1.259864e+10 1.583859e+10 8.747090e+09 1.018017e+11 2.686971e+10 -1.537675e+10 2.592954e+09 -2.528776e+09 1.638303e+09 -1.135126e+09 7.754716e+09 1.228225e+09 -2.732966e+08 4.853526e+09 -5.936183e+09 1.888273e+08 3.380277e+09 -1.918858e+09 1.760997e+10 691.808225 0.144152 5.511515e+07 2.689541e+06 36890.275189 20192.660454 5.441265e-01 8.497679e+02 -8.752004e-02 3.635436e+07 -3.961379e+06 2.439266e+07 -1.657096e+06 1.590276e+06 1.876079e+07 -331190.573172 1.032216e+06 -128032.675688 327995.662816 729270.960045 -40349.958742 52317.110791 37

FINALLY, we measure the mutual-information between the merged features and the targets.

In [17]:
from sklearn.feature_selection import mutual_info_classif

# Get the Merged-Features
merged_features = data.drop(columns=["objid", "target"])

# Get Mutual Information between 'target' and each 'feature'
# Then turn it into a DataFrame, and sort descendingly
mutual_info = mutual_info_classif(merged_features, data['target'])

mutual_info_df = pd.DataFrame(data=mutual_info, index=merged_features.columns, columns=['MI'])
mutual_info_df.sort_values(by='MI', inplace=True, ascending=False)

To know how good these mutual-information are, we compare them to the entropy of the targets.

In [18]:
from scipy.stats import entropy

targets_freq = data['target'].value_counts().tolist()
targets_entropy = entropy(targets_freq)

print(targets_entropy)
1.833188360384403
In [32]:
# Show the head & tail of mutual-information data-frame
display(mutual_info_df[:5])
display(mutual_info_df[-5:])

print("# of features: {}.".format(len(mutual_info_df)))
print("# of features that reduces uncertainty BY-MORE-THAN-10%: {}.".\
          format(len(mutual_info_df[mutual_info_df['MI'] > 0.1*targets_entropy])))
MI
cv 0.570214
kurt_mean 0.488479
skew_mean 0.475384
weighted_moments_hu0 0.462755
mean 0.456567
MI
esd 0.142574
nb1 0.138857
moments_normalized12 0.129653
centroids 0.113580
angle 0.027163
# of features: 123.
# of features that reduces uncertainty BY-MORE-THAN-10%: 108.

COMMENT:

  • First, 108/123 features is quite a good number, given that they are the features that reduces more-than-10% the uncertainty of the labels.
  • This probably confirms out intuition about features engineered using expert-knowledge:
    1. First, by definition, planktons are categorized by a multitude of morphological features.
    2. Second, the features here are identified using regions of interests (ROIs) marked by experts.

Therefore, see almost all of them reduces uncertainty for a good amount is really confirming.

F. Split Train-Validation-Test sets

In [10]:
# Reverse the label-encoding, as we don't need it anymore
data['target'] = le.inverse_transform(data['target'])

# Split the training set, validation set and testing set with ratio as 0.5 : 0.25 : 0.25
X_train, X_test, y_train, y_test = train_test_split(data.drop(['objid', 'target'], axis=1), data['target'], test_size=0.25, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.25, random_state=42)

print("Number of classes of training set: ", y_train.nunique())
print("Number of classes of validation set: ", y_val.nunique())
print("Number of classes of testing set: ", y_test.nunique())
Number of classes of training set:  39
Number of classes of validation set:  39
Number of classes of testing set:  39

COMMENT.

  • Here we split the full set into training set, validation set and testing set with ratio 0.5, 0.25, 0.25 respectively.
  • Each splitted dataset keep the same number of classes as the full set (39 classes).

3. SOLVING IMBALANCE DATASET

MAIN IDEAS.

  1. Why we need to solve this? Because otherwise, our trained-models will have low predictive power for the infrequent classes of the dataset.

  2. How are we going to solve this?

    • Since the dataset is very imblanced, we use both oversampling and undersampling techniques.
    • In addition to that, we use a cleaning method called TomekLinks to improve the resamplings.
  3. How do we know which resamplings are good?

    • We use a Baseline Model and the Validation Set to measure the performances of our resamplings.
    • The metric we used are f1-score with avg = 'macro'.
In [11]:
# A helper function to display the score
def report_f1(y_test, y_pred):
    """ Return the well-displayed scores for evaluation.
    
    Parameters:
        y_test: the true labels
        y_pred: the predicted labels
    Return:
        result (DataFrame): the dataframe of scores
    """
    # Evaluate the performance using sklearn library
    report = classification_report(y_test, y_pred, output_dict=True)
    report_df = pd.DataFrame.from_dict(report).transpose()
    
    idx_first = ['macro avg']
    
    # Re-order to show f1 macro first
    df1 = report_df.ix[idx_first]
    idx_df2 = list(set(report_df.index)-set(idx_first + ['weighted avg', 'micro avg']))
    df2 = report_df.ix[idx_df2].sort_values(by = ['f1-score'], ascending=False)
    result = pd.concat([df1, df2]).transpose()
    
    return result
In [12]:
# Define a baseline model to fit, predict and evaluate the input data sets and models
def baseline_model(X_train, y_train, X_test, y_test, model):
    """ Return the evaluation of given models and given inputs.
    
    Parameters:
        X_train: features for training
        y_train: true labels for training
        X_test: features for testing/validation
        y_test: true labels for testing/validation
        model: the learning model
    Return:
        A dictionary of evaluation
    """
    start = time.time()
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    running_time = time.time()-start
    report = report_f1(y_test, y_pred)
    print("Time execution: ", running_time)
    return {'model': model, 'y_pred': y_pred, 'running_time': running_time, 'report': report}

A. The Baseline Model & Feature Importance

FIRST, for the baseline-model, we'll use RandomForestClassifier.

In [8]:
rf = RandomForestClassifier(n_estimators=100, random_state=42)
baseline_rf = baseline_model(X_train, y_train, X_val, y_val, rf)
baseline_rf['report']
Time execution:  321.334059715271
Out[8]:
macro avg badfocus (artefact) detritus silks Neoceratium Codonellopsis (Dictyocystidae) rods Thalassionema Copepoda nauplii (Crustacea) feces Dictyocysta Undellidae artefact Tintinnidiidae Codonaria Coscinodiscids Protoperidinium Chaetoceros pollen Rhabdonella Rhizosolenids egg (other) Cyttarocylis Pleurosigma Annelida multiple (other) Hemiaulus chainlarge Retaria Stenosemella tempChaetoceros danicus centric Bacteriastrum Xystonellidae Dinophysiales Lithodesmioides Odontella (Mediophyceae) Asterionellopsis Ceratocorys horrida
f1-score 0.441259 0.952245 0.891566 0.891512 0.867442 0.863462 0.854265 0.850056 0.789233 0.776889 0.765073 0.744828 0.733032 0.702194 0.663793 0.626609 0.613636 0.582897 0.578352 0.553030 0.550459 0.481605 0.474820 0.347826 0.238095 0.230769 0.192256 0.165517 0.104478 0.090909 0.032258 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
precision 0.671722 0.981631 0.840651 0.876786 0.883793 0.916327 0.879668 0.887059 0.774676 0.785971 0.811862 0.964286 0.910112 0.800000 0.875000 0.839080 0.870968 0.917582 0.788546 0.784946 0.857143 0.804469 0.891892 1.000000 1.000000 1.000000 0.553846 0.923077 0.777778 1.000000 1.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
recall 0.382860 0.924567 0.949047 0.906741 0.851685 0.816364 0.830287 0.816017 0.804348 0.768014 0.723384 0.606742 0.613636 0.625698 0.534722 0.500000 0.473684 0.427110 0.456633 0.426901 0.405405 0.343675 0.323529 0.210526 0.135135 0.130435 0.116317 0.090909 0.056000 0.047619 0.016393 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
support 45489.000000 1445.000000 25926.000000 1083.000000 2670.000000 550.000000 766.000000 924.000000 966.000000 1707.000000 5166.000000 89.000000 132.000000 358.000000 432.000000 146.000000 57.000000 391.000000 392.000000 342.000000 74.000000 419.000000 102.000000 19.000000 37.000000 92.000000 619.000000 132.000000 125.000000 42.000000 61.000000 10.0 20.0 2.0 5.0 86.0 12.0 28.0 26.0 36.0

COMMENT-1.

  • F1 avg macro for the baseline model is 0.441259
  • The high performances belong to the classes like badfocus (artefact), silks, detritus, Codonellopsis (Dictyocystidae), Neoceratium, Thalassionema and rods. The classes are major classes and/or having specific shapes of images.

  • The low performances belong to the minor classes which do not have enough information to classify.

COMMENT-2.

  • The highest score belongs to the class badfocus (artefact). In the training set, this class has 4416 instances while the biggest class has 77812 instances.
  • So, more data does not mean more accuracy, we will consider this aspect when choose the number of instances for resampling data.

NEXT:

  • As stated before, the features used in this dataset are somwhat based on expert knowledge, so it is not really necessary to do feature-selection.
  • However, here we take the baseline model as an opportunity to compute the importances of each feature to verify that conclusion.
In [9]:
feature_importance = baseline_rf['model'].feature_importances_
plt.bar(range(0, len(feature_importance)), sorted(feature_importance, reverse=True));

COMMENT.

The feature importances decrease gradually in this graph. There are only four features which have very low importances. Therefore, we keep all the features for learning models.

In [13]:
# A helper funtion to resampling several classes
def resampling_multiclass(X_train, y_train, alg, classes):
    """ Resample the given classes while remainig the other classes
    Parameters:
        X_train: Features for training
        y_train: Labels for training
        alg: A method for resampling
        classes: The classes to resample
    Returns: 
        X_train_new: Features after resampling
        y_train_new: Labels after resampling
    """
    
    # Get row indices of given classes
    idx_g1 = y_train[y_train.isin(classes)].index
    # Get row indices of the remaining classes
    idx_g2 = y_train[~y_train.isin(classes)].index
    
    # Get new set for features according to idx_g1
    X_train_g1 = X_train.ix[idx_g1]
    y_train_g1 = y_train.ix[idx_g1]
    # Get new set for features according to idx_g2
    X_train_g2 = X_train.ix[idx_g2]
    y_train_g2 = y_train.ix[idx_g2]
    
    # Resample the new set of the given classes
    X_res, y_res = alg.fit_resample(X_train_g1, y_train_g1)
    
    # Concatenate the resampled sets and non-resample sets
    X_train_new = np.concatenate([X_res, X_train_g2])
    y_train_new = np.concatenate([y_res, y_train_g2])
    
    return X_train_new, y_train_new

FIRST is oversampling for infrequent classes.

  • Here we use SMOTE (Synthetic Minority Oversampling Technique), a very popular oversampling method that was proposed to improve random oversampling.

  • About SMOTE: Rather than replicating the minority observations, SMOTE works by creating synthetic observations in by interpolation. The basic implementation of SMOTE will not make any distinction between easy and hard samples to be classified using the nearest neighbors rule.

  • We also use another SMOTE variants, Borderline-SMOTE. This variant detects which point to select in the border between two classes.

In [28]:
sm = SMOTE(random_state=42)
X_train_sm, y_train_sm = resampling_multiclass(X_train, y_train, sm, ['multiple (other)', 'badfocus (artefact)', 'silks', 'Copepoda', 'Thalassionema', 'rods', 'Codonellopsis (Dictyocystidae)', 'Protoperidinium', 'Tintinnidiidae', 'Rhizosolenids', 'Chaetoceros', 'artefact', 'pollen', 'Codonaria', 'chainlarge', 'Undellidae', 'Hemiaulus', 'egg (other)', 'Dinophysiales', 'Dictyocysta', 'Annelida', 'Stenosemella', 'Rhabdonella', 'Coscinodiscids', 'Retaria', 'Pleurosigma', 'Ceratocorys horrida', 'centric', 'Odontella (Mediophyceae)', 'Asterionellopsis', 'Cyttarocylis', 'Lithodesmioides', 'tempChaetoceros danicus', 'Xystonellidae', 'Bacteriastrum'])
baseline_model(X_train_sm, y_train_sm, X_val, y_val, rf)['report']
Time execution:  642.2512774467468
Out[28]:
macro avg badfocus (artefact) detritus silks Codonellopsis (Dictyocystidae) Neoceratium Thalassionema rods Copepoda Undellidae Dictyocysta nauplii (Crustacea) feces artefact Coscinodiscids Protoperidinium Codonaria Tintinnidiidae Chaetoceros Cyttarocylis Rhabdonella pollen Rhizosolenids Hemiaulus Pleurosigma egg (other) Odontella (Mediophyceae) Annelida Retaria chainlarge multiple (other) Stenosemella Asterionellopsis Dinophysiales Ceratocorys horrida centric Lithodesmioides Xystonellidae Bacteriastrum tempChaetoceros danicus
f1-score 0.543803 0.942919 0.891501 0.889192 0.867424 0.864217 0.847420 0.833872 0.800995 0.783582 0.780488 0.776729 0.750742 0.743949 0.721311 0.684211 0.682432 0.665786 0.643045 0.638298 0.627451 0.607251 0.577957 0.572687 0.507042 0.504854 0.500000 0.457143 0.447761 0.377193 0.328032 0.314050 0.210526 0.190476 0.177778 0.0 0.0 0.0 0.0 0.0
precision 0.590710 0.986395 0.859562 0.874888 0.905138 0.893600 0.859688 0.825864 0.771073 0.772059 0.853333 0.809010 0.830361 0.683841 0.676923 0.798635 0.673333 0.775385 0.662162 0.535714 0.607595 0.628125 0.661538 0.684211 0.529412 0.500000 0.833333 0.481928 0.600000 0.417476 0.426357 0.316667 0.333333 0.526316 0.444444 0.0 0.0 0.0 0.0 0.0
recall 0.523744 0.903114 0.925904 0.903970 0.832727 0.836704 0.835498 0.842037 0.833333 0.795455 0.719101 0.746924 0.685056 0.815642 0.771930 0.598465 0.691781 0.583333 0.625000 0.789474 0.648649 0.587719 0.513126 0.492424 0.486486 0.509804 0.357143 0.434783 0.357143 0.344000 0.266559 0.311475 0.153846 0.116279 0.111111 0.0 0.0 0.0 0.0 0.0
support 45489.000000 1445.000000 25926.000000 1083.000000 550.000000 2670.000000 924.000000 766.000000 966.000000 132.000000 89.000000 1707.000000 5166.000000 358.000000 57.000000 391.000000 146.000000 432.000000 392.000000 19.000000 74.000000 342.000000 419.000000 132.000000 37.000000 102.000000 28.000000 92.000000 42.000000 125.000000 619.000000 61.000000 26.000000 86.000000 36.000000 20.0 12.0 5.0 2.0 10.0
In [29]:
bsm = BorderlineSMOTE(random_state=42)
X_train_bsm, y_train_bsm = resampling_multiclass(X_train, y_train, bsm, ['multiple (other)', 'badfocus (artefact)', 'silks', 'Copepoda', 'Thalassionema', 'rods', 'Codonellopsis (Dictyocystidae)', 'Protoperidinium', 'Tintinnidiidae', 'Rhizosolenids', 'Chaetoceros', 'artefact', 'pollen', 'Codonaria', 'chainlarge', 'Undellidae', 'Hemiaulus', 'egg (other)', 'Dinophysiales', 'Dictyocysta', 'Annelida', 'Stenosemella', 'Rhabdonella', 'Coscinodiscids', 'Retaria', 'Pleurosigma', 'Ceratocorys horrida', 'centric', 'Odontella (Mediophyceae)', 'Asterionellopsis', 'Cyttarocylis', 'Lithodesmioides', 'tempChaetoceros danicus', 'Xystonellidae', 'Bacteriastrum'])
baseline_model(X_train_bsm, y_train_bsm, X_val, y_val, rf)['report']
Time execution:  599.5486073493958
Out[29]:
macro avg badfocus (artefact) detritus silks Neoceratium Codonellopsis (Dictyocystidae) Thalassionema rods Undellidae Dictyocysta Copepoda Coscinodiscids nauplii (Crustacea) feces artefact Codonaria Protoperidinium Tintinnidiidae Chaetoceros Rhabdonella pollen Cyttarocylis Rhizosolenids egg (other) Pleurosigma Hemiaulus Annelida chainlarge Retaria multiple (other) Stenosemella Ceratocorys horrida Dinophysiales Odontella (Mediophyceae) Asterionellopsis Lithodesmioides centric Xystonellidae Bacteriastrum tempChaetoceros danicus
f1-score 0.517850 0.942405 0.891053 0.883574 0.869834 0.857949 0.854673 0.831938 0.806084 0.800000 0.788900 0.774775 0.770770 0.751940 0.736424 0.685315 0.682284 0.657682 0.654746 0.651515 0.605067 0.571429 0.558704 0.511628 0.444444 0.434343 0.361111 0.350000 0.346154 0.327902 0.303030 0.150000 0.141414 0.125000 0.074074 0.0 0.0 0.0 0.0 0.0
precision 0.629514 0.981995 0.855374 0.864078 0.897927 0.888889 0.853290 0.820839 0.809160 0.901408 0.756654 0.796296 0.798869 0.834159 0.700252 0.700000 0.797945 0.787097 0.710448 0.741379 0.617021 0.625000 0.642857 0.486726 0.538462 0.651515 0.500000 0.466667 0.900000 0.443526 0.394737 0.750000 0.538462 0.500000 1.000000 0.0 0.0 0.0 0.0 0.0
recall 0.483403 0.905882 0.929839 0.903970 0.843446 0.829091 0.856061 0.843342 0.803030 0.719101 0.824017 0.754386 0.744581 0.684475 0.776536 0.671233 0.595908 0.564815 0.607143 0.581081 0.593567 0.526316 0.494033 0.539216 0.378378 0.325758 0.282609 0.280000 0.214286 0.260097 0.245902 0.083333 0.081395 0.071429 0.038462 0.0 0.0 0.0 0.0 0.0
support 45489.000000 1445.000000 25926.000000 1083.000000 2670.000000 550.000000 924.000000 766.000000 132.000000 89.000000 966.000000 57.000000 1707.000000 5166.000000 358.000000 146.000000 391.000000 432.000000 392.000000 74.000000 342.000000 19.000000 419.000000 102.000000 37.000000 132.000000 92.000000 125.000000 42.000000 619.000000 61.000000 36.000000 86.000000 28.000000 26.000000 12.0 20.0 5.0 2.0 10.0

COMMENT.

Based on the results of the baseline model, we varied the number of classes for oversampling and choose to oversample data for 33 minor classes. Each of them was oversampled up to 4416 instances. We choose this number for oversampling the minor classes because this number is not a too big but still can gives the highest f1 score (in the baseline model, badfocus (artefact) has 4416 instances in the training set and gets the highesh f1-score macro as 0.952245).

  1. Results when oversampling:

    • Using SMOTE:
      • F1-score macro: 0.543803
      • Running time: 642 seconds
    • Using Borderline-SMOTE:
      • F1-score macro: 0.517850
      • Running time: 599 seconds
  2. Compare to the baseline model, the model after oversampling training data gives the better score (increase ~ 10.25%, from 54.38% to 44.13%). This is a significant improvement.

  3. Compare to Borderline-SMOTE, SMOTE method runs a bit slower but gives the better score (54.38%). We prefer to improve the score, therfore, we choose SMOTE as the technique for oversampling in this project.

  4. Further work: As can be seen from the two output tables above, each method SMOTE and Borderline-SMOTE may improve different classes. Therefore, we may combine them to oversample classes.

NEXT is under-sampling.

The controlled under-sampling methods: we can specify the expected number of instances after undersampling. After oversampling, we use RandomUnderSampler and NearMiss-1 to undersample the major classes.

  • RandomUnderSampler: it undersample the majority classes by randomly picking samples.
  • NearMiss-1: it selects samples from the majority class for which the average distance of the k nearest samples of the minority class is the smallest.
In [30]:
nm = NearMiss(random_state=42)
X_train_nm, y_train_nm = resampling_multiclass(pd.DataFrame(X_train_sm), pd.DataFrame(y_train_sm)[0], nm, ['detritus', 'feces'])
baseline_model(X_train_nm, y_train_nm, X_val, y_val, rf)['report']
Time execution:  460.01442432403564
Out[30]:
macro avg Codonellopsis (Dictyocystidae) silks badfocus (artefact) rods Dictyocysta Copepoda Neoceratium Thalassionema Protoperidinium Codonaria Coscinodiscids nauplii (Crustacea) feces Undellidae pollen Cyttarocylis Rhabdonella artefact detritus Tintinnidiidae Pleurosigma Stenosemella Rhizosolenids egg (other) Chaetoceros Annelida Asterionellopsis Hemiaulus multiple (other) chainlarge Ceratocorys horrida Odontella (Mediophyceae) Retaria Dinophysiales centric Lithodesmioides Xystonellidae Bacteriastrum tempChaetoceros danicus
f1-score 0.416348 0.856364 0.826446 0.814552 0.780069 0.777143 0.755697 0.747564 0.710218 0.694938 0.686469 0.682540 0.662711 0.662555 0.642458 0.608828 0.607143 0.447059 0.446154 0.434362 0.365145 0.355932 0.344828 0.329569 0.315522 0.308059 0.273632 0.264706 0.215812 0.210463 0.117123 0.097421 0.074434 0.068643 0.042484 0.010526 0.0 0.0 0.0 0.0
precision 0.366061 0.856364 0.747943 0.711260 0.694898 0.790698 0.675081 0.629810 0.579235 0.747059 0.662420 0.623188 0.557063 0.581736 0.508850 0.634921 0.459459 0.314917 0.297575 0.854636 0.245418 0.259259 0.363636 0.209941 0.213058 0.189443 0.177419 0.214286 0.125622 0.129127 0.063624 0.054313 0.038983 0.036728 0.022847 0.005556 0.0 0.0 0.0 0.0
recall 0.618664 0.856364 0.923361 0.952941 0.889034 0.764045 0.858178 0.919476 0.917749 0.649616 0.712329 0.754386 0.817809 0.769454 0.871212 0.584795 0.894737 0.770270 0.891061 0.291175 0.712963 0.567568 0.327869 0.766110 0.607843 0.823980 0.597826 0.346154 0.765152 0.568659 0.736000 0.472222 0.821429 0.523810 0.302326 0.100000 0.0 0.0 0.0 0.0
support 45489.000000 550.000000 1083.000000 1445.000000 766.000000 89.000000 966.000000 2670.000000 924.000000 391.000000 146.000000 57.000000 1707.000000 5166.000000 132.000000 342.000000 19.000000 74.000000 358.000000 25926.000000 432.000000 37.000000 61.000000 419.000000 102.000000 392.000000 92.000000 26.000000 132.000000 619.000000 125.000000 36.000000 28.000000 42.000000 86.000000 20.000000 12.0 5.0 2.0 10.0
In [31]:
rus = RandomUnderSampler(random_state=42)
X_train_rus, y_train_rus = resampling_multiclass(pd.DataFrame(X_train_sm), pd.DataFrame(y_train_sm)[0], rus, ['detritus', 'feces'])
baseline_model(X_train_rus, y_train_rus, X_val, y_val, rf)['report']
Time execution:  457.41386675834656
Out[31]:
macro avg badfocus (artefact) Codonellopsis (Dictyocystidae) silks Neoceratium detritus Thalassionema rods Copepoda nauplii (Crustacea) feces Undellidae Dictyocysta Tintinnidiidae artefact Codonaria Coscinodiscids Protoperidinium pollen Rhabdonella Chaetoceros Odontella (Mediophyceae) Cyttarocylis Bacteriastrum Rhizosolenids Annelida Pleurosigma Hemiaulus egg (other) multiple (other) Retaria chainlarge Asterionellopsis Stenosemella Dinophysiales Ceratocorys horrida centric Lithodesmioides Xystonellidae tempChaetoceros danicus
f1-score 0.513717 0.942120 0.850492 0.840470 0.828552 0.800730 0.795090 0.792562 0.762212 0.746308 0.719067 0.664773 0.654709 0.638158 0.619946 0.610778 0.608108 0.601554 0.564345 0.557214 0.555844 0.551724 0.548387 0.5 0.498366 0.444444 0.441860 0.429630 0.380403 0.344482 0.325581 0.310345 0.271186 0.236364 0.218750 0.195804 0.184615 0.0 0.0 0.0
precision 0.446015 0.949403 0.837743 0.770593 0.761456 0.963540 0.705193 0.714136 0.686877 0.678657 0.606161 0.531818 0.544776 0.606250 0.456954 0.542553 0.494505 0.531373 0.473267 0.440945 0.420708 0.533333 0.395349 0.5 0.378882 0.366197 0.387755 0.318681 0.269388 0.262979 0.241379 0.204030 0.242424 0.163522 0.149573 0.130841 0.133333 0.0 0.0 0.0
recall 0.642110 0.934948 0.863636 0.924284 0.908614 0.684988 0.911255 0.890339 0.856108 0.828940 0.883662 0.886364 0.820225 0.673611 0.963687 0.698630 0.789474 0.693095 0.698830 0.756757 0.818878 0.571429 0.894737 0.5 0.727924 0.565217 0.513514 0.659091 0.647059 0.499192 0.500000 0.648000 0.307692 0.426230 0.406977 0.388889 0.300000 0.0 0.0 0.0
support 45489.000000 1445.000000 550.000000 1083.000000 2670.000000 25926.000000 924.000000 766.000000 966.000000 1707.000000 5166.000000 132.000000 89.000000 432.000000 358.000000 146.000000 57.000000 391.000000 342.000000 74.000000 392.000000 28.000000 19.000000 2.0 419.000000 92.000000 37.000000 132.000000 102.000000 619.000000 42.000000 125.000000 26.000000 61.000000 86.000000 36.000000 20.000000 12.0 5.0 10.0

COMMENT.

As can be seen from the results above, the undersampling techniques do not improve the f1-score macro, therefore, we skip this step and do not use it in the final models.

FINALLY, we use the technique named TomekLinks, which is a cleaning undersampling method. The number of samples to be selected need not to be specified.

  • TomekLinks: it removes the samples considered noisy.
In [32]:
# Undersampling using TomekLinks
tl = TomekLinks(sampling_strategy='all', random_state=42)
X_train_tl, y_train_tl = tl.fit_resample(X_train_sm, y_train_sm)
baseline_model(X_train_tl, y_train_tl, X_val, y_val, rf)['report']
Time execution:  585.842173576355
Out[32]:
macro avg badfocus (artefact) detritus silks Codonellopsis (Dictyocystidae) Neoceratium Thalassionema rods Undellidae Copepoda Dictyocysta nauplii (Crustacea) artefact Coscinodiscids feces Protoperidinium Tintinnidiidae Codonaria Rhabdonella Chaetoceros Cyttarocylis pollen Rhizosolenids Hemiaulus egg (other) Odontella (Mediophyceae) Pleurosigma Retaria Annelida chainlarge multiple (other) Ceratocorys horrida Stenosemella Asterionellopsis Dinophysiales tempChaetoceros danicus centric Xystonellidae Lithodesmioides Bacteriastrum
f1-score 0.546899 0.939592 0.887125 0.880726 0.864198 0.862745 0.844469 0.823306 0.808824 0.789916 0.768293 0.760572 0.730038 0.728814 0.728658 0.694051 0.684967 0.679868 0.670968 0.649351 0.640000 0.603221 0.580052 0.540773 0.515837 0.487805 0.461538 0.447761 0.433498 0.413502 0.326000 0.297872 0.256000 0.210526 0.184874 0.133333 0.0 0.0 0.0 0.0
precision 0.584877 0.990790 0.850520 0.865419 0.904573 0.895607 0.854767 0.799508 0.785714 0.755913 0.840000 0.831711 0.668213 0.704918 0.848290 0.777778 0.786787 0.656051 0.641975 0.661376 0.516129 0.604106 0.644315 0.623762 0.478992 0.769231 0.439024 0.600000 0.396396 0.437500 0.427822 0.636364 0.250000 0.333333 0.333333 0.200000 0.0 0.0 0.0 0.0
recall 0.532968 0.893426 0.927023 0.896584 0.827273 0.832210 0.834416 0.848564 0.833333 0.827122 0.707865 0.700644 0.804469 0.754386 0.638599 0.626598 0.606481 0.705479 0.702703 0.637755 0.842105 0.602339 0.527446 0.477273 0.558824 0.357143 0.486486 0.357143 0.478261 0.392000 0.263328 0.194444 0.262295 0.153846 0.127907 0.100000 0.0 0.0 0.0 0.0
support 45489.000000 1445.000000 25926.000000 1083.000000 550.000000 2670.000000 924.000000 766.000000 132.000000 966.000000 89.000000 1707.000000 358.000000 57.000000 5166.000000 391.000000 432.000000 146.000000 74.000000 392.000000 19.000000 342.000000 419.000000 132.000000 102.000000 28.000000 37.000000 42.000000 92.000000 125.000000 619.000000 36.000000 61.000000 26.000000 86.000000 10.000000 20.0 5.0 12.0 2.0

COMMENT.

The TomekLinks technique help to improve a bit of the score of model after resampling data. It increase the score from 54.38% (after using SMOTE) to 54.68% (after using TomekLinks).

SUMMARY for RESAMPLING DATA

Based on the result after resamplings in the training set and evaluate in the validation set, our training data will be oversample using SMOTE and then undersample using TomekLinks.


3. Model Selection

We do not use cross-validation to measure the performance of models because the training set was resampled, which means when using cross-validation on the training set, the hold-out set was be seen and lead to over-fitting.

Insteads, we train models on using the training set (which was resampled) and evaluate them on the validation set (which is not resampled).

Here we use RandomForestClassifier and XGBClassifier for model selection.

In [15]:
rf = RandomForestClassifier(n_estimators=100, random_state=42)
model_rf = baseline_model(X_train_tl, y_train_tl, X_val, y_val, rf)
model_rf['report']
Time execution:  500.11835741996765
Out[15]:
macro avg badfocus (artefact) detritus silks Codonellopsis (Dictyocystidae) Neoceratium Thalassionema rods Undellidae Copepoda Dictyocysta nauplii (Crustacea) artefact Coscinodiscids feces Protoperidinium Tintinnidiidae Codonaria Rhabdonella Chaetoceros Cyttarocylis pollen Rhizosolenids Hemiaulus egg (other) Odontella (Mediophyceae) Pleurosigma Retaria Annelida chainlarge multiple (other) Ceratocorys horrida Stenosemella Asterionellopsis Dinophysiales tempChaetoceros danicus Bacteriastrum Lithodesmioides centric Xystonellidae
f1-score 0.546899 0.939592 0.887125 0.880726 0.864198 0.862745 0.844469 0.823306 0.808824 0.789916 0.768293 0.760572 0.730038 0.728814 0.728658 0.694051 0.684967 0.679868 0.670968 0.649351 0.640000 0.603221 0.580052 0.540773 0.515837 0.487805 0.461538 0.447761 0.433498 0.413502 0.326000 0.297872 0.256000 0.210526 0.184874 0.133333 0.0 0.0 0.0 0.0
precision 0.584877 0.990790 0.850520 0.865419 0.904573 0.895607 0.854767 0.799508 0.785714 0.755913 0.840000 0.831711 0.668213 0.704918 0.848290 0.777778 0.786787 0.656051 0.641975 0.661376 0.516129 0.604106 0.644315 0.623762 0.478992 0.769231 0.439024 0.600000 0.396396 0.437500 0.427822 0.636364 0.250000 0.333333 0.333333 0.200000 0.0 0.0 0.0 0.0
recall 0.532968 0.893426 0.927023 0.896584 0.827273 0.832210 0.834416 0.848564 0.833333 0.827122 0.707865 0.700644 0.804469 0.754386 0.638599 0.626598 0.606481 0.705479 0.702703 0.637755 0.842105 0.602339 0.527446 0.477273 0.558824 0.357143 0.486486 0.357143 0.478261 0.392000 0.263328 0.194444 0.262295 0.153846 0.127907 0.100000 0.0 0.0 0.0 0.0
support 45489.000000 1445.000000 25926.000000 1083.000000 550.000000 2670.000000 924.000000 766.000000 132.000000 966.000000 89.000000 1707.000000 358.000000 57.000000 5166.000000 391.000000 432.000000 146.000000 74.000000 392.000000 19.000000 342.000000 419.000000 132.000000 102.000000 28.000000 37.000000 42.000000 92.000000 125.000000 619.000000 36.000000 61.000000 26.000000 86.000000 10.000000 2.0 12.0 20.0 5.0
In [26]:
xgb = XGBClassifier(n_estimators=100, seed=42)
model_xgb = baseline_model(X_train_tl, y_train_tl, X_val.as_matrix(), y_val, xgb)
model_xgb['report']
Time execution:  3734.2902903556824
Out[26]:
macro avg badfocus (artefact) detritus silks Codonellopsis (Dictyocystidae) Thalassionema Neoceratium rods Copepoda Dictyocysta Undellidae nauplii (Crustacea) Codonaria feces artefact Protoperidinium Coscinodiscids Tintinnidiidae Chaetoceros pollen Rhabdonella Rhizosolenids Hemiaulus egg (other) Odontella (Mediophyceae) chainlarge Annelida multiple (other) Retaria Cyttarocylis Pleurosigma Stenosemella Asterionellopsis centric Ceratocorys horrida Dinophysiales Lithodesmioides tempChaetoceros danicus Bacteriastrum Xystonellidae
f1-score 0.450603 0.939698 0.846329 0.816473 0.805447 0.787464 0.773577 0.757021 0.728335 0.673575 0.666667 0.648128 0.642384 0.628564 0.619145 0.587396 0.586667 0.571081 0.545718 0.526829 0.487805 0.468371 0.421456 0.412281 0.311111 0.310559 0.269231 0.257117 0.250000 0.241135 0.239130 0.207951 0.192308 0.147059 0.081448 0.080645 0.025000 0.020408 0.0 0.0
precision 0.435318 0.976137 0.818742 0.792354 0.866109 0.831528 0.825127 0.711009 0.687500 0.625000 0.552239 0.827869 0.621795 0.848355 0.487179 0.548889 0.473118 0.601023 0.632997 0.451883 0.381679 0.537037 0.426357 0.373016 0.225806 0.253807 0.208333 0.297872 0.239130 0.139344 0.149660 0.127820 0.128205 0.104167 0.048649 0.131579 0.014706 0.011364 0.0 0.0
recall 0.526186 0.905882 0.875839 0.842105 0.752727 0.747835 0.728090 0.809399 0.774327 0.730337 0.840909 0.532513 0.664384 0.499226 0.849162 0.631714 0.771930 0.543981 0.479592 0.631579 0.675676 0.415274 0.416667 0.460784 0.500000 0.400000 0.380435 0.226171 0.261905 0.894737 0.594595 0.557377 0.384615 0.250000 0.250000 0.058140 0.083333 0.100000 0.0 0.0
support 45489.000000 1445.000000 25926.000000 1083.000000 550.000000 924.000000 2670.000000 766.000000 966.000000 89.000000 132.000000 1707.000000 146.000000 5166.000000 358.000000 391.000000 57.000000 432.000000 392.000000 342.000000 74.000000 419.000000 132.000000 102.000000 28.000000 125.000000 92.000000 619.000000 42.000000 19.000000 37.000000 61.000000 26.000000 20.000000 36.000000 86.000000 12.000000 10.000000 2.0 5.0

COMMENT.

Results:

  • RandomForestClassifier:
    • F1-score macro: 0.546899
    • Runing time: 500 seconds
  • XGBClassifier:
    • F1-score macro: 0.450603
    • Runing time: 3734 seconds

RandomForestClassifier outperforms the XGBClassifier in both score and runing time. Random Forest is easier to implement in parallel.

In conclusion, our final model is RandomForestClassifier.


4. Parameter Optimisation

The hyper-paremete tuning for RandomForestClassifier is:

{bootstrap=True, class_weight=None, criterion='gini', max_depth=None, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=None, oob_score=False, random_state=42, verbose=0, warm_start=False }


5. Model Evaluation

In this section, we evaluate the final model using the test set.

In [15]:
final_model = RandomForestClassifier(n_estimators=100, random_state=42)
final_result = baseline_model(X_train_tl, y_train_tl, X_test, y_test, final_model)
final_result['report']['macro avg']
Time execution:  501.76478028297424
Out[15]:
f1-score         0.572838
precision        0.608164
recall           0.559696
support      60652.000000
Name: macro avg, dtype: float64

Our final evaluation f1-score avg=macro on a held-out test set is 57.28%.

The time for training and prediction is 501 seconds.