GROUP 19¶

DO Thi Duyen
LE Ta Dang Khoa

!pip3 install --user imbalanced-learn

Requirement already satisfied: imbalanced-learn in /mnt/workspace/.local/lib/python3.5/site-packages (0.4.3)
Requirement already satisfied: scipy>=0.13.3 in /usr/local/lib/python3.5/dist-packages (from imbalanced-learn) (1.1.0)
Requirement already satisfied: scikit-learn>=0.20 in /mnt/workspace/.local/lib/python3.5/site-packages (from imbalanced-learn) (0.20.3)
Requirement already satisfied: numpy>=1.8.2 in /usr/local/lib/python3.5/dist-packages (from imbalanced-learn) (1.14.5)
You are using pip version 18.0, however version 19.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.

import warnings
warnings.filterwarnings('ignore')

import numpy as np
import time
import pandas as pd
pd.set_option('display.max_columns', 130)

from collections import Counter

import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split, cross_validate
from sklearn.feature_selection import RFECV
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.metrics import classification_report
from sklearn.metrics.scorer import make_scorer
from sklearn.metrics import f1_score

from imblearn.under_sampling import RandomUnderSampler, NearMiss, TomekLinks
from imblearn.over_sampling import SMOTE, BorderlineSMOTE

import zipfile
from io import BytesIO
from PIL import Image

from sklearn import preprocessing

import seaborn as sns
import matplotlib.pyplot as plt

1+2. DATA EXPLORATION & PRE-PROCESSING¶

A. Extract Images from ZIP file¶

def extract_zip_to_memory(input_zip):
    '''
    This function extracts the images stored inside the given zip file.
    It stores the result in a python dictionary.
    
    input_zip (string): path to the zip file
    
    returns (dict): {filename (string): image_file (bytes)}
    '''
    input_zip=zipfile.ZipFile(input_zip)
    return {name: BytesIO(input_zip.read(name)) for name in input_zip.namelist() if name.endswith('.jpg')}

start = time.time()
img_files = extract_zip_to_memory("/mnt/datasets/plankton/flowcam/imgs.zip")
print("Extraction time: ", time.time()-start)
print("Number of images: %d" % len(img_files))

Extraction time:  22.135587453842163
Number of images: 243610

Comment. The size of the dataset is pretty large (~240k), in comparison to the Oregon State dataset (~30K).

B. Take a look at the Images¶

def view_images(keys):
    """ Show images of planktons within the dataset,
        given the images' names.
    
    Parameters:
        keys (string): Name of the images
    Return:
        None.
    """
    for key in keys:
        # Read the Image
        img = imread(img_files[key])
    
        # Show the Image
        print(key)
        plt.imshow(img, cmap=cm.gray)
        plt.show()

FIRST, let's take look at the images themselves.

from skimage.io import imread, imshow
from pylab import cm
import random

# Randomly selects 5 images
img_keys = list(img_files.keys())
rand_names = random.sample(img_keys, 5)

# Show the images
view_images(rand_names)

imgs/32732419.jpg

imgs/32678561.jpg

imgs/32568427.jpg

imgs/32625285.jpg

imgs/32643509.jpg

NEXT, let's inspect the type & depth of the images' pixels.

modes = [Image.open(img_files[k]).mode for k in img_files]
print(Counter(modes))

Counter({'L': 243610})

Comment.

Since all 243,610 images of this dataset is of mode L, we can conclude that our image-dataset are 8-bit color pixels and greyscale.
After running several times this process of randomly showing images, we see that the size (dimensions) of the images are very different, each crops exactly to the plankton's object.
The implication of this is if one wants to use CNN on this dataset, then she has to resize all the images to a same size, a very time-consuming process.

FINALLY, we plot the distribution of images' sizes to see if we can get some more insights.

# Get the dimensions of the image
dims = [Image.open(img_files[k]).size for k in img_files]

# Get the ratios of the image
wh_ratios = [(w / h) for w, h in dims]
widths = [w for w, _ in dims]
heights = [h for _, h in dims]

# Graph the distribution of width-height ratios
plt.figure()
r = sns.distplot(wh_ratios, bins=1000, kde=False)
r.set(xlim=(0, 5), title='width / height ratio')

# Graph the distribution of widths
plt.figure()
sns.distplot(widths).set_title('widths')

# Graph the distribution of heights
plt.figure()
sns.distplot(heights).set_title('heights')

Text(0.5,1,'heights')

COMMENT.

Looking at the width-height ratio graph, we can see that:
- the images are mostly square (ratio = 1),
- a big amount of images are horizontal rectangles (ratio < 1 => width < height),
- and the images rarely are vettical rectangles (ratio > 1 => width > height).
Looking at the widths and heights graph, we can see that both share a very similar distribution, left-skewed with modes around 100.
The overall implication is that, in this dataset, image-size & image-shape are very different. Therefore, scaling them to a same size and apply CNN will encode wrong (latent) signals in a BIG proportion of images. Unless we're very skillful at image processing to resolve this problem (which we surely aren't), we should avoid that CNN path.

C. Analysing the Labels¶

FIRST, we load the meta-data of the images.

meta = pd.read_csv('/mnt/datasets/plankton/flowcam/meta.csv')
meta.head(5)

COMMENT. A big suprise when we look at the labels of this dataset is that many of them aren't planktons in a common sense.

For example, detritus is actually about dead bodies; feces is, of course, not a usual plankton; and badfocus is actually about images that are taken with a wrong focus.
When taking a deeper look at the lineage-column of this meta-data, one can see the full taxonomy that also includes _silks, artefacts and many types that aren't usual plankton.
We would speculate that if this data is representative, then it is highly unblanced, as dead bodies are overwhelmed in the sea, and silk or plastic cannot share the same proportion with real planktons.

NEXT, we build the targets dataframe, which includes object-ids and their corresponding level-2 labels.

targets_df = meta[['objid','level2']].drop_duplicates()
targets_df = targets_df[targets_df['level2'].isna()!=True]
targets_df['objid'] = targets_df['objid'].astype('int64')

targets_df.head(5)

print("Number of records: %d" % targets_df.shape[0])
print("Number of classes: %d" % targets_df['level2'].nunique())

Number of records: 242607
Number of classes: 39

COMMENT. Here we build the targets set as followings:

Only choose 2 columns objid (image-id) and level2 (label) because they are all we need.
Remove all rows that doesn't have a label, as we cannot train with them.

After this process, there are 242,607 records left. In comparison with 243,610 images in the dataset, this is smaller. Therefore, we need to make sure that all records in the target-set are included in the image-dataset. Since otherwise, we will train our models on objects that aren't existed.

FINALLY, we plot the distribution of labels.

labels_freq = Counter(targets_df['level2'])
labels_freq_df = pd.DataFrame.from_dict(labels_freq, orient='index')\
                             .rename(columns={0: 'count'})\
                             .sort_values(by=['count'], ascending=False)

labels_freq_df.plot(kind='bar');

COMMENT. As speculated above, this dataset is highly imblanced.

The biggest class has with 77812 instances (detritus or dead bodies), while the smallest class has only 8 instances (Bacteriastrum, a real plankton).
It is very difficult to get high performance if we leave this dataset imbalanced as it is. Therefore, resampling techniques are very needed, possible over-sampling for minor classes.

D. Encode the Labels. Check if object-ids are in synced with image-ids.¶

FIRST, we encode the labels of our targets dataframe.

# Initialize the LabelEncoder
le = preprocessing.LabelEncoder()
le.fit(targets_df['level2'])

# Encode the labels, drop the original column
targets_df['target'] = le.transform(targets_df['level2'])
targets_df.drop(['level2'], axis=1, inplace=True)

targets_df.head(5)

NEXT, we ensure that all object-ids are belong the image-ids.

# Get the ID of all images
image_ids = [int(k.split('/')[1].split('.')[0]) for k in img_files]
image_ids_set = set(image_ids)

# Check if all object-ids are belong to image-ids
object_ids_set = set(targets_df['objid'])
object_ids_set.difference(image_ids_set)

set()

COMMENT. Since all object-ids refer to images in our image-dataset, we can proceed with feature exploration.

E. Explore the Features:¶

WHY WE USE FEATURES INSTEAD OF IMAGES?

First, as stated above, resizing images is not a good idea.
Second, but much more important, is that the features given are engineered with expert-knolwedge:
- If look for a partial of this dataset online, one will see that the features are constructed using an expertise process name ZooProcess.
- This process computes morphological features in the native feature-set.
- Then, they use the skimage library to recompute the morphological features using the ROIs produced by ZooProcess.
- ROIs are regions of interested, which can only be pointed out by expert knowledge.

That's why we decided to use features instead (both of them, indeed).

TO BEGIN, we investigate the Native feature-set, and see if there are anything we should do to preprocess the data.

# Load the Native Feature-set
features_native = pd.read_csv('/mnt/datasets/plankton/flowcam/features_native.csv.gz', compression='gzip', error_bad_lines=False)
features_native['objid'] = features_native['objid'].astype('int64')

We find all columns that contains at least one NaN.

features_native.loc[:, features_native.isna().any()][:10]

COMMENT.

The NaN are due to some problems with the records, not the inherent flaws of the features. Therefore, we need to fill in the missing values instead of dropping columns.
Next, let's find the rows that have the values at those columns equals NaN, to see if there are any patterns among them.

# For the columns: 'perimareaexc', 'feretareaexc' & 'cdexc'
display(features_native[features_native['perimareaexc'].isnull()].loc[:, features_native.isna().any()][:10])

# For the columns: 'convarea_area', 'convarea_area' 'symetrieh_area', 'symetriev_area'
# 'nb1_area', 'nb2_area', 'nb3_area' & 'skeleton_area'
display(features_native[features_native['convarea_area'].isnull()].loc[:, features_native.isna().any()][:10])

COMMENT. The rows identified for inspection are:

To fill in the missing values for perimareaexc, feretareaexc, cdexc, let's inspect 17, 22, 29 (select randomly).
To fill in the missing values for convarea_area, symetrieh_area, symetriev_area, nb1_area, nb2_area, nb3_area, skeleton_area, let's inspect 826, 842, 855 (select randomly).

# Get the obj-ids of object at index = 17, 22, 29
focus_ids = features_native.loc[[17, 22, 29], ['objid']].objid.tolist()
focus_keys = ["imgs/{}.jpg".format(id) for id in focus_ids]

# View their images
view_images(focus_keys)

imgs/32733995.jpg

imgs/32587335.jpg

imgs/32736277.jpg

# Get the obj-ids of object at index = 826, 842, 855
focus_ids = features_native.loc[[826, 842, 855], ['objid']].objid.tolist()
focus_keys = ["imgs/{}.jpg".format(id) for id in focus_ids]

# View their images
view_images(focus_keys)

imgs/32745721.jpg

imgs/32643818.jpg

imgs/32642887.jpg

COMMENT. In each case, we can that the images are very similar, showing a systemic evidence of absence, NOT random absence of evidence:

Therefore, there must be a reason why the experts decided not to fill in the values.
So, we should fill in 0 instead of something else (such as mode, mean, median, etc.).

# Fill all missing values with 0
features_native.fillna(0, inplace=True)

NEXT, we investigate the skimage feature-set, for the purpose of data-preprocessing.

# Load the skimage's feature-set
features_skimage = pd.read_csv('/mnt/datasets/plankton/flowcam/features_skimage.csv.gz', compression='gzip', error_bad_lines=False)
features_skimage['objid'] = features_skimage['objid'].astype('int64')

# View ALL columns that contains NaN
features_skimage.loc[:10, features_skimage.isna().any()]

COMMENT. These columns are inherently flawed, as they contains mostly NaN, which makes sense because the features here are recomputed using skimage library, a very general library. So, it is normal that there are features cannot be computed. Therefore, we should remove all these empty features.

# Drop the 'flawed' columns
features_skimage.dropna(axis='columns', inplace=True)

NOW, we'll merge the 2 feature-sets with the targets, using object-ids.

data = features_native.merge(features_skimage, on='objid', how='inner').merge(targets_df, on='objid', how='inner')
print("Data after joining features and target: ", data.shape)
data.head(10)

Data after joining features and target:  (242607, 125)

FINALLY, we measure the mutual-information between the merged features and the targets.

from sklearn.feature_selection import mutual_info_classif

# Get the Merged-Features
merged_features = data.drop(columns=["objid", "target"])

# Get Mutual Information between 'target' and each 'feature'
# Then turn it into a DataFrame, and sort descendingly
mutual_info = mutual_info_classif(merged_features, data['target'])

mutual_info_df = pd.DataFrame(data=mutual_info, index=merged_features.columns, columns=['MI'])
mutual_info_df.sort_values(by='MI', inplace=True, ascending=False)

To know how good these mutual-information are, we compare them to the entropy of the targets.

from scipy.stats import entropy

targets_freq = data['target'].value_counts().tolist()
targets_entropy = entropy(targets_freq)

print(targets_entropy)

1.833188360384403

# Show the head & tail of mutual-information data-frame
display(mutual_info_df[:5])
display(mutual_info_df[-5:])

print("# of features: {}.".format(len(mutual_info_df)))
print("# of features that reduces uncertainty BY-MORE-THAN-10%: {}.".\
          format(len(mutual_info_df[mutual_info_df['MI'] > 0.1*targets_entropy])))

# of features: 123.
# of features that reduces uncertainty BY-MORE-THAN-10%: 108.

COMMENT:

First, 108/123 features is quite a good number, given that they are the features that reduces more-than-10% the uncertainty of the labels.
This probably confirms out intuition about features engineered using expert-knowledge:
1. First, by definition, planktons are categorized by a multitude of morphological features.
2. Second, the features here are identified using regions of interests (ROIs) marked by experts.

Therefore, see almost all of them reduces uncertainty for a good amount is really confirming.

F. Split Train-Validation-Test sets¶

# Reverse the label-encoding, as we don't need it anymore
data['target'] = le.inverse_transform(data['target'])

# Split the training set, validation set and testing set with ratio as 0.5 : 0.25 : 0.25
X_train, X_test, y_train, y_test = train_test_split(data.drop(['objid', 'target'], axis=1), data['target'], test_size=0.25, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.25, random_state=42)

print("Number of classes of training set: ", y_train.nunique())
print("Number of classes of validation set: ", y_val.nunique())
print("Number of classes of testing set: ", y_test.nunique())

Number of classes of training set:  39
Number of classes of validation set:  39
Number of classes of testing set:  39

COMMENT.

Here we split the full set into training set, validation set and testing set with ratio 0.5, 0.25, 0.25 respectively.
Each splitted dataset keep the same number of classes as the full set (39 classes).

3. SOLVING IMBALANCE DATASET¶

MAIN IDEAS.

Why we need to solve this? Because otherwise, our trained-models will have low predictive power for the infrequent classes of the dataset.
How are we going to solve this?
- Since the dataset is very imblanced, we use both oversampling and undersampling techniques.
- In addition to that, we use a cleaning method called TomekLinks to improve the resamplings.
How do we know which resamplings are good?
- We use a Baseline Model and the Validation Set to measure the performances of our resamplings.
- The metric we used are f1-score with avg = 'macro'.

# A helper function to display the score
def report_f1(y_test, y_pred):
    """ Return the well-displayed scores for evaluation.
    
    Parameters:
        y_test: the true labels
        y_pred: the predicted labels
    Return:
        result (DataFrame): the dataframe of scores
    """
    # Evaluate the performance using sklearn library
    report = classification_report(y_test, y_pred, output_dict=True)
    report_df = pd.DataFrame.from_dict(report).transpose()
    
    idx_first = ['macro avg']
    
    # Re-order to show f1 macro first
    df1 = report_df.ix[idx_first]
    idx_df2 = list(set(report_df.index)-set(idx_first + ['weighted avg', 'micro avg']))
    df2 = report_df.ix[idx_df2].sort_values(by = ['f1-score'], ascending=False)
    result = pd.concat([df1, df2]).transpose()
    
    return result

# Define a baseline model to fit, predict and evaluate the input data sets and models
def baseline_model(X_train, y_train, X_test, y_test, model):
    """ Return the evaluation of given models and given inputs.
    
    Parameters:
        X_train: features for training
        y_train: true labels for training
        X_test: features for testing/validation
        y_test: true labels for testing/validation
        model: the learning model
    Return:
        A dictionary of evaluation
    """
    start = time.time()
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    running_time = time.time()-start
    report = report_f1(y_test, y_pred)
    print("Time execution: ", running_time)
    return {'model': model, 'y_pred': y_pred, 'running_time': running_time, 'report': report}

A. The Baseline Model & Feature Importance¶

FIRST, for the baseline-model, we'll use RandomForestClassifier.

rf = RandomForestClassifier(n_estimators=100, random_state=42)
baseline_rf = baseline_model(X_train, y_train, X_val, y_val, rf)
baseline_rf['report']

Time execution:  321.334059715271

COMMENT-1.

F1 avg macro for the baseline model is 0.441259
The high performances belong to the classes like badfocus (artefact), silks, detritus, Codonellopsis (Dictyocystidae), Neoceratium, Thalassionema and rods. The classes are major classes and/or having specific shapes of images.
The low performances belong to the minor classes which do not have enough information to classify.

COMMENT-2.

The highest score belongs to the class badfocus (artefact). In the training set, this class has 4416 instances while the biggest class has 77812 instances.
So, more data does not mean more accuracy, we will consider this aspect when choose the number of instances for resampling data.

NEXT:

As stated before, the features used in this dataset are somwhat based on expert knowledge, so it is not really necessary to do feature-selection.
However, here we take the baseline model as an opportunity to compute the importances of each feature to verify that conclusion.

feature_importance = baseline_rf['model'].feature_importances_
plt.bar(range(0, len(feature_importance)), sorted(feature_importance, reverse=True));

COMMENT.

The feature importances decrease gradually in this graph. There are only four features which have very low importances. Therefore, we keep all the features for learning models.

B. Resampling: over-sampling, under-sampling & TomekLinks¶

# A helper funtion to resampling several classes
def resampling_multiclass(X_train, y_train, alg, classes):
    """ Resample the given classes while remainig the other classes
    Parameters:
        X_train: Features for training
        y_train: Labels for training
        alg: A method for resampling
        classes: The classes to resample
    Returns: 
        X_train_new: Features after resampling
        y_train_new: Labels after resampling
    """
    
    # Get row indices of given classes
    idx_g1 = y_train[y_train.isin(classes)].index
    # Get row indices of the remaining classes
    idx_g2 = y_train[~y_train.isin(classes)].index
    
    # Get new set for features according to idx_g1
    X_train_g1 = X_train.ix[idx_g1]
    y_train_g1 = y_train.ix[idx_g1]
    # Get new set for features according to idx_g2
    X_train_g2 = X_train.ix[idx_g2]
    y_train_g2 = y_train.ix[idx_g2]
    
    # Resample the new set of the given classes
    X_res, y_res = alg.fit_resample(X_train_g1, y_train_g1)
    
    # Concatenate the resampled sets and non-resample sets
    X_train_new = np.concatenate([X_res, X_train_g2])
    y_train_new = np.concatenate([y_res, y_train_g2])
    
    return X_train_new, y_train_new

FIRST is oversampling for infrequent classes.

Here we use SMOTE (Synthetic Minority Oversampling Technique), a very popular oversampling method that was proposed to improve random oversampling.
About SMOTE: Rather than replicating the minority observations, SMOTE works by creating synthetic observations in by interpolation. The basic implementation of SMOTE will not make any distinction between easy and hard samples to be classified using the nearest neighbors rule.
We also use another SMOTE variants, Borderline-SMOTE. This variant detects which point to select in the border between two classes.

sm = SMOTE(random_state=42)
X_train_sm, y_train_sm = resampling_multiclass(X_train, y_train, sm, ['multiple (other)', 'badfocus (artefact)', 'silks', 'Copepoda', 'Thalassionema', 'rods', 'Codonellopsis (Dictyocystidae)', 'Protoperidinium', 'Tintinnidiidae', 'Rhizosolenids', 'Chaetoceros', 'artefact', 'pollen', 'Codonaria', 'chainlarge', 'Undellidae', 'Hemiaulus', 'egg (other)', 'Dinophysiales', 'Dictyocysta', 'Annelida', 'Stenosemella', 'Rhabdonella', 'Coscinodiscids', 'Retaria', 'Pleurosigma', 'Ceratocorys horrida', 'centric', 'Odontella (Mediophyceae)', 'Asterionellopsis', 'Cyttarocylis', 'Lithodesmioides', 'tempChaetoceros danicus', 'Xystonellidae', 'Bacteriastrum'])
baseline_model(X_train_sm, y_train_sm, X_val, y_val, rf)['report']

Time execution:  642.2512774467468

bsm = BorderlineSMOTE(random_state=42)
X_train_bsm, y_train_bsm = resampling_multiclass(X_train, y_train, bsm, ['multiple (other)', 'badfocus (artefact)', 'silks', 'Copepoda', 'Thalassionema', 'rods', 'Codonellopsis (Dictyocystidae)', 'Protoperidinium', 'Tintinnidiidae', 'Rhizosolenids', 'Chaetoceros', 'artefact', 'pollen', 'Codonaria', 'chainlarge', 'Undellidae', 'Hemiaulus', 'egg (other)', 'Dinophysiales', 'Dictyocysta', 'Annelida', 'Stenosemella', 'Rhabdonella', 'Coscinodiscids', 'Retaria', 'Pleurosigma', 'Ceratocorys horrida', 'centric', 'Odontella (Mediophyceae)', 'Asterionellopsis', 'Cyttarocylis', 'Lithodesmioides', 'tempChaetoceros danicus', 'Xystonellidae', 'Bacteriastrum'])
baseline_model(X_train_bsm, y_train_bsm, X_val, y_val, rf)['report']

Time execution:  599.5486073493958

COMMENT.

Based on the results of the baseline model, we varied the number of classes for oversampling and choose to oversample data for 33 minor classes. Each of them was oversampled up to 4416 instances. We choose this number for oversampling the minor classes because this number is not a too big but still can gives the highest f1 score (in the baseline model, badfocus (artefact) has 4416 instances in the training set and gets the highesh f1-score macro as 0.952245).

Results when oversampling:
- Using SMOTE:
  - F1-score macro: 0.543803
  - Running time: 642 seconds
- Using Borderline-SMOTE:
  - F1-score macro: 0.517850
  - Running time: 599 seconds
Compare to the baseline model, the model after oversampling training data gives the better score (increase ~ 10.25%, from 54.38% to 44.13%). This is a significant improvement.
Compare to Borderline-SMOTE, SMOTE method runs a bit slower but gives the better score (54.38%). We prefer to improve the score, therfore, we choose SMOTE as the technique for oversampling in this project.
Further work: As can be seen from the two output tables above, each method SMOTE and Borderline-SMOTE may improve different classes. Therefore, we may combine them to oversample classes.

NEXT is under-sampling.

The controlled under-sampling methods: we can specify the expected number of instances after undersampling. After oversampling, we use RandomUnderSampler and NearMiss-1 to undersample the major classes.

RandomUnderSampler: it undersample the majority classes by randomly picking samples.
NearMiss-1: it selects samples from the majority class for which the average distance of the k nearest samples of the minority class is the smallest.

nm = NearMiss(random_state=42)
X_train_nm, y_train_nm = resampling_multiclass(pd.DataFrame(X_train_sm), pd.DataFrame(y_train_sm)[0], nm, ['detritus', 'feces'])
baseline_model(X_train_nm, y_train_nm, X_val, y_val, rf)['report']

Time execution:  460.01442432403564

rus = RandomUnderSampler(random_state=42)
X_train_rus, y_train_rus = resampling_multiclass(pd.DataFrame(X_train_sm), pd.DataFrame(y_train_sm)[0], rus, ['detritus', 'feces'])
baseline_model(X_train_rus, y_train_rus, X_val, y_val, rf)['report']

Time execution:  457.41386675834656

COMMENT.

As can be seen from the results above, the undersampling techniques do not improve the f1-score macro, therefore, we skip this step and do not use it in the final models.

FINALLY, we use the technique named TomekLinks, which is a cleaning undersampling method. The number of samples to be selected need not to be specified.

TomekLinks: it removes the samples considered noisy.

# Undersampling using TomekLinks
tl = TomekLinks(sampling_strategy='all', random_state=42)
X_train_tl, y_train_tl = tl.fit_resample(X_train_sm, y_train_sm)
baseline_model(X_train_tl, y_train_tl, X_val, y_val, rf)['report']

Time execution:  585.842173576355

COMMENT.

The TomekLinks technique help to improve a bit of the score of model after resampling data. It increase the score from 54.38% (after using SMOTE) to 54.68% (after using TomekLinks).

SUMMARY for RESAMPLING DATA

Based on the result after resamplings in the training set and evaluate in the validation set, our training data will be oversample using SMOTE and then undersample using TomekLinks.

3. Model Selection¶

We do not use cross-validation to measure the performance of models because the training set was resampled, which means when using cross-validation on the training set, the hold-out set was be seen and lead to over-fitting.

Insteads, we train models on using the training set (which was resampled) and evaluate them on the validation set (which is not resampled).

Here we use RandomForestClassifier and XGBClassifier for model selection.

rf = RandomForestClassifier(n_estimators=100, random_state=42)
model_rf = baseline_model(X_train_tl, y_train_tl, X_val, y_val, rf)
model_rf['report']

Time execution:  500.11835741996765

xgb = XGBClassifier(n_estimators=100, seed=42)
model_xgb = baseline_model(X_train_tl, y_train_tl, X_val.as_matrix(), y_val, xgb)
model_xgb['report']

Time execution:  3734.2902903556824

COMMENT.

Results:

RandomForestClassifier:
- F1-score macro: 0.546899
- Runing time: 500 seconds
XGBClassifier:
- F1-score macro: 0.450603
- Runing time: 3734 seconds

RandomForestClassifier outperforms the XGBClassifier in both score and runing time. Random Forest is easier to implement in parallel.

In conclusion, our final model is RandomForestClassifier.

4. Parameter Optimisation¶

The hyper-paremete tuning for RandomForestClassifier is:

{bootstrap=True, class_weight=None, criterion='gini', max_depth=None, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=None, oob_score=False, random_state=42, verbose=0, warm_start=False }

5. Model Evaluation¶

In this section, we evaluate the final model using the test set.

final_model = RandomForestClassifier(n_estimators=100, random_state=42)
final_result = baseline_model(X_train_tl, y_train_tl, X_test, y_test, final_model)
final_result['report']['macro avg']

Time execution:  501.76478028297424

f1-score         0.572838
precision        0.608164
recall           0.559696
support      60652.000000
Name: macro avg, dtype: float64

Our final evaluation f1-score avg=macro on a held-out test set is 57.28%.

The time for training and prediction is 501 seconds.

	objid	projid	id	status	latitude	longitude	objdate	objtime	depth_max	unique_name	lineage	level1	level2
0	32756761.0	133	84963	V	43.683333	7.3	2013-09-19	00:09:00	75	detritus	/#/not-living/detritus	detritus	detritus
1	32759364.0	133	84963	V	43.683333	7.3	2013-09-19	00:09:00	75	detritus	/#/not-living/detritus	detritus	detritus
2	32758055.0	133	28299	V	43.683333	7.3	2013-09-19	00:09:00	75	Guinardia	/#/living/Eukaryota/Harosa/Stramenopiles/Ochro...	Guinardia	Rhizosolenids
3	32758988.0	133	92010	V	43.683333	7.3	2013-09-19	00:09:00	75	silks	/#/not-living/plastic/other/silks	silks	silks
4	32760598.0	133	92010	V	43.683333	7.3	2013-09-19	00:09:00	75	silks	/#/not-living/plastic/other/silks	silks	silks

	perimareaexc	feretareaexc	cdexc	convarea_area	symetrieh_area	symetriev_area	nb1_area	nb2_area	nb3_area	skeleton_area
0	2.215909	0.366477	0.011364	91.778443	0.005988	0.005988	0.017964	0.017964	0.059880	91.778443
1	0.462871	0.165842	-0.002475	18.030120	0.012048	0.012048	0.024096	0.012048	0.006024	18.030120
2	0.332130	0.139591	-0.001203	24.892857	0.017857	0.017857	0.011905	0.053571	0.107143	24.892857
3	NaN	NaN	NaN	91.718563	0.005988	0.005988	0.000000	0.000000	0.023952	99.365269
4	10.898551	3.536232	0.057971	109.449102	0.017964	0.017964	0.029940	0.083832	0.059880	109.449102
5	1.325219	0.194742	0.000000	190.180723	0.006024	0.006024	0.078313	0.072289	0.072289	190.180723
6	1.012579	0.134591	0.003774	208.923077	0.006410	0.006410	0.064103	0.217949	0.173077	208.923077
7	13.155844	3.129870	0.051948	113.497006	0.011976	0.011976	0.059880	0.017964	0.011976	113.497006
8	99.000000	44.166667	-0.166667	61.523810	0.029762	0.029762	0.017857	0.017857	0.011905	61.523810
9	23.000000	8.275862	0.137931	179.190476	0.011905	0.011905	0.041667	0.041667	0.047619	179.190476

	objid	area_x	meanimagegrey	mean	stddev	min	perim.	width	height	major	minor	angle	circ.	feret	intden	median	skew	kurt	%area	area_exc	fractal	skelarea	slope	histcum1	histcum2	histcum3	nb1	nb2	nb3	symetrieh	symetriev	symetriehc	symetrievc	convperim	convarea	fcons	thickr	esd	elongation	range	meanpos	centroids	cv	sr	perimareaexc	feretareaexc	perimferet	perimmajor	circex	cdexc	kurt_mean	skew_mean	convperim_perim	convarea_area	symetrieh_area	symetriev_area	nb1_area	nb2_area	nb3_area	nb1_range	nb2_range	nb3_range	median_mean	median_mean_range	skeleton_area	area_y	convex_area	eccentricity	equivalent_diameter	euler_number	filled_area	inertia_tensor0	inertia_tensor1	inertia_tensor2	inertia_tensor3	inertia_tensor_eigvals0	inertia_tensor_eigvals1	major_axis_length	max_intensity	mean_intensity	min_intensity	minor_axis_length	moments_hu0	moments_hu1	moments_hu2	moments_hu3	moments_hu4	moments_hu5	moments_hu6	moments_normalized2	moments_normalized3	moments_normalized5	moments_normalized6	moments_normalized7	moments_normalized8	moments_normalized9	moments_normalized10	moments_normalized11	moments_normalized12	moments_normalized13	moments_normalized14	moments_normalized15	perimeter	solidity	weighted_moments_hu0	weighted_moments_hu1	weighted_moments_hu2	weighted_moments_hu3	weighted_moments_hu4	weighted_moments_hu5	weighted_moments_hu6	weighted_moments_normalized2	weighted_moments_normalized3	weighted_moments_normalized5	weighted_moments_normalized6	weighted_moments_normalized7	weighted_moments_normalized8	weighted_moments_normalized9	weighted_moments_normalized10	weighted_moments_normalized11	weighted_moments_normalized12	weighted_moments_normalized13	weighted_moments_normalized14	weighted_moments_normalized15	target
0	32756761	6653.0	167.18	205.76	65.341	85	779.66	109	123	147.8	132.0	90	0.138	128.8	3153711.0	253	-0.804	-1.142	94.71	352.0	1.012	15327.0	0.802	100	146	238	3	3	10	1.431	1.430	3	3	15327.0	15327.0	253.107	1.008	106	1.121212	170	-0.404959	4	31.553398	38.235294	2.215909	0.366477	6.046512	5.270270	5.656474	0.011364	-0.004854	-0.004854	19.650000	91.778443	0.005988	0.005988	0.017964	0.017964	0.059880	0.017647	0.017647	0.058824	47	0.276471	91.778443	6277.0	9994.0	0.675984	89.398684	-9	6531.0	577.942452	-170.679844	-170.679844	803.296613	895.137829	486.101236	119.675416	227	173.805958	21	88.190815	2.200476e+08	4.246392e+06	2.662663e+06	2.354605e+05	1.611829e+02	-1.336356e+03	-9.369804e+01	1.279746e+08	2.231306e+07	2.719131e+07	-7.213529e+06	5.430433e+06	9.207304e+07	-7.823094e+06	9.849355e+06	2.387593e+05	2.163625e+06	3.671610e+06	-1.423642e+06	6.063819e+05	1050.714862	0.628077	1.086257e+06	1.419815e+02	0.328274	0.045152	5.380579e-12	-1.682703e-07	-1.125877e-12	6.014419e+05	9.325868e+03	1.791506e+05	-1.525763e+03	1.941643e+02	4.848153e+05	-2694.631449	2.414405e+02	1.295103	440.053063	129.297892	-2.122403	0.115495	30
1	32759364	1275.0	165.83	234.29	38.562	98	186.99	33	65	82.4	46.3	90	0.458	67.2	701229.0	255	-1.903	2.434	68.31	404.0	1.026	2993.0	0.110	167	216	246	4	2	1	1.657	1.646	3	2	2993.0	2993.0	31.693	1.014	104	1.782609	157	-0.154412	-1	16.666667	24.840764	0.462871	0.165842	2.791045	2.280488	27.442236	-0.002475	0.008547	-0.008547	16.005348	18.030120	0.012048	0.012048	0.024096	0.012048	0.006024	0.025478	0.012739	0.006369	21	0.133758	18.030120	857.0	1500.0	0.888307	33.032806	-2	1273.0	70.190034	48.680909	48.680909	267.628870	278.979223	58.839681	66.810684	218	115.376896	22	30.682811	3.941878e+08	6.598337e+07	1.246740e+06	5.309983e+04	5.416430e+00	-7.431118e+03	-1.254289e+01	3.122857e+08	3.873429e+06	-5.680386e+07	3.288071e+06	-3.626536e+07	8.190202e+07	-8.713215e+06	2.221631e+07	-4.622357e+06	-8.735667e+06	-7.377503e+06	-3.434297e+05	-3.928709e+06	338.663997	0.571333	3.302787e+06	5.701727e+03	0.905013	0.112204	-1.375254e-11	-1.946467e-04	3.300476e-11	2.728961e+06	8.736413e+03	-5.140816e+05	-2.060929e+03	-2.790422e+03	5.738257e+05	-7088.333306	1.420614e+03	-17.330565	-8402.726309	-335.688243	-11.030494	-1.784313	30
2	32758055	2416.0	167.92	239.15	25.590	94	276.33	26	115	138.8	38.4	90	0.398	115.6	1000126.0	255	-2.147	5.385	65.60	831.0	1.050	4182.0	0.183	205	229	243	2	9	18	2.799	2.769	4	4	4182.0	4182.0	63.590	1.008	106	3.657895	161	-0.110345	-1	10.878661	16.149068	0.332130	0.139591	2.379310	1.985612	37.563504	-0.001203	0.020921	-0.008368	15.152174	24.892857	0.017857	0.017857	0.011905	0.053571	0.107143	0.012422	0.055901	0.111801	16	0.099379	24.892857	1377.0	2564.0	0.977587	41.871838	-12	2416.0	52.042625	28.888459	28.888459	1156.370209	1157.125395	51.287439	136.066184	221	77.190995	19	28.646100	8.775692e+08	6.449337e+08	2.441400e+07	2.843659e+07	7.490162e+05	2.276425e+07	-1.932569e+04	8.397750e+08	-1.646830e+08	-2.097927e+07	9.473912e+06	-3.281929e+07	3.779421e+07	-3.581400e+06	3.041981e+07	-9.421348e+06	1.647676e+06	-3.332623e+06	2.087634e+06	-4.829305e+06	867.416306	0.537051	1.220825e+07	1.258289e+05	76.795145	84.364545	6.789266e-06	9.448652e-01	-1.335473e-07	1.170246e+07	-2.859165e+05	-3.403948e+05	1.495453e+04	-6.573885e+03	5.057891e+05	-4044.999299	5.357163e+03	-175.738886	1982.197147	-585.246490	36.387240	-11.036476	20
3	32758988	1433.0	167.34	248.79	20.933	107	388.40	126	117	151.2	141.0	0	0.119	171.3	4167272.0	255	-4.124	17.007	100.00	0.0	1.185	16594.0	1.607	251	252	253	0	0	4	1.344	1.353	9	9	16245.0	15317.0	281.337	1.872	106	1.070922	148	-0.042254	-1	8.433735	14.189189	0.000000	0.000000	2.269006	2.569536	0.000000	0.000000	0.068273	-0.016064	41.868557	91.718563	0.005988	0.005988	0.000000	0.000000	0.023952	0.000000	0.000000	0.027027	6	0.040541	99.365269	1433.0	2451.0	0.996145	42.714778	1	1433.0	1151.329152	-1023.481219	-1023.481219	938.576959	2073.947555	15.958556	182.162457	211	109.443126	20	15.979265	1.458413e+09	2.062499e+09	2.714748e+08	1.889040e+08	4.263768e+07	2.418257e+08	3.469392e+06	6.549735e+08	-2.556928e+07	7.142228e+08	-9.617200e+07	9.277311e+08	8.034397e+08	-1.821978e+08	1.036359e+09	-2.851636e+08	-2.855829e+08	1.176318e+09	-4.614650e+08	1.829032e+09	402.315801	0.584659	1.182985e+07	1.360323e+05	371.384959	322.572749	1.116296e-04	3.612542e+00	2.067509e-06	5.341701e+06	-9.034729e+04	5.803403e+06	-1.525937e+05	6.966508e+04	6.488148e+06	-227827.736082	7.859818e+04	-4117.231070	-317871.379287	89887.970505	-5606.440604	1325.968409	37
4	32760598	1650.0	166.89	250.42	14.603	124	751.96	66	239	278.7	83.5	90	0.037	244.4	4577141.0	255	-4.376	20.118	95.82	69.0	1.013	18278.0	1.825	242	251	252	5	14	10	2.612	2.598	19	19	18278.0	18278.0	285.249	1.004	106	3.321429	131	-0.039683	4	6.000000	11.450382	10.898551	3.536232	3.081967	2.695341	1.149973	0.057971	0.080000	-0.016000	24.305851	109.449102	0.017964	0.017964	0.029940	0.083832	0.059880	0.038168	0.106870	0.076336	5	0.038168	109.449102	1581.0	3902.0	0.998342	44.866376	-3	1650.0	256.002523	961.046175	961.046175	3821.627978	4064.162374	13.468127	255.003133	195	78.139152	20	14.679579	2.579146e+09	6.564402e+09	2.037154e+09	1.989648e+09	4.005676e+09	5.094617e+09	6.813774e+06	2.417222e+09	1.305025e+09	-6.078723e+08	-3.077877e+08	-3.056582e+09	1.619244e+08	6.843483e+07	7.729145e+08	8.843851e+08	-1.354579e+07	-2.018699e+08	-2.155121e+08	-1.281629e+09	780.150324	0.405177	2.578533e+07	6.534352e+05	3282.103656	3173.567259	1.024227e-02	8.103163e+01	2.693943e-05	2.401762e+07	1.645298e+06	-6.292385e+06	-3.959394e+05	-3.502930e+05	1.767707e+06	87800.105557	9.101105e+04	13798.884373	-16296.563545	-24557.198375	-3379.838856	-1854.215755	37
5	32760828	11200.0	166.39	221.26	55.169	76	1361.04	197	146	231.3	173.8	0	0.076	199.6	6985233.0	254	-1.409	0.385	90.83	1027.0	1.008	31570.0	2.188	129	216	250	13	12	12	1.493	1.486	3	3	31570.0	31570.0	317.430	1.005	104	1.327586	179	-0.234483	0	24.886878	30.726257	1.325219	0.194742	6.805000	5.891775	9.468571	0.000000	0.000000	-0.004525	23.196179	190.180723	0.006024	0.006024	0.078313	0.072289	0.072289	0.072626	0.067039	0.067039	33	0.184358	190.180723	10156.0	17641.0	0.519437	113.714646	-59	11034.0	1258.148157	93.425958	93.425958	965.945424	1285.465414	938.628167	143.413551	233	151.482670	21	122.548157	2.189931e+08	1.166289e+06	1.003936e+06	1.422881e+05	-4.154643e+01	3.632891e+03	3.414654e+01	9.511081e+07	-9.269539e+06	-9.199090e+06	-4.531977e+06	6.782668e+05	1.238823e+08	2.073144e+06	6.263602e+06	-8.236135e+04	1.404513e+07	-6.233700e+06	-3.473279e+05	-2.407816e+05	2437.739428	0.575704	1.249359e+06	1.254557e+02	0.081539	0.043521	-6.897267e-14	1.361957e-05	2.591624e-12	4.684963e+05	-2.617300e+03	-8.349132e+04	-7.826433e+02	-1.261023e+01	7.808622e+05	-482.665019	2.153805e+02	-1.012162	6605.944524	-255.575908	0.784273	-0.077414	30
6	32760820	8726.0	156.03	234.44	34.530	89	1610.45	160	186	218.9	189.6	90	0.042	213.8	7640751.0	251	-2.099	3.652	81.78	1590.0	1.008	32592.0	2.301	208	237	248	10	34	27	1.441	1.440	2	2	32592.0	32592.0	474.603	1.005	98	1.152632	166	-0.144828	6	14.957265	21.084337	1.012579	0.134591	7.523364	7.351598	12.425702	0.003774	0.017094	-0.008547	20.243478	208.923077	0.006410	0.006410	0.064103	0.217949	0.173077	0.060241	0.204819	0.162651	17	0.102410	208.923077	7121.0	21540.0	0.588547	95.219424	-28	8720.0	1797.845596	-300.829808	-300.829808	2454.768711	2571.712103	1680.902205	202.848203	224	121.518467	42	163.995229	5.971934e+08	1.564905e+07	1.472326e+07	2.054673e+07	-2.072334e+04	2.237670e+06	3.567670e+05	3.447225e+08	1.163498e+08	4.224544e+07	2.870739e+07	4.841636e+07	2.524709e+08	-3.545087e+05	6.001376e+07	2.107953e+07	5.550557e+07	2.806477e+07	1.270830e+07	2.143566e+07	2603.691701	0.330594	4.864202e+06	4.682387e+02	16.433113	11.892377	2.046904e-08	7.216512e-03	1.649856e-07	2.730455e+06	1.039565e+05	1.674658e+05	1.427095e+04	2.244763e+03	2.133748e+06	-8043.359597	3.354233e+03	92.965136	37623.535423	1190.466229	37.466611	7.282237	30
7	32758467	2158.0	167.32	250.72	11.956	131	1013.26	70	235	274.2	88.0	90	0.026	241.4	4752102.0	255	-4.335	23.199	96.43	77.0	1.012	18954.0	1.766	238	250	252	10	3	2	2.466	2.470	11	11	18954.0	18954.0	263.710	1.004	106	3.113636	124	-0.033333	4	4.780876	9.677419	13.155844	3.129870	4.203320	3.697080	0.953311	0.051948	0.091633	-0.015936	18.710760	113.497006	0.011976	0.011976	0.059880	0.017964	0.011976	0.080645	0.024194	0.016129	4	0.032258	113.497006	2081.0	7298.0	0.990062	51.474377	0	2154.0	358.730300	1111.734532	1111.734532	4925.997404	5182.233115	102.494590	287.950916	188	59.539164	19	40.495845	2.539514e+09	5.958522e+09	1.159142e+09	1.560103e+08	-2.259260e+07	-2.438001e+08	6.237812e+07	2.367130e+09	3.701275e+08	-5.342309e+08	2.708991e+08	-1.957788e+09	1.723836e+08	-1.265406e+08	5.359610e+08	-6.337206e+08	4.002802e+07	-1.696316e+08	2.375503e+08	-6.872392e+08	1076.466125	0.285147	4.675404e+07	2.050523e+06	19146.557896	16949.604847	3.053276e-01	7.211900e+02	2.832575e-03	4.340180e+07	4.149616e+06	-1.056593e+07	-3.020575e+05	-6.752099e+05	3.352239e+06	-42179.759247	1.689413e+05	6273.581578	21755.668574	-50407.740029	713.762983	-2984.091582	37
8	32760505	1650.0	167.81	247.04	19.333	107	594.49	30	264	306.9	42.9	90	0.059	265.2	2553399.0	255	-3.010	9.464	99.64	6.0	1.061	10336.0	0.622	215	245	251	3	3	2	5.232	5.210	28	27	10336.0	10336.0	203.293	1.004	106	7.139535	148	-0.057143	-1	7.692308	12.837838	99.000000	44.166667	2.241509	1.934853	0.127362	-0.166667	0.036437	-0.012146	17.400673	61.523810	0.029762	0.029762	0.017857	0.017857	0.011905	0.020270	0.020270	0.013514	8	0.054054	61.523810	1644.0	3647.0	0.998624	45.751566	0	1645.0	46.686870	-410.605985	-410.605985	5261.864414	5293.994657	14.556626	291.039369	211	80.802311	17	15.261259	3.229046e+09	1.031269e+10	7.054339e+08	2.708485e+08	1.084802e+08	5.111947e+08	4.741762e+07	3.200647e+09	-4.018777e+08	2.497603e+08	-2.662539e+08	1.573086e+09	2.839834e+07	-4.197437e+07	1.696885e+08	-3.676090e+08	-5.488270e+06	2.072013e+07	-5.113759e+07	1.639030e+08	591.486327	0.450781	3.054801e+07	9.225525e+05	5441.903019	5067.285606	2.660885e-02	1.500216e+02	2.005059e-04	3.033143e+07	-2.165898e+06	1.977943e+06	-3.926003e+05	1.790250e+05	2.165878e+05	-49614.751002	1.975551e+04	-6500.007422	-5881.199021	2361.462960	-818.651949	303.587619	37
9	32760693	1365.0	167.59	252.11	10.563	120	667.28	134	204	239.2	160.2	90	0.039	239.8	7589445.0	255	-6.211	44.310	97.88	29.0	1.008	30104.0	3.683	249	251	253	7	7	8	1.555	1.555	24	24	30104.0	30104.0	324.113	1.005	106	1.493750	135	-0.022727	4	4.365079	8.148148	23.000000	8.275862	2.779167	2.790795	0.548007	0.137931	0.174603	-0.023810	45.133433	179.190476	0.011905	0.011905	0.041667	0.041667	0.047619	0.051852	0.051852	0.059259	3	0.022222	179.190476	1336.0	9268.0	0.981394	41.243764	0	1340.0	1640.908337	-2188.773144	-2188.773144	3464.186926	4923.584131	181.511132	280.673023	199	73.474551	20	53.890427	3.821179e+09	1.259864e+10	1.583859e+10	8.747090e+09	1.018017e+11	2.686971e+10	-1.537675e+10	2.592954e+09	-2.528776e+09	1.638303e+09	-1.135126e+09	7.754716e+09	1.228225e+09	-2.732966e+08	4.853526e+09	-5.936183e+09	1.888273e+08	3.380277e+09	-1.918858e+09	1.760997e+10	691.808225	0.144152	5.511515e+07	2.689541e+06	36890.275189	20192.660454	5.441265e-01	8.497679e+02	-8.752004e-02	3.635436e+07	-3.961379e+06	2.439266e+07	-1.657096e+06	1.590276e+06	1.876079e+07	-331190.573172	1.032216e+06	-128032.675688	327995.662816	729270.960045	-40349.958742	52317.110791	37

	MI
cv	0.570214
kurt_mean	0.488479
skew_mean	0.475384
weighted_moments_hu0	0.462755
mean	0.456567

	MI
esd	0.142574
nb1	0.138857
moments_normalized12	0.129653
centroids	0.113580
angle	0.027163

	perimareaexc	feretareaexc	cdexc	convarea_area	symetrieh_area	symetriev_area	nb1_area	nb2_area	nb3_area	skeleton_area
807	4.508475	1.084746	-0.016949	NaN	NaN	NaN	NaN	NaN	NaN	NaN
826	1.652632	0.333333	0.014035	NaN	NaN	NaN	NaN	NaN	NaN	NaN
829	2.481818	0.545455	-0.009091	NaN	NaN	NaN	NaN	NaN	NaN	NaN
830	1.793939	0.400000	-0.006061	NaN	NaN	NaN	NaN	NaN	NaN	NaN
842	4.072165	0.814433	0.020619	NaN	NaN	NaN	NaN	NaN	NaN	NaN
844	2.561905	0.533333	-0.009524	NaN	NaN	NaN	NaN	NaN	NaN	NaN
850	2.780952	0.476190	0.038095	NaN	NaN	NaN	NaN	NaN	NaN	NaN
853	11.714286	2.321429	0.142857	NaN	NaN	NaN	NaN	NaN	NaN	NaN
854	3.623377	0.740260	0.051948	NaN	NaN	NaN	NaN	NaN	NaN	NaN
855	1.635838	0.358382	0.028902	NaN	NaN	NaN	NaN	NaN	NaN	NaN

	moments_normalized0	moments_normalized1	moments_normalized4	weighted_moments_normalized0	weighted_moments_normalized1	weighted_moments_normalized4
0	NaN	NaN	NaN	NaN	NaN	NaN
1	NaN	NaN	NaN	NaN	NaN	NaN
2	NaN	NaN	NaN	NaN	NaN	NaN
3	NaN	NaN	NaN	NaN	NaN	NaN
4	NaN	NaN	NaN	NaN	NaN	NaN
5	NaN	NaN	NaN	NaN	NaN	NaN
6	NaN	NaN	NaN	NaN	NaN	NaN
7	NaN	NaN	NaN	NaN	NaN	NaN
8	NaN	NaN	NaN	NaN	NaN	NaN
9	NaN	NaN	NaN	NaN	NaN	NaN
10	NaN	NaN	NaN	NaN	NaN	NaN

	macro avg	badfocus (artefact)	detritus	silks	Neoceratium	Codonellopsis (Dictyocystidae)	rods	Thalassionema	Copepoda	nauplii (Crustacea)	feces	Dictyocysta	Undellidae	artefact	Tintinnidiidae	Codonaria	Coscinodiscids	Protoperidinium	Chaetoceros	pollen	Rhabdonella	Rhizosolenids	egg (other)	Cyttarocylis	Pleurosigma	Annelida	multiple (other)	Hemiaulus	chainlarge	Retaria	Stenosemella	tempChaetoceros danicus	centric	Bacteriastrum	Xystonellidae	Dinophysiales	Lithodesmioides	Odontella (Mediophyceae)	Asterionellopsis	Ceratocorys horrida
f1-score	0.441259	0.952245	0.891566	0.891512	0.867442	0.863462	0.854265	0.850056	0.789233	0.776889	0.765073	0.744828	0.733032	0.702194	0.663793	0.626609	0.613636	0.582897	0.578352	0.553030	0.550459	0.481605	0.474820	0.347826	0.238095	0.230769	0.192256	0.165517	0.104478	0.090909	0.032258	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
precision	0.671722	0.981631	0.840651	0.876786	0.883793	0.916327	0.879668	0.887059	0.774676	0.785971	0.811862	0.964286	0.910112	0.800000	0.875000	0.839080	0.870968	0.917582	0.788546	0.784946	0.857143	0.804469	0.891892	1.000000	1.000000	1.000000	0.553846	0.923077	0.777778	1.000000	1.000000	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
recall	0.382860	0.924567	0.949047	0.906741	0.851685	0.816364	0.830287	0.816017	0.804348	0.768014	0.723384	0.606742	0.613636	0.625698	0.534722	0.500000	0.473684	0.427110	0.456633	0.426901	0.405405	0.343675	0.323529	0.210526	0.135135	0.130435	0.116317	0.090909	0.056000	0.047619	0.016393	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
support	45489.000000	1445.000000	25926.000000	1083.000000	2670.000000	550.000000	766.000000	924.000000	966.000000	1707.000000	5166.000000	89.000000	132.000000	358.000000	432.000000	146.000000	57.000000	391.000000	392.000000	342.000000	74.000000	419.000000	102.000000	19.000000	37.000000	92.000000	619.000000	132.000000	125.000000	42.000000	61.000000	10.0	20.0	2.0	5.0	86.0	12.0	28.0	26.0	36.0

	macro avg	badfocus (artefact)	detritus	silks	Codonellopsis (Dictyocystidae)	Neoceratium	Thalassionema	rods	Copepoda	Undellidae	Dictyocysta	nauplii (Crustacea)	feces	artefact	Coscinodiscids	Protoperidinium	Codonaria	Tintinnidiidae	Chaetoceros	Cyttarocylis	Rhabdonella	pollen	Rhizosolenids	Hemiaulus	Pleurosigma	egg (other)	Odontella (Mediophyceae)	Annelida	Retaria	chainlarge	multiple (other)	Stenosemella	Asterionellopsis	Dinophysiales	Ceratocorys horrida	centric	Lithodesmioides	Xystonellidae	Bacteriastrum	tempChaetoceros danicus
f1-score	0.543803	0.942919	0.891501	0.889192	0.867424	0.864217	0.847420	0.833872	0.800995	0.783582	0.780488	0.776729	0.750742	0.743949	0.721311	0.684211	0.682432	0.665786	0.643045	0.638298	0.627451	0.607251	0.577957	0.572687	0.507042	0.504854	0.500000	0.457143	0.447761	0.377193	0.328032	0.314050	0.210526	0.190476	0.177778	0.0	0.0	0.0	0.0	0.0
precision	0.590710	0.986395	0.859562	0.874888	0.905138	0.893600	0.859688	0.825864	0.771073	0.772059	0.853333	0.809010	0.830361	0.683841	0.676923	0.798635	0.673333	0.775385	0.662162	0.535714	0.607595	0.628125	0.661538	0.684211	0.529412	0.500000	0.833333	0.481928	0.600000	0.417476	0.426357	0.316667	0.333333	0.526316	0.444444	0.0	0.0	0.0	0.0	0.0
recall	0.523744	0.903114	0.925904	0.903970	0.832727	0.836704	0.835498	0.842037	0.833333	0.795455	0.719101	0.746924	0.685056	0.815642	0.771930	0.598465	0.691781	0.583333	0.625000	0.789474	0.648649	0.587719	0.513126	0.492424	0.486486	0.509804	0.357143	0.434783	0.357143	0.344000	0.266559	0.311475	0.153846	0.116279	0.111111	0.0	0.0	0.0	0.0	0.0
support	45489.000000	1445.000000	25926.000000	1083.000000	550.000000	2670.000000	924.000000	766.000000	966.000000	132.000000	89.000000	1707.000000	5166.000000	358.000000	57.000000	391.000000	146.000000	432.000000	392.000000	19.000000	74.000000	342.000000	419.000000	132.000000	37.000000	102.000000	28.000000	92.000000	42.000000	125.000000	619.000000	61.000000	26.000000	86.000000	36.000000	20.0	12.0	5.0	2.0	10.0

	macro avg	badfocus (artefact)	detritus	silks	Neoceratium	Codonellopsis (Dictyocystidae)	Thalassionema	rods	Undellidae	Dictyocysta	Copepoda	Coscinodiscids	nauplii (Crustacea)	feces	artefact	Codonaria	Protoperidinium	Tintinnidiidae	Chaetoceros	Rhabdonella	pollen	Cyttarocylis	Rhizosolenids	egg (other)	Pleurosigma	Hemiaulus	Annelida	chainlarge	Retaria	multiple (other)	Stenosemella	Ceratocorys horrida	Dinophysiales	Odontella (Mediophyceae)	Asterionellopsis	Lithodesmioides	centric	Xystonellidae	Bacteriastrum	tempChaetoceros danicus
f1-score	0.517850	0.942405	0.891053	0.883574	0.869834	0.857949	0.854673	0.831938	0.806084	0.800000	0.788900	0.774775	0.770770	0.751940	0.736424	0.685315	0.682284	0.657682	0.654746	0.651515	0.605067	0.571429	0.558704	0.511628	0.444444	0.434343	0.361111	0.350000	0.346154	0.327902	0.303030	0.150000	0.141414	0.125000	0.074074	0.0	0.0	0.0	0.0	0.0
precision	0.629514	0.981995	0.855374	0.864078	0.897927	0.888889	0.853290	0.820839	0.809160	0.901408	0.756654	0.796296	0.798869	0.834159	0.700252	0.700000	0.797945	0.787097	0.710448	0.741379	0.617021	0.625000	0.642857	0.486726	0.538462	0.651515	0.500000	0.466667	0.900000	0.443526	0.394737	0.750000	0.538462	0.500000	1.000000	0.0	0.0	0.0	0.0	0.0
recall	0.483403	0.905882	0.929839	0.903970	0.843446	0.829091	0.856061	0.843342	0.803030	0.719101	0.824017	0.754386	0.744581	0.684475	0.776536	0.671233	0.595908	0.564815	0.607143	0.581081	0.593567	0.526316	0.494033	0.539216	0.378378	0.325758	0.282609	0.280000	0.214286	0.260097	0.245902	0.083333	0.081395	0.071429	0.038462	0.0	0.0	0.0	0.0	0.0
support	45489.000000	1445.000000	25926.000000	1083.000000	2670.000000	550.000000	924.000000	766.000000	132.000000	89.000000	966.000000	57.000000	1707.000000	5166.000000	358.000000	146.000000	391.000000	432.000000	392.000000	74.000000	342.000000	19.000000	419.000000	102.000000	37.000000	132.000000	92.000000	125.000000	42.000000	619.000000	61.000000	36.000000	86.000000	28.000000	26.000000	12.0	20.0	5.0	2.0	10.0

	macro avg	Codonellopsis (Dictyocystidae)	silks	badfocus (artefact)	rods	Dictyocysta	Copepoda	Neoceratium	Thalassionema	Protoperidinium	Codonaria	Coscinodiscids	nauplii (Crustacea)	feces	Undellidae	pollen	Cyttarocylis	Rhabdonella	artefact	detritus	Tintinnidiidae	Pleurosigma	Stenosemella	Rhizosolenids	egg (other)	Chaetoceros	Annelida	Asterionellopsis	Hemiaulus	multiple (other)	chainlarge	Ceratocorys horrida	Odontella (Mediophyceae)	Retaria	Dinophysiales	centric	Lithodesmioides	Xystonellidae	Bacteriastrum	tempChaetoceros danicus
f1-score	0.416348	0.856364	0.826446	0.814552	0.780069	0.777143	0.755697	0.747564	0.710218	0.694938	0.686469	0.682540	0.662711	0.662555	0.642458	0.608828	0.607143	0.447059	0.446154	0.434362	0.365145	0.355932	0.344828	0.329569	0.315522	0.308059	0.273632	0.264706	0.215812	0.210463	0.117123	0.097421	0.074434	0.068643	0.042484	0.010526	0.0	0.0	0.0	0.0
precision	0.366061	0.856364	0.747943	0.711260	0.694898	0.790698	0.675081	0.629810	0.579235	0.747059	0.662420	0.623188	0.557063	0.581736	0.508850	0.634921	0.459459	0.314917	0.297575	0.854636	0.245418	0.259259	0.363636	0.209941	0.213058	0.189443	0.177419	0.214286	0.125622	0.129127	0.063624	0.054313	0.038983	0.036728	0.022847	0.005556	0.0	0.0	0.0	0.0
recall	0.618664	0.856364	0.923361	0.952941	0.889034	0.764045	0.858178	0.919476	0.917749	0.649616	0.712329	0.754386	0.817809	0.769454	0.871212	0.584795	0.894737	0.770270	0.891061	0.291175	0.712963	0.567568	0.327869	0.766110	0.607843	0.823980	0.597826	0.346154	0.765152	0.568659	0.736000	0.472222	0.821429	0.523810	0.302326	0.100000	0.0	0.0	0.0	0.0
support	45489.000000	550.000000	1083.000000	1445.000000	766.000000	89.000000	966.000000	2670.000000	924.000000	391.000000	146.000000	57.000000	1707.000000	5166.000000	132.000000	342.000000	19.000000	74.000000	358.000000	25926.000000	432.000000	37.000000	61.000000	419.000000	102.000000	392.000000	92.000000	26.000000	132.000000	619.000000	125.000000	36.000000	28.000000	42.000000	86.000000	20.000000	12.0	5.0	2.0	10.0

	macro avg	badfocus (artefact)	Codonellopsis (Dictyocystidae)	silks	Neoceratium	detritus	Thalassionema	rods	Copepoda	nauplii (Crustacea)	feces	Undellidae	Dictyocysta	Tintinnidiidae	artefact	Codonaria	Coscinodiscids	Protoperidinium	pollen	Rhabdonella	Chaetoceros	Odontella (Mediophyceae)	Cyttarocylis	Bacteriastrum	Rhizosolenids	Annelida	Pleurosigma	Hemiaulus	egg (other)	multiple (other)	Retaria	chainlarge	Asterionellopsis	Stenosemella	Dinophysiales	Ceratocorys horrida	centric	Lithodesmioides	Xystonellidae	tempChaetoceros danicus
f1-score	0.513717	0.942120	0.850492	0.840470	0.828552	0.800730	0.795090	0.792562	0.762212	0.746308	0.719067	0.664773	0.654709	0.638158	0.619946	0.610778	0.608108	0.601554	0.564345	0.557214	0.555844	0.551724	0.548387	0.5	0.498366	0.444444	0.441860	0.429630	0.380403	0.344482	0.325581	0.310345	0.271186	0.236364	0.218750	0.195804	0.184615	0.0	0.0	0.0
precision	0.446015	0.949403	0.837743	0.770593	0.761456	0.963540	0.705193	0.714136	0.686877	0.678657	0.606161	0.531818	0.544776	0.606250	0.456954	0.542553	0.494505	0.531373	0.473267	0.440945	0.420708	0.533333	0.395349	0.5	0.378882	0.366197	0.387755	0.318681	0.269388	0.262979	0.241379	0.204030	0.242424	0.163522	0.149573	0.130841	0.133333	0.0	0.0	0.0
recall	0.642110	0.934948	0.863636	0.924284	0.908614	0.684988	0.911255	0.890339	0.856108	0.828940	0.883662	0.886364	0.820225	0.673611	0.963687	0.698630	0.789474	0.693095	0.698830	0.756757	0.818878	0.571429	0.894737	0.5	0.727924	0.565217	0.513514	0.659091	0.647059	0.499192	0.500000	0.648000	0.307692	0.426230	0.406977	0.388889	0.300000	0.0	0.0	0.0
support	45489.000000	1445.000000	550.000000	1083.000000	2670.000000	25926.000000	924.000000	766.000000	966.000000	1707.000000	5166.000000	132.000000	89.000000	432.000000	358.000000	146.000000	57.000000	391.000000	342.000000	74.000000	392.000000	28.000000	19.000000	2.0	419.000000	92.000000	37.000000	132.000000	102.000000	619.000000	42.000000	125.000000	26.000000	61.000000	86.000000	36.000000	20.000000	12.0	5.0	10.0

	macro avg	badfocus (artefact)	detritus	silks	Codonellopsis (Dictyocystidae)	Neoceratium	Thalassionema	rods	Undellidae	Copepoda	Dictyocysta	nauplii (Crustacea)	artefact	Coscinodiscids	feces	Protoperidinium	Tintinnidiidae	Codonaria	Rhabdonella	Chaetoceros	Cyttarocylis	pollen	Rhizosolenids	Hemiaulus	egg (other)	Odontella (Mediophyceae)	Pleurosigma	Retaria	Annelida	chainlarge	multiple (other)	Ceratocorys horrida	Stenosemella	Asterionellopsis	Dinophysiales	tempChaetoceros danicus	centric	Xystonellidae	Lithodesmioides	Bacteriastrum
f1-score	0.546899	0.939592	0.887125	0.880726	0.864198	0.862745	0.844469	0.823306	0.808824	0.789916	0.768293	0.760572	0.730038	0.728814	0.728658	0.694051	0.684967	0.679868	0.670968	0.649351	0.640000	0.603221	0.580052	0.540773	0.515837	0.487805	0.461538	0.447761	0.433498	0.413502	0.326000	0.297872	0.256000	0.210526	0.184874	0.133333	0.0	0.0	0.0	0.0
precision	0.584877	0.990790	0.850520	0.865419	0.904573	0.895607	0.854767	0.799508	0.785714	0.755913	0.840000	0.831711	0.668213	0.704918	0.848290	0.777778	0.786787	0.656051	0.641975	0.661376	0.516129	0.604106	0.644315	0.623762	0.478992	0.769231	0.439024	0.600000	0.396396	0.437500	0.427822	0.636364	0.250000	0.333333	0.333333	0.200000	0.0	0.0	0.0	0.0
recall	0.532968	0.893426	0.927023	0.896584	0.827273	0.832210	0.834416	0.848564	0.833333	0.827122	0.707865	0.700644	0.804469	0.754386	0.638599	0.626598	0.606481	0.705479	0.702703	0.637755	0.842105	0.602339	0.527446	0.477273	0.558824	0.357143	0.486486	0.357143	0.478261	0.392000	0.263328	0.194444	0.262295	0.153846	0.127907	0.100000	0.0	0.0	0.0	0.0
support	45489.000000	1445.000000	25926.000000	1083.000000	550.000000	2670.000000	924.000000	766.000000	132.000000	966.000000	89.000000	1707.000000	358.000000	57.000000	5166.000000	391.000000	432.000000	146.000000	74.000000	392.000000	19.000000	342.000000	419.000000	132.000000	102.000000	28.000000	37.000000	42.000000	92.000000	125.000000	619.000000	36.000000	61.000000	26.000000	86.000000	10.000000	20.0	5.0	12.0	2.0

	macro avg	badfocus (artefact)	detritus	silks	Codonellopsis (Dictyocystidae)	Thalassionema	Neoceratium	rods	Copepoda	Dictyocysta	Undellidae	nauplii (Crustacea)	Codonaria	feces	artefact	Protoperidinium	Coscinodiscids	Tintinnidiidae	Chaetoceros	pollen	Rhabdonella	Rhizosolenids	Hemiaulus	egg (other)	Odontella (Mediophyceae)	chainlarge	Annelida	multiple (other)	Retaria	Cyttarocylis	Pleurosigma	Stenosemella	Asterionellopsis	centric	Ceratocorys horrida	Dinophysiales	Lithodesmioides	tempChaetoceros danicus	Bacteriastrum	Xystonellidae
f1-score	0.450603	0.939698	0.846329	0.816473	0.805447	0.787464	0.773577	0.757021	0.728335	0.673575	0.666667	0.648128	0.642384	0.628564	0.619145	0.587396	0.586667	0.571081	0.545718	0.526829	0.487805	0.468371	0.421456	0.412281	0.311111	0.310559	0.269231	0.257117	0.250000	0.241135	0.239130	0.207951	0.192308	0.147059	0.081448	0.080645	0.025000	0.020408	0.0	0.0
precision	0.435318	0.976137	0.818742	0.792354	0.866109	0.831528	0.825127	0.711009	0.687500	0.625000	0.552239	0.827869	0.621795	0.848355	0.487179	0.548889	0.473118	0.601023	0.632997	0.451883	0.381679	0.537037	0.426357	0.373016	0.225806	0.253807	0.208333	0.297872	0.239130	0.139344	0.149660	0.127820	0.128205	0.104167	0.048649	0.131579	0.014706	0.011364	0.0	0.0
recall	0.526186	0.905882	0.875839	0.842105	0.752727	0.747835	0.728090	0.809399	0.774327	0.730337	0.840909	0.532513	0.664384	0.499226	0.849162	0.631714	0.771930	0.543981	0.479592	0.631579	0.675676	0.415274	0.416667	0.460784	0.500000	0.400000	0.380435	0.226171	0.261905	0.894737	0.594595	0.557377	0.384615	0.250000	0.250000	0.058140	0.083333	0.100000	0.0	0.0
support	45489.000000	1445.000000	25926.000000	1083.000000	550.000000	924.000000	2670.000000	766.000000	966.000000	89.000000	132.000000	1707.000000	146.000000	5166.000000	358.000000	391.000000	57.000000	432.000000	392.000000	342.000000	74.000000	419.000000	132.000000	102.000000	28.000000	125.000000	92.000000	619.000000	42.000000	19.000000	37.000000	61.000000	26.000000	20.000000	36.000000	86.000000	12.000000	10.000000	2.0	5.0

	moments_normalized0	moments_normalized1	moments_normalized4	weighted_moments_normalized0	weighted_moments_normalized1	weighted_moments_normalized4
0	NaN	NaN	NaN	NaN	NaN	NaN
1	NaN	NaN	NaN	NaN	NaN	NaN
2	NaN	NaN	NaN	NaN	NaN	NaN
3	NaN	NaN	NaN	NaN	NaN	NaN
4	NaN	NaN	NaN	NaN	NaN	NaN
5	NaN	NaN	NaN	NaN	NaN	NaN
6	NaN	NaN	NaN	NaN	NaN	NaN
7	NaN	NaN	NaN	NaN	NaN	NaN
8	NaN	NaN	NaN	NaN	NaN	NaN
9	NaN	NaN	NaN	NaN	NaN	NaN
10	NaN	NaN	NaN	NaN	NaN	NaN

	moments_normalized0	moments_normalized1	moments_normalized4	weighted_moments_normalized0	weighted_moments_normalized1	weighted_moments_normalized4
0	NaN	NaN	NaN	NaN	NaN	NaN
1	NaN	NaN	NaN	NaN	NaN	NaN
2	NaN	NaN	NaN	NaN	NaN	NaN
3	NaN	NaN	NaN	NaN	NaN	NaN
4	NaN	NaN	NaN	NaN	NaN	NaN
5	NaN	NaN	NaN	NaN	NaN	NaN
6	NaN	NaN	NaN	NaN	NaN	NaN
7	NaN	NaN	NaN	NaN	NaN	NaN
8	NaN	NaN	NaN	NaN	NaN	NaN
9	NaN	NaN	NaN	NaN	NaN	NaN
10	NaN	NaN	NaN	NaN	NaN	NaN