!pip3 install --user imbalanced-learn
import warnings
warnings.filterwarnings('ignore')
import numpy as np
import time
import pandas as pd
pd.set_option('display.max_columns', 130)
from collections import Counter
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, cross_validate
from sklearn.feature_selection import RFECV
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.metrics import classification_report
from sklearn.metrics.scorer import make_scorer
from sklearn.metrics import f1_score
from imblearn.under_sampling import RandomUnderSampler, NearMiss, TomekLinks
from imblearn.over_sampling import SMOTE, BorderlineSMOTE
import zipfile
from io import BytesIO
from PIL import Image
from sklearn import preprocessing
import seaborn as sns
import matplotlib.pyplot as plt
def extract_zip_to_memory(input_zip):
'''
This function extracts the images stored inside the given zip file.
It stores the result in a python dictionary.
input_zip (string): path to the zip file
returns (dict): {filename (string): image_file (bytes)}
'''
input_zip=zipfile.ZipFile(input_zip)
return {name: BytesIO(input_zip.read(name)) for name in input_zip.namelist() if name.endswith('.jpg')}
start = time.time()
img_files = extract_zip_to_memory("/mnt/datasets/plankton/flowcam/imgs.zip")
print("Extraction time: ", time.time()-start)
print("Number of images: %d" % len(img_files))
Comment. The size of the dataset is pretty large (~240k), in comparison to the Oregon State dataset (~30K).
def view_images(keys):
""" Show images of planktons within the dataset,
given the images' names.
Parameters:
keys (string): Name of the images
Return:
None.
"""
for key in keys:
# Read the Image
img = imread(img_files[key])
# Show the Image
print(key)
plt.imshow(img, cmap=cm.gray)
plt.show()
FIRST, let's take look at the images themselves.
from skimage.io import imread, imshow
from pylab import cm
import random
# Randomly selects 5 images
img_keys = list(img_files.keys())
rand_names = random.sample(img_keys, 5)
# Show the images
view_images(rand_names)
NEXT, let's inspect the type & depth of the images' pixels.
modes = [Image.open(img_files[k]).mode for k in img_files]
print(Counter(modes))
Comment.
Since all 243,610 images of this dataset is of mode L, we can conclude that our image-dataset are 8-bit color pixels and greyscale.
After running several times this process of randomly showing images, we see that the size (dimensions) of the images are very different, each crops exactly to the plankton's object.
The implication of this is if one wants to use CNN on this dataset, then she has to resize all the images to a same size, a very time-consuming process.
FINALLY, we plot the distribution of images' sizes to see if we can get some more insights.
# Get the dimensions of the image
dims = [Image.open(img_files[k]).size for k in img_files]
# Get the ratios of the image
wh_ratios = [(w / h) for w, h in dims]
widths = [w for w, _ in dims]
heights = [h for _, h in dims]
# Graph the distribution of width-height ratios
plt.figure()
r = sns.distplot(wh_ratios, bins=1000, kde=False)
r.set(xlim=(0, 5), title='width / height ratio')
# Graph the distribution of widths
plt.figure()
sns.distplot(widths).set_title('widths')
# Graph the distribution of heights
plt.figure()
sns.distplot(heights).set_title('heights')
COMMENT.
Looking at the width-height ratio graph, we can see that:
Looking at the widths and heights graph, we can see that both share a very similar distribution, left-skewed with modes around 100.
The overall implication is that, in this dataset, image-size & image-shape are very different. Therefore, scaling them to a same size and apply CNN will encode wrong (latent) signals in a BIG proportion of images. Unless we're very skillful at image processing to resolve this problem (which we surely aren't), we should avoid that CNN path.
FIRST, we load the meta-data of the images.
meta = pd.read_csv('/mnt/datasets/plankton/flowcam/meta.csv')
meta.head(5)
COMMENT. A big suprise when we look at the labels of this dataset is that many of them aren't planktons in a common sense.
NEXT, we build the targets dataframe, which includes object-ids and their corresponding level-2 labels.
targets_df = meta[['objid','level2']].drop_duplicates()
targets_df = targets_df[targets_df['level2'].isna()!=True]
targets_df['objid'] = targets_df['objid'].astype('int64')
targets_df.head(5)
print("Number of records: %d" % targets_df.shape[0])
print("Number of classes: %d" % targets_df['level2'].nunique())
COMMENT. Here we build the targets set as followings:
After this process, there are 242,607 records left. In comparison with 243,610 images in the dataset, this is smaller. Therefore, we need to make sure that all records in the target-set are included in the image-dataset. Since otherwise, we will train our models on objects that aren't existed.
FINALLY, we plot the distribution of labels.
labels_freq = Counter(targets_df['level2'])
labels_freq_df = pd.DataFrame.from_dict(labels_freq, orient='index')\
.rename(columns={0: 'count'})\
.sort_values(by=['count'], ascending=False)
labels_freq_df.plot(kind='bar');
COMMENT. As speculated above, this dataset is highly imblanced.
The biggest class has with 77812 instances (detritus or dead bodies), while the smallest class has only 8 instances (Bacteriastrum, a real plankton).
It is very difficult to get high performance if we leave this dataset imbalanced as it is. Therefore, resampling techniques are very needed, possible over-sampling for minor classes.
FIRST, we encode the labels of our targets dataframe.
# Initialize the LabelEncoder
le = preprocessing.LabelEncoder()
le.fit(targets_df['level2'])
# Encode the labels, drop the original column
targets_df['target'] = le.transform(targets_df['level2'])
targets_df.drop(['level2'], axis=1, inplace=True)
targets_df.head(5)
NEXT, we ensure that all object-ids are belong the image-ids.
# Get the ID of all images
image_ids = [int(k.split('/')[1].split('.')[0]) for k in img_files]
image_ids_set = set(image_ids)
# Check if all object-ids are belong to image-ids
object_ids_set = set(targets_df['objid'])
object_ids_set.difference(image_ids_set)
COMMENT. Since all object-ids refer to images in our image-dataset, we can proceed with feature exploration.
WHY WE USE FEATURES INSTEAD OF IMAGES?
That's why we decided to use features instead (both of them, indeed).
TO BEGIN, we investigate the Native feature-set, and see if there are anything we should do to preprocess the data.
# Load the Native Feature-set
features_native = pd.read_csv('/mnt/datasets/plankton/flowcam/features_native.csv.gz', compression='gzip', error_bad_lines=False)
features_native['objid'] = features_native['objid'].astype('int64')
We find all columns that contains at least one NaN.
features_native.loc[:, features_native.isna().any()][:10]
COMMENT.
# For the columns: 'perimareaexc', 'feretareaexc' & 'cdexc'
display(features_native[features_native['perimareaexc'].isnull()].loc[:, features_native.isna().any()][:10])
# For the columns: 'convarea_area', 'convarea_area' 'symetrieh_area', 'symetriev_area'
# 'nb1_area', 'nb2_area', 'nb3_area' & 'skeleton_area'
display(features_native[features_native['convarea_area'].isnull()].loc[:, features_native.isna().any()][:10])
COMMENT. The rows identified for inspection are:
# Get the obj-ids of object at index = 17, 22, 29
focus_ids = features_native.loc[[17, 22, 29], ['objid']].objid.tolist()
focus_keys = ["imgs/{}.jpg".format(id) for id in focus_ids]
# View their images
view_images(focus_keys)
# Get the obj-ids of object at index = 826, 842, 855
focus_ids = features_native.loc[[826, 842, 855], ['objid']].objid.tolist()
focus_keys = ["imgs/{}.jpg".format(id) for id in focus_ids]
# View their images
view_images(focus_keys)
COMMENT. In each case, we can that the images are very similar, showing a systemic evidence of absence, NOT random absence of evidence:
# Fill all missing values with 0
features_native.fillna(0, inplace=True)
NEXT, we investigate the skimage feature-set, for the purpose of data-preprocessing.
# Load the skimage's feature-set
features_skimage = pd.read_csv('/mnt/datasets/plankton/flowcam/features_skimage.csv.gz', compression='gzip', error_bad_lines=False)
features_skimage['objid'] = features_skimage['objid'].astype('int64')
# View ALL columns that contains NaN
features_skimage.loc[:10, features_skimage.isna().any()]
COMMENT. These columns are inherently flawed, as they contains mostly NaN, which makes sense because the features here are recomputed using skimage library, a very general library. So, it is normal that there are features cannot be computed. Therefore, we should remove all these empty features.
# Drop the 'flawed' columns
features_skimage.dropna(axis='columns', inplace=True)
NOW, we'll merge the 2 feature-sets with the targets, using object-ids.
data = features_native.merge(features_skimage, on='objid', how='inner').merge(targets_df, on='objid', how='inner')
print("Data after joining features and target: ", data.shape)
data.head(10)
FINALLY, we measure the mutual-information between the merged features and the targets.
from sklearn.feature_selection import mutual_info_classif
# Get the Merged-Features
merged_features = data.drop(columns=["objid", "target"])
# Get Mutual Information between 'target' and each 'feature'
# Then turn it into a DataFrame, and sort descendingly
mutual_info = mutual_info_classif(merged_features, data['target'])
mutual_info_df = pd.DataFrame(data=mutual_info, index=merged_features.columns, columns=['MI'])
mutual_info_df.sort_values(by='MI', inplace=True, ascending=False)
To know how good these mutual-information are, we compare them to the entropy of the targets.
from scipy.stats import entropy
targets_freq = data['target'].value_counts().tolist()
targets_entropy = entropy(targets_freq)
print(targets_entropy)
# Show the head & tail of mutual-information data-frame
display(mutual_info_df[:5])
display(mutual_info_df[-5:])
print("# of features: {}.".format(len(mutual_info_df)))
print("# of features that reduces uncertainty BY-MORE-THAN-10%: {}.".\
format(len(mutual_info_df[mutual_info_df['MI'] > 0.1*targets_entropy])))
COMMENT:
Therefore, see almost all of them reduces uncertainty for a good amount is really confirming.
# Reverse the label-encoding, as we don't need it anymore
data['target'] = le.inverse_transform(data['target'])
# Split the training set, validation set and testing set with ratio as 0.5 : 0.25 : 0.25
X_train, X_test, y_train, y_test = train_test_split(data.drop(['objid', 'target'], axis=1), data['target'], test_size=0.25, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.25, random_state=42)
print("Number of classes of training set: ", y_train.nunique())
print("Number of classes of validation set: ", y_val.nunique())
print("Number of classes of testing set: ", y_test.nunique())
COMMENT.
MAIN IDEAS.
Why we need to solve this? Because otherwise, our trained-models will have low predictive power for the infrequent classes of the dataset.
How are we going to solve this?
How do we know which resamplings are good?
# A helper function to display the score
def report_f1(y_test, y_pred):
""" Return the well-displayed scores for evaluation.
Parameters:
y_test: the true labels
y_pred: the predicted labels
Return:
result (DataFrame): the dataframe of scores
"""
# Evaluate the performance using sklearn library
report = classification_report(y_test, y_pred, output_dict=True)
report_df = pd.DataFrame.from_dict(report).transpose()
idx_first = ['macro avg']
# Re-order to show f1 macro first
df1 = report_df.ix[idx_first]
idx_df2 = list(set(report_df.index)-set(idx_first + ['weighted avg', 'micro avg']))
df2 = report_df.ix[idx_df2].sort_values(by = ['f1-score'], ascending=False)
result = pd.concat([df1, df2]).transpose()
return result
# Define a baseline model to fit, predict and evaluate the input data sets and models
def baseline_model(X_train, y_train, X_test, y_test, model):
""" Return the evaluation of given models and given inputs.
Parameters:
X_train: features for training
y_train: true labels for training
X_test: features for testing/validation
y_test: true labels for testing/validation
model: the learning model
Return:
A dictionary of evaluation
"""
start = time.time()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
running_time = time.time()-start
report = report_f1(y_test, y_pred)
print("Time execution: ", running_time)
return {'model': model, 'y_pred': y_pred, 'running_time': running_time, 'report': report}
FIRST, for the baseline-model, we'll use RandomForestClassifier.
rf = RandomForestClassifier(n_estimators=100, random_state=42)
baseline_rf = baseline_model(X_train, y_train, X_val, y_val, rf)
baseline_rf['report']
COMMENT-1.
The high performances belong to the classes like badfocus (artefact), silks, detritus, Codonellopsis (Dictyocystidae), Neoceratium, Thalassionema and rods. The classes are major classes and/or having specific shapes of images.
The low performances belong to the minor classes which do not have enough information to classify.
COMMENT-2.
NEXT:
feature_importance = baseline_rf['model'].feature_importances_
plt.bar(range(0, len(feature_importance)), sorted(feature_importance, reverse=True));
COMMENT.
The feature importances decrease gradually in this graph. There are only four features which have very low importances. Therefore, we keep all the features for learning models.
# A helper funtion to resampling several classes
def resampling_multiclass(X_train, y_train, alg, classes):
""" Resample the given classes while remainig the other classes
Parameters:
X_train: Features for training
y_train: Labels for training
alg: A method for resampling
classes: The classes to resample
Returns:
X_train_new: Features after resampling
y_train_new: Labels after resampling
"""
# Get row indices of given classes
idx_g1 = y_train[y_train.isin(classes)].index
# Get row indices of the remaining classes
idx_g2 = y_train[~y_train.isin(classes)].index
# Get new set for features according to idx_g1
X_train_g1 = X_train.ix[idx_g1]
y_train_g1 = y_train.ix[idx_g1]
# Get new set for features according to idx_g2
X_train_g2 = X_train.ix[idx_g2]
y_train_g2 = y_train.ix[idx_g2]
# Resample the new set of the given classes
X_res, y_res = alg.fit_resample(X_train_g1, y_train_g1)
# Concatenate the resampled sets and non-resample sets
X_train_new = np.concatenate([X_res, X_train_g2])
y_train_new = np.concatenate([y_res, y_train_g2])
return X_train_new, y_train_new
FIRST is oversampling for infrequent classes.
Here we use SMOTE (Synthetic Minority Oversampling Technique), a very popular oversampling method that was proposed to improve random oversampling.
About SMOTE: Rather than replicating the minority observations, SMOTE works by creating synthetic observations in by interpolation. The basic implementation of SMOTE will not make any distinction between easy and hard samples to be classified using the nearest neighbors rule.
We also use another SMOTE variants, Borderline-SMOTE. This variant detects which point to select in the border between two classes.
sm = SMOTE(random_state=42)
X_train_sm, y_train_sm = resampling_multiclass(X_train, y_train, sm, ['multiple (other)', 'badfocus (artefact)', 'silks', 'Copepoda', 'Thalassionema', 'rods', 'Codonellopsis (Dictyocystidae)', 'Protoperidinium', 'Tintinnidiidae', 'Rhizosolenids', 'Chaetoceros', 'artefact', 'pollen', 'Codonaria', 'chainlarge', 'Undellidae', 'Hemiaulus', 'egg (other)', 'Dinophysiales', 'Dictyocysta', 'Annelida', 'Stenosemella', 'Rhabdonella', 'Coscinodiscids', 'Retaria', 'Pleurosigma', 'Ceratocorys horrida', 'centric', 'Odontella (Mediophyceae)', 'Asterionellopsis', 'Cyttarocylis', 'Lithodesmioides', 'tempChaetoceros danicus', 'Xystonellidae', 'Bacteriastrum'])
baseline_model(X_train_sm, y_train_sm, X_val, y_val, rf)['report']
bsm = BorderlineSMOTE(random_state=42)
X_train_bsm, y_train_bsm = resampling_multiclass(X_train, y_train, bsm, ['multiple (other)', 'badfocus (artefact)', 'silks', 'Copepoda', 'Thalassionema', 'rods', 'Codonellopsis (Dictyocystidae)', 'Protoperidinium', 'Tintinnidiidae', 'Rhizosolenids', 'Chaetoceros', 'artefact', 'pollen', 'Codonaria', 'chainlarge', 'Undellidae', 'Hemiaulus', 'egg (other)', 'Dinophysiales', 'Dictyocysta', 'Annelida', 'Stenosemella', 'Rhabdonella', 'Coscinodiscids', 'Retaria', 'Pleurosigma', 'Ceratocorys horrida', 'centric', 'Odontella (Mediophyceae)', 'Asterionellopsis', 'Cyttarocylis', 'Lithodesmioides', 'tempChaetoceros danicus', 'Xystonellidae', 'Bacteriastrum'])
baseline_model(X_train_bsm, y_train_bsm, X_val, y_val, rf)['report']
COMMENT.
Based on the results of the baseline model, we varied the number of classes for oversampling and choose to oversample data for 33 minor classes. Each of them was oversampled up to 4416 instances. We choose this number for oversampling the minor classes because this number is not a too big but still can gives the highest f1 score (in the baseline model, badfocus (artefact) has 4416 instances in the training set and gets the highesh f1-score macro as 0.952245).
Results when oversampling:
Compare to the baseline model, the model after oversampling training data gives the better score (increase ~ 10.25%, from 54.38% to 44.13%). This is a significant improvement.
Compare to Borderline-SMOTE, SMOTE method runs a bit slower but gives the better score (54.38%). We prefer to improve the score, therfore, we choose SMOTE as the technique for oversampling in this project.
Further work: As can be seen from the two output tables above, each method SMOTE and Borderline-SMOTE may improve different classes. Therefore, we may combine them to oversample classes.
NEXT is under-sampling.
The controlled under-sampling methods: we can specify the expected number of instances after undersampling. After oversampling, we use RandomUnderSampler and NearMiss-1 to undersample the major classes.
nm = NearMiss(random_state=42)
X_train_nm, y_train_nm = resampling_multiclass(pd.DataFrame(X_train_sm), pd.DataFrame(y_train_sm)[0], nm, ['detritus', 'feces'])
baseline_model(X_train_nm, y_train_nm, X_val, y_val, rf)['report']
rus = RandomUnderSampler(random_state=42)
X_train_rus, y_train_rus = resampling_multiclass(pd.DataFrame(X_train_sm), pd.DataFrame(y_train_sm)[0], rus, ['detritus', 'feces'])
baseline_model(X_train_rus, y_train_rus, X_val, y_val, rf)['report']
COMMENT.
As can be seen from the results above, the undersampling techniques do not improve the f1-score macro, therefore, we skip this step and do not use it in the final models.
FINALLY, we use the technique named TomekLinks, which is a cleaning undersampling method. The number of samples to be selected need not to be specified.
# Undersampling using TomekLinks
tl = TomekLinks(sampling_strategy='all', random_state=42)
X_train_tl, y_train_tl = tl.fit_resample(X_train_sm, y_train_sm)
baseline_model(X_train_tl, y_train_tl, X_val, y_val, rf)['report']
COMMENT.
The TomekLinks technique help to improve a bit of the score of model after resampling data. It increase the score from 54.38% (after using SMOTE) to 54.68% (after using TomekLinks).
SUMMARY for RESAMPLING DATA
Based on the result after resamplings in the training set and evaluate in the validation set, our training data will be oversample using SMOTE and then undersample using TomekLinks.
We do not use cross-validation to measure the performance of models because the training set was resampled, which means when using cross-validation on the training set, the hold-out set was be seen and lead to over-fitting.
Insteads, we train models on using the training set (which was resampled) and evaluate them on the validation set (which is not resampled).
Here we use RandomForestClassifier and XGBClassifier for model selection.
rf = RandomForestClassifier(n_estimators=100, random_state=42)
model_rf = baseline_model(X_train_tl, y_train_tl, X_val, y_val, rf)
model_rf['report']
xgb = XGBClassifier(n_estimators=100, seed=42)
model_xgb = baseline_model(X_train_tl, y_train_tl, X_val.as_matrix(), y_val, xgb)
model_xgb['report']
COMMENT.
Results:
RandomForestClassifier outperforms the XGBClassifier in both score and runing time. Random Forest is easier to implement in parallel.
In conclusion, our final model is RandomForestClassifier.
The hyper-paremete tuning for RandomForestClassifier is:
{bootstrap=True, class_weight=None, criterion='gini',
max_depth=None, max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=None,
oob_score=False, random_state=42, verbose=0, warm_start=False
}
In this section, we evaluate the final model using the test set.
final_model = RandomForestClassifier(n_estimators=100, random_state=42)
final_result = baseline_model(X_train_tl, y_train_tl, X_test, y_test, final_model)
final_result['report']['macro avg']
Our final evaluation f1-score avg=macro on a held-out test set is 57.28%.
The time for training and prediction is 501 seconds.