!pip install --user ruleset
!pip install --user graphviz
import pandas as pd
pd.set_option('display.max_columns', 50)
import matplotlib.pyplot as plt
import seaborn as sns
import time
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import normalize, StandardScaler
from sklearn.model_selection import cross_validate
from sklearn.metrics import f1_score
from sklearn.feature_selection import mutual_info_classif
from scipy.stats import entropy
from sklearn.model_selection import KFold
from sklearn import tree
from sklearn.model_selection import GridSearchCV
from IPython.display import Image
from IPython.core.display import HTML 
from ruleset import *
import warnings
warnings.filterwarnings("ignore")
FIRST, let's take an overview look at the data, as well as some statistics.
# Load data
columns = ['SessionNumber', 'SystemID', 'Date', 'HighPriorityAlerts', 'Dumps', 'CleanupOOMDumps', 'CompositeOOMDums', 'IndexServerRestarts', 'NameServerRestarts', 'XSEngineRestarts', 'PreprocessorRestarts', 'DaemonRestarts', 'StatisticsServerRestarts', 'CPU', 'PhysMEM', 'InstanceMEM', 'TablesAllocation', 'IndexServerAllocationLimit', 'ColumnUnloads', 'DeltaSize', 'MergeErrors', 'BlockingPhaseSec', 'Disk', 'LargestTableSize', 'LargestPartitionSize', 'DiagnosisFiles', 'DiagnosisFilesSize', 'DaysWithSuccessfulDataBackups', 'DaysWithSuccessfulLogBackups', 'DaysWithFailedDataBackups', 'DaysWithFailedfulLogBackups', 'MinDailyNumberOfSuccessfulDataBackups', 'MinDailyNumberOfSuccessfulLogBackups', 'MaxDailyNumberOfFailedDataBackups', 'MaxDailyNumberOfFailedLogBackups', 'LogSegmentChange', 'Check1', 'Check2', 'Check3', 'Check4', 'Check5', 'Check6', 'Check7', 'Check8']
data = pd.read_csv('/mnt/datasets/anomaly/data.csv', sep=';', names=columns)
data[:3]
# Size of data
print('\nData size: ', data.shape)
# Print count of missing values
print('\nMissing values:')
NAs = data.isna().sum()
display(pd.DataFrame(NAs[NAs>0]).transpose())
# Column name of labels
label_cols = [col for col in data.columns if 'Check' in col]
# Convert data type
cat_cols = ['SessionNumber', 'SystemID', 'MergeErrors']
for col in cat_cols:
    data[col] = data[col].astype(str)
data['Date'] = pd.to_datetime(data['Date'])
# Print count unique values of categorical features
print('\n\nCount unique values of categorical features: ')
display(pd.DataFrame(data[cat_cols].nunique(), columns=['CountUnique']))
Feature SessionNumber has too many unique values (228196). It is almost same as the row id since in average, every ~1.25 rows will belong to 1 session (287031/228196). We should drop this column.
# Print descriptive statistics of numerical features
print('\n\nDescriptive statistics of numerical features: ')
display(data[data.columns.difference(cat_cols + label_cols)].describe())
Looking at the overall statistics, we can see that:
As mentioned above, we will remove the useless columns in the dataset.
# Drop useless columns
useless_cols = ['SessionNumber', 'CleanupOOMDumps', 'PreprocessorRestarts', 'DaemonRestarts', 'Date']
data.drop(useless_cols, inplace=True, axis=1)
cat_cols.remove('SessionNumber')
SECOND, we will explore the dataset corresponding to labels.
There are two approaches to solve this challenge: a multiclass classification model or 8 binary classification models. Let's take a look to both of them.
There are 8 binary labels, which means the dataset can have 256 (28) possible values for the target.
pd.DataFrame(data[label_cols].dropna().astype('int64').sum(axis=1))[0].value_counts()
As can be seen from the output, there is:
In summary, we will not use this approach because of several reasons:
We define a function to compute the percent of anomaly in each label.
# A function to count anomaly and non-anomaly and to compute percent of anomaly
def ratio_labels(df, label_cols):
    """ Count anomaly and non-anomaly, compute percent of anomaly
    Parameters:
        df: DataFrame
        label_cols (list of string): List of label names
    Return:
        DataFrame: A dataframe with one column (cnt_non-anomaly, cnt-anomaly, percent_anomaly)
    """
    # Count values each label in the given list
    count_labels = df[label_cols].apply(pd.Series.value_counts)
    # Compute percent of anomaly
    ratio_labels = count_labels.apply(lambda x: (x[0], x[1], round(x[1]/x.sum(), 2)), axis=0)
    ratio_labels_df = pd.DataFrame(ratio_labels, columns=['(Normal, Anomaly, %Anomaly)'])
    return ratio_labels_df
ratio_all = ratio_labels(data, label_cols)
display(ratio_all)
As can be seen from the output, 8 labels have extremely imbalance binary classes.
Compare to approach 1, the approach 2 which are 8 binary classification tasks are much better. We will use it to solve the chalenge.
Datasets
Since the missing values of labels are different, we need 8 individual dataset for 8 labels. Here we define a function to create the dataset according to each label. The missing values of each label are removed in this function.
# A function to create a dataset for a given label
def get_data_by_label(df, label):
    """ Create a dataset for a given label
    Parameters:
        df (DataFrame): The original dataframe with multiple labels
        label (str): The given label
    Return: A dataframe with one given label
    """
    # Copy data where the given label is not null
    data = df[df[label].isna() == False].copy()
    # Convert type of the label from float to boolean
    data[label] = data[label].astype(bool)
    # Remove other labels
    data.drop(list(set(label_cols) - set([label])), axis=1, inplace=True)
    return data
We create 8 datasets for 8 models. These datasets are not preprocessed (it contains missing values, errors, outliers).
# Create 8 datasets which are not preprocessed for 8 labels
datasets_before_prep = []
for i in range(1, 9):
    df = get_data_by_label(data, 'Check%d'%i)
    datasets_before_prep.append(df)
1. Categorical features
For the categorical feature, we'll consider the percent of anomaly in each group. However, there are several groups which can have too few instances, so the percent of anomaly cannot reveal anything about these groups. Therefore, we consider both percent and count of anomaly.
# A function to compute percent and count of anomaly in each group of each feature
def anomaly_groupby_cat(data, cat_col, label):
    """ Compute percent and count of anomaly
    Parameters:
        data: DataFrame
        cat_col (list of string): List of names of categorical features
        label (str): Name of a given label
    Return:
        DataFrame which contains (count of non-anomaly, count of anomaly, percent of anomaly)
    """
    # Count anomaly and non-anomaly in each group of the feature
    df = pd.crosstab(data[cat_col], data[label])
    # Compute percent of anomaly
    df['PercentageOfAnomaly'] = df.apply(lambda r: r[True]*100/r.sum(), axis=1)
    # Sort dataframe by percent of anomaly
    df = df.sort_values(by=['PercentageOfAnomaly'], ascending=False)
    return df
# Create list of dataframes which contain percent and count of anomaly
groupby_anomalies = []
for i in range(1, 9):
    for col in cat_cols:
        groupby_anomaly = anomaly_groupby_cat(datasets_before_prep[i-1], col, 'Check%d'%i)
        groupby_anomalies.append(groupby_anomaly)
groupby_anomalies[1]
# Outline the Plot
fig, axes = plt.subplots(4, 4, figsize=(20, 12))
# Indexed and titled each subplot
subplot_idx = 0
# Plotting
for df in groupby_anomalies:
    row, col = divmod(subplot_idx, 4)
    ax1 = df.iloc[:, [0, 1]].plot.bar(stacked=True, legend=False, ax=axes[row, col])
    ax2 = df[:]['PercentageOfAnomaly'].plot(secondary_y=True, use_index=False, color='r', legend=False, ax=axes[row, col])
    ax2.set_ylim(0, 100)
    ax1.set_ylabel('Count')
    ax2.set_yticklabels(['{}%'.format(int(x)) for x in ax2.get_yticks()])
    h1, l1 = ax1.get_legend_handles_labels()
    h2, l2 = ax2.get_legend_handles_labels()
    axes[row, col].legend(h1 + h2, l1 + ['% Anomaly'])
    plt.title('%s: Anomaly groupby %s' % (df.columns.name, df.index.name))
    plt.xticks([]);
    
    subplot_idx += 1
plt.tight_layout()
COMMENT.
The group which is strongly related to anomaly is the group having the high percent of anomaly and containing not too few instances (sample size). As can be seen from the plot of SystemID and Check4 and the plot of SystemID and Check6, there are several _SystemID_s which have the high percent of anomaly. That means perhaps there are problems inside some systems which caused the anomalies.
2. Numerical features
# List of numerical features
num_cols = data.columns.difference(useless_cols + cat_cols + label_cols).tolist()
FIRST, we'll explore the relationships between the numeric attributes.
# Compute the ABSOLUTE correlation coefficients
numericCorrel = data[num_cols].corr().abs()
# Get the lower-left half of the square
mask = np.zeros_like(numericCorrel)
mask[np.triu_indices_from(mask)] = True
# Plot the heatmap
plt.figure(figsize=(10, 8))
plt.title("Correlation Heatmap between Predictors")
sns.heatmap(numericCorrel, mask=mask, cmap="Reds", square=True, linewidths=1.)
plt.show()
We can see that:
There are many correlated features in the dataset. This finding of correlated predictors is very useful in case we want to reduce the dimensions later.
SECOND, we'll explore the relationship between numerical features and labels.
Since there are many outliers in the datasets, we hide it from graph to observe subplots more obvious.
# A boxplot function to plot the numerical features and labels
def boxplot_numerical_features(data, cat_features, label, nrow, ncol, figsize):
    """ Plot the relationship between a numerical feature and a label
    Parameters:
        data: Dataframe
        cat_features: Categorical features
        label: A label name
        nrow, ncol, figsize: The number of rows, columns and the size of the figure
    """
    # Outline the Plot
    fig, axes = plt.subplots(nrow, ncol, figsize=figsize)
    # Indexed and titled each subplot
    subplot_idx = 0
    # Plotting
    for feature in cat_features:
        row, col = divmod(subplot_idx, 10)
        ax = sns.boxplot(x=label, y=feature, data=data, ax=axes[row, col], showfliers=False)
        plt.xticks([]);
        subplot_idx += 1
    plt.tight_layout()
Numerical features and label Check1
boxplot_numerical_features(datasets_before_prep[0], num_cols, "Check1", 3, 10, (20,9))
COMMENT.
For this label, feature CPU is very useful and can help to separate classes comletely if there are not outliers in the dataset. The instances which have the CPU usage over ~85 will belong to the anomaly class.
Besides that, there are other informative features such as HighPriorityAlerts, PhysMEM, etc.
Numerical features and label Check2
boxplot_numerical_features(datasets_before_prep[1], num_cols, "Check2", 3, 10, (20,9))
COMMENT.
There are many informative features for classifying label 2 such as BlockingPhaseSec, CPU, ColumnUnloads, DiagnosisFiles, IndexServerAllocationLimit, PhysMEM, TableAllocation, etc but they cannot completely seperate classes.
Numerical features and label Check3
boxplot_numerical_features(datasets_before_prep[2], num_cols, "Check3", 3, 10, (20,9))
COMMENT.
Similar to the data for label 2, there are many informative features but cannot help to classify classes seperately. They are BlockingPhaseSec, HighPriorityAlerts , ColumnUnloads, LargestTableSize, LargestPartitionSize, etc.
Numerical features and label Check4
boxplot_numerical_features(datasets_before_prep[3], num_cols, "Check4", 3, 10, (20,9))
COMMENT.
For this label, the features which may be useful for classification are BlockingPhaseSec, CPU, HighPriorityAlerts, ColumnUnloads, IndexServerAllocationLimit, InstanceMEM, etc.
Numerical features and label Check5
boxplot_numerical_features(datasets_before_prep[4], num_cols, "Check5", 3, 10, (20,9))
COMMENT.
As can be seen from this graph, TableAllocation can be a very useful feature to specify anomaly in label 5. Besides that, InstanceMEM is also a good feature.
Numerical features and label Check6
boxplot_numerical_features(datasets_before_prep[5], num_cols, "Check6", 3, 10, (20,9))
COMMENT.
For label 6, DiagnosisFiles and DiagnosisFilesSize are good features to classify the classes.
Numerical features and label Check7
boxplot_numerical_features(datasets_before_prep[6], num_cols, "Check7", 3, 10, (20,9))
COMMENT.
For this label, LogSegmentChange and MaxDailyNumberOfFailedDataBackups, MaxDailyNumberOfFailedLogBackups are good features to classify the classes.
Numerical features and label Check8
boxplot_numerical_features(datasets_before_prep[7], num_cols, "Check8", 3, 10, (20,9))
COMMENT.
For label 8, NameServerRestarts and XSEngineRestarts maybe the good features for classification.
Although not all algorithms fail when the data contain missing data/outliers/etc., there are algorithms which are not robust to unclean data. We create two sets of data for 8 models, one set is not preprocessed and another one is preprocessed.
According to data description in the document, we prepared incorrect values of features LogSegmentChange, CPU, PhysMEM, InstanceMEM, TablesAllocation, IndexServerAllocationLimit, and Disk.
data['LogSegmentChange'] = abs(data['LogSegmentChange'])
features_range_0_100 = ['CPU','PhysMEM','InstanceMEM','TablesAllocation','IndexServerAllocationLimit','Disk']
for col in features_range_0_100:
    data.loc[data[col] >100, col] = 100
Since the number of missing values is quite big, we should be carefull before imputing them. To know whether the missing values are related to labels or not, we select the missing values of a feature and then compute its anomaly ratio.
# List of missing features
missing_cols = ['Dumps', 'CompositeOOMDums', 'CPU', 'PhysMEM', 'InstanceMEM', \
                'TablesAllocation', 'IndexServerAllocationLimit', 'DeltaSize', 'MergeErrors', 'BlockingPhaseSec', \
                'Disk', 'LargestTableSize', 'LargestPartitionSize', 'DiagnosisFiles', 'DiagnosisFilesSize', 'LogSegmentChange']
# List of ratio dataframes. ratio_all is a anomaly percent of whole data
ratio_labels_dfs = [ratio_all]
# Compute anomaly ratio of missing values of each feature, then add to the list
for col in missing_cols:
    ratio_df = ratio_labels(data[data[col].isna()==True], label_cols)
    ratio_labels_dfs.append(ratio_df)
# Concat mutiplt dataframes into one
ratio_labels_merged = pd.concat(ratio_labels_dfs, axis=1)
ratio_labels_merged.columns = ['AnomalyFullDataset'] + ['AnomalyMissing' + col for col in missing_cols]
ratio_labels_merged
We compare the percent of anomaly of missing features to the percent of whole data. As can be seen from the table, there are several remarkable points:
We are not sure about the reasons of missing values, therefore, we will not impute them. We prefer to train models on lack of information data than to train models on incorrect information data. Therefore, in the preprocessed data, we drop missing rows instead of filling them.
# Get dataset for each label
datasets_after_prep = []
for i in range(1, 9):
    df = get_data_by_label(data, 'Check%d'%i)
    datasets_after_prep.append(df)
    
# Removing missing rows for each dataset
for df in datasets_after_prep:
    df.dropna(inplace=True)
Standardization/Normalization
According to the data exploration, there are very informative features for each model. For interpretation purpose, we intent to use tree-based and rule-based algorithms for learning models. These algorithms utilize rules and do not require normalization/standardization. They would not be affected by any monotonic transformations of the variables.
Outliers
We keep the outliers while preprocessing data because of two reasons:
Here we split data into train set and test set with the ratio as 7:3.
test_size = 0.3
random_state = 42
# Split train/test data which are not preprocessed
traintest_before_prep = []
for df in datasets_before_prep:
    X = df.iloc[:,:-1]
    y = df.iloc[:,-1]
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=random_state)
    traintest_before_prep.append((X_train, X_test, y_train, y_test))
# Split train/test data which are preprocessed
traintest_after_prep = []
for df in datasets_after_prep:
    X = df.iloc[:,:-1]
    y = df.iloc[:,-1]
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=random_state)
    traintest_after_prep.append((X_train, X_test, y_train, y_test))
After preprocessing data, we can measure the Mutual Information of categorical features and labels. To know how good these mutual-information are, we compare them to the entropy of the targets.
# A function to compute mutual information and percent of reduce of uncertainty
def features_reduce_uncertainty(features, target, percent_reducing):
    """ Compute mutual information and percent of reduce of uncertainty
    Parameters:
        features (DataFrame): A dataframe which contains only categorical features
        target (Series/Numpy array): The target column
        percent_reducing (float): The percent of reduce of uncertainty
    
    Return:
        A dataframe with top mutual information according to the given percent of reduce of uncertainty    
    """
    # Compute mutual information between categorical features and labels
    mutual_info = mutual_info_classif(features, target)
    mutual_info_df = pd.DataFrame(data=mutual_info, index=features.columns, columns=['MI'])
    # Sort the dataframe by the mutual information
    mutual_info_df.sort_values(by='MI', inplace=True, ascending=False)
    
    # Compute entropy of the target
    target_entropy = entropy(target.value_counts().tolist())
    
    # Compute number of features which can reduce the uncertainty of the target
    ncols_reducing = len(mutual_info_df[mutual_info_df['MI'] > percent_reducing*target_entropy])
    
    print('Entropy of the target: %.2f' % (target_entropy))
    print("# of features that reduces uncertainty BY-MORE-THAN-{}%: {}.".format(percent_reducing*100, ncols_reducing))
    return mutual_info_df[:ncols_reducing].T
# Top Mutual Information for model 1
topMI_check1 = features_reduce_uncertainty(traintest_after_prep[0][0], traintest_after_prep[0][2], 0.1)
display(topMI_check1)
# Top Mutual Information for model 2
topMI_check2 = features_reduce_uncertainty(traintest_after_prep[1][0], traintest_after_prep[1][2], 0.1)
display(topMI_check2)
# Top Mutual Information for model 3
topMI_check3 = features_reduce_uncertainty(traintest_after_prep[2][0], traintest_after_prep[2][2], 0.1)
display(topMI_check3)
# Top Mutual Information for model 4
topMI_check4 = features_reduce_uncertainty(traintest_after_prep[3][0], traintest_after_prep[3][2], 0.1)
display(topMI_check4)
# Top Mutual Information for model 5
topMI_check5 = features_reduce_uncertainty(traintest_after_prep[4][0], traintest_after_prep[4][2], 0.1)
display(topMI_check5)
# Top Mutual Information for model 6
topMI_check6 = features_reduce_uncertainty(traintest_after_prep[5][0], traintest_after_prep[5][2], 0.1)
display(topMI_check6)
# Top Mutual Information for model 7
topMI_check7 = features_reduce_uncertainty(traintest_after_prep[6][0], traintest_after_prep[6][2], 0.1)
display(topMI_check7)
# Top Mutual Information for model 8
topMI_check8 = features_reduce_uncertainty(traintest_after_prep[7][0], traintest_after_prep[7][2], 0.1)
display(topMI_check8)
COMMENT.
Although we plot boxplot to examine the relationship between categorical features and labels, we employ Mutual Information between them and select the best features which help to reduce the uncertainty in the labels. The selected features will be use in the Model Selection (Section 3.).
# A function to train, predict and measure running time
def evaluate(X_train, X_test, y_train, y_test, model):
    """ A function to train, predict and measure running time
        Parameters:
            X_train (DataFrame): Features for training
            X_test (DataFrame): Features for predicting
            y_train (Array): Target for training
            y_test: True target for testing
            model: Model for learning
        Returns:
            y_pred (Array): The prediction
            f1 (float): F1 macro score
            fitting_time (float): Duration for training
    """
    # Fit the training data
    start = time.time()
    model = model
    model.fit(X_train, y_train)
    fitting_time = time.time()-start
    # Predict the testing data
    y_pred = model.predict(X_test)
    f1 = f1_score(y_test, y_pred, average='macro')
    
    # Print f1 score and the training time
    print('\033[1m' + 'F1 \'macro\': %.4f \t Fitting time: %.4f' % (f1, fitting_time) + '\033[0m')
    
    return y_pred, f1, fitting_time
# A cross-validate function 
def cross_validate_KFold(X, y, nfolds, random_state , model):
    X.reset_index(drop=True, inplace=True)
    y.reset_index(drop=True, inplace=True)
    traintest = []
    kf = KFold(n_splits=nfolds, shuffle=True, random_state=random_state)
    for train_idx, test_idx in kf.split(X):
        X_train, X_test = X.ix[train_idx], X.ix[test_idx]
        y_train, y_test = y.ix[train_idx], y.ix[test_idx]
        traintest.append((X_train, X_test, y_train, y_test))
    
    scores = []
    for tupl in traintest:
        _, f1, fitting_time = evaluate(tupl[0], tupl[1], tupl[2].as_matrix(), tupl[3].as_matrix(), model)
        scores.append(np.array([f1, fitting_time]))
    result = np.round(np.array(scores).mean(axis=0), decimals=3)
    print('\033[1m' + '\nAvg F1 macro and Avg fitting time: %s' %  result + '\033[0m' )
    
    return result
SUBSET DATA
Although 8 models are 8 different anomalies, its features are similar. Therefore, we implement model selection for only one label and then apply to the others. To save time, we train the models on a subset data instead of whole data. The large data will be used in the the evaluation section (Section 5).
Here we have 4 datasets for label Check1:
# Full features which are not preprocessed
X1_train = traintest_before_prep[0][0]
# Selected features which are not preprocessed
X1_train_FS = traintest_before_prep[0][0][topMI_check1.columns]
# Target corresponding to features not preprocessed
y1_train = traintest_before_prep[0][2]
# Full features which are preprocessed
X2_train = traintest_after_prep[0][0]
# Selected features which are preprocessed
X2_train_FS = traintest_after_prep[0][0][topMI_check1.columns]
# Target corresponding to preprocessed features
y2_train = traintest_after_prep[0][2]
ALGORITHMS
Here we will train the model 1 (label Check1) using a rule-learning algorithm (Bayesian Rule Set) and a tree-based algorithm (Decision Tree).
1. Complexity
2. Interpretation
3. Handling imbalanced datasets
Imbalanced dataset is a very frequent problem in classification, not just in Bayesian Rule Set and Decision Tree but in virtually all classification algorithms. The prediction will be biased towards the more frequent class.
4. Handling both numerical and categorical
Bayesian Rule Set and Decision Tree can handle both numerical and categorical variables.
5. Handling missing values
Here we train the subset data using Bayesian Rule Set.
The Bayesian Rule Set consists of a set of rules, where each rule is a conjunction of conditions. Rule set models predict that an observation is in the positive class when at least one of the rules is satisfied. Otherwise, the observation is classified to be in the negative class.
brs = BayesianRuleSet(method='forest')
_, X_subset, _, y_subset = train_test_split(X1_train, y1_train, test_size=0.1, random_state=2019)
_ = cross_validate_KFold(X_subset, y_subset, 3, 42, brs)
_, X_subset, _, y_subset = train_test_split(X1_train_FS, y1_train, test_size=0.1, random_state=2019)
_ = cross_validate_KFold(X_subset, y_subset, 3, 42, brs)
_, X_subset, _, y_subset = train_test_split(X2_train, y2_train, test_size=0.1, random_state=2019)
_ = cross_validate_KFold(X_subset, y_subset, 3, 42, brs)
_, X_subset, _, y_subset = train_test_split(X2_train_FS, y2_train, test_size=0.1, random_state=2019)
_ = cross_validate_KFold(X_subset, y_subset, 3, 42, brs)
COMMENT.
A. Cross-validation results
| Subset data | CV=3 | F1 'macro' | Training time | 
|---|---|---|---|
| Full features, not preprocessed data | #1 | 0.4980 | 174.6350 | 
| Full features, not preprocessed data | #2 | 0.9678 | 178.1014 | 
| Full features, not preprocessed data | #3 | 0.9736 | 187.4181 | 
| Full features, not preprocessed data | Avg | 0.813 | 180.052 | 
| ----------------------------------------- | ------ | ------------- | --------------- | 
| Selected features, not preprocessed data | #1 | 0.4980 | 103.3495 | 
| Selected features, not preprocessed data | #2 | 0.9678 | 110.5208 | 
| Selected features, not preprocessed data | #3 | 0.9915 | 116.4722 | 
| Selected features, not preprocessed data | Avg | 0.819 | 110.114 | 
| ----------------------------------------- | ------ | ------------- | --------------- | 
| Full features, preprocessed data | #1 | 0.4986 | 131.3369 | 
| Full features, preprocessed data | #2 | 0.4981 | 134.2788 | 
| Full features, preprocessed data | #3 | 0.4981 | 133.1967 | 
| Full features, preprocessed data | Avg | 0.498 | 132.938 | 
| ----------------------------------------- | ------ | ------------- | --------------- | 
| Selected features, preprocessed data | #1 | 0.4985 | 76.5023 | 
| Selected features, preprocessed data | #2 | 0.4981 | 79.0241 | 
| Selected features, preprocessed data | #3 | 0.4981 | 67.2948 | 
| Selected features, preprocessed data | Avg | 0.498 | 74.274 | 
| ----------------------------------------- | ------ | ------------- | --------------- | 
B. REMARK
The cross-validation outputs for model of Label Check1 shows that:
Bayesian Rule Set model works better on the raw data (f1 macro ~0.81) than on the preprocessed data (f1 macro ~0.49). This experiment is suitable with the information of algorithms mentioned before ("robust to outliers and naturally handle missing data, with no imputation needed for missing attribute values").
The selected features using Mutual Information are really useful for Bayesian Rule Set model. They help to reduce a lot of the training time while preserving the f1 score. The training time when using full features (31 features) is ~180 seconds while the time when training on selected features (11 features) is ~110 seconds. It reduces ~40% of time.
The predictive performances of Bayesian Rule Set are not stable on different datasets. For example, using the not-preprocessed data, f1 'macro' at CV-#1 data is ~0.49 while it at CV-#2 data and CV-#3 data are 0.97 and 0.97, respectively.
Compared to Decision Tree model in Section 3.2., Bayesian Rule Set model runs slower and its performance is lower.
About the rules of the model:
We do cross-validation with 3 folds using two datasets and with two set of features. Let's take a look to one of these results. Considering the CV-#2, the output rules is ['90.2<=CPU', 'CPU<4602.49', 'SystemID_238_neg']. That means each instance which satifies at least one of these three conditions {CPU usage equal or more than 90.2; CPU usage less than 4602.49; SystemID is '238'} will be classified as Anomaly. Otherwise, it will be classified as non-Anomaly.
In general, training the model on cross-validation sets of the subsets, the number of rules are from 3 to 5. On average, it is 3 rules.
Decision Trees Classifier is a non-parametric supervised learning method. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features.
dtc = DecisionTreeClassifier(random_state=0)
_, X_subset, _, y_subset = train_test_split(X2_train, y2_train, test_size=0.1, random_state=2019)
_ = cross_validate_KFold(X_subset, y_subset, 3, 42, dtc)
_, X_subset, _, y_subset = train_test_split(X2_train_FS, y2_train, test_size=0.1, random_state=2019)
_ = cross_validate_KFold(X_subset, y_subset, 3, 42, dtc)
A. Cross-validation results
| Subset data | CV=3 | F1 'macro' | Training time | 
|---|---|---|---|
| Full features, preprocessed data | #1 | 0.9901 | 0.0394 | 
| Full features, preprocessed data | #2 | 1.0000 | 0.0406 | 
| Full features, preprocessed data | #3 | 0.9856 | 0.0402 | 
| Full features, preprocessed data | Avg | 0.992 | 0.04 | 
| ----------------------------------------- | ------ | ------------- | --------------- | 
| Selected features, preprocessed data | #1 | 0.9692 | 0.0152 | 
| Selected features, preprocessed data | #2 | 1.0000 | 0.0197 | 
| Selected features, preprocessed data | #3 | 0.9856 | 0.0202 | 
| Selected features, preprocessed data | Avg | 0.985 | 0.018 | 
| ----------------------------------------- | ------ | ------------- | --------------- | 
B. REMARK
The cross-validation outputs for model of Label Check1 shows that:
Although the datasets are extremely unbalanced, the preditive performances of Decision Tree model are very high (~0.9 - 1.0). That's because of the features on this data. There are super useful informative features for detecting anomalies. The minority class are all in one area of the feature space.
The selected features using Mutual Information are useful for the model (help to reduce the training time from 0.04 seconds to 0.18 seconds) but not very much. The reason is that the training time of Decision Tree model is so small, therefore, we cannot see the improvement clearly.
Compared to Bayesian Rule Set model in Section 3.1., Decision Tree model is much better in terms of accuracy and expecially in terms of training time. Decision Tree model loops through features and try to find the locally optimal solutions, therefore, it is faster than finding the global optimal solutions on full space of rules.
We will use the Decision Tree Classifier model and full features for detecting anomalies.
Here we try to find the best hyperparmeter for each model (model from 1 to 8) using GridSearchCV. The parameters for optimizing are _maxdepth, _maxfeatures, _min_samplesleaf and _min_samplessplit.
parameters = {'max_depth': [3, 4, 5, 6, 7],
              'min_samples_split':[2, 3, 4, 5], 
             'min_samples_leaf': [1, 2, 3, 4, 5, 6, 7],
             'max_features': [10, 15, 20, 25, 30]}
OPTIMIZING FOR MODEL 1 (Check1)
dtc = DecisionTreeClassifier(random_state=0)
grid = GridSearchCV(dtc, parameters, cv=3, scoring='f1_macro')
grid.fit(traintest_after_prep[0][0], traintest_after_prep[0][2])
print('Best score: ', grid.best_score_)
print('Best params: ', grid.best_params_)
OPTIMIZING FOR MODEL 2 (Check2)
dtc = DecisionTreeClassifier(random_state=0)
grid = GridSearchCV(dtc, parameters, cv=3, scoring='f1_macro')
grid.fit(traintest_after_prep[1][0], traintest_after_prep[1][2])
print('Best score: ', grid.best_score_)
print('Best params: ', grid.best_params_)
OPTIMIZING FOR MODEL 3 (Check3)
dtc = DecisionTreeClassifier(random_state=0)
grid = GridSearchCV(dtc, parameters, cv=3, scoring='f1_macro')
grid.fit(traintest_after_prep[2][0], traintest_after_prep[2][2])
print('Best score: ', grid.best_score_)
print('Best params: ', grid.best_params_)
OPTIMIZING FOR MODEL 4 (Check4)
dtc = DecisionTreeClassifier(random_state=0)
grid = GridSearchCV(dtc, parameters, cv=3, scoring='f1_macro')
grid.fit(traintest_after_prep[3][0], traintest_after_prep[3][2])
print('Best score: ', grid.best_score_)
print('Best params: ', grid.best_params_)
OPTIMIZING FOR MODEL 5 (Check5)
dtc = DecisionTreeClassifier(random_state=0)
grid = GridSearchCV(dtc, parameters, cv=3, scoring='f1_macro')
grid.fit(traintest_after_prep[4][0], traintest_after_prep[4][2])
print('Best score: ', grid.best_score_)
print('Best params: ', grid.best_params_)
OPTIMIZING FOR MODEL 6 (Check6)
dtc = DecisionTreeClassifier(random_state=0)
grid = GridSearchCV(dtc, parameters, cv=3, scoring='f1_macro')
grid.fit(traintest_after_prep[5][0], traintest_after_prep[5][2])
print('Best score: ', grid.best_score_)
print('Best params: ', grid.best_params_)
OPTIMIZING FOR MODEL 7 (Check7)
parameters = {'max_depth': [8, 9, 10, 12, 13, 14],
             'min_samples_leaf': [1, 2, 3, 4, 5, 6, 7],
             'max_features': [22, 23, 24, 25, 26, 27]}
dtc = DecisionTreeClassifier(random_state=0)
grid = GridSearchCV(dtc, parameters, cv=3, scoring='f1_macro')
grid.fit(traintest_after_prep[6][0], traintest_after_prep[6][2])
print('Best score: ', grid.best_score_)
print('Best params: ', grid.best_params_)
OPTIMIZING FOR MODEL 8 (Check8)
dtc = DecisionTreeClassifier(random_state=0)
grid = GridSearchCV(dtc, parameters, cv=3, scoring='f1_macro')
grid.fit(traintest_after_prep[7][0], traintest_after_prep[7][2])
print('Best score: ', grid.best_score_)
print('Best params: ', grid.best_params_)
MODEL 1
X_train, X_test, y_train, y_test = traintest_after_prep[0][0], traintest_after_prep[0][1], \
                                    traintest_after_prep[0][2], traintest_after_prep[0][3]
model1 = DecisionTreeClassifier(random_state=0, max_features=25, max_depth=5, min_samples_leaf=4, min_samples_split=2)
_, _, _ = evaluate(X_train, X_test, y_train, y_test, model1)
tree.export_graphviz(model1, out_file="graph_model1.dot", feature_names=X2_train.columns)
MODEL 2
X_train, X_test, y_train, y_test = traintest_after_prep[1][0], traintest_after_prep[1][1], \
                                    traintest_after_prep[1][2], traintest_after_prep[1][3]
model2 = DecisionTreeClassifier(random_state=0, max_features=30, max_depth=6, min_samples_leaf=1, min_samples_split=5)
_, _, _ = evaluate(X_train, X_test, y_train, y_test, model2)
tree.export_graphviz(model2, out_file="graph_model2.dot", feature_names=X_train.columns)
MODEL 3
X_train, X_test, y_train, y_test = traintest_after_prep[2][0], traintest_after_prep[2][1], \
                                    traintest_after_prep[2][2], traintest_after_prep[2][3]
model3 = DecisionTreeClassifier(random_state=0, max_features=30, max_depth=6, min_samples_leaf=5, min_samples_split=2)
_, _, _ = evaluate(X_train, X_test, y_train, y_test, model3)
tree.export_graphviz(model3, out_file="graph_model3.dot", feature_names=X_train.columns)
MODEL 4
X_train, X_test, y_train, y_test = traintest_after_prep[3][0], traintest_after_prep[3][1], \
                                    traintest_after_prep[3][2], traintest_after_prep[3][3]
model4 = DecisionTreeClassifier(random_state=0, max_features=10, max_depth=3, min_samples_leaf=1, min_samples_split=2)
_, _, _ = evaluate(X_train, X_test, y_train, y_test, model4)
tree.export_graphviz(model4, out_file="graph_model4.dot", feature_names=X_train.columns)
MODEL 5
X_train, X_test, y_train, y_test = traintest_after_prep[4][0], traintest_after_prep[4][1], \
                                    traintest_after_prep[4][2], traintest_after_prep[4][3]
model5 = DecisionTreeClassifier(random_state=0, max_features=30, max_depth=7, min_samples_leaf=1, min_samples_split=2)
_, _, _ = evaluate(X_train, X_test, y_train, y_test, model5)
tree.export_graphviz(model5, out_file="graph_model5.dot", feature_names=X_train.columns)
MODEL 6
X_train, X_test, y_train, y_test = traintest_after_prep[5][0], traintest_after_prep[5][1], \
                                    traintest_after_prep[5][2], traintest_after_prep[5][3]
model6 = DecisionTreeClassifier(random_state=0, max_features=10, max_depth=3, min_samples_leaf=1, min_samples_split=2)
_, _, _ = evaluate(X_train, X_test, y_train, y_test, model6)
tree.export_graphviz(model6, out_file="graph_model6.dot", feature_names=X_train.columns)
MODEL 7
X_train, X_test, y_train, y_test = traintest_after_prep[6][0], traintest_after_prep[6][1], \
                                    traintest_after_prep[6][2], traintest_after_prep[6][3]
model7 = DecisionTreeClassifier(random_state=0, max_features=24, max_depth=14, min_samples_leaf=1)
_, _, _ = evaluate(X_train, X_test, y_train, y_test, model7)
tree.export_graphviz(model7, out_file="graph_model7.dot", feature_names=X_train.columns)
MODEL 8
X_train, X_test, y_train, y_test = traintest_after_prep[7][0], traintest_after_prep[7][1], \
                                    traintest_after_prep[7][2], traintest_after_prep[7][3]
model8 = DecisionTreeClassifier(random_state=0, max_features=25, max_depth=5, min_samples_leaf=1, min_samples_split=2)
_, _, _ = evaluate(X_train, X_test, y_train, y_test, model8)
tree.export_graphviz(model8, out_file="graph_model8.dot", feature_names=X_train.columns)
RESULTS
| Models | F1 'macro' | Training time | 
|---|---|---|
| Model 1 | 0.9929 | 0.6590 | 
| Model 2 | 0.9985 | 0.7570 | 
| Model 3 | 0.9971 | 0.7884 | 
| Model 4 | 0.9998 | 0.4666 | 
| Model 5 | 0.9964 | 1.1264 | 
| Model 6 | 1.0000 | 0.4698 | 
| Model 7 | 0.8044 | 2.5990 | 
| Model 8 | 1.0000 | 1.1837 | 
The predictive performances of these models is very high, especially the model 6 and the model 8 (f1 macro is ~1.0000). The lowest performance is for the model 7 (f1 macro is ~0.8044).
The training time of the models are so small, from 0.4666 seconds to 2.5990 seconds.
Training the model with smaller subsets of the data still preserve these performances (see at the Section 3.2.)
We display the trees for interpretation of models. The rules for classification are paths starting from the top to the bottom and the conditions belong to that paths.
Since we cannot install package graphviz, we save the tree structure of the models, then plot them in another system and embedded to this jupyter notebook.
As can be seen form the trees, each anomaly is caused by different reasons in the system such as CPU usage, memory usage, memory allocation, system backup, system restart, etc.
MODEL 1
Image(url= "graph1.PNG", width=1000, height=1000)
MODEL 2
Image(url= "graph2.PNG", width=1000, height=1000)
MODEL 3
Image(url= "graph3.PNG", width=1000, height=1000)
MODEL 4
Image(url= "graph4.PNG", width=600, height=600)
MODEL 5
Image(url= "graph5.PNG", width=1000, height=1000)
MODEL 6
Image(url= "graph6.PNG", width=600, height=600)
MODEL 8
Image(url= "graph8.PNG", width=600, height=600)