Explainable AI

Deep Learning in Practice - CentraleSupélec

Deep learning is known for being a black box. So powerful, and yet so hard to comprehend.

In the project below we will explore some techniques used in computer vision to interpret a given model's decision-making process. In particular we will explore two methods:

Grad-CAM (best)
CAM (good)

Grad-CAM

Grad-CAM, short for Gradient-weighted Class Activation Mapping, is an approach in explainable AI used to visualize and understand the decisions made by convolutional neural networks (CNNs) in image classification tasks. It helps identify which parts of an input image are crucial for the CNN's prediction of a particular class. Let's use the image below as an example.

ResNet 34

The images above were the result of a ResNet 34 model, which predicted "Egyptian cat" as the most likely class. Most importantly, we can see that the prediction is based on the correct part of the image, adding value to the model performance.

Let's now take a deeper dive into the model architecture to evaluate the different steps of the decision making process. Before doing so, let's take a quick overview of the model architecture.

This model includes 4 feature layers, before applying a Global Average Pooling layer and a Softmax activation function for the final predictions.

Grad-CAM works by:

computing the gradient of the predicted class score with respect to the feature maps of a given layer.
the gradients are then globally averaged to obtain importance weights for each feature map. These weights indicate the importance of each feature map in making the prediction. Higher gradients suggest that the corresponding feature maps contribute more significantly to the prediction.
the importance weights are then used to linearly combine the feature maps, producing a weighted feature map. This map highlights the regions of the input image that are influential in the CNN's decision-making process for the predicted class.

Let's now project the weighted feature map onto the input image to generate a heatmap. This will allow us to understand which regions of the input image contributed the most to the CNN's prediction of the target class.

Initial Layers:

Grad-CAM applied to initial layers of a CNN tends to focus on low-level features such as edges, corners, and textures. These layers capture basic visual patterns in the input image.
The resulting heatmaps from these layers might highlight fine-grained details that are essential for localizing specific objects or parts of objects. For example, in the case above the network is classifying a fox, and these initial layers might highlight the outline of the fox's ears, nose, and fur texture.

Final Layers:

Grad-CAM applied to final layers of a CNN captures high-level semantic information relevant to the target class. These layers integrate information from lower layers to make the final classification decision.
The resulting heatmaps from these layers tend to highlight more global and holistic features that are discriminative for the target class. For instance, if the network is classifying a car, the final layers might focus on the overall shape, wheels, and other distinctive features characteristic of cars.

In summary, applying Grad-CAM to different layers of a CNN provides insights into how the network processes information at different levels of abstraction, from low-level features to high-level semantics.

CAM

CAM, short for Class Activation Mapping, computes a weighted combination of the feature maps produced by the final convolutional layer, where the weights are derived from the importance of each feature map in predicting a particular class. However, CAM doesn't consider the gradient information flowing into the final convolutional layer, which limits its ability to precisely localize object boundaries and intricate details.

VGG 16

The result above is from a ResNet 34 model, but I also applied CAM to a VGG 16 model. As CAM uses GAP (Global Averega Pooling) to replace fully-connected (FC) layers, we start by removing the FC layer of VGG16.

As we have modified the architecture of VGG16, we need to train it first. And we only need to train the FC laye (i.e. the GAP) of the network. To do so, I used the Hymenoptera dataset, which is a collection of images commonly used for image classification tasks in machine learning, consisting primarily of two classes: ants and bees. It is often utilized to train and evaluate algorithms for distinguishing between these insect species based on visual characteristics.

The results above are not bad and we can see that the model focuses on the correct area of the image to take its decision. However, the output is not as good as the Grad-CAM's one.

Skills acquired/improved and tools used in the project

Throughout this project I have focused on the back propagation and gradient computation of different CNN models. The goal was understanding more about how deep learning models work, and make the so famous "black box" slightly less "black." I have used PyTorch to carry out the analysis and I became more confident with this machine learning lirary.

PyTorch

Numpy