Abstract [eng] |
Integrating computer vision systems into modern manufacturing for quality control offers substantial benefits. These systems, particularly those powered by deep learning, surpass human capabilities in both speed and accuracy for visual inspections. Nevertheless, significant challenges remain, particularly the need for task-specific data and the labor-intensive process of precisely annotating datasets for methods like segmentation or object detection. Explainability maps of neural networks can help address this issue by visually highlighting the regions that the network deems essential for predictions. This approach allows classifier networks to output class labels and an associated explainability map, functioning as a preliminary form of object detection. This technique only requires class labels, which are easier to obtain. This study conducts a comparative analysis of explainability methods across various neural network architectures, focusing on Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs). For CNNs, a Class Activation Mapping (CAM)-based technique generates an attention map for defect localization, while ViTs employ an Attention Rollout calculation for the same purpose. Both architectures are adapted to efficiently identify defects in images during a single iteration, eliminating the need for detailed pixel-wise annotations and substantial model complexity increase. This research evaluates and compares these methods using segmentation metrics of generated explainability maps with two datasets: Printed Circuit Boards (PCB) and Gear Defect Inspection (GID). When considering different architectures, CNNs explainability output is more precise, enhancing the F1 score by up to 19%, the Jaccard index by up to 22%, recall by up to 48%, and the pointing game metric by up to 13%. |