In this thesis, we propose two models for weakly supervised object localization (WSOL). Many existing WSOL models have various burden of learning, e.g., the nonnegligible cost of hyperparameter search for loss function. Thus, we first propose a model called SFPN to reduce the cost of hyperparameter search for loss function. SFPN enhances the information of the feature maps by exploiting the structure of feature pyramid network. Then these feature maps are engaged in the prediction of the bounding box.
This process helps us use only cross-entropy loss as well as improving performance. Furthermore, we propose the second model named A2E Net to enjoy a smaller number of parameters. A2E Net consists of ‘spatial attention branch’ and ‘refinement branch’. Spatial attention branch heightens the spatial information using few parameters. Also, refinement branch is composed of ‘attention module’ and ‘erasing module’, and these modules have no trainable parameters.
With the output feature map of spatial attention branch, attention module makes the feature map with more accurate information by using a connection between pixels. Also, erasing module erases the most discriminative region to make the network take account of the less discriminative region. Moreover, we boost the performance with multiple sizes of erasing. Finally, we sum up two output feature maps from attention module and erasing module to utilize information from these two modules. Extensive experiments on CUB-200-2011 and ILSVRC show the great performance of SFPN and A2E Net compared to other existing WSOL models.