Abstract:In recent years, with the change of climatic conditions and the introduction of cultivation techniques, the planting area of Lycium has gradually expanded. It has become one of the important economic crops in Ningxia and even the entire northwestern region. Lycium is a multi-insect host and has poor resistance to insect pests. It is very susceptible to insect infestation, which has a huge impact on yield and quality, causing serious economic losses. Therefore, it is very important to quickly and accurately retrieve and obtain various information about Lycium pests and provide timely and accurate control for the development of the industry. To address the problem that the present retrieval system on crop pests owns only the single mode, the crossmodal retrieval for images and texts in Lycium pest dataset was introduced, which had 17 kinds of common pests, and a cross-modal image and text retrieval method with the attention mechanism was proposed. Firstly, the transformer and the LSTM were used to obtain text and image fine-grained feature sequences with the context information, respectively. Then, the attention mechanism was leveraged to aggregate feature sequences to capture the salient semantic information in texts and images. Finally, in order to explore the semantic correlation between different modalities, the cross-media joint loss was used to constrain the proposed model. The experiment showed that the averaged MAP of the proposed method in the self-built Lycium pest dataset achieved 0.458. Compared with the existing eight methods, the averaged MAP of the method was improved by 0.011~0.195, outperforming all these methods. The proposed method can provide technical support and algorithm reference for diversified retrieval requirements of crop pests.