3/2023 - 10 | View TOC | « Previous Article | Next Article » |
Video Moment Localization Network Based on Text Multi-semantic Clues GuidanceWU, G. , XU, T. |
Extra paper information in |
Click to see author's profile in SCOPUS, IEEE Xplore, Web of Science |
Download PDF (1,730 KB) | Citation | Downloads: 602 | Views: 1,010 |
Author keywords
information retrieval, machine learning, computer vision, natural language processing, pattern matching
References keywords
vision(24), video(16), cvpr(16), temporal(14), moment(14), localization(14), recognition(12), language(12), iccv(12), videos(11)
Blue keywords are present in both the references section and the paper title.
About this article
Date of Publication: 2023-08-31
Volume 23, Issue 3, Year 2023, On page(s): 85 - 92
ISSN: 1582-7445, e-ISSN: 1844-7600
Digital Object Identifier: 10.4316/AECE.2023.03010
Web of Science Accession Number: 001062641900010
SCOPUS ID: 85172352256
Abstract
With the rapid development of the Internet and information technology, people are able to create multimedia data such as pictures or videos anytime and anywhere. Efficient multimedia processing tools are needed for the vast video data. The video moment localization task aims to locate the video moment which best matches the query in the untrimmed video. Existing text-guided methods only consider single-scale text features, which cannot fully represent the semantic features of text, and also do not consider the masking of crucial information in the video by text information when using text to guide the extraction of video features. To solve the above problems, we propose a video moment localization network based on text multi-semantic clues guidance. Specifically, we first design a text encoder based on fusion gate to better capture the semantic information in the text through multi-semantic clues composed of word embedding, local features and global features. Then text guidance module guides the extraction of video features by text semantic features to highlight the video features related to text semantics. Experimental results on two datasets, Charades-STA and ActivityNet Captions, show that our approach provides significant improvements over state-of-the-art methods. |
References | | | Cited By «-- Click to see who has cited this paper |
[1] Z. Shou, D. Wang, and S.-F. Chang, "Temporal action localization in untrimmed videos via multi-stage CNNs," in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA: IEEE, Jun. 2016, pp. 1049-1058. [CrossRef] [Web of Science Times Cited 623] [SCOPUS Times Cited 811] [2] T. Lin, X. Zhao, and Z. Shou, "Single shot temporal action detection," in Proceedings of the 25th ACM international conference on Multimedia, Mountain View California USA: ACM, Oct. 2017, pp. 988-996. [CrossRef] [Web of Science Times Cited 293] [SCOPUS Times Cited 359] [3] Z. Shou, J. Chan, A. Zareian, K. Miyazawa, and S.-F. Chang, "CDC: Convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos," in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI: IEEE, Jul. 2017, pp. 1417-1426. [CrossRef] [Web of Science Times Cited 325] [SCOPUS Times Cited 451] [4] T. Lin, X. Liu, X. Li, E. Ding, and S. Wen, "BMN: Boundary-matching network for temporal action proposal generation," in 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea (South): IEEE, Oct. 2019, pp. 3888-3897. [CrossRef] [Web of Science Times Cited 384] [SCOPUS Times Cited 463] [5] J. Gao, C. Sun, Z. Yang, and R. Nevatia, "TALL: Temporal activity localization via language query," in 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 2017, pp. 5277-5285. [CrossRef] [Web of Science Times Cited 352] [SCOPUS Times Cited 516] [6] L. Anne Hendricks, O. Wang, E. Shechtman, J. Sivic, T. Darrell, and B. Russell, "Localizing moments in video with natural language," in 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 2017, pp. 5804-5813. [CrossRef] [Web of Science Times Cited 472] [SCOPUS Times Cited 600] [7] H. Wang, Z.-J. Zha, X. Chen, Z. Xiong, and J. Luo, "Dual path interaction network for video moment localization," in Proceedings of the 28th ACM International Conference on Multimedia, Seattle WA USA: ACM, Oct. 2020, pp. 4116-4124. [CrossRef] [Web of Science Times Cited 42] [SCOPUS Times Cited 47] [8] H. Zhang, A. Sun, W. Jing, and J. T. Zhou, "Span-based localizing network for natural language video localization," in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online: Association for Computational Linguistics, 2020, pp. 6543-6554. [CrossRef] [9] M. Soldan, M. Xu, S. Qu, J. Tegner, and B. Ghanem, "VLG-Net: Video-language graph matching network for video grounding," in 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), Montreal, BC, Canada: IEEE, Oct. 2021, pp. 3217-3227. [CrossRef] [Web of Science Times Cited 19] [SCOPUS Times Cited 39] [10] Y. Hu, M. Liu, X. Su, Z. Gao, and L. Nie, "Video moment localization via deep cross-modal hashing," IEEE Trans. Image Process., vol. 30, pp. 4667-4677, 2021. [CrossRef] [Web of Science Times Cited 52] [SCOPUS Times Cited 59] [11] S. Paul, N. C. Mithun, and A. K. Roy-Chowdhury, "Text-based localization of moments in a video corpus," IEEE Trans. Image Process., vol. 30, pp. 8886-8899, 2021. [CrossRef] [Web of Science Times Cited 9] [SCOPUS Times Cited 10] [12] D. Zhang, X. Dai, X. Wang, Y.-F. Wang, and L. S. Davis, "MAN: Moment alignment network for natural language moment retrieval via iterative graph adjustment," in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA: IEEE, Jun. 2019, pp. 1247-1257. [CrossRef] [Web of Science Times Cited 178] [SCOPUS Times Cited 240] [13] D. Liu, X. Qu, X.-Y. Liu, J. Dong, P. Zhou, and Z. Xu, "Jointly cross- and self-modal graph attention network for query-based moment localization," in Proceedings of the 28th ACM International Conference on Multimedia, Seattle WA USA: ACM, Oct. 2020, pp. 4070-4078. [CrossRef] [Web of Science Times Cited 58] [SCOPUS Times Cited 86] [14] T. Xu, H. Du, E. Chen, J. Chen, and Y. Wu, "Cross-modal video moment retrieval based on visual-textual relationship alignment," Sci. Sin. Informationis, vol. 50, no. 6, pp. 862-876, Jun. 2020. [CrossRef] [SCOPUS Times Cited 15] [15] M. Liu, X. Wang, L. Nie, X. He, B. Chen, and T.-S. Chua, "Attentive moment retrieval in videos," in the 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, Ann Arbor MI USA: ACM, Jun. 2018, pp. 15-24. [CrossRef] [Web of Science Times Cited 174] [SCOPUS Times Cited 232] [16] H. Xu, K. He, B. A. Plummer, L. Sigal, S. Sclaroff, and K. Saenko, "Multilevel language and vision integration for text-to-clip retrieval," Proc. AAAI Conf. Artif. Intell., vol. 33, no. 01, pp. 9062-9069, Jul. 2019. [CrossRef] [17] Z. Lin, Z. Zhao, Z. Zhang, Z. Zhang, and D. Cai, "Moment retrieval via cross-modal interaction networks with query reconstruction," IEEE Trans. Image Process., vol. 29, pp. 3750-3762, 2020. [CrossRef] [Web of Science Times Cited 36] [SCOPUS Times Cited 42] [18] J. Wang, L. Ma, and W. Jiang, "Temporally grounding language queries in videos by contextual boundary-aware prediction," Proc. AAAI Conf. Artif. Intell., vol. 34, no. 07, Art. no. 07, Apr. 2020. [CrossRef] [19] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei, "Large-scale video classification with convolutional neural networks," in 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA: IEEE, Jun. 2014, pp. 1725-1732. [CrossRef] [Web of Science Times Cited 4087] [SCOPUS Times Cited 5495] [20] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, "Learning spatiotemporal features with 3D convolutional networks," in 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 2015, pp. 4489-4497. [CrossRef] [Web of Science Times Cited 5477] [SCOPUS Times Cited 7410] [21] J. Carreira and A. Zisserman, "Quo vadis, action recognition? A new model and the kinetics dataset," in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 2017, pp. 4724-4733. [CrossRef] [Web of Science Times Cited 5007] [SCOPUS Times Cited 6024] [22] K. Simonyan and A. Zisserman, "Two-stream convolutional networks for action recognition in videos," in Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 1, Cambridge, MA, USA: MIT Press, 2014, pp. 568-576. [CrossRef] [23] H. Xu, A. Das, and K. Saenko, "R-C3D: Region convolutional 3D network for temporal activity detection," in 2017 IEEE International Conference on Computer Vision (ICCV), Venice: IEEE, Oct. 2017, pp. 5794-5803. [CrossRef] [Web of Science Times Cited 435] [SCOPUS Times Cited 577] [24] S. Ren, K. He, R. Girshick, and J. Sun, "Faster R-CNN: Towards real-time object detection with region proposal networks," IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 6, pp. 1137-1149, Jun. 2017. [CrossRef] [Web of Science Times Cited 36968] [SCOPUS Times Cited 23384] [25] B. Singh, T. K. Marks, M. Jones, O. Tuzel, and M. Shao, "A Multi-stream bi-directional recurrent neural network for fine-grained action detection," in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA: IEEE, Jun. 2016, pp. 1961-1970. [CrossRef] [Web of Science Times Cited 287] [SCOPUS Times Cited 396] [26] S. Buch, V. Escorcia, B. Ghanem, and J. C. Niebles, "End-to-end, single-stream temporal action detection in untrimmed videos," in Procedings of the British Machine Vision Conference 2017, London, UK: British Machine Vision Association, 2017, p. 93. [CrossRef] [SCOPUS Times Cited 178] [27] L. Wang et al., "Temporal segment networks: Towards good practices for deep action recognition," in Computer Vision - ECCV 2016, B. Leibe, J. Matas, N. Sebe, and M. Welling, Eds., in Lecture Notes in Computer Science, vol. 9912. Cham: Springer International Publishing, 2016, pp. 20-36. [CrossRef] [Web of Science Times Cited 1954] [SCOPUS Times Cited 2181] [28] B. Zhou, A. Andonian, A. Oliva, and A. Torralba, "Temporal relational reasoning in videos," in Computer Vision - ECCV 2018, V. Ferrari, M. Hebert, C. Sminchisescu, and Y. Weiss, Eds., in Lecture Notes in Computer Science, vol. 11205. Cham: Springer International Publishing, 2018, pp. 831-846. [CrossRef] [29] M. Liu, X. Wang, L. Nie, Q. Tian, B. Chen, and T.-S. Chua, "Cross-modal moment localization in videos," in Proceedings of the 26th ACM international conference on Multimedia, Seoul Republic of Korea: ACM, Oct. 2018, pp. 843-851. [CrossRef] [Web of Science Times Cited 128] [SCOPUS Times Cited 144] [30] Y. Yuan, T. Mei, and W. Zhu, "To find where you talk: Temporal sentence localization in video with attention based location regression," Proc. AAAI Conf. Artif. Intell., vol. 33, pp. 9159-9166, Jul. 2019. [CrossRef] [31] Y.-W. Chen, Y.-H. Tsai, and M.-H. Yang, "End-to-end multi-modal video temporal grounding," in 35th Conference on Neural Information Processing Systems, NeurIPS 2021. Neural information processing systems foundation, 2021, pp. 28442-28453 [32] Y. Liu, S. Li, Y. Wu, C. W. Chen, Y. Shan, and X. Qie, "UMT: Unified multi-modal transformers for joint video moment retrieval and highlight detection," in 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA: IEEE, Jun. 2022, pp. 3032-3041. [CrossRef] [Web of Science Times Cited 38] [SCOPUS Times Cited 70] [33] Y. Hu, L. Nie, M. Liu, K. Wang, Y. Wang, and X.-S. Hua, "Coarse-to-fine semantic alignment for cross-modal moment localization," IEEE Trans. Image Process., vol. 30, pp. 5933-5943, 2021. [CrossRef] [Web of Science Times Cited 28] [SCOPUS Times Cited 31] [34] Y. Zeng, "Point prompt tuning for temporally language grounding," in Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, Madrid Spain: ACM, Jul. 2022, pp. 2003-2007. [CrossRef] [Web of Science Times Cited 13] [SCOPUS Times Cited 14] [35] S. Hochreiter and J. Schmidhuber, "Long short-term memory," Neural Comput., vol. 9, no. 8, pp. 1735-1780, Nov. 1997. [CrossRef] [SCOPUS Times Cited 71839] [36] Y. Gong and S. Bowman, "Ruminating reader: Reasoning with gated multi-hop attention," in Proceedings of the Workshop on Machine Reading for Question Answering, Melbourne, Australia: Association for Computational Linguistics, 2018, pp. 1-11. [CrossRef] [37] K. Cho et al., "Learning phrase representations using RNN encoder-decoder for statistical machine translation," in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar: Association for Computational Linguistics, 2014, pp. 1724-1734. [CrossRef] [SCOPUS Times Cited 11048] [38] X. Wang, R. Girshick, A. Gupta, and K. He, "Non-local neural networks," in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA: IEEE, Jun. 2018, pp. 7794-7803. [CrossRef] [Web of Science Times Cited 5314] [SCOPUS Times Cited 8797] [39] G. A. Sigurdsson, G. Varol, X. Wang, A. Farhadi, I. Laptev, and A. Gupta, "Hollywood in homes: Crowdsourcing data collection for activity understanding," in Computer Vision - ECCV 2016, B. Leibe, J. Matas, N. Sebe, and M. Welling, Eds., in Lecture Notes in Computer Science, vol. 9905. Cham: Springer International Publishing, 2016, pp. 510-526. [CrossRef] [Web of Science Times Cited 615] [SCOPUS Times Cited 556] [40] R. Krishna, K. Hata, F. Ren, L. Fei-Fei, and J. C. Niebles, "Dense-captioning events in videos," in 2017 IEEE International Conference on Computer Vision (ICCV), Venice: IEEE, Oct. 2017, pp. 706-715. [CrossRef] [Web of Science Times Cited 644] [SCOPUS Times Cited 881] [41] Y. Yuan, L. Ma, J. Wang, W. Liu, and W. Zhu, "Semantic conditioned dynamic modulation for temporal sentence grounding in videos," in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 5, pp. 2725-2741, 1 May 2022. [CrossRef] [Web of Science Times Cited 13] [SCOPUS Times Cited 17] [42] J. Pennington, R. Socher, and C. Manning, "Glove: Global vectors for word representation," in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar: Association for Computational Linguistics, 2014, pp. 1532-1543. [CrossRef] [SCOPUS Times Cited 25724] [43] D. P. Kingma and L. J. Ba, "Adam: A method for stochastic optimization," International Conference on Learning Representations (ICLR), 2015 [44] K. Li, D. Guo, and M. Wang, "Proposal-free video grounding with contextual pyramid network," Proc. AAAI Conf. Artif. Intell., vol. 35, no. 3, pp. 1902-1910, May 2021, [CrossRef] [SCOPUS Times Cited 75] [45] M. Hahn, "Tripping through time: Efï¬cient localization of activities in videos," Br. Mach. Vis. Conf. BMVC, 2020 [46] C. Rodriguez-Opazo, E. Marrese-Taylor, F. S. Saleh, H. Li, and S. Gould, "Proposal-free temporal moment localization of a natural-language query in video using guided attention," in 2020 IEEE Winter Conference on Applications of Computer Vision (WACV), Snowmass Village, CO, USA: IEEE, Mar. 2020, pp. 2453-2462. [CrossRef] [Web of Science Times Cited 55] [SCOPUS Times Cited 105] [47] L. Zhang and R. J. Radke, "Natural language video moment localization through query-controlled temporal convolution," in 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA: IEEE, Jan. 2022, pp. 2524-2532. [CrossRef] [Web of Science Times Cited 8] [SCOPUS Times Cited 9] [48] Y. Zeng, D. Cao, X. Wei, M. Liu, Z. Zhao, and Z. Qin, "Multi-modal relational graph for cross-modal video moment retrieval," in 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA: IEEE, Jun. 2021, pp. 2215-2224. [CrossRef] [Web of Science Times Cited 44] [SCOPUS Times Cited 59] Web of Science® Citations for all references: 64,132 TCR SCOPUS® Citations for all references: 168,984 TCR Web of Science® Average Citations per reference: 1,283 ACR SCOPUS® Average Citations per reference: 3,380 ACR TCR = Total Citations for References / ACR = Average Citations per Reference We introduced in 2010 - for the first time in scientific publishing, the term "References Weight", as a quantitative indication of the quality ... Read more Citations for references updated on 2024-11-08 21:17 in 305 seconds. Note1: Web of Science® is a registered trademark of Clarivate Analytics. Note2: SCOPUS® is a registered trademark of Elsevier B.V. Disclaimer: All queries to the respective databases were made by using the DOI record of every reference (where available). Due to technical problems beyond our control, the information is not always accurate. Please use the CrossRef link to visit the respective publisher site. |
Faculty of Electrical Engineering and Computer Science
Stefan cel Mare University of Suceava, Romania
All rights reserved: Advances in Electrical and Computer Engineering is a registered trademark of the Stefan cel Mare University of Suceava. No part of this publication may be reproduced, stored in a retrieval system, photocopied, recorded or archived, without the written permission from the Editor. When authors submit their papers for publication, they agree that the copyright for their article be transferred to the Faculty of Electrical Engineering and Computer Science, Stefan cel Mare University of Suceava, Romania, if and only if the articles are accepted for publication. The copyright covers the exclusive rights to reproduce and distribute the article, including reprints and translations.
Permission for other use: The copyright owner's consent does not extend to copying for general distribution, for promotion, for creating new works, or for resale. Specific written permission must be obtained from the Editor for such copying. Direct linking to files hosted on this website is strictly prohibited.
Disclaimer: Whilst every effort is made by the publishers and editorial board to see that no inaccurate or misleading data, opinions or statements appear in this journal, they wish to make it clear that all information and opinions formulated in the articles, as well as linguistic accuracy, are the sole responsibility of the author.