3/2023 - 10 | View TOC | « Previous Article | Next Article » |
Video Moment Localization Network Based on Text Multi-semantic Clues GuidanceWU, G.![]() ![]() ![]() ![]() ![]() ![]() |
View the paper record and citations in ![]() |
Click to see author's profile in ![]() ![]() ![]() |
Download PDF ![]() |
Author keywords
information retrieval, machine learning, computer vision, natural language processing, pattern matching
References keywords
vision(24), video(16), cvpr(16), temporal(14), moment(14), localization(14), recognition(12), language(12), iccv(12), videos(11)
Blue keywords are present in both the references section and the paper title.
About this article
Date of Publication: 2023-08-31
Volume 23, Issue 3, Year 2023, On page(s): 85 - 92
ISSN: 1582-7445, e-ISSN: 1844-7600
Digital Object Identifier: 10.4316/AECE.2023.03010
Web of Science Accession Number: 001062641900010
Abstract
With the rapid development of the Internet and information technology, people are able to create multimedia data such as pictures or videos anytime and anywhere. Efficient multimedia processing tools are needed for the vast video data. The video moment localization task aims to locate the video moment which best matches the query in the untrimmed video. Existing text-guided methods only consider single-scale text features, which cannot fully represent the semantic features of text, and also do not consider the masking of crucial information in the video by text information when using text to guide the extraction of video features. To solve the above problems, we propose a video moment localization network based on text multi-semantic clues guidance. Specifically, we first design a text encoder based on fusion gate to better capture the semantic information in the text through multi-semantic clues composed of word embedding, local features and global features. Then text guidance module guides the extraction of video features by text semantic features to highlight the video features related to text semantics. Experimental results on two datasets, Charades-STA and ActivityNet Captions, show that our approach provides significant improvements over state-of-the-art methods. |
References | | | Cited By «-- Click to see who has cited this paper |
[1] Z. Shou, D. Wang, and S.-F. Chang, "Temporal action localization in untrimmed videos via multi-stage CNNs," in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA: IEEE, Jun. 2016, pp. 1049-1058. [CrossRef] [Web of Science Times Cited 532] [SCOPUS Times Cited 689] [2] T. Lin, X. Zhao, and Z. Shou, "Single shot temporal action detection," in Proceedings of the 25th ACM international conference on Multimedia, Mountain View California USA: ACM, Oct. 2017, pp. 988-996. [CrossRef] [Web of Science Times Cited 229] [SCOPUS Times Cited 282] [3] Z. Shou, J. Chan, A. Zareian, K. Miyazawa, and S.-F. Chang, "CDC: Convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos," in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI: IEEE, Jul. 2017, pp. 1417-1426. [CrossRef] [Web of Science Times Cited 255] [SCOPUS Times Cited 399] [4] T. Lin, X. Liu, X. Li, E. Ding, and S. Wen, "BMN: Boundary-matching network for temporal action proposal generation," in 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea (South): IEEE, Oct. 2019, pp. 3888-3897. [CrossRef] [Web of Science Times Cited 234] [SCOPUS Times Cited 299] [5] J. Gao, C. Sun, Z. Yang, and R. Nevatia, "TALL: Temporal activity localization via language query," in 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 2017, pp. 5277-5285. [CrossRef] [Web of Science Times Cited 193] [SCOPUS Times Cited 323] [6] L. Anne Hendricks, O. Wang, E. Shechtman, J. Sivic, T. Darrell, and B. Russell, "Localizing moments in video with natural language," in 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 2017, pp. 5804-5813. [CrossRef] [Web of Science Times Cited 267] [SCOPUS Times Cited 391] [7] H. Wang, Z.-J. Zha, X. Chen, Z. Xiong, and J. Luo, "Dual path interaction network for video moment localization," in Proceedings of the 28th ACM International Conference on Multimedia, Seattle WA USA: ACM, Oct. 2020, pp. 4116-4124. [CrossRef] [Web of Science Times Cited 18] [SCOPUS Times Cited 24] [8] H. Zhang, A. Sun, W. Jing, and J. T. Zhou, "Span-based localizing network for natural language video localization," in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online: Association for Computational Linguistics, 2020, pp. 6543-6554. [CrossRef] [9] M. Soldan, M. Xu, S. Qu, J. Tegner, and B. Ghanem, "VLG-Net: Video-language graph matching network for video grounding," in 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), Montreal, BC, Canada: IEEE, Oct. 2021, pp. 3217-3227. [CrossRef] [Web of Science Times Cited 1] [SCOPUS Times Cited 14] [10] Y. Hu, M. Liu, X. Su, Z. Gao, and L. Nie, "Video moment localization via deep cross-modal hashing," IEEE Trans. Image Process., vol. 30, pp. 4667-4677, 2021. [CrossRef] [Web of Science Times Cited 37] [SCOPUS Times Cited 38] [11] S. Paul, N. C. Mithun, and A. K. Roy-Chowdhury, "Text-based localization of moments in a video corpus," IEEE Trans. Image Process., vol. 30, pp. 8886-8899, 2021. [CrossRef] [Web of Science Times Cited 4] [SCOPUS Times Cited 4] [12] D. Zhang, X. Dai, X. Wang, Y.-F. Wang, and L. S. Davis, "MAN: Moment alignment network for natural language moment retrieval via iterative graph adjustment," in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA: IEEE, Jun. 2019, pp. 1247-1257. [CrossRef] [Web of Science Times Cited 113] [SCOPUS Times Cited 171] [13] D. Liu, X. Qu, X.-Y. Liu, J. Dong, P. Zhou, and Z. Xu, "Jointly cross- and self-modal graph attention network for query-based moment localization," in Proceedings of the 28th ACM International Conference on Multimedia, Seattle WA USA: ACM, Oct. 2020, pp. 4070-4078. [CrossRef] [Web of Science Times Cited 22] [SCOPUS Times Cited 45] [14] T. Xu, H. Du, E. Chen, J. Chen, and Y. Wu, "Cross-modal video moment retrieval based on visual-textual relationship alignment," Sci. Sin. Informationis, vol. 50, no. 6, pp. 862-876, Jun. 2020. [CrossRef] [SCOPUS Times Cited 8] [15] M. Liu, X. Wang, L. Nie, X. He, B. Chen, and T.-S. Chua, "Attentive moment retrieval in videos," in the 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, Ann Arbor MI USA: ACM, Jun. 2018, pp. 15-24. [CrossRef] [Web of Science Times Cited 123] [SCOPUS Times Cited 173] [16] H. Xu, K. He, B. A. Plummer, L. Sigal, S. Sclaroff, and K. Saenko, "Multilevel language and vision integration for text-to-clip retrieval," Proc. AAAI Conf. Artif. Intell., vol. 33, no. 01, pp. 9062-9069, Jul. 2019. [CrossRef] [17] Z. Lin, Z. Zhao, Z. Zhang, Z. Zhang, and D. Cai, "Moment retrieval via cross-modal interaction networks with query reconstruction," IEEE Trans. Image Process., vol. 29, pp. 3750-3762, 2020. [CrossRef] [Web of Science Times Cited 25] [SCOPUS Times Cited 31] [18] J. Wang, L. Ma, and W. Jiang, "Temporally grounding language queries in videos by contextual boundary-aware prediction," Proc. AAAI Conf. Artif. Intell., vol. 34, no. 07, Art. no. 07, Apr. 2020. [CrossRef] [19] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei, "Large-scale video classification with convolutional neural networks," in 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA: IEEE, Jun. 2014, pp. 1725-1732. [CrossRef] [Web of Science Times Cited 3884] [SCOPUS Times Cited 4942] [20] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, "Learning spatiotemporal features with 3D convolutional networks," in 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 2015, pp. 4489-4497. [CrossRef] [Web of Science Times Cited 4405] [SCOPUS Times Cited 5963] [21] J. Carreira and A. Zisserman, "Quo vadis, action recognition? A new model and the kinetics dataset," in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 2017, pp. 4724-4733. [CrossRef] [Web of Science Times Cited 3492] [SCOPUS Times Cited 4285] [22] K. Simonyan and A. Zisserman, "Two-stream convolutional networks for action recognition in videos," in Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 1, Cambridge, MA, USA: MIT Press, 2014, pp. 568-576. [CrossRef] [23] H. Xu, A. Das, and K. Saenko, "R-C3D: Region convolutional 3D network for temporal activity detection," in 2017 IEEE International Conference on Computer Vision (ICCV), Venice: IEEE, Oct. 2017, pp. 5794-5803. [CrossRef] [Web of Science Times Cited 373] [SCOPUS Times Cited 485] [24] S. Ren, K. He, R. Girshick, and J. Sun, "Faster R-CNN: Towards real-time object detection with region proposal networks," IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 6, pp. 1137-1149, Jun. 2017. [CrossRef] [Web of Science Times Cited 29365] [SCOPUS Times Cited 17363] [25] B. Singh, T. K. Marks, M. Jones, O. Tuzel, and M. Shao, "A Multi-stream bi-directional recurrent neural network for fine-grained action detection," in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA: IEEE, Jun. 2016, pp. 1961-1970. [CrossRef] [Web of Science Times Cited 255] [SCOPUS Times Cited 334] [26] S. Buch, V. Escorcia, B. Ghanem, and J. C. Niebles, "End-to-end, single-stream temporal action detection in untrimmed videos," in Procedings of the British Machine Vision Conference 2017, London, UK: British Machine Vision Association, 2017, p. 93. [CrossRef] [SCOPUS Times Cited 152] [27] L. Wang et al., "Temporal segment networks: Towards good practices for deep action recognition," in Computer Vision - ECCV 2016, B. Leibe, J. Matas, N. Sebe, and M. Welling, Eds., in Lecture Notes in Computer Science, vol. 9912. Cham: Springer International Publishing, 2016, pp. 20-36. [CrossRef] [Web of Science Times Cited 1674] [SCOPUS Times Cited 1668] [28] B. Zhou, A. Andonian, A. Oliva, and A. Torralba, "Temporal relational reasoning in videos," in Computer Vision - ECCV 2018, V. Ferrari, M. Hebert, C. Sminchisescu, and Y. Weiss, Eds., in Lecture Notes in Computer Science, vol. 11205. Cham: Springer International Publishing, 2018, pp. 831-846. [CrossRef] [29] M. Liu, X. Wang, L. Nie, Q. Tian, B. Chen, and T.-S. Chua, "Cross-modal moment localization in videos," in Proceedings of the 26th ACM international conference on Multimedia, Seoul Republic of Korea: ACM, Oct. 2018, pp. 843-851. [CrossRef] [Web of Science Times Cited 83] [SCOPUS Times Cited 116] [30] Y. Yuan, T. Mei, and W. Zhu, "To find where you talk: Temporal sentence localization in video with attention based location regression," Proc. AAAI Conf. Artif. Intell., vol. 33, pp. 9159-9166, Jul. 2019. [CrossRef] [31] Y.-W. Chen, Y.-H. Tsai, and M.-H. Yang, "End-to-end multi-modal video temporal grounding," in 35th Conference on Neural Information Processing Systems, NeurIPS 2021. Neural information processing systems foundation, 2021, pp. 28442-28453 [32] Y. Liu, S. Li, Y. Wu, C. W. Chen, Y. Shan, and X. Qie, "UMT: Unified multi-modal transformers for joint video moment retrieval and highlight detection," in 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA: IEEE, Jun. 2022, pp. 3032-3041. [CrossRef] [Web of Science Times Cited 5] [SCOPUS Times Cited 10] [33] Y. Hu, L. Nie, M. Liu, K. Wang, Y. Wang, and X.-S. Hua, "Coarse-to-fine semantic alignment for cross-modal moment localization," IEEE Trans. Image Process., vol. 30, pp. 5933-5943, 2021. [CrossRef] [Web of Science Times Cited 9] [SCOPUS Times Cited 14] [34] Y. Zeng, "Point prompt tuning for temporally language grounding," in Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, Madrid Spain: ACM, Jul. 2022, pp. 2003-2007. [CrossRef] [Web of Science Times Cited 3] [SCOPUS Times Cited 6] [35] S. Hochreiter and J. Schmidhuber, "Long short-term memory," Neural Comput., vol. 9, no. 8, pp. 1735-1780, Nov. 1997. [CrossRef] [SCOPUS Times Cited 56441] [36] Y. Gong and S. Bowman, "Ruminating reader: Reasoning with gated multi-hop attention," in Proceedings of the Workshop on Machine Reading for Question Answering, Melbourne, Australia: Association for Computational Linguistics, 2018, pp. 1-11. [CrossRef] [37] K. Cho et al., "Learning phrase representations using RNN encoder-decoder for statistical machine translation," in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar: Association for Computational Linguistics, 2014, pp. 1724-1734. [CrossRef] [SCOPUS Times Cited 9134] [38] X. Wang, R. Girshick, A. Gupta, and K. He, "Non-local neural networks," in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA: IEEE, Jun. 2018, pp. 7794-7803. [CrossRef] [Web of Science Times Cited 3638] [SCOPUS Times Cited 6408] [39] G. A. Sigurdsson, G. Varol, X. Wang, A. Farhadi, I. Laptev, and A. Gupta, "Hollywood in homes: Crowdsourcing data collection for activity understanding," in Computer Vision - ECCV 2016, B. Leibe, J. Matas, N. Sebe, and M. Welling, Eds., in Lecture Notes in Computer Science, vol. 9905. Cham: Springer International Publishing, 2016, pp. 510-526. [CrossRef] [Web of Science Times Cited 442] [SCOPUS Times Cited 378] [40] R. Krishna, K. Hata, F. Ren, L. Fei-Fei, and J. C. Niebles, "Dense-captioning events in videos," in 2017 IEEE International Conference on Computer Vision (ICCV), Venice: IEEE, Oct. 2017, pp. 706-715. [CrossRef] [Web of Science Times Cited 404] [SCOPUS Times Cited 600] [41] Y. Yuan, L. Ma, J. Wang, W. Liu, and W. Zhu, "Semantic conditioned dynamic modulation for temporal sentence grounding in videos," in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 5, pp. 2725-2741, 1 May 2022. [CrossRef] [Web of Science Times Cited 3] [SCOPUS Times Cited 3] [42] J. Pennington, R. Socher, and C. Manning, "Glove: Global vectors for word representation," in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar: Association for Computational Linguistics, 2014, pp. 1532-1543. [CrossRef] [SCOPUS Times Cited 21490] [43] D. P. Kingma and L. J. Ba, "Adam: A method for stochastic optimization," International Conference on Learning Representations (ICLR), 2015 [44] K. Li, D. Guo, and M. Wang, "Proposal-free video grounding with contextual pyramid network," Proc. AAAI Conf. Artif. Intell., vol. 35, no. 3, pp. 1902-1910, May 2021, [CrossRef] [45] M. Hahn, "Tripping through time: Efï¬cient localization of activities in videos," Br. Mach. Vis. Conf. BMVC, 2020 [46] C. Rodriguez-Opazo, E. Marrese-Taylor, F. S. Saleh, H. Li, and S. Gould, "Proposal-free temporal moment localization of a natural-language query in video using guided attention," in 2020 IEEE Winter Conference on Applications of Computer Vision (WACV), Snowmass Village, CO, USA: IEEE, Mar. 2020, pp. 2453-2462. [CrossRef] [SCOPUS Times Cited 63] [47] L. Zhang and R. J. Radke, "Natural language video moment localization through query-controlled temporal convolution," in 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA: IEEE, Jan. 2022, pp. 2524-2532. [CrossRef] [Web of Science Times Cited 1] [SCOPUS Times Cited 2] [48] Y. Zeng, D. Cao, X. Wei, M. Liu, Z. Zhao, and Z. Qin, "Multi-modal relational graph for cross-modal video moment retrieval," in 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA: IEEE, Jun. 2021, pp. 2215-2224. [CrossRef] [Web of Science Times Cited 14] [SCOPUS Times Cited 26] Web of Science® Citations for all references: 50,103 TCR SCOPUS® Citations for all references: 132,774 TCR Web of Science® Average Citations per reference: 1,002 ACR SCOPUS® Average Citations per reference: 2,655 ACR TCR = Total Citations for References / ACR = Average Citations per Reference We introduced in 2010 - for the first time in scientific publishing, the term "References Weight", as a quantitative indication of the quality ... Read more Citations for references updated on 2023-09-29 14:30 in 257 seconds. Note1: Web of Science® is a registered trademark of Clarivate Analytics. Note2: SCOPUS® is a registered trademark of Elsevier B.V. Disclaimer: All queries to the respective databases were made by using the DOI record of every reference (where available). Due to technical problems beyond our control, the information is not always accurate. Please use the CrossRef link to visit the respective publisher site. |
Faculty of Electrical Engineering and Computer Science
Stefan cel Mare University of Suceava, Romania
All rights reserved: Advances in Electrical and Computer Engineering is a registered trademark of the Stefan cel Mare University of Suceava. No part of this publication may be reproduced, stored in a retrieval system, photocopied, recorded or archived, without the written permission from the Editor. When authors submit their papers for publication, they agree that the copyright for their article be transferred to the Faculty of Electrical Engineering and Computer Science, Stefan cel Mare University of Suceava, Romania, if and only if the articles are accepted for publication. The copyright covers the exclusive rights to reproduce and distribute the article, including reprints and translations.
Permission for other use: The copyright owner's consent does not extend to copying for general distribution, for promotion, for creating new works, or for resale. Specific written permission must be obtained from the Editor for such copying. Direct linking to files hosted on this website is strictly prohibited.
Disclaimer: Whilst every effort is made by the publishers and editorial board to see that no inaccurate or misleading data, opinions or statements appear in this journal, they wish to make it clear that all information and opinions formulated in the articles, as well as linguistic accuracy, are the sole responsibility of the author.