3/2023 - 10 | View TOC | « Previous Article | Next Article » |
Video Moment Localization Network Based on Text Multi-semantic Clues GuidanceWU, G. , XU, T. |
Extra paper information in |
Click to see author's profile in SCOPUS, IEEE Xplore, Web of Science |
Download PDF (1,730 KB) | Citation | Downloads: 618 | Views: 1,051 |
Author keywords
information retrieval, machine learning, computer vision, natural language processing, pattern matching
References keywords
vision(24), video(16), cvpr(16), temporal(14), moment(14), localization(14), recognition(12), language(12), iccv(12), videos(11)
Blue keywords are present in both the references section and the paper title.
About this article
Date of Publication: 2023-08-31
Volume 23, Issue 3, Year 2023, On page(s): 85 - 92
ISSN: 1582-7445, e-ISSN: 1844-7600
Digital Object Identifier: 10.4316/AECE.2023.03010
Web of Science Accession Number: 001062641900010
SCOPUS ID: 85172352256
Abstract
With the rapid development of the Internet and information technology, people are able to create multimedia data such as pictures or videos anytime and anywhere. Efficient multimedia processing tools are needed for the vast video data. The video moment localization task aims to locate the video moment which best matches the query in the untrimmed video. Existing text-guided methods only consider single-scale text features, which cannot fully represent the semantic features of text, and also do not consider the masking of crucial information in the video by text information when using text to guide the extraction of video features. To solve the above problems, we propose a video moment localization network based on text multi-semantic clues guidance. Specifically, we first design a text encoder based on fusion gate to better capture the semantic information in the text through multi-semantic clues composed of word embedding, local features and global features. Then text guidance module guides the extraction of video features by text semantic features to highlight the video features related to text semantics. Experimental results on two datasets, Charades-STA and ActivityNet Captions, show that our approach provides significant improvements over state-of-the-art methods. |
References | | | Cited By «-- Click to see who has cited this paper |
[1] Z. Shou, D. Wang, and S.-F. Chang, "Temporal action localization in untrimmed videos via multi-stage CNNs," in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA: IEEE, Jun. 2016, pp. 1049-1058. [CrossRef] [Web of Science Times Cited 623] [SCOPUS Times Cited 812] [2] T. Lin, X. Zhao, and Z. Shou, "Single shot temporal action detection," in Proceedings of the 25th ACM international conference on Multimedia, Mountain View California USA: ACM, Oct. 2017, pp. 988-996. [CrossRef] [Web of Science Times Cited 295] [SCOPUS Times Cited 361] [3] Z. Shou, J. Chan, A. Zareian, K. Miyazawa, and S.-F. Chang, "CDC: Convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos," in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI: IEEE, Jul. 2017, pp. 1417-1426. [CrossRef] [Web of Science Times Cited 325] [SCOPUS Times Cited 451] [4] T. Lin, X. Liu, X. Li, E. Ding, and S. Wen, "BMN: Boundary-matching network for temporal action proposal generation," in 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea (South): IEEE, Oct. 2019, pp. 3888-3897. [CrossRef] [Web of Science Times Cited 390] [SCOPUS Times Cited 467] [5] J. Gao, C. Sun, Z. Yang, and R. Nevatia, "TALL: Temporal activity localization via language query," in 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 2017, pp. 5277-5285. [CrossRef] [Web of Science Times Cited 353] [SCOPUS Times Cited 521] [6] L. Anne Hendricks, O. Wang, E. Shechtman, J. Sivic, T. Darrell, and B. Russell, "Localizing moments in video with natural language," in 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 2017, pp. 5804-5813. [CrossRef] [Web of Science Times Cited 474] [SCOPUS Times Cited 610] [7] H. Wang, Z.-J. Zha, X. Chen, Z. Xiong, and J. Luo, "Dual path interaction network for video moment localization," in Proceedings of the 28th ACM International Conference on Multimedia, Seattle WA USA: ACM, Oct. 2020, pp. 4116-4124. [CrossRef] [Web of Science Times Cited 42] [SCOPUS Times Cited 49] [8] H. Zhang, A. Sun, W. Jing, and J. T. Zhou, "Span-based localizing network for natural language video localization," in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online: Association for Computational Linguistics, 2020, pp. 6543-6554. [CrossRef] [9] M. Soldan, M. Xu, S. Qu, J. Tegner, and B. Ghanem, "VLG-Net: Video-language graph matching network for video grounding," in 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), Montreal, BC, Canada: IEEE, Oct. 2021, pp. 3217-3227. [CrossRef] [Web of Science Times Cited 19] [SCOPUS Times Cited 40] [10] Y. Hu, M. Liu, X. Su, Z. Gao, and L. Nie, "Video moment localization via deep cross-modal hashing," IEEE Trans. Image Process., vol. 30, pp. 4667-4677, 2021. [CrossRef] [Web of Science Times Cited 52] [SCOPUS Times Cited 59] [11] S. Paul, N. C. Mithun, and A. K. Roy-Chowdhury, "Text-based localization of moments in a video corpus," IEEE Trans. Image Process., vol. 30, pp. 8886-8899, 2021. [CrossRef] [Web of Science Times Cited 9] [SCOPUS Times Cited 10] [12] D. Zhang, X. Dai, X. Wang, Y.-F. Wang, and L. S. Davis, "MAN: Moment alignment network for natural language moment retrieval via iterative graph adjustment," in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA: IEEE, Jun. 2019, pp. 1247-1257. [CrossRef] [Web of Science Times Cited 178] [SCOPUS Times Cited 243] [13] D. Liu, X. Qu, X.-Y. Liu, J. Dong, P. Zhou, and Z. Xu, "Jointly cross- and self-modal graph attention network for query-based moment localization," in Proceedings of the 28th ACM International Conference on Multimedia, Seattle WA USA: ACM, Oct. 2020, pp. 4070-4078. [CrossRef] [Web of Science Times Cited 58] [SCOPUS Times Cited 87] [14] T. Xu, H. Du, E. Chen, J. Chen, and Y. Wu, "Cross-modal video moment retrieval based on visual-textual relationship alignment," Sci. Sin. Informationis, vol. 50, no. 6, pp. 862-876, Jun. 2020. [CrossRef] [SCOPUS Times Cited 15] [15] M. Liu, X. Wang, L. Nie, X. He, B. Chen, and T.-S. Chua, "Attentive moment retrieval in videos," in the 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, Ann Arbor MI USA: ACM, Jun. 2018, pp. 15-24. [CrossRef] [Web of Science Times Cited 174] [SCOPUS Times Cited 233] [16] H. Xu, K. He, B. A. Plummer, L. Sigal, S. Sclaroff, and K. Saenko, "Multilevel language and vision integration for text-to-clip retrieval," Proc. AAAI Conf. Artif. Intell., vol. 33, no. 01, pp. 9062-9069, Jul. 2019. [CrossRef] [17] Z. Lin, Z. Zhao, Z. Zhang, Z. Zhang, and D. Cai, "Moment retrieval via cross-modal interaction networks with query reconstruction," IEEE Trans. Image Process., vol. 29, pp. 3750-3762, 2020. [CrossRef] [Web of Science Times Cited 36] [SCOPUS Times Cited 42] [18] J. Wang, L. Ma, and W. Jiang, "Temporally grounding language queries in videos by contextual boundary-aware prediction," Proc. AAAI Conf. Artif. Intell., vol. 34, no. 07, Art. no. 07, Apr. 2020. [CrossRef] [19] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei, "Large-scale video classification with convolutional neural networks," in 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA: IEEE, Jun. 2014, pp. 1725-1732. [CrossRef] [Web of Science Times Cited 4089] [SCOPUS Times Cited 5500] [20] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, "Learning spatiotemporal features with 3D convolutional networks," in 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 2015, pp. 4489-4497. [CrossRef] [Web of Science Times Cited 5496] [SCOPUS Times Cited 7443] [21] J. Carreira and A. Zisserman, "Quo vadis, action recognition? A new model and the kinetics dataset," in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 2017, pp. 4724-4733. [CrossRef] [Web of Science Times Cited 5029] [SCOPUS Times Cited 6058] [22] K. Simonyan and A. Zisserman, "Two-stream convolutional networks for action recognition in videos," in Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 1, Cambridge, MA, USA: MIT Press, 2014, pp. 568-576. [CrossRef] [23] H. Xu, A. Das, and K. Saenko, "R-C3D: Region convolutional 3D network for temporal activity detection," in 2017 IEEE International Conference on Computer Vision (ICCV), Venice: IEEE, Oct. 2017, pp. 5794-5803. [CrossRef] [Web of Science Times Cited 438] [SCOPUS Times Cited 578] [24] S. Ren, K. He, R. Girshick, and J. Sun, "Faster R-CNN: Towards real-time object detection with region proposal networks," IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 6, pp. 1137-1149, Jun. 2017. [CrossRef] [Web of Science Times Cited 37089] [SCOPUS Times Cited 23515] [25] B. Singh, T. K. Marks, M. Jones, O. Tuzel, and M. Shao, "A Multi-stream bi-directional recurrent neural network for fine-grained action detection," in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA: IEEE, Jun. 2016, pp. 1961-1970. [CrossRef] [Web of Science Times Cited 288] [SCOPUS Times Cited 399] [26] S. Buch, V. Escorcia, B. Ghanem, and J. C. Niebles, "End-to-end, single-stream temporal action detection in untrimmed videos," in Procedings of the British Machine Vision Conference 2017, London, UK: British Machine Vision Association, 2017, p. 93. [CrossRef] [SCOPUS Times Cited 179] [27] L. Wang et al., "Temporal segment networks: Towards good practices for deep action recognition," in Computer Vision - ECCV 2016, B. Leibe, J. Matas, N. Sebe, and M. Welling, Eds., in Lecture Notes in Computer Science, vol. 9912. Cham: Springer International Publishing, 2016, pp. 20-36. [CrossRef] [Web of Science Times Cited 1960] [SCOPUS Times Cited 2193] [28] B. Zhou, A. Andonian, A. Oliva, and A. Torralba, "Temporal relational reasoning in videos," in Computer Vision - ECCV 2018, V. Ferrari, M. Hebert, C. Sminchisescu, and Y. Weiss, Eds., in Lecture Notes in Computer Science, vol. 11205. Cham: Springer International Publishing, 2018, pp. 831-846. [CrossRef] [29] M. Liu, X. Wang, L. Nie, Q. Tian, B. Chen, and T.-S. Chua, "Cross-modal moment localization in videos," in Proceedings of the 26th ACM international conference on Multimedia, Seoul Republic of Korea: ACM, Oct. 2018, pp. 843-851. [CrossRef] [Web of Science Times Cited 128] [SCOPUS Times Cited 144] [30] Y. Yuan, T. Mei, and W. Zhu, "To find where you talk: Temporal sentence localization in video with attention based location regression," Proc. AAAI Conf. Artif. Intell., vol. 33, pp. 9159-9166, Jul. 2019. [CrossRef] [31] Y.-W. Chen, Y.-H. Tsai, and M.-H. Yang, "End-to-end multi-modal video temporal grounding," in 35th Conference on Neural Information Processing Systems, NeurIPS 2021. Neural information processing systems foundation, 2021, pp. 28442-28453 [32] Y. Liu, S. Li, Y. Wu, C. W. Chen, Y. Shan, and X. Qie, "UMT: Unified multi-modal transformers for joint video moment retrieval and highlight detection," in 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA: IEEE, Jun. 2022, pp. 3032-3041. [CrossRef] [Web of Science Times Cited 38] [SCOPUS Times Cited 70] [33] Y. Hu, L. Nie, M. Liu, K. Wang, Y. Wang, and X.-S. Hua, "Coarse-to-fine semantic alignment for cross-modal moment localization," IEEE Trans. Image Process., vol. 30, pp. 5933-5943, 2021. [CrossRef] [Web of Science Times Cited 28] [SCOPUS Times Cited 32] [34] Y. Zeng, "Point prompt tuning for temporally language grounding," in Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, Madrid Spain: ACM, Jul. 2022, pp. 2003-2007. [CrossRef] [Web of Science Times Cited 13] [SCOPUS Times Cited 14] [35] S. Hochreiter and J. Schmidhuber, "Long short-term memory," Neural Comput., vol. 9, no. 8, pp. 1735-1780, Nov. 1997. [CrossRef] [SCOPUS Times Cited 72074] [36] Y. Gong and S. Bowman, "Ruminating reader: Reasoning with gated multi-hop attention," in Proceedings of the Workshop on Machine Reading for Question Answering, Melbourne, Australia: Association for Computational Linguistics, 2018, pp. 1-11. [CrossRef] [37] K. Cho et al., "Learning phrase representations using RNN encoder-decoder for statistical machine translation," in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar: Association for Computational Linguistics, 2014, pp. 1724-1734. [CrossRef] [SCOPUS Times Cited 11078] [38] X. Wang, R. Girshick, A. Gupta, and K. He, "Non-local neural networks," in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA: IEEE, Jun. 2018, pp. 7794-7803. [CrossRef] [Web of Science Times Cited 5328] [SCOPUS Times Cited 8837] [39] G. A. Sigurdsson, G. Varol, X. Wang, A. Farhadi, I. Laptev, and A. Gupta, "Hollywood in homes: Crowdsourcing data collection for activity understanding," in Computer Vision - ECCV 2016, B. Leibe, J. Matas, N. Sebe, and M. Welling, Eds., in Lecture Notes in Computer Science, vol. 9905. Cham: Springer International Publishing, 2016, pp. 510-526. [CrossRef] [Web of Science Times Cited 618] [SCOPUS Times Cited 561] [40] R. Krishna, K. Hata, F. Ren, L. Fei-Fei, and J. C. Niebles, "Dense-captioning events in videos," in 2017 IEEE International Conference on Computer Vision (ICCV), Venice: IEEE, Oct. 2017, pp. 706-715. [CrossRef] [Web of Science Times Cited 645] [SCOPUS Times Cited 890] [41] Y. Yuan, L. Ma, J. Wang, W. Liu, and W. Zhu, "Semantic conditioned dynamic modulation for temporal sentence grounding in videos," in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 5, pp. 2725-2741, 1 May 2022. [CrossRef] [Web of Science Times Cited 13] [SCOPUS Times Cited 17] [42] J. Pennington, R. Socher, and C. Manning, "Glove: Global vectors for word representation," in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar: Association for Computational Linguistics, 2014, pp. 1532-1543. [CrossRef] [SCOPUS Times Cited 25784] [43] D. P. Kingma and L. J. Ba, "Adam: A method for stochastic optimization," International Conference on Learning Representations (ICLR), 2015 [44] K. Li, D. Guo, and M. Wang, "Proposal-free video grounding with contextual pyramid network," Proc. AAAI Conf. Artif. Intell., vol. 35, no. 3, pp. 1902-1910, May 2021, [CrossRef] [SCOPUS Times Cited 75] [45] M. Hahn, "Tripping through time: Efï¬cient localization of activities in videos," Br. Mach. Vis. Conf. BMVC, 2020 [46] C. Rodriguez-Opazo, E. Marrese-Taylor, F. S. Saleh, H. Li, and S. Gould, "Proposal-free temporal moment localization of a natural-language query in video using guided attention," in 2020 IEEE Winter Conference on Applications of Computer Vision (WACV), Snowmass Village, CO, USA: IEEE, Mar. 2020, pp. 2453-2462. [CrossRef] [Web of Science Times Cited 55] [SCOPUS Times Cited 105] [47] L. Zhang and R. J. Radke, "Natural language video moment localization through query-controlled temporal convolution," in 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA: IEEE, Jan. 2022, pp. 2524-2532. [CrossRef] [Web of Science Times Cited 8] [SCOPUS Times Cited 10] [48] Y. Zeng, D. Cao, X. Wei, M. Liu, Z. Zhao, and Z. Qin, "Multi-modal relational graph for cross-modal video moment retrieval," in 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA: IEEE, Jun. 2021, pp. 2215-2224. [CrossRef] [Web of Science Times Cited 44] [SCOPUS Times Cited 61] Web of Science® Citations for all references: 64,335 TCR SCOPUS® Citations for all references: 169,617 TCR Web of Science® Average Citations per reference: 1,287 ACR SCOPUS® Average Citations per reference: 3,392 ACR TCR = Total Citations for References / ACR = Average Citations per Reference We introduced in 2010 - for the first time in scientific publishing, the term "References Weight", as a quantitative indication of the quality ... Read more Citations for references updated on 2024-11-16 05:10 in 303 seconds. Note1: Web of Science® is a registered trademark of Clarivate Analytics. Note2: SCOPUS® is a registered trademark of Elsevier B.V. Disclaimer: All queries to the respective databases were made by using the DOI record of every reference (where available). Due to technical problems beyond our control, the information is not always accurate. Please use the CrossRef link to visit the respective publisher site. |
Faculty of Electrical Engineering and Computer Science
Stefan cel Mare University of Suceava, Romania
All rights reserved: Advances in Electrical and Computer Engineering is a registered trademark of the Stefan cel Mare University of Suceava. No part of this publication may be reproduced, stored in a retrieval system, photocopied, recorded or archived, without the written permission from the Editor. When authors submit their papers for publication, they agree that the copyright for their article be transferred to the Faculty of Electrical Engineering and Computer Science, Stefan cel Mare University of Suceava, Romania, if and only if the articles are accepted for publication. The copyright covers the exclusive rights to reproduce and distribute the article, including reprints and translations.
Permission for other use: The copyright owner's consent does not extend to copying for general distribution, for promotion, for creating new works, or for resale. Specific written permission must be obtained from the Editor for such copying. Direct linking to files hosted on this website is strictly prohibited.
Disclaimer: Whilst every effort is made by the publishers and editorial board to see that no inaccurate or misleading data, opinions or statements appear in this journal, they wish to make it clear that all information and opinions formulated in the articles, as well as linguistic accuracy, are the sole responsibility of the author.