Video Moment Localization Network Based on Text Multi-semantic Clues Guidance

doi:10.4316/AECE.2023.03010

3/2023 - 10

View TOC | « Previous Article | Next Article »

Video Moment Localization Network Based on Text Multi-semantic Clues Guidance

WU, G. , XU, T.

View the paper record and citations in

Click to see author's profile in

SCOPUS,

IEEE Xplore,

Web of Science

Download PDF (1,730 KB) | Citation | Downloads: 589 | Views: 975

Author keywords
information retrieval, machine learning, computer vision, natural language processing, pattern matching

References keywords
vision(24), video(16), cvpr(16), temporal(14), moment(14), localization(14), recognition(12), language(12), iccv(12), videos(11)
Blue keywords are present in both the references section and the paper title.

About this article
Date of Publication: 2023-08-31
Volume 23, Issue 3, Year 2023, On page(s): 85 - 92
ISSN: 1582-7445, e-ISSN: 1844-7600
Digital Object Identifier: 10.4316/AECE.2023.03010
Web of Science Accession Number: 001062641900010
SCOPUS ID: 85172352256

Abstract

Full text preview

With the rapid development of the Internet and information technology, people are able to create multimedia data such as pictures or videos anytime and anywhere. Efficient multimedia processing tools are needed for the vast video data. The video moment localization task aims to locate the video moment which best matches the query in the untrimmed video. Existing text-guided methods only consider single-scale text features, which cannot fully represent the semantic features of text, and also do not consider the masking of crucial information in the video by text information when using text to guide the extraction of video features. To solve the above problems, we propose a video moment localization network based on text multi-semantic clues guidance. Specifically, we first design a text encoder based on fusion gate to better capture the semantic information in the text through multi-semantic clues composed of word embedding, local features and global features. Then text guidance module guides the extraction of video features by text semantic features to highlight the video features related to text semantics. Experimental results on two datasets, Charades-STA and ActivityNet Captions, show that our approach provides significant improvements over state-of-the-art methods.

References

Cited By «-- Click to see who has cited this paper

[1] Z. Shou, D. Wang, and S.-F. Chang, "Temporal action localization in untrimmed videos via multi-stage CNNs," in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA: IEEE, Jun. 2016, pp. 1049-1058.
[CrossRef] [Web of Science Times Cited 616] [SCOPUS Times Cited 807]

[2] T. Lin, X. Zhao, and Z. Shou, "Single shot temporal action detection," in Proceedings of the 25th ACM international conference on Multimedia, Mountain View California USA: ACM, Oct. 2017, pp. 988-996.
[CrossRef] [Web of Science Times Cited 290] [SCOPUS Times Cited 351]

[3] Z. Shou, J. Chan, A. Zareian, K. Miyazawa, and S.-F. Chang, "CDC: Convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos," in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI: IEEE, Jul. 2017, pp. 1417-1426.
[CrossRef] [Web of Science Times Cited 318] [SCOPUS Times Cited 450]

[4] T. Lin, X. Liu, X. Li, E. Ding, and S. Wen, "BMN: Boundary-matching network for temporal action proposal generation," in 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea (South): IEEE, Oct. 2019, pp. 3888-3897.
[CrossRef] [Web of Science Times Cited 376] [SCOPUS Times Cited 442]

[5] J. Gao, C. Sun, Z. Yang, and R. Nevatia, "TALL: Temporal activity localization via language query," in 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 2017, pp. 5277-5285.
[CrossRef] [Web of Science Times Cited 349] [SCOPUS Times Cited 492]

[6] L. Anne Hendricks, O. Wang, E. Shechtman, J. Sivic, T. Darrell, and B. Russell, "Localizing moments in video with natural language," in 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 2017, pp. 5804-5813.
[CrossRef] [Web of Science Times Cited 465] [SCOPUS Times Cited 578]

[7] H. Wang, Z.-J. Zha, X. Chen, Z. Xiong, and J. Luo, "Dual path interaction network for video moment localization," in Proceedings of the 28th ACM International Conference on Multimedia, Seattle WA USA: ACM, Oct. 2020, pp. 4116-4124.
[CrossRef] [Web of Science Times Cited 42] [SCOPUS Times Cited 46]

[8] H. Zhang, A. Sun, W. Jing, and J. T. Zhou, "Span-based localizing network for natural language video localization," in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online: Association for Computational Linguistics, 2020, pp. 6543-6554.
[CrossRef]

[9] M. Soldan, M. Xu, S. Qu, J. Tegner, and B. Ghanem, "VLG-Net: Video-language graph matching network for video grounding," in 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), Montreal, BC, Canada: IEEE, Oct. 2021, pp. 3217-3227.
[CrossRef] [Web of Science Times Cited 19] [SCOPUS Times Cited 36]

[10] Y. Hu, M. Liu, X. Su, Z. Gao, and L. Nie, "Video moment localization via deep cross-modal hashing," IEEE Trans. Image Process., vol. 30, pp. 4667-4677, 2021.
[CrossRef] [Web of Science Times Cited 52] [SCOPUS Times Cited 59]

[11] S. Paul, N. C. Mithun, and A. K. Roy-Chowdhury, "Text-based localization of moments in a video corpus," IEEE Trans. Image Process., vol. 30, pp. 8886-8899, 2021.
[CrossRef] [Web of Science Times Cited 9] [SCOPUS Times Cited 9]

[12] D. Zhang, X. Dai, X. Wang, Y.-F. Wang, and L. S. Davis, "MAN: Moment alignment network for natural language moment retrieval via iterative graph adjustment," in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA: IEEE, Jun. 2019, pp. 1247-1257.
[CrossRef] [Web of Science Times Cited 174] [SCOPUS Times Cited 229]

[13] D. Liu, X. Qu, X.-Y. Liu, J. Dong, P. Zhou, and Z. Xu, "Jointly cross- and self-modal graph attention network for query-based moment localization," in Proceedings of the 28th ACM International Conference on Multimedia, Seattle WA USA: ACM, Oct. 2020, pp. 4070-4078.
[CrossRef] [Web of Science Times Cited 58] [SCOPUS Times Cited 84]

[14] T. Xu, H. Du, E. Chen, J. Chen, and Y. Wu, "Cross-modal video moment retrieval based on visual-textual relationship alignment," Sci. Sin. Informationis, vol. 50, no. 6, pp. 862-876, Jun. 2020.
[CrossRef] [SCOPUS Times Cited 15]

[15] M. Liu, X. Wang, L. Nie, X. He, B. Chen, and T.-S. Chua, "Attentive moment retrieval in videos," in the 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, Ann Arbor MI USA: ACM, Jun. 2018, pp. 15-24.
[CrossRef] [Web of Science Times Cited 171] [SCOPUS Times Cited 227]

[16] H. Xu, K. He, B. A. Plummer, L. Sigal, S. Sclaroff, and K. Saenko, "Multilevel language and vision integration for text-to-clip retrieval," Proc. AAAI Conf. Artif. Intell., vol. 33, no. 01, pp. 9062-9069, Jul. 2019.
[CrossRef]

[17] Z. Lin, Z. Zhao, Z. Zhang, Z. Zhang, and D. Cai, "Moment retrieval via cross-modal interaction networks with query reconstruction," IEEE Trans. Image Process., vol. 29, pp. 3750-3762, 2020.
[CrossRef] [Web of Science Times Cited 35] [SCOPUS Times Cited 41]

[18] J. Wang, L. Ma, and W. Jiang, "Temporally grounding language queries in videos by contextual boundary-aware prediction," Proc. AAAI Conf. Artif. Intell., vol. 34, no. 07, Art. no. 07, Apr. 2020.
[CrossRef]

[19] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei, "Large-scale video classification with convolutional neural networks," in 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA: IEEE, Jun. 2014, pp. 1725-1732.
[CrossRef] [Web of Science Times Cited 4060] [SCOPUS Times Cited 5460]

[20] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, "Learning spatiotemporal features with 3D convolutional networks," in 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 2015, pp. 4489-4497.
[CrossRef] [Web of Science Times Cited 5402] [SCOPUS Times Cited 7306]

[21] J. Carreira and A. Zisserman, "Quo vadis, action recognition? A new model and the kinetics dataset," in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 2017, pp. 4724-4733.
[CrossRef] [Web of Science Times Cited 4898] [SCOPUS Times Cited 5885]

[22] K. Simonyan and A. Zisserman, "Two-stream convolutional networks for action recognition in videos," in Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 1, Cambridge, MA, USA: MIT Press, 2014, pp. 568-576.
[CrossRef]

[23] H. Xu, A. Das, and K. Saenko, "R-C3D: Region convolutional 3D network for temporal activity detection," in 2017 IEEE International Conference on Computer Vision (ICCV), Venice: IEEE, Oct. 2017, pp. 5794-5803.
[CrossRef] [Web of Science Times Cited 428] [SCOPUS Times Cited 570]

[24] S. Ren, K. He, R. Girshick, and J. Sun, "Faster R-CNN: Towards real-time object detection with region proposal networks," IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 6, pp. 1137-1149, Jun. 2017.
[CrossRef] [Web of Science Times Cited 36340] [SCOPUS Times Cited 22992]

[25] B. Singh, T. K. Marks, M. Jones, O. Tuzel, and M. Shao, "A Multi-stream bi-directional recurrent neural network for fine-grained action detection," in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA: IEEE, Jun. 2016, pp. 1961-1970.
[CrossRef] [Web of Science Times Cited 283] [SCOPUS Times Cited 395]

[26] S. Buch, V. Escorcia, B. Ghanem, and J. C. Niebles, "End-to-end, single-stream temporal action detection in untrimmed videos," in Procedings of the British Machine Vision Conference 2017, London, UK: British Machine Vision Association, 2017, p. 93.
[CrossRef] [SCOPUS Times Cited 174]

[27] L. Wang et al., "Temporal segment networks: Towards good practices for deep action recognition," in Computer Vision - ECCV 2016, B. Leibe, J. Matas, N. Sebe, and M. Welling, Eds., in Lecture Notes in Computer Science, vol. 9912. Cham: Springer International Publishing, 2016, pp. 20-36.
[CrossRef] [Web of Science Times Cited 1912] [SCOPUS Times Cited 2142]

[28] B. Zhou, A. Andonian, A. Oliva, and A. Torralba, "Temporal relational reasoning in videos," in Computer Vision - ECCV 2018, V. Ferrari, M. Hebert, C. Sminchisescu, and Y. Weiss, Eds., in Lecture Notes in Computer Science, vol. 11205. Cham: Springer International Publishing, 2018, pp. 831-846.
[CrossRef]

[29] M. Liu, X. Wang, L. Nie, Q. Tian, B. Chen, and T.-S. Chua, "Cross-modal moment localization in videos," in Proceedings of the 26th ACM international conference on Multimedia, Seoul Republic of Korea: ACM, Oct. 2018, pp. 843-851.
[CrossRef] [Web of Science Times Cited 126] [SCOPUS Times Cited 142]

[30] Y. Yuan, T. Mei, and W. Zhu, "To find where you talk: Temporal sentence localization in video with attention based location regression," Proc. AAAI Conf. Artif. Intell., vol. 33, pp. 9159-9166, Jul. 2019.
[CrossRef]

[31] Y.-W. Chen, Y.-H. Tsai, and M.-H. Yang, "End-to-end multi-modal video temporal grounding," in 35th Conference on Neural Information Processing Systems, NeurIPS 2021. Neural information processing systems foundation, 2021, pp. 28442-28453

[32] Y. Liu, S. Li, Y. Wu, C. W. Chen, Y. Shan, and X. Qie, "UMT: Unified multi-modal transformers for joint video moment retrieval and highlight detection," in 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA: IEEE, Jun. 2022, pp. 3032-3041.
[CrossRef] [Web of Science Times Cited 37] [SCOPUS Times Cited 60]

[33] Y. Hu, L. Nie, M. Liu, K. Wang, Y. Wang, and X.-S. Hua, "Coarse-to-fine semantic alignment for cross-modal moment localization," IEEE Trans. Image Process., vol. 30, pp. 5933-5943, 2021.
[CrossRef] [Web of Science Times Cited 27] [SCOPUS Times Cited 31]

[34] Y. Zeng, "Point prompt tuning for temporally language grounding," in Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, Madrid Spain: ACM, Jul. 2022, pp. 2003-2007.
[CrossRef] [Web of Science Times Cited 13] [SCOPUS Times Cited 14]

[35] S. Hochreiter and J. Schmidhuber, "Long short-term memory," Neural Comput., vol. 9, no. 8, pp. 1735-1780, Nov. 1997.
[CrossRef] [SCOPUS Times Cited 70978]

[36] Y. Gong and S. Bowman, "Ruminating reader: Reasoning with gated multi-hop attention," in Proceedings of the Workshop on Machine Reading for Question Answering, Melbourne, Australia: Association for Computational Linguistics, 2018, pp. 1-11.
[CrossRef]

[37] K. Cho et al., "Learning phrase representations using RNN encoder-decoder for statistical machine translation," in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar: Association for Computational Linguistics, 2014, pp. 1724-1734.
[CrossRef] [SCOPUS Times Cited 10942]

[38] X. Wang, R. Girshick, A. Gupta, and K. He, "Non-local neural networks," in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA: IEEE, Jun. 2018, pp. 7794-7803.
[CrossRef] [Web of Science Times Cited 5241] [SCOPUS Times Cited 8666]

[39] G. A. Sigurdsson, G. Varol, X. Wang, A. Farhadi, I. Laptev, and A. Gupta, "Hollywood in homes: Crowdsourcing data collection for activity understanding," in Computer Vision - ECCV 2016, B. Leibe, J. Matas, N. Sebe, and M. Welling, Eds., in Lecture Notes in Computer Science, vol. 9905. Cham: Springer International Publishing, 2016, pp. 510-526.
[CrossRef] [Web of Science Times Cited 592] [SCOPUS Times Cited 534]

[40] R. Krishna, K. Hata, F. Ren, L. Fei-Fei, and J. C. Niebles, "Dense-captioning events in videos," in 2017 IEEE International Conference on Computer Vision (ICCV), Venice: IEEE, Oct. 2017, pp. 706-715.
[CrossRef] [Web of Science Times Cited 631] [SCOPUS Times Cited 844]

[41] Y. Yuan, L. Ma, J. Wang, W. Liu, and W. Zhu, "Semantic conditioned dynamic modulation for temporal sentence grounding in videos," in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 5, pp. 2725-2741, 1 May 2022.
[CrossRef] [Web of Science Times Cited 13] [SCOPUS Times Cited 13]

[42] J. Pennington, R. Socher, and C. Manning, "Glove: Global vectors for word representation," in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar: Association for Computational Linguistics, 2014, pp. 1532-1543.
[CrossRef] [SCOPUS Times Cited 25517]

[43] D. P. Kingma and L. J. Ba, "Adam: A method for stochastic optimization," International Conference on Learning Representations (ICLR), 2015

[44] K. Li, D. Guo, and M. Wang, "Proposal-free video grounding with contextual pyramid network," Proc. AAAI Conf. Artif. Intell., vol. 35, no. 3, pp. 1902-1910, May 2021,
[CrossRef]

[45] M. Hahn, "Tripping through time: Efï¬cient localization of activities in videos," Br. Mach. Vis. Conf. BMVC, 2020

[46] C. Rodriguez-Opazo, E. Marrese-Taylor, F. S. Saleh, H. Li, and S. Gould, "Proposal-free temporal moment localization of a natural-language query in video using guided attention," in 2020 IEEE Winter Conference on Applications of Computer Vision (WACV), Snowmass Village, CO, USA: IEEE, Mar. 2020, pp. 2453-2462.
[CrossRef] [Web of Science Times Cited 55] [SCOPUS Times Cited 102]

[47] L. Zhang and R. J. Radke, "Natural language video moment localization through query-controlled temporal convolution," in 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA: IEEE, Jan. 2022, pp. 2524-2532.
[CrossRef] [Web of Science Times Cited 8] [SCOPUS Times Cited 8]

[48] Y. Zeng, D. Cao, X. Wei, M. Liu, Z. Zhao, and Z. Qin, "Multi-modal relational graph for cross-modal video moment retrieval," in 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA: IEEE, Jun. 2021, pp. 2215-2224.
[CrossRef] [Web of Science Times Cited 44] [SCOPUS Times Cited 58]

References Weight

Web of Science® Citations for all references: 63,084 TCR
SCOPUS® Citations for all references: 166,699 TCR

Web of Science® Average Citations per reference: 1,262 ACR
SCOPUS® Average Citations per reference: 3,334 ACR

TCR = Total Citations for References / ACR = Average Citations per Reference

We introduced in 2010 - for the first time in scientific publishing, the term "References Weight", as a quantitative indication of the quality ... Read more

Citations for references updated on 2024-10-17 06:29 in 306 seconds.

Note¹: Web of Science® is a registered trademark of Clarivate Analytics.
Note²: SCOPUS® is a registered trademark of Elsevier B.V.
Disclaimer: All queries to the respective databases were made by using the DOI record of every reference (where available). Due to technical problems beyond our control, the information is not always accurate. Please use the CrossRef link to visit the respective publisher site.

Copyright ©2001-2024
Faculty of Electrical Engineering and Computer Science
Stefan cel Mare University of Suceava, Romania

All rights reserved: Advances in Electrical and Computer Engineering is a registered trademark of the Stefan cel Mare University of Suceava. No part of this publication may be reproduced, stored in a retrieval system, photocopied, recorded or archived, without the written permission from the Editor. When authors submit their papers for publication, they agree that the copyright for their article be transferred to the Faculty of Electrical Engineering and Computer Science, Stefan cel Mare University of Suceava, Romania, if and only if the articles are accepted for publication. The copyright covers the exclusive rights to reproduce and distribute the article, including reprints and translations.

Permission for other use: The copyright owner's consent does not extend to copying for general distribution, for promotion, for creating new works, or for resale. Specific written permission must be obtained from the Editor for such copying. Direct linking to files hosted on this website is strictly prohibited.

Disclaimer: Whilst every effort is made by the publishers and editorial board to see that no inaccurate or misleading data, opinions or statements appear in this journal, they wish to make it clear that all information and opinions formulated in the articles, as well as linguistic accuracy, are the sole responsibility of the author.

Menu:

Video Moment Localization Network Based on Text Multi-semantic Clues Guidance