2/2024 - 3 |
Workflow Detection with Improved Phase DiscriminabilityZHANG, M. , HU, H. , LI, Z. |
Extra paper information in |
Click to see author's profile in SCOPUS, IEEE Xplore, Web of Science |
Download PDF (2,243 KB) | Citation | Downloads: 270 | Views: 454 |
Author keywords
intelligent manufacturing, workflow detection, self-attention mechanism, graph relation reasoning, transformer
References keywords
vision(24), recognition(24), action(23), temporal(20), pattern(15), networks(12), convolutional(12), network(10), iccv(10), cvpr(10)
No common words between the references section and the paper title.
About this article
Date of Publication: 2024-05-31
Volume 24, Issue 2, Year 2024, On page(s): 21 - 30
ISSN: 1582-7445, e-ISSN: 1844-7600
Digital Object Identifier: 10.4316/AECE.2024.02003
Web of Science Accession Number: 001242091800003
SCOPUS ID: 85195645436
Abstract
Workflow detection is a challenge issue in the process of Industry 4.0, which plays a crucial role in intelligent production. However, it faces the problem of inaccurate phase classification and unclear boundary positioning, which are not well resolved in previous works. To solve them, this paper develops a temporal-aware workflow detection framework (TransGAN) which takes advantage of the complementarity between Transformer and graph attention network to improve phase discriminability. Specifically, temporal self-attention is firstly designed to learn the relationship between different positions of feature sequence. Then, multi-scale Transformer is introduced to encode pyramid features, which fuses multiple context cues for discriminative feature representation. At last, contextual and surrounding relations are learned in graph attention network for refined phase classification and boundary localization. Comprehensive experiments are performed to verify the effectiveness of our method. Compared to the advanced AFSD, the accuracy is improved by 2.3 % and 2.1 % when tIoU=0.5 on POTFD and THUMOS-14 dataset, respectively. Empirical study of running speed indicates that the proposed TransGAN can be deployed to real-world industrial environment for workflow detection. |
References | | | Cited By «-- Click to see who has cited this paper |
[1] L. Zelnik-Manor, M. Irani, "Statistical analysis of dynamic actions," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 28, no. 9, pp. 1530-1535, Sep. 2006. [CrossRef] [Web of Science Times Cited 68] [SCOPUS Times Cited 91] [2] H. Hu, K. Cheng, Z. Li, J. Chen, H. Hu, "Workflow recognition with structured two-stream convolutional networks," Pattern Recognition Letters, vol. 130, pp. 267-274, Oct. 2018. [CrossRef] [Web of Science Times Cited 7] [SCOPUS Times Cited 9] [3] C. Thomay, B. Gollan, M. Haslgrubler, A. Ferscha, J. Heftberger, "A multi-sensor algorithm for activity and workflow recognition in an industrial setting," the 12th ACM international conference on pervasive technologies related to assistive environments, pp. 69-76, Jun. 2019. [CrossRef] [Web of Science Times Cited 5] [SCOPUS Times Cited 5] [4] T. Xiang, S. Gong, "Beyond tracking: Modelling activity and understanding behavior," International Journal of Computer Vision, vol. 67, pp. 21-51, Apr. 2006. [CrossRef] [Web of Science Times Cited 157] [SCOPUS Times Cited 198] [5] A. Voulodimos, D. Kosmopoulos, G. Veres, H. Grabner, L. Van Gool, T. Varvarigou, "Online classification of visual tasks for industrial workflow monitoring," Neural Networks, vol. 24, no. 8, pp. 852-860, Oct. 2011. [CrossRef] [Web of Science Times Cited 22] [SCOPUS Times Cited 27] [6] J. E. Bardram, A. Doryab, R. M. Jensen, P. M. Lange, K. L. Nielsen, S. T. Petersen, "Phase recognition during surgical procedures using embedded and body-worn sensors," the 9th IEEE international conference on pervasive computing and communications (PerCom), pp. 45-53, Mar. 2011. [CrossRef] [SCOPUS Times Cited 63] [7] T. Czempiel, M. Paschali, M. Keicher, W. Simson, H. Feussner, S. T. Kim, N. Navab, "TeCNO: Surgical phase recognition with multi-stage temporal convolutional networks," the 23rd international conference on medical image computing and computer-assisted intervention, pp. 343-352, Sep. 2020. [CrossRef] [SCOPUS Times Cited 104] [8] M. Zhang, H. Hu, Z. Li, J. Chen, "Proposal-based graph attention networks for workflow detection," Neural Processing Letters, vol. 54, no. 1, pp. 101-123, Feb. 2022. [CrossRef] [Web of Science Times Cited 7] [SCOPUS Times Cited 4] [9] T. Lima, B. Fernandes, P. Barros, "Human action recognition with 3D convolutional neural network," IEEE Latin American Conference on Computational Intelligence (LA-CCI), pp. 1-6, Nov. 2017. [CrossRef] [SCOPUS Times Cited 15] [10] M. Li, S. Chen, X. Chen, Y. Zhang, Y. Wang, Q. Tian, "Symbiotic graph neural networks for 3D skeleton-based human action recognition and motion prediction," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 6, pp. 3316-3333, Jan. 2021. [CrossRef] [Web of Science Times Cited 109] [SCOPUS Times Cited 122] [11] H. Fan, B. Xiong, K. Mangalam, Y. Li, Z. Yan, J. Malik, C. Feichtenhofer, "Multiscale vision transformers," IEEE/CVF International Conference on Computer Vision, pp. 6824-6835, Oct. 2021. [CrossRef] [Web of Science Times Cited 453] [SCOPUS Times Cited 636] [12] S. Ji, W. Xu, M. Yang, K. Yu, "3D convolutional neural networks for human action recognition," IEEE transactions on pattern analysis and machine intelligence, vol. 35, no. 1, pp. 221-231, Mar. 2012. [CrossRef] [Web of Science Times Cited 3301] [SCOPUS Times Cited 5083] [13] D. Tran, L. Bourdev, R. Fergus, L. Torresani, M. Paluri, "Learning spatiotemporal features with 3d convolutional networks," IEEE international conference on computer vision, pp. 4489-4497, Dec. 2015. [CrossRef] [Web of Science Times Cited 5508] [SCOPUS Times Cited 7459] [14] K. Simonyan, A. Zisserman, "Two-stream convolutional networks for action recognition in videos," Advances in neural information processing systems, pp. 568-576, 2014 [15] J. Li, X. Liu, W. Zhang, M. Zhang, J. Song, N. Sebe, "Spatio-temporal attention networks for action recognition and detection," IEEE Transactions on Multimedia, vol. 22, no. 11, pp. 2990-3001, Nov. 2020. [CrossRef] [Web of Science Times Cited 109] [SCOPUS Times Cited 129] [16] J. Gao, Z. Yang, K. Chen, C. Sun, R. Nevatia, "TURN TAP: Temporal unit regression network for temporal action proposals," IEEE international conference on computer vision, pp. 3628-3636, Oct. 2017. [CrossRef] [Web of Science Times Cited 324] [SCOPUS Times Cited 350] [17] T. Lin, X. Liu, X. Li, E. Ding, S. Wen, "BMN: Boundary-matching network for temporal action proposal generation," IEEE/CVF international conference on computer vision, pp. 3889-3898, Oct. 2019. [CrossRef] [Web of Science Times Cited 393] [SCOPUS Times Cited 469] [18] Z. Zhu, W. Tang, L. Wang, N. Zheng, G. Hua, "Enriching local and global contexts for temporal action localization," IEEE/CVF International Conference on Computer Vision, pp. 13516-13525, Oct. 2021. [CrossRef] [Web of Science Times Cited 48] [SCOPUS Times Cited 80] [19] R. Girdhar, J. Carreira, C. Doersch, A. Zisserman, "Video action transformer network," IEEE/CVF conference on computer vision and pattern recognition, pp. 244-253, Jun. 2019. [CrossRef] [Web of Science Times Cited 450] [SCOPUS Times Cited 546] [20] G. Bertasius, H. Wang, L. Torresani, "Is space-time attention all you need for video understanding?," The 38th International Conference on Machine Learning, pp. 813-824, 2021 [21] D. Neimark, O. Bar, M. Zohar, D. Asselmann, "Video transformer network," IEEE/CVF International Conference on Computer Vision, pp. 3163-3172, Oct. 2021. [CrossRef] [Web of Science Times Cited 243] [SCOPUS Times Cited 231] [22] J. Yang, X. Dong, L. Liu, C. Zhang, J. Shen, D. Yu, "Recurring the transformer for video action recognition," IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14063-14073, Jun. 2022. [CrossRef] [Web of Science Times Cited 48] [SCOPUS Times Cited 59] [23] T. Nagarajan, Y. Li, C. Feichtenhofer, K. Grauman, "Ego-topo: Environment affordances from egocentric video," IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 163-172, Jun. 2020. [CrossRef] [Web of Science Times Cited 40] [SCOPUS Times Cited 82] [24] B. Pan, H. Cai, D. A. Huang, K. H. Lee, A. Gaidon, E. Adeli, J. C. Niebles, "Spatio-temporal graph for video captioning with knowledge distillation," IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10870-10879, Jun. 2020. [CrossRef] [Web of Science Times Cited 174] [SCOPUS Times Cited 201] [25] X. Wang, A. Gupta, "Videos as space-time region graphs," European conference on computer vision (ECCV), pp. 399-417, Oct. 2018. [CrossRef] [Web of Science Times Cited 421] [SCOPUS Times Cited 124] [26] Y. Chen, B. Guo, Y. Shen, W. Wang, W. Lu, X. Suo, "Boundary graph convolutional network for temporal action detection," Image and Vision Computing, vol. 109, pp. 104144, May, 2021. [CrossRef] [Web of Science Times Cited 11] [SCOPUS Times Cited 11] [27] R. Zeng, W. Huang, M. Tan, Y. Rong, P. Zhao, J. Huang, C. Gan, "Graph convolutional networks for temporal action localization," IEEE/CVF International Conference on Computer Vision, pp. 7094-7103, Oct. 2019. [CrossRef] [Web of Science Times Cited 351] [SCOPUS Times Cited 411] [28] Z. Chen, S. Li, B. Yang, Q. Li, H. Liu, "Multi-scale spatial temporal graph convolutional network for skeleton-based action recognition," AAAI Conference on Artificial Intelligence, pp. 1113-1122, May, 2021. [CrossRef] [SCOPUS Times Cited 197] [29] L. Deng, Z. Liu, J. Wang, B. Yang, "ATT-YOLOv5-Ghost: water surface object detection in complex scenes," Journal of Real-Time Image Processing, vol. 20(5), pp. 97, Aug. 2023. [CrossRef] [Web of Science Times Cited 6] [SCOPUS Times Cited 8] [30] I. D. Borlea, R. E. Precup, A. B. Borlea, "Improvement of K-means cluster quality by post processing resulted clusters," Procedia Computer Science, vol. 199, pp. 63-70, Feb. 2022. [CrossRef] [Web of Science Times Cited 78] [SCOPUS Times Cited 93] [31] D. Protic, M. Stankovic, "XOR-based detector of different decisions on anomalies in the computer network traffic," Science and Technology, vol. 26, no. 3-4, pp. 323-338, 2023. [CrossRef] [Web of Science Times Cited 9] [SCOPUS Times Cited 10] [32] J. Carreira, A. Zisserman, "Quo vadis, action recognition? A new model and the kinetics dataset," IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299-6308, Jul. 2017. [CrossRef] [Web of Science Times Cited 5041] [SCOPUS Times Cited 6071] [33] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, N. Houlsby, "An image is worth 16x16 words: Transformers for image recognition at scale," arXiv preprint, 2020. [CrossRef] [34] T. Xiao, M. Singh, E. Mintun, T. Darrell, P. Dollar, R. Girshick, "Early convolutions help transformers see better," Advances in Neural Information Processing Systems, pp. 30392-30400, 2021 [35] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, I. Polosukhin, "Attention is all you need," Advances in neural information processing systems, pp. 5998-6008, 2017 [36] C. Lin, C. Xu, D. Luo, Y. Wang, Y. Tai, C. Wang, Y. Fu, "Learning salient boundary feature for anchor-free temporal action localization," IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3320-3329, Jun. 2021. [CrossRef] [Web of Science Times Cited 141] [SCOPUS Times Cited 181] [37] T. Y. Lin, P. Goyal, R. Girshick, K. He, P. Dollar, "Focal Loss for Dense Object Detection," IEEE transactions on pattern analysis and machine intelligence, vol. 42, no. 2, pp. 318-327, Oct. 2017. [CrossRef] [Web of Science Times Cited 8035] [SCOPUS Times Cited 15878] [38] H. Rezatofighi, N. Tsoi, J. Gwak, A. Sadeghian, I. Reid, S. Savarese, "Generalized intersection over union: A metric and a loss for bounding box regression," IEEE/CVF conference on computer vision and pattern recognition, pp. 658-666, Jun. 2019. [CrossRef] [Web of Science Times Cited 2997] [SCOPUS Times Cited 4019] [39] R. Girshick, "Fast R-CNN," IEEE international conference on computer vision, pp. 1440-1448, Dec. 2015. [CrossRef] [Web of Science Times Cited 15287] [SCOPUS Times Cited 21551] [40] D. P. Kingma, J. Ba, "Adam: A method for stochastic optimization," arXiv preprint, 2014. [CrossRef] [41] N. Bodla, B. Singh, R. Chellappa, L. S. Davis, "Soft-NMS--improving object detection with one line of code," IEEE international conference on computer vision, pp. 5561-5569, Oct. 2017. [CrossRef] [Web of Science Times Cited 1252] [SCOPUS Times Cited 1610] [42] H. Xu, A. Das, K. Saenko, "R-C3D: Region convolutional 3D network for temporal activity detection," IEEE international conference on computer vision, pp. 5783-579, Oct. 2017. [CrossRef] [Web of Science Times Cited 439] [SCOPUS Times Cited 580] [43] Y. W. Chao, S. Vijayanarasimhan, B. Seybold, D. A. Ross, J. Deng, R. Sukthankar, "Rethinking the faster R-CNN architecture for temporal action localization," IEEE conference on computer vision and pattern recognition, pp. 1130-1139, Jun. 2018. [CrossRef] [Web of Science Times Cited 469] [SCOPUS Times Cited 587] [44] F. Long, T. Yao, Z. Qiu, X. Tian, J. Luo, T. Mei, "Gaussian temporal awareness networks for action localization," IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 344-353, Jun. 2019. [CrossRef] [Web of Science Times Cited 241] [SCOPUS Times Cited 289] [45] L. Yang, H. Peng, D. Zhang, J. Fu, J. Han, "Revisiting anchor mechanisms for temporal action localization," IEEE Transactions on Image Processing, vol. 29, pp. 8535-8548, Aug. 2020. [CrossRef] [Web of Science Times Cited 120] [SCOPUS Times Cited 152] [46] R. Su, D. Xu, L. Sheng, W. Ouyang, "PCG-TAL: Progressive cross-granularity cooperation for temporal action localization," IEEE Transactions on Image Processing, vol. 30, pp. 2103-2113, Dec. 2020. [CrossRef] [Web of Science Times Cited 22] [SCOPUS Times Cited 23] [47] Z. Shou, J. Chan, A. Zareian, K. Miyazawa, S. F. Chang, "Cdc: Convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos," IEEE conference on computer vision and pattern recognition, pp. 5734-5743, Jul. 2017. [CrossRef] [Web of Science Times Cited 325] [SCOPUS Times Cited 451] [48] Q. Liu, Z. Wang, "Progressive boundary refinement network for temporal action detection," AAAI Conference on Artificial Intelligence, pp. 11612-11619, Apr. 2020. [CrossRef] [49] X. Liu, Q. Wang, Y. Hu, X. Tang, S. Zhang, S. Bai, "End-to-end temporal action detection with transformer," IEEE Transactions on Image Processing, vol. 31, pp. 5427-5441, 2022. [CrossRef] [Web of Science Times Cited 78] [SCOPUS Times Cited 120] [50] M. Nawhal, G. Mori, "Activity graph transformer for temporal action localization," arXiv preprint, 2021. [CrossRef] Web of Science® Citations for all references: 46,789 TCR SCOPUS® Citations for all references: 68,329 TCR Web of Science® Average Citations per reference: 917 ACR SCOPUS® Average Citations per reference: 1,340 ACR TCR = Total Citations for References / ACR = Average Citations per Reference We introduced in 2010 - for the first time in scientific publishing, the term "References Weight", as a quantitative indication of the quality ... Read more Citations for references updated on 2024-11-21 01:39 in 312 seconds. Note1: Web of Science® is a registered trademark of Clarivate Analytics. Note2: SCOPUS® is a registered trademark of Elsevier B.V. Disclaimer: All queries to the respective databases were made by using the DOI record of every reference (where available). Due to technical problems beyond our control, the information is not always accurate. Please use the CrossRef link to visit the respective publisher site. |
Faculty of Electrical Engineering and Computer Science
Stefan cel Mare University of Suceava, Romania
All rights reserved: Advances in Electrical and Computer Engineering is a registered trademark of the Stefan cel Mare University of Suceava. No part of this publication may be reproduced, stored in a retrieval system, photocopied, recorded or archived, without the written permission from the Editor. When authors submit their papers for publication, they agree that the copyright for their article be transferred to the Faculty of Electrical Engineering and Computer Science, Stefan cel Mare University of Suceava, Romania, if and only if the articles are accepted for publication. The copyright covers the exclusive rights to reproduce and distribute the article, including reprints and translations.
Permission for other use: The copyright owner's consent does not extend to copying for general distribution, for promotion, for creating new works, or for resale. Specific written permission must be obtained from the Editor for such copying. Direct linking to files hosted on this website is strictly prohibited.
Disclaimer: Whilst every effort is made by the publishers and editorial board to see that no inaccurate or misleading data, opinions or statements appear in this journal, they wish to make it clear that all information and opinions formulated in the articles, as well as linguistic accuracy, are the sole responsibility of the author.