2/2024 - 3 |
Workflow Detection with Improved Phase DiscriminabilityZHANG, M.![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Extra paper information in ![]() ![]() ![]() |
Click to see author's profile in ![]() ![]() ![]() |
Download PDF ![]() |
Author keywords
intelligent manufacturing, workflow detection, self-attention mechanism, graph relation reasoning, transformer
References keywords
vision(24), recognition(24), action(23), temporal(20), pattern(15), networks(12), convolutional(12), network(10), iccv(10), cvpr(10)
No common words between the references section and the paper title.
About this article
Date of Publication: 2024-05-31
Volume 24, Issue 2, Year 2024, On page(s): 21 - 30
ISSN: 1582-7445, e-ISSN: 1844-7600
Digital Object Identifier: 10.4316/AECE.2024.02003
Web of Science Accession Number: 001242091800003
SCOPUS ID: 85195645436
Abstract
Workflow detection is a challenge issue in the process of Industry 4.0, which plays a crucial role in intelligent production. However, it faces the problem of inaccurate phase classification and unclear boundary positioning, which are not well resolved in previous works. To solve them, this paper develops a temporal-aware workflow detection framework (TransGAN) which takes advantage of the complementarity between Transformer and graph attention network to improve phase discriminability. Specifically, temporal self-attention is firstly designed to learn the relationship between different positions of feature sequence. Then, multi-scale Transformer is introduced to encode pyramid features, which fuses multiple context cues for discriminative feature representation. At last, contextual and surrounding relations are learned in graph attention network for refined phase classification and boundary localization. Comprehensive experiments are performed to verify the effectiveness of our method. Compared to the advanced AFSD, the accuracy is improved by 2.3 % and 2.1 % when tIoU=0.5 on POTFD and THUMOS-14 dataset, respectively. Empirical study of running speed indicates that the proposed TransGAN can be deployed to real-world industrial environment for workflow detection. |
References | | | Cited By «-- Click to see who has cited this paper |
[1] L. Zelnik-Manor, M. Irani, "Statistical analysis of dynamic actions," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 28, no. 9, pp. 1530-1535, Sep. 2006. [CrossRef] [Web of Science Times Cited 68] [SCOPUS Times Cited 91] [2] H. Hu, K. Cheng, Z. Li, J. Chen, H. Hu, "Workflow recognition with structured two-stream convolutional networks," Pattern Recognition Letters, vol. 130, pp. 267-274, Oct. 2018. [CrossRef] [Web of Science Times Cited 7] [SCOPUS Times Cited 9] [3] C. Thomay, B. Gollan, M. Haslgrubler, A. Ferscha, J. Heftberger, "A multi-sensor algorithm for activity and workflow recognition in an industrial setting," the 12th ACM international conference on pervasive technologies related to assistive environments, pp. 69-76, Jun. 2019. [CrossRef] [Web of Science Times Cited 5] [SCOPUS Times Cited 5] [4] T. Xiang, S. Gong, "Beyond tracking: Modelling activity and understanding behavior," International Journal of Computer Vision, vol. 67, pp. 21-51, Apr. 2006. [CrossRef] [Web of Science Times Cited 157] [SCOPUS Times Cited 198] [5] A. Voulodimos, D. Kosmopoulos, G. Veres, H. Grabner, L. Van Gool, T. Varvarigou, "Online classification of visual tasks for industrial workflow monitoring," Neural Networks, vol. 24, no. 8, pp. 852-860, Oct. 2011. [CrossRef] [Web of Science Times Cited 23] [SCOPUS Times Cited 27] [6] J. E. Bardram, A. Doryab, R. M. Jensen, P. M. Lange, K. L. Nielsen, S. T. Petersen, "Phase recognition during surgical procedures using embedded and body-worn sensors," the 9th IEEE international conference on pervasive computing and communications (PerCom), pp. 45-53, Mar. 2011. [CrossRef] [SCOPUS Times Cited 64] [7] T. Czempiel, M. Paschali, M. Keicher, W. Simson, H. Feussner, S. T. Kim, N. Navab, "TeCNO: Surgical phase recognition with multi-stage temporal convolutional networks," the 23rd international conference on medical image computing and computer-assisted intervention, pp. 343-352, Sep. 2020. [CrossRef] [SCOPUS Times Cited 117] [8] M. Zhang, H. Hu, Z. Li, J. Chen, "Proposal-based graph attention networks for workflow detection," Neural Processing Letters, vol. 54, no. 1, pp. 101-123, Feb. 2022. [CrossRef] [Web of Science Times Cited 7] [SCOPUS Times Cited 4] [9] T. Lima, B. Fernandes, P. Barros, "Human action recognition with 3D convolutional neural network," IEEE Latin American Conference on Computational Intelligence (LA-CCI), pp. 1-6, Nov. 2017. [CrossRef] [SCOPUS Times Cited 15] [10] M. Li, S. Chen, X. Chen, Y. Zhang, Y. Wang, Q. Tian, "Symbiotic graph neural networks for 3D skeleton-based human action recognition and motion prediction," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 6, pp. 3316-3333, Jan. 2021. [CrossRef] [Web of Science Times Cited 129] [SCOPUS Times Cited 150] [11] H. Fan, B. Xiong, K. Mangalam, Y. Li, Z. Yan, J. Malik, C. Feichtenhofer, "Multiscale vision transformers," IEEE/CVF International Conference on Computer Vision, pp. 6824-6835, Oct. 2021. [CrossRef] [Web of Science Times Cited 633] [SCOPUS Times Cited 796] [12] S. Ji, W. Xu, M. Yang, K. Yu, "3D convolutional neural networks for human action recognition," IEEE transactions on pattern analysis and machine intelligence, vol. 35, no. 1, pp. 221-231, Mar. 2012. [CrossRef] [Web of Science Times Cited 3284] [SCOPUS Times Cited 5242] [13] D. Tran, L. Bourdev, R. Fergus, L. Torresani, M. Paluri, "Learning spatiotemporal features with 3d convolutional networks," IEEE international conference on computer vision, pp. 4489-4497, Dec. 2015. [CrossRef] [Web of Science Times Cited 5905] [SCOPUS Times Cited 7873] [14] K. Simonyan, A. Zisserman, "Two-stream convolutional networks for action recognition in videos," Advances in neural information processing systems, pp. 568-576, 2014 [15] J. Li, X. Liu, W. Zhang, M. Zhang, J. Song, N. Sebe, "Spatio-temporal attention networks for action recognition and detection," IEEE Transactions on Multimedia, vol. 22, no. 11, pp. 2990-3001, Nov. 2020. [CrossRef] [Web of Science Times Cited 113] [SCOPUS Times Cited 139] [16] J. Gao, Z. Yang, K. Chen, C. Sun, R. Nevatia, "TURN TAP: Temporal unit regression network for temporal action proposals," IEEE international conference on computer vision, pp. 3628-3636, Oct. 2017. [CrossRef] [Web of Science Times Cited 347] [SCOPUS Times Cited 363] [17] T. Lin, X. Liu, X. Li, E. Ding, S. Wen, "BMN: Boundary-matching network for temporal action proposal generation," IEEE/CVF international conference on computer vision, pp. 3889-3898, Oct. 2019. [CrossRef] [Web of Science Times Cited 451] [SCOPUS Times Cited 515] [18] Z. Zhu, W. Tang, L. Wang, N. Zheng, G. Hua, "Enriching local and global contexts for temporal action localization," IEEE/CVF International Conference on Computer Vision, pp. 13516-13525, Oct. 2021. [CrossRef] [Web of Science Times Cited 69] [SCOPUS Times Cited 97] [19] R. Girdhar, J. Carreira, C. Doersch, A. Zisserman, "Video action transformer network," IEEE/CVF conference on computer vision and pattern recognition, pp. 244-253, Jun. 2019. [CrossRef] [Web of Science Times Cited 487] [SCOPUS Times Cited 578] [20] G. Bertasius, H. Wang, L. Torresani, "Is space-time attention all you need for video understanding?," The 38th International Conference on Machine Learning, pp. 813-824, 2021 [21] D. Neimark, O. Bar, M. Zohar, D. Asselmann, "Video transformer network," IEEE/CVF International Conference on Computer Vision, pp. 3163-3172, Oct. 2021. [CrossRef] [Web of Science Times Cited 292] [SCOPUS Times Cited 279] [22] J. Yang, X. Dong, L. Liu, C. Zhang, J. Shen, D. Yu, "Recurring the transformer for video action recognition," IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14063-14073, Jun. 2022. [CrossRef] [Web of Science Times Cited 65] [SCOPUS Times Cited 74] [23] T. Nagarajan, Y. Li, C. Feichtenhofer, K. Grauman, "Ego-topo: Environment affordances from egocentric video," IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 163-172, Jun. 2020. [CrossRef] [Web of Science Times Cited 53] [SCOPUS Times Cited 89] [24] B. Pan, H. Cai, D. A. Huang, K. H. Lee, A. Gaidon, E. Adeli, J. C. Niebles, "Spatio-temporal graph for video captioning with knowledge distillation," IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10870-10879, Jun. 2020. [CrossRef] [Web of Science Times Cited 197] [SCOPUS Times Cited 218] [25] X. Wang, A. Gupta, "Videos as space-time region graphs," European conference on computer vision (ECCV), pp. 399-417, Oct. 2018. [CrossRef] [Web of Science Times Cited 447] [SCOPUS Times Cited 130] [26] Y. Chen, B. Guo, Y. Shen, W. Wang, W. Lu, X. Suo, "Boundary graph convolutional network for temporal action detection," Image and Vision Computing, vol. 109, pp. 104144, May, 2021. [CrossRef] [Web of Science Times Cited 13] [SCOPUS Times Cited 13] [27] R. Zeng, W. Huang, M. Tan, Y. Rong, P. Zhao, J. Huang, C. Gan, "Graph convolutional networks for temporal action localization," IEEE/CVF International Conference on Computer Vision, pp. 7094-7103, Oct. 2019. [CrossRef] [Web of Science Times Cited 394] [SCOPUS Times Cited 440] [28] Z. Chen, S. Li, B. Yang, Q. Li, H. Liu, "Multi-scale spatial temporal graph convolutional network for skeleton-based action recognition," AAAI Conference on Artificial Intelligence, pp. 1113-1122, May, 2021. [CrossRef] [SCOPUS Times Cited 244] [29] L. Deng, Z. Liu, J. Wang, B. Yang, "ATT-YOLOv5-Ghost: water surface object detection in complex scenes," Journal of Real-Time Image Processing, vol. 20(5), pp. 97, Aug. 2023. [CrossRef] [Web of Science Times Cited 10] [SCOPUS Times Cited 12] [30] I. D. Borlea, R. E. Precup, A. B. Borlea, "Improvement of K-means cluster quality by post processing resulted clusters," Procedia Computer Science, vol. 199, pp. 63-70, Feb. 2022. [CrossRef] [Web of Science Times Cited 89] [SCOPUS Times Cited 104] [31] D. Protic, M. Stankovic, "XOR-based detector of different decisions on anomalies in the computer network traffic," Science and Technology, vol. 26, no. 3-4, pp. 323-338, 2023. [CrossRef] [Web of Science Times Cited 13] [SCOPUS Times Cited 13] [32] J. Carreira, A. Zisserman, "Quo vadis, action recognition? A new model and the kinetics dataset," IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299-6308, Jul. 2017. [CrossRef] [Web of Science Times Cited 5624] [SCOPUS Times Cited 6617] [33] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, N. Houlsby, "An image is worth 16x16 words: Transformers for image recognition at scale," arXiv preprint, 2020. [CrossRef] [34] T. Xiao, M. Singh, E. Mintun, T. Darrell, P. Dollar, R. Girshick, "Early convolutions help transformers see better," Advances in Neural Information Processing Systems, pp. 30392-30400, 2021 [35] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, I. Polosukhin, "Attention is all you need," Advances in neural information processing systems, pp. 5998-6008, 2017 [36] C. Lin, C. Xu, D. Luo, Y. Wang, Y. Tai, C. Wang, Y. Fu, "Learning salient boundary feature for anchor-free temporal action localization," IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3320-3329, Jun. 2021. [CrossRef] [Web of Science Times Cited 177] [SCOPUS Times Cited 216] [37] T. Y. Lin, P. Goyal, R. Girshick, K. He, P. Dollar, "Focal Loss for Dense Object Detection," IEEE transactions on pattern analysis and machine intelligence, vol. 42, no. 2, pp. 318-327, Oct. 2017. [CrossRef] [Web of Science Times Cited 9353] [SCOPUS Times Cited 17372] [38] H. Rezatofighi, N. Tsoi, J. Gwak, A. Sadeghian, I. Reid, S. Savarese, "Generalized intersection over union: A metric and a loss for bounding box regression," IEEE/CVF conference on computer vision and pattern recognition, pp. 658-666, Jun. 2019. [CrossRef] [Web of Science Times Cited 3568] [SCOPUS Times Cited 4582] [39] R. Girshick, "Fast R-CNN," IEEE international conference on computer vision, pp. 1440-1448, Dec. 2015. [CrossRef] [Web of Science Times Cited 16375] [SCOPUS Times Cited 23054] [40] D. P. Kingma, J. Ba, "Adam: A method for stochastic optimization," arXiv preprint, 2014. [CrossRef] [41] N. Bodla, B. Singh, R. Chellappa, L. S. Davis, "Soft-NMS--improving object detection with one line of code," IEEE international conference on computer vision, pp. 5561-5569, Oct. 2017. [CrossRef] [Web of Science Times Cited 1350] [SCOPUS Times Cited 1701] [42] H. Xu, A. Das, K. Saenko, "R-C3D: Region convolutional 3D network for temporal activity detection," IEEE international conference on computer vision, pp. 5783-579, Oct. 2017. [CrossRef] [Web of Science Times Cited 456] [SCOPUS Times Cited 595] [43] Y. W. Chao, S. Vijayanarasimhan, B. Seybold, D. A. Ross, J. Deng, R. Sukthankar, "Rethinking the faster R-CNN architecture for temporal action localization," IEEE conference on computer vision and pattern recognition, pp. 1130-1139, Jun. 2018. [CrossRef] [Web of Science Times Cited 502] [SCOPUS Times Cited 623] [44] F. Long, T. Yao, Z. Qiu, X. Tian, J. Luo, T. Mei, "Gaussian temporal awareness networks for action localization," IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 344-353, Jun. 2019. [CrossRef] [Web of Science Times Cited 266] [SCOPUS Times Cited 324] [45] L. Yang, H. Peng, D. Zhang, J. Fu, J. Han, "Revisiting anchor mechanisms for temporal action localization," IEEE Transactions on Image Processing, vol. 29, pp. 8535-8548, Aug. 2020. [CrossRef] [Web of Science Times Cited 143] [SCOPUS Times Cited 174] [46] R. Su, D. Xu, L. Sheng, W. Ouyang, "PCG-TAL: Progressive cross-granularity cooperation for temporal action localization," IEEE Transactions on Image Processing, vol. 30, pp. 2103-2113, Dec. 2020. [CrossRef] [Web of Science Times Cited 24] [SCOPUS Times Cited 25] [47] Z. Shou, J. Chan, A. Zareian, K. Miyazawa, S. F. Chang, "Cdc: Convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos," IEEE conference on computer vision and pattern recognition, pp. 5734-5743, Jul. 2017. [CrossRef] [Web of Science Times Cited 327] [SCOPUS Times Cited 462] [48] Q. Liu, Z. Wang, "Progressive boundary refinement network for temporal action detection," AAAI Conference on Artificial Intelligence, pp. 11612-11619, Apr. 2020. [CrossRef] [49] X. Liu, Q. Wang, Y. Hu, X. Tang, S. Zhang, S. Bai, "End-to-end temporal action detection with transformer," IEEE Transactions on Image Processing, vol. 31, pp. 5427-5441, 2022. [CrossRef] [Web of Science Times Cited 118] [SCOPUS Times Cited 162] [50] M. Nawhal, G. Mori, "Activity graph transformer for temporal action localization," arXiv preprint, 2021. [CrossRef] Web of Science® Citations for all references: 51,541 TCR SCOPUS® Citations for all references: 73,806 TCR Web of Science® Average Citations per reference: 1,011 ACR SCOPUS® Average Citations per reference: 1,447 ACR TCR = Total Citations for References / ACR = Average Citations per Reference We introduced in 2010 - for the first time in scientific publishing, the term "References Weight", as a quantitative indication of the quality ... Read more Citations for references updated on 2025-03-22 07:26 in 325 seconds. Note1: Web of Science® is a registered trademark of Clarivate Analytics. Note2: SCOPUS® is a registered trademark of Elsevier B.V. Disclaimer: All queries to the respective databases were made by using the DOI record of every reference (where available). Due to technical problems beyond our control, the information is not always accurate. Please use the CrossRef link to visit the respective publisher site. |
Faculty of Electrical Engineering and Computer Science
Stefan cel Mare University of Suceava, Romania
All rights reserved: Advances in Electrical and Computer Engineering is a registered trademark of the Stefan cel Mare University of Suceava. No part of this publication may be reproduced, stored in a retrieval system, photocopied, recorded or archived, without the written permission from the Editor. When authors submit their papers for publication, they agree that the copyright for their article be transferred to the Faculty of Electrical Engineering and Computer Science, Stefan cel Mare University of Suceava, Romania, if and only if the articles are accepted for publication. The copyright covers the exclusive rights to reproduce and distribute the article, including reprints and translations.
Permission for other use: The copyright owner's consent does not extend to copying for general distribution, for promotion, for creating new works, or for resale. Specific written permission must be obtained from the Editor for such copying. Direct linking to files hosted on this website is strictly prohibited.
Disclaimer: Whilst every effort is made by the publishers and editorial board to see that no inaccurate or misleading data, opinions or statements appear in this journal, they wish to make it clear that all information and opinions formulated in the articles, as well as linguistic accuracy, are the sole responsibility of the author.