Workflow Detection with Improved Phase Discriminability

doi:10.4316/AECE.2024.02003

2/2024 - 3

View TOC | « Previous Article | Next Article »

Workflow Detection with Improved Phase Discriminability

ZHANG, M. , HU, H. , LI, Z.

Extra paper information in

Click to see author's profile in

SCOPUS,

IEEE Xplore,

Web of Science

Download PDF (2,243 KB) | Citation | Downloads: 582 | Views: 1,820

Author keywords
intelligent manufacturing, workflow detection, self-attention mechanism, graph relation reasoning, transformer

References keywords
vision(24), recognition(24), action(23), temporal(20), pattern(15), networks(12), convolutional(12), network(10), iccv(10), cvpr(10)
No common words between the references section and the paper title.

About this article
Date of Publication: 2024-05-31
Volume 24, Issue 2, Year 2024, On page(s): 21 - 30
ISSN: 1582-7445, e-ISSN: 1844-7600
Digital Object Identifier: 10.4316/AECE.2024.02003
Web of Science Accession Number: 001242091800003
SCOPUS ID: 85195645436

Abstract

Full text preview

Workflow detection is a challenge issue in the process of Industry 4.0, which plays a crucial role in intelligent production. However, it faces the problem of inaccurate phase classification and unclear boundary positioning, which are not well resolved in previous works. To solve them, this paper develops a temporal-aware workflow detection framework (TransGAN) which takes advantage of the complementarity between Transformer and graph attention network to improve phase discriminability. Specifically, temporal self-attention is firstly designed to learn the relationship between different positions of feature sequence. Then, multi-scale Transformer is introduced to encode pyramid features, which fuses multiple context cues for discriminative feature representation. At last, contextual and surrounding relations are learned in graph attention network for refined phase classification and boundary localization. Comprehensive experiments are performed to verify the effectiveness of our method. Compared to the advanced AFSD, the accuracy is improved by 2.3 % and 2.1 % when tIoU=0.5 on POTFD and THUMOS-14 dataset, respectively. Empirical study of running speed indicates that the proposed TransGAN can be deployed to real-world industrial environment for workflow detection.

References

Cited By «-- Click to see who has cited this paper

[1] L. Zelnik-Manor, M. Irani, "Statistical analysis of dynamic actions," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 28, no. 9, pp. 1530-1535, Sep. 2006.
[CrossRef] [Web of Science Times Cited 67] [SCOPUS Times Cited 91]

[2] H. Hu, K. Cheng, Z. Li, J. Chen, H. Hu, "Workflow recognition with structured two-stream convolutional networks," Pattern Recognition Letters, vol. 130, pp. 267-274, Oct. 2018.
[CrossRef] [Web of Science Times Cited 7] [SCOPUS Times Cited 9]

[3] C. Thomay, B. Gollan, M. Haslgrubler, A. Ferscha, J. Heftberger, "A multi-sensor algorithm for activity and workflow recognition in an industrial setting," the 12th ACM international conference on pervasive technologies related to assistive environments, pp. 69-76, Jun. 2019.
[CrossRef] [Web of Science Times Cited 6] [SCOPUS Times Cited 6]

[4] T. Xiang, S. Gong, "Beyond tracking: Modelling activity and understanding behavior," International Journal of Computer Vision, vol. 67, pp. 21-51, Apr. 2006.
[CrossRef] [Web of Science Times Cited 157] [SCOPUS Times Cited 200]

[5] A. Voulodimos, D. Kosmopoulos, G. Veres, H. Grabner, L. Van Gool, T. Varvarigou, "Online classification of visual tasks for industrial workflow monitoring," Neural Networks, vol. 24, no. 8, pp. 852-860, Oct. 2011.
[CrossRef] [Web of Science Times Cited 25] [SCOPUS Times Cited 27]

[6] J. E. Bardram, A. Doryab, R. M. Jensen, P. M. Lange, K. L. Nielsen, S. T. Petersen, "Phase recognition during surgical procedures using embedded and body-worn sensors," the 9th IEEE international conference on pervasive computing and communications (PerCom), pp. 45-53, Mar. 2011.
[CrossRef] [SCOPUS Times Cited 66]

[7] T. Czempiel, M. Paschali, M. Keicher, W. Simson, H. Feussner, S. T. Kim, N. Navab, "TeCNO: Surgical phase recognition with multi-stage temporal convolutional networks," the 23rd international conference on medical image computing and computer-assisted intervention, pp. 343-352, Sep. 2020.
[CrossRef] [SCOPUS Times Cited 158]

[8] M. Zhang, H. Hu, Z. Li, J. Chen, "Proposal-based graph attention networks for workflow detection," Neural Processing Letters, vol. 54, no. 1, pp. 101-123, Feb. 2022.
[CrossRef] [Web of Science Times Cited 7] [SCOPUS Times Cited 4]

[9] T. Lima, B. Fernandes, P. Barros, "Human action recognition with 3D convolutional neural network," IEEE Latin American Conference on Computational Intelligence (LA-CCI), pp. 1-6, Nov. 2017.
[CrossRef] [SCOPUS Times Cited 15]

[10] M. Li, S. Chen, X. Chen, Y. Zhang, Y. Wang, Q. Tian, "Symbiotic graph neural networks for 3D skeleton-based human action recognition and motion prediction," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 6, pp. 3316-3333, Jan. 2021.
[CrossRef] [Web of Science Times Cited 150] [SCOPUS Times Cited 178]

[11] H. Fan, B. Xiong, K. Mangalam, Y. Li, Z. Yan, J. Malik, C. Feichtenhofer, "Multiscale vision transformers," IEEE/CVF International Conference on Computer Vision, pp. 6824-6835, Oct. 2021.
[CrossRef] [Web of Science Times Cited 856] [SCOPUS Times Cited 1023]

[12] S. Ji, W. Xu, M. Yang, K. Yu, "3D convolutional neural networks for human action recognition," IEEE transactions on pattern analysis and machine intelligence, vol. 35, no. 1, pp. 221-231, Mar. 2012.
[CrossRef] [Web of Science Times Cited 3423] [SCOPUS Times Cited 5561]

[13] D. Tran, L. Bourdev, R. Fergus, L. Torresani, M. Paluri, "Learning spatiotemporal features with 3d convolutional networks," IEEE international conference on computer vision, pp. 4489-4497, Dec. 2015.
[CrossRef] [Web of Science Times Cited 6528] [SCOPUS Times Cited 8578]

[14] K. Simonyan, A. Zisserman, "Two-stream convolutional networks for action recognition in videos," Advances in neural information processing systems, pp. 568-576, 2014

[15] J. Li, X. Liu, W. Zhang, M. Zhang, J. Song, N. Sebe, "Spatio-temporal attention networks for action recognition and detection," IEEE Transactions on Multimedia, vol. 22, no. 11, pp. 2990-3001, Nov. 2020.
[CrossRef] [Web of Science Times Cited 126] [SCOPUS Times Cited 157]

[16] J. Gao, Z. Yang, K. Chen, C. Sun, R. Nevatia, "TURN TAP: Temporal unit regression network for temporal action proposals," IEEE international conference on computer vision, pp. 3628-3636, Oct. 2017.
[CrossRef] [Web of Science Times Cited 372] [SCOPUS Times Cited 375]

[17] T. Lin, X. Liu, X. Li, E. Ding, S. Wen, "BMN: Boundary-matching network for temporal action proposal generation," IEEE/CVF international conference on computer vision, pp. 3889-3898, Oct. 2019.
[CrossRef] [Web of Science Times Cited 512] [SCOPUS Times Cited 570]

[18] Z. Zhu, W. Tang, L. Wang, N. Zheng, G. Hua, "Enriching local and global contexts for temporal action localization," IEEE/CVF International Conference on Computer Vision, pp. 13516-13525, Oct. 2021.
[CrossRef] [Web of Science Times Cited 87] [SCOPUS Times Cited 114]

[19] R. Girdhar, J. Carreira, C. Doersch, A. Zisserman, "Video action transformer network," IEEE/CVF conference on computer vision and pattern recognition, pp. 244-253, Jun. 2019.
[CrossRef] [Web of Science Times Cited 555] [SCOPUS Times Cited 635]

[20] G. Bertasius, H. Wang, L. Torresani, "Is space-time attention all you need for video understanding?," The 38th International Conference on Machine Learning, pp. 813-824, 2021

[21] D. Neimark, O. Bar, M. Zohar, D. Asselmann, "Video transformer network," IEEE/CVF International Conference on Computer Vision, pp. 3163-3172, Oct. 2021.
[CrossRef] [SCOPUS Times Cited 346]

[22] J. Yang, X. Dong, L. Liu, C. Zhang, J. Shen, D. Yu, "Recurring the transformer for video action recognition," IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14063-14073, Jun. 2022.
[CrossRef] [Web of Science Times Cited 88] [SCOPUS Times Cited 110]

[23] T. Nagarajan, Y. Li, C. Feichtenhofer, K. Grauman, "Ego-topo: Environment affordances from egocentric video," IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 163-172, Jun. 2020.
[CrossRef] [Web of Science Times Cited 60] [SCOPUS Times Cited 103]

[24] B. Pan, H. Cai, D. A. Huang, K. H. Lee, A. Gaidon, E. Adeli, J. C. Niebles, "Spatio-temporal graph for video captioning with knowledge distillation," IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10870-10879, Jun. 2020.
[CrossRef] [Web of Science Times Cited 217] [SCOPUS Times Cited 246]

[25] X. Wang, A. Gupta, "Videos as space-time region graphs," European conference on computer vision (ECCV), pp. 399-417, Oct. 2018.
[CrossRef] [Web of Science Times Cited 496] [SCOPUS Times Cited 136]

[26] Y. Chen, B. Guo, Y. Shen, W. Wang, W. Lu, X. Suo, "Boundary graph convolutional network for temporal action detection," Image and Vision Computing, vol. 109, pp. 104144, May, 2021.
[CrossRef] [Web of Science Times Cited 14] [SCOPUS Times Cited 14]

[27] R. Zeng, W. Huang, M. Tan, Y. Rong, P. Zhao, J. Huang, C. Gan, "Graph convolutional networks for temporal action localization," IEEE/CVF International Conference on Computer Vision, pp. 7094-7103, Oct. 2019.
[CrossRef] [Web of Science Times Cited 438] [SCOPUS Times Cited 470]

[28] Z. Chen, S. Li, B. Yang, Q. Li, H. Liu, "Multi-scale spatial temporal graph convolutional network for skeleton-based action recognition," AAAI Conference on Artificial Intelligence, pp. 1113-1122, May, 2021.
[CrossRef] [SCOPUS Times Cited 311]

[29] L. Deng, Z. Liu, J. Wang, B. Yang, "ATT-YOLOv5-Ghost: water surface object detection in complex scenes," Journal of Real-Time Image Processing, vol. 20(5), pp. 97, Aug. 2023.
[CrossRef] [Web of Science Times Cited 15] [SCOPUS Times Cited 19]

[30] I. D. Borlea, R. E. Precup, A. B. Borlea, "Improvement of K-means cluster quality by post processing resulted clusters," Procedia Computer Science, vol. 199, pp. 63-70, Feb. 2022.
[CrossRef] [Web of Science Times Cited 96] [SCOPUS Times Cited 115]

[31] D. Protic, M. Stankovic, "XOR-based detector of different decisions on anomalies in the computer network traffic," Science and Technology, vol. 26, no. 3-4, pp. 323-338, 2023.
[CrossRef] [Web of Science Times Cited 15] [SCOPUS Times Cited 15]

[32] J. Carreira, A. Zisserman, "Quo vadis, action recognition? A new model and the kinetics dataset," IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299-6308, Jul. 2017.
[CrossRef] [Web of Science Times Cited 6507] [SCOPUS Times Cited 7416]

[33] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, N. Houlsby, "An image is worth 16x16 words: Transformers for image recognition at scale," arXiv preprint, 2020.
[CrossRef]

[34] T. Xiao, M. Singh, E. Mintun, T. Darrell, P. Dollar, R. Girshick, "Early convolutions help transformers see better," Advances in Neural Information Processing Systems, pp. 30392-30400, 2021

[35] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, I. Polosukhin, "Attention is all you need," Advances in neural information processing systems, pp. 5998-6008, 2017

[36] C. Lin, C. Xu, D. Luo, Y. Wang, Y. Tai, C. Wang, Y. Fu, "Learning salient boundary feature for anchor-free temporal action localization," IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3320-3329, Jun. 2021.
[CrossRef] [Web of Science Times Cited 215] [SCOPUS Times Cited 256]

[37] T. Y. Lin, P. Goyal, R. Girshick, K. He, P. Dollar, "Focal Loss for Dense Object Detection," IEEE transactions on pattern analysis and machine intelligence, vol. 42, no. 2, pp. 318-327, Oct. 2017.
[CrossRef] [Web of Science Times Cited 12044] [SCOPUS Times Cited 20453]

[38] H. Rezatofighi, N. Tsoi, J. Gwak, A. Sadeghian, I. Reid, S. Savarese, "Generalized intersection over union: A metric and a loss for bounding box regression," IEEE/CVF conference on computer vision and pattern recognition, pp. 658-666, Jun. 2019.
[CrossRef] [Web of Science Times Cited 4440] [SCOPUS Times Cited 5704]

[39] R. Girshick, "Fast R-CNN," IEEE international conference on computer vision, pp. 1440-1448, Dec. 2015.
[CrossRef] [Web of Science Times Cited 18769] [SCOPUS Times Cited 25779]

[40] D. P. Kingma, J. Ba, "Adam: A method for stochastic optimization," arXiv preprint, 2014.
[CrossRef]

[41] N. Bodla, B. Singh, R. Chellappa, L. S. Davis, "Soft-NMS--improving object detection with one line of code," IEEE international conference on computer vision, pp. 5561-5569, Oct. 2017.
[CrossRef] [Web of Science Times Cited 1538] [SCOPUS Times Cited 1945]

[42] H. Xu, A. Das, K. Saenko, "R-C3D: Region convolutional 3D network for temporal activity detection," IEEE international conference on computer vision, pp. 5783-579, Oct. 2017.
[CrossRef] [Web of Science Times Cited 487] [SCOPUS Times Cited 616]

[43] Y. W. Chao, S. Vijayanarasimhan, B. Seybold, D. A. Ross, J. Deng, R. Sukthankar, "Rethinking the faster R-CNN architecture for temporal action localization," IEEE conference on computer vision and pattern recognition, pp. 1130-1139, Jun. 2018.
[CrossRef] [Web of Science Times Cited 548] [SCOPUS Times Cited 659]

[44] F. Long, T. Yao, Z. Qiu, X. Tian, J. Luo, T. Mei, "Gaussian temporal awareness networks for action localization," IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 344-353, Jun. 2019.
[CrossRef] [Web of Science Times Cited 300] [SCOPUS Times Cited 350]

[45] L. Yang, H. Peng, D. Zhang, J. Fu, J. Han, "Revisiting anchor mechanisms for temporal action localization," IEEE Transactions on Image Processing, vol. 29, pp. 8535-8548, Aug. 2020.
[CrossRef] [Web of Science Times Cited 166] [SCOPUS Times Cited 194]

[46] R. Su, D. Xu, L. Sheng, W. Ouyang, "PCG-TAL: Progressive cross-granularity cooperation for temporal action localization," IEEE Transactions on Image Processing, vol. 30, pp. 2103-2113, Dec. 2020.
[CrossRef] [Web of Science Times Cited 26] [SCOPUS Times Cited 28]

[47] Z. Shou, J. Chan, A. Zareian, K. Miyazawa, S. F. Chang, "Cdc: Convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos," IEEE conference on computer vision and pattern recognition, pp. 5734-5743, Jul. 2017.
[CrossRef] [Web of Science Times Cited 349] [SCOPUS Times Cited 476]

[48] Q. Liu, Z. Wang, "Progressive boundary refinement network for temporal action detection," AAAI Conference on Artificial Intelligence, pp. 11612-11619, Apr. 2020.
[CrossRef]

[49] X. Liu, Q. Wang, Y. Hu, X. Tang, S. Zhang, S. Bai, "End-to-end temporal action detection with transformer," IEEE Transactions on Image Processing, vol. 31, pp. 5427-5441, 2022.
[CrossRef] [Web of Science Times Cited 169] [SCOPUS Times Cited 215]

[50] M. Nawhal, G. Mori, "Activity graph transformer for temporal action localization," arXiv preprint, 2021.
[CrossRef]

References Weight

Web of Science® Citations for all references: 59,875 TCR
SCOPUS® Citations for all references: 83,743 TCR

Web of Science® Average Citations per reference: 1,174 ACR
SCOPUS® Average Citations per reference: 1,642 ACR

TCR = Total Citations for References / ACR = Average Citations per Reference

We introduced in 2010 - for the first time in scientific publishing, the term "References Weight", as a quantitative indication of the quality ... Read more

Citations for references updated on 2025-11-18 04:10 in 314 seconds.

Note¹: Web of Science® is a registered trademark of Clarivate Analytics.
Note²: SCOPUS® is a registered trademark of Elsevier B.V.
Disclaimer: All queries to the respective databases were made by using the DOI record of every reference (where available). Due to technical problems beyond our control, the information is not always accurate. Please use the CrossRef link to visit the respective publisher site.

Copyright ©2001-2025
Faculty of Electrical Engineering and Computer Science
Stefan cel Mare University of Suceava, Romania

All rights reserved: Advances in Electrical and Computer Engineering is a registered trademark of the Stefan cel Mare University of Suceava. No part of this publication may be reproduced, stored in a retrieval system, photocopied, recorded or archived, without the written permission from the Editor. When authors submit their papers for publication, they agree that the copyright for their article be transferred to the Faculty of Electrical Engineering and Computer Science, Stefan cel Mare University of Suceava, Romania, if and only if the articles are accepted for publication. The copyright covers the exclusive rights to reproduce and distribute the article, including reprints and translations.

Permission for other use: The copyright owner's consent does not extend to copying for general distribution, for promotion, for creating new works, or for resale. Specific written permission must be obtained from the Editor for such copying. Direct linking to files hosted on this website is strictly prohibited.

Disclaimer: Whilst every effort is made by the publishers and editorial board to see that no inaccurate or misleading data, opinions or statements appear in this journal, they wish to make it clear that all information and opinions formulated in the articles, as well as linguistic accuracy, are the sole responsibility of the author.

Menu:

Workflow Detection with Improved Phase Discriminability