Post-processing of Deep Web Information Extraction Based on Domain Ontology

doi:10.4316/AECE.2013.04005

4/2013 - 5

View TOC | « Previous Article | Next Article »

HIGHLY CITED PAPER

Post-processing of Deep Web Information Extraction Based on Domain Ontology

LIU, L. , PENG, T.

View the paper record and citations in

Click to see author's profile in

SCOPUS,

IEEE Xplore,

Web of Science

Download PDF (793 KB) | Citation | Downloads: 900 | Views: 3,536

Author keywords
knowledge based systems, machine learning, semantic web, web mining, World Wide Web

References keywords
information(9), systems(8), data(8), search(5), meng(5), extraction(5), automatic(5), wise(4)
Blue keywords are present in both the references section and the paper title.

About this article
Date of Publication: 2013-11-30
Volume 13, Issue 4, Year 2013, On page(s): 25 - 32
ISSN: 1582-7445, e-ISSN: 1844-7600
Digital Object Identifier: 10.4316/AECE.2013.04005
Web of Science Accession Number: 000331461300005
SCOPUS ID: 84890180257

Abstract

Full text preview

Many methods are utilized to extract and process query results in deep Web, which rely on the different structures of Web pages and various designing modes of databases. However, some semantic meanings and relations are ignored. So, in this paper, we present an approach for post-processing deep Web query results based on domain ontology which can utilize the semantic meanings and relations. A block identification model (BIM) based on node similarity is defined to extract data blocks that are relevant to specific domain after reducing noisy nodes. Feature vector of domain books is obtained by result set extraction model (RSEM) based on vector space model (VSM). RSEM, in combination with BIM, builds the domain ontology on books which can not only remove the limit of Web page structures when extracting data information, but also make use of semantic meanings of domain ontology. After extracting basic information of Web pages, a ranking algorithm is adopted to offer an ordered list of data records to users. Experimental results show that BIM and RSEM extract data blocks and build domain ontology accurately. In addition, relevant data records and basic information are extracted and ranked. The performances precision and recall show that our proposed method is feasible and efficient.

References

Cited By «-- Click to see who has cited this paper

[1] I. A. Letia and A. Marginean, "Client provider collaboration for service bundling," Advances in Electrical and Computer Engineering, Vol. 8, no. 1, pp. 36-43, 2008.
[CrossRef] [Full Text] [Web of Science Times Cited 3] [SCOPUS Times Cited 4]

[2] C. H. Chang, M. Kayed, M. R. Girgis, and K. F. Shaalan, "A survey of Web information extraction systems," IEEE Transactions on Knowledge and Data Engineering, vol. 18, no. 10, pp. 1411-1428, 2006.
[CrossRef] [SCOPUS Times Cited 627]

[3] R. Grishman and B. Sundheim, "Message understanding conference-6: a brief history," In Proc. Of the16th Int'l Conf. on Computational Linguistics (COLING -96), August 1996.

[4] D.G. Gregg and S. Walczak, "Exploiting the information Web," IEEE Transactions on Systems Man and Cybernetics Part C-Applications and Review, vol. 37, no. 1, 2007.
[CrossRef] [Web of Science Times Cited 13] [SCOPUS Times Cited 21]

[5] H. He, W.Y. Meng, and Y.Y. Lu, "Towards deeper understanding of the search interfaces of the deep web," Word Wide Web Journal, vol. 10, no. 2, pp. 133-155, 2007.
[CrossRef] [Web of Science Times Cited 22] [SCOPUS Times Cited 38]

[6] H. He, W. Meng, C.T. Yu, and Z. Wu. "WISE-integrator: an automatic integrator of web search interfaces for e-commerce," In Proceedings of the 29th International Conference on Very Large Data Bases(VLDB), Berlin, pp: 357-368, 2003. [PubMed]

[7] H. He, W.Y. Meng. C. Yu, and Z.H. Wu, "Constructing interface schemas for search interfaces of web databases," In Proceedings of WISE, pp: 29-42, 2005. [PubMed]

[8] X. Peng and Z. Huang, "Enabling semantic queries against the spatial database," Advances in Electrical and Engineering, Vol. 12, no. 1, pp. 45-50, 2012.
[CrossRef] [Full Text] [Web of Science Times Cited 1] [SCOPUS Times Cited 2]

[9] R.B. Doorenbos, O. Etzioni, and D. Weld, "A scalable comparison shopping agent for the World Wide Web, " Proc of the First International Conference on Autonomous Agents, Marina del Rey, CA, pp. 39-48, 1997.
[CrossRef]

[10] L. Gravano, P.G. Ipeirotis, and M. Sahami, "QProbe: a system for automatic classification of hidden-Web databases,"ACM Transactions on Information Systems, vol. 21, no. 1, pp. 1-41, 2003.
[CrossRef] [Web of Science Times Cited 44] [SCOPUS Times Cited 95]

[11] F. Ashraf, T. Ozyer, and R. Alhajj, "Employing clustering techniques for automatic information extraction from HTML documents," IEEE Transactions on Systems Man and Cybernetics Part C-Applications and Review, vol. 38, no. 5, 2008.
[CrossRef] [Web of Science Times Cited 13] [SCOPUS Times Cited 28]

[12] B. Liu and Y.H. Zhai, "NET-a system for extracting web data from flat and nested data records," Web Information Systems Engineering-WISE 2005, Lecture Notes in Computer Science, vol.3806, pp. 487-495, 2005.
[CrossRef] [SCOPUS Times Cited 51]

[13] J.L. Hong, "Data extraction for deep web using WordNet, "IEEE Transactions on Systems Man and Cybernetics Part C-Appliacations and Reviews, vol.41, no.6, pp. 854-868, 2011.
[CrossRef] [Web of Science Times Cited 8] [SCOPUS Times Cited 31]

[14] N. Marian and S. Top, "Integration of simulink models with component-based software model," Advances in Electrical and Computer Engineering, Vol. 8, no. 2, pp. 3-10, 2008.
[CrossRef] [Full Text] [Web of Science Times Cited 3] [SCOPUS Times Cited 5]

[15] L. Stanescu and D. Burdescu, "Information structuring and retrieval with topic maps for medical e-learning," Advances in Electrical and Computer Engineering, Vol. 9, no. 3, pp. 27-33, 2009.
[CrossRef] [Full Text] [Web of Science Times Cited 4] [SCOPUS Times Cited 4]

[16] S. Slderland, "Learning information extraction rules for semi-structured and free text," Machine Learning, vol. 34, nos. 1-3, pp. 233-272, 1999.
[CrossRef] [Web of Science Times Cited 401] [SCOPUS Times Cited 704]

[17] V. Crescenzi, G. Mecca, and P. Merialdo, "RoadRunner: towards automatic data extraction form large Web site," Proceeding of the 27th. International Conference on Very Large Data Bases, Roma, pp. 109-118, 2001.

[18] D. Cai, S.P. Yu, J.R. Wen, and W.Y. Ma, "VIPS: a vision-based page segmentation algorithm," Microsoft Technical Report, MSR-TR 2003-79.

[19] H. Zhao, W. Meng, Z. Wu, V. Raghavan, and C. Yu, " Fully automatic wrapper generation for search engines," In Proceedings of the 14th World Wide Web Conference, pp. 66-75, 2005.
[CrossRef]

[20] Z.P. Wang, Y.G. Zhang, J.F. Zhang, and J. Ma, "Recent research process in fault analysis of complex electric power systems," Advances in Electrical and Computer Engineering, Vol. 10, no. 1, pp. 28-33, 2010.
[CrossRef] [Full Text] [Web of Science Times Cited 19] [SCOPUS Times Cited 23]

[21] B. Liu, R. Grossman, and Y. Zhai, "Mining data records in web page," In SIGKDD'03, 2003.

[22] W. C. Bruce, M. Donald, and S. Trevor, "Search engines: information retrieval in practice," Addison Wesley, 2009.

[23] M. Horridge, B. Parsia, and U. Sattler, "Explanation of OWL entailments in protege4," In Proceedings of International Semantic Web Conference, 2008. [PubMed]

[24] A. Bilke and F. Naumann, "Schema matching using duplicates, " In Proceedings of the 21st IEEE International Conference on Date Engineering, pp. 69-80, 2005.
[CrossRef] [SCOPUS Times Cited 170]

[25] Y. Y., Lu, W. Y. Meng, L. C. Shu, C. Yu, and K. L. Liu, "Evaluation of result merging strategies for metasearch engines," 6th International Conference on Web Information Systems Engineering(WISE05), New York City, pp. 53-66, November 2005.

References Weight

Web of Science® Citations for all references: 531 TCR
SCOPUS® Citations for all references: 1,803 TCR

Web of Science® Average Citations per reference: 20 ACR
SCOPUS® Average Citations per reference: 69 ACR

TCR = Total Citations for References / ACR = Average Citations per Reference

We introduced in 2010 - for the first time in scientific publishing, the term "References Weight", as a quantitative indication of the quality ... Read more

Citations for references updated on 2024-10-19 11:15 in 110 seconds.

Note¹: Web of Science® is a registered trademark of Clarivate Analytics.
Note²: SCOPUS® is a registered trademark of Elsevier B.V.
Disclaimer: All queries to the respective databases were made by using the DOI record of every reference (where available). Due to technical problems beyond our control, the information is not always accurate. Please use the CrossRef link to visit the respective publisher site.

Copyright ©2001-2024
Faculty of Electrical Engineering and Computer Science
Stefan cel Mare University of Suceava, Romania

All rights reserved: Advances in Electrical and Computer Engineering is a registered trademark of the Stefan cel Mare University of Suceava. No part of this publication may be reproduced, stored in a retrieval system, photocopied, recorded or archived, without the written permission from the Editor. When authors submit their papers for publication, they agree that the copyright for their article be transferred to the Faculty of Electrical Engineering and Computer Science, Stefan cel Mare University of Suceava, Romania, if and only if the articles are accepted for publication. The copyright covers the exclusive rights to reproduce and distribute the article, including reprints and translations.

Permission for other use: The copyright owner's consent does not extend to copying for general distribution, for promotion, for creating new works, or for resale. Specific written permission must be obtained from the Editor for such copying. Direct linking to files hosted on this website is strictly prohibited.

Disclaimer: Whilst every effort is made by the publishers and editorial board to see that no inaccurate or misleading data, opinions or statements appear in this journal, they wish to make it clear that all information and opinions formulated in the articles, as well as linguistic accuracy, are the sole responsibility of the author.

Menu:

Post-processing of Deep Web Information Extraction Based on Domain Ontology