Automatic Table Recognition and Extraction from Heterogeneous Documents


This paper examines automatic recognition and extraction of tables from a large collection of het-erogeneous documents. The heterogeneous documents are initially pre-processed and converted to HTML codes, after which an algorithm recognises the table portion of the documents. Hidden Markov Model (HMM) is then applied to the HTML code in order to extract the tables. The model was trained and tested with five hundred and twenty six self-generated tables (three hundred and twenty-one (321) tables for training and two hundred and five (205) tables for testing). Viterbi algorithm was implemented for the testing part. The system was evaluated in terms of accuracy, precision, recall and f-measure. The overall evaluation results show 88.8% accuracy, 96.8% precision, 91.7% recall and 88.8% F-measure revealing that the method is good at solving the problem of table extraction.

Share and Cite:

Babatunde, F. , Ojokoh, B. and Oluwadare, S. (2015) Automatic Table Recognition and Extraction from Heterogeneous Documents. Journal of Computer and Communications, 3, 100-110. doi: 10.4236/jcc.2015.312009.

Conflicts of Interest

The authors declare no conflicts of interest.


[1] Oro, E. and Ruffolo, M. (2009) PDF-TREX: An Approach for Recognizing and Extracting Tables from PDF Documents. 10th International Conference on Document Analysis and Recognition, Barcelona, 26-29 July 2009, 906 -910.
[2] Pinto, D., McCallum, A., Wei, X. and Bruce, W. (2003) Table Extraction using Conditional Random Fields. Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM, New York, 235-242.
[3] Gatos, B., Danatsas, D., Pratikakis, I. and Perantonis, S.J. (2005) Automatic Table Detection in Document Images. Proceedings of the International Conference on Advances in Pattern Recognition, 612-621.
[4] Yildiz, B., Kaiser, K. and Miksch, S. (2005) pdf2table: A Method to Extract Table Information from PDF Files.
[5] Cafarella, M.J. (2009) Extracting and Managing Structured Web Data. Ph.D. Dissertation on Computer Science and Engineering, University of Washington, Seattle.
[6] Liu, Y., Mitra, P. and Giles, L.C. (2006) TableSeer: Automatic Table Metadata Extraction and Searching in Digital Libraries. Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries, ACM, New York, 339-340.
[7] Ojokoh, B., Zhang, M. and Tang, J. (2011) A Trigram Hidden Markov Model for Metadata Extraction from Heterogeneous References. Journal of Information Sciences, 181, 1538-1551.
[8] Costa e Silva, A. (2009) Learning Rich Hidden Markov Models in Document Analysis: Table Location. Proceedings of the International Conference on Document Analysis and Recognition, Barcelona, 26-29 July 2009, 843-847.
[9] Kieninger, T.G. (1998) Table Structure Recognition Based on Robust Block Segmentation. Proceedings of SPIE 3305, Document Recognition V, 22-32.
[10] Dalvi, B., William, W., Cohen, J. and Callan, J. (2012) WebSets: Extracting Sets of Entities from the Web Using Unsupervised Information Extraction. Proceedings of the 5th ACM International Conference on Web Search and Data Mining (WSDM), Seattle, 8-12 February 2012, 245-252.
[11] Tengli, A., Yang, Y. and Ma, N.L. (2004) Learning Table Extraction from Examples. Proceedings of the 20th International Conference on Computational Linguistics, Association for Computational Linguistics, Stroudsburg, Article No. 987.
[12] Sale, M.A., Chawan, P.M. and Chauhan, P.M. (2012) Information Extraction from Web Tables. International Journal of Engineering Research and Applications (IJERA), 2, 313-318.
[13] Wang, Y. and Hu, J. (2002) A Machine Learning Based Approach for Table Detection on the Web. Proceedings of the 11th International Conference on World Wide Web (WWW), ACM, New York, 242-250.
[14] Rabiner, L.R. and Juang, B.H. (1986) An Introduction to Hidden Markov Models. IEEE ASSP Magazine, 3, 4-16.
[15] Borkar, V.R., Deshmukh, K. and Sarawagi, S. (2001) Automatic Segmentation of Text into Structured Records. Proceedings of the ACM SIGMOD International Conference on Management of Data, ACM, New York, 175-186.

Copyright © 2021 by authors and Scientific Research Publishing Inc.

Creative Commons License

This work and the related PDF file are licensed under a Creative Commons Attribution 4.0 International License.