Semantic Recognition of a Data Structure in Big-Data


Data governance is a subject that is becoming increasingly important in business and government. In fact, good governance data allows improved interactions between employees of one or more organizations. Data quality represents a great challenge because the cost of non-quality can be very high. Therefore the use of data quality becomes an absolute necessity within an organization. To improve the data quality in a Big-Data source, our purpose, in this paper, is to add semantics to data and help user to recognize the Big-Data schema. The originality of this approach lies in the semantic aspect it offers. It detects issues in data and proposes a data schema by applying a semantic data profiling.

Share and Cite:

Salem, A. , Boufares, F. and Correia, S. (2014) Semantic Recognition of a Data Structure in Big-Data. Journal of Computer and Communications, 2, 93-102. doi: 10.4236/jcc.2014.29013.

Conflicts of Interest

The authors declare no conflicts of interest.


[1] Becker, J., Matzner, M., Müller, O. and Winkelmann, A. (2008) Towards a Semantic Data Quality Management— Using Ontologies to Assess Master Data Quality in Retailing. Proceedings of the Fourteenth Americas Conference on Information Systems (AMCIS’08), Toronto.
[2] Madnick, S. and Zhu, H. (2005) Improving Data Quality through Effective Use of Data Semantics. Working Paper CISL#2005-08, 1-19.
[3] Wang, X., Hamilton, J-H. and Bither, Y. (2005) An Ontology-Based Approach to Data Cleaning. Technical Report CS-2005-05, 1-10.
[4] K?pcke, H. and Rahm, E. (2009) Frameworks for Entity Matching: A Comparison. Data Knowledge Engineering (DKE’09), Leipzig, 197-210.
[5] Bilenko, M. and Mooney, R.J. (2003) Adaptive Duplicate Detection Using Learnable String Similarity Measures. Pro- ceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery, and Data Mining, Washington DC, 39-48.
[6] Koudas, N., Sarawagi, S. and Srivastava, D. (2006) Record Linkage: Similarity Measures and Algorithms. In: ACM SIGMOD’06, International Conference on Management of Data, Chicago, 802-803.
[7] Cohen, W.W. and Richman, J. (2004) Iterative Record Linkage for Cleaning and Integration. Proceedings of the 9th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery (DMKD’04), Paris, 11-18.
[8] Monge, A.E. and Elkan, C.P. (1997) An Efficient Domain-Independent Algorithm for Detecting Approximately Duplicate Database Records. Proceedings of the Second ACM SIGMOD Workshop Research Issues in Data Mining and Knowledge Discovery (DMKD’97), 23-29.
[9] Boufarès, F., Ben Salem, A., Rehab, M. and Correia, S. (2013) Similar Elimination Data: MFB Algorithm. IEEE 2013 International Conference on Control, Decision and Information Technologies (CODIT’13), Hammamet, 6-8 May 2013, 289-293.
[10] Boufarés, F., Ben-Salem, A. and Correia, S. (2012) Qualité de données dans les entrep?ts de données: Elimination des similaires. 8èmes Journées francophones sur les Entrep?ts de Données et l’Analyse en ligne (EDA’12), Bordeaux, 32-41.
[11] Berti-équille, L. (2007) Quality Awereness for Managing and Mining Data. HDR, Rennes.
[12] Tamraparni, D., Theodore, J., Muthukrishnan, S. and Vladislav, S. (2002) Mining Database Structure; or, How to Build a Data Quality Browser. Proceedings of the ACM SIGMOD International Conference on Management of Data, (SIGMOD’02), Madison, 2002, 240-251.
[13] Dean, J. and Ghemawat, S. (2004) MapReduce: Simplified Data Processing on Large Clusters. 6th Symposium on Operating System Design and Implementation (OSDI’04), San Francisco, 6-8 December 2004, 137-150.
[14] Data Cleaner, Reference Documentation, 2008-2013,
[15] (2011) Oracle Warehouse Builder Data Modeling, ETL, and Data Quality Guide, Performing Data Profiling.
[16] Datiris Profiler.
[17] UML.
[18] Noy, N.F. and McGuinness, D.L. (2001) Ontology Development 101: A Guide to Creating Your First Ontology. Stan- ford Knowledge Systems Laboratory Technical Report KSL-01-05 and Stanford Medical Informatics Technical Report SMI-2001-0880, 1-25.
[19] Bechhofer, S. (2012) Ontologies and Vocabularies. Presentation at the 9th Summer School on Ontology Engineering and the Semantic Web (SSSW’12), Cercedilla.
[20] Hauswirth, M. (2012) Linking the Real World. Presentation at the 9th Summer School on Ontology Engineering and the Semantic Web (SSSW’12), Cercedilla.
[21] Herman, I. (2012) Semantic Web Activities@W3C. Presentation at the 9th Summer School on Ontology Engineering and the Semantic Web (SSSW’12), Cercedilla.
[22] Kamel, M. and Aussenac-Gilles, N. (2009) Construction automatique d’ontologies à partir de spécification de bases de données. Actes des 20èmes Journées Francophones d'Ingénierie des Connaissances (IC), Hammamet, 85-96.
[23] Protégé Tool.
[24] Wordnet Database.
[25] WOLF Database.
[26] Talend Data Profiling.
[27] MapReduce (2013) The Apache Software Foundation. MapReduce Tutorial.

Copyright © 2024 by authors and Scientific Research Publishing Inc.

Creative Commons License

This work and the related PDF file are licensed under a Creative Commons Attribution 4.0 International License.