Semantic Recognition of a Data Structure in Big-Data

Aïcha Ben Salem; Faouzi Boufares; Sebastiao Correia

doi:10.4236/jcc.2014.29013

Journal of Computer and Communications > Vol.2 No.9, July 2014

Semantic Recognition of a Data Structure in Big-Data

Aïcha Ben Salem, Faouzi Boufares, Sebastiao Correia
1Laboratory LIPN-UMR 7030-CNRS, University Paris 13, Sorbonne Paris Cité, Villetaneuse, France 2Company Talend, Suresnes, France.
Company Talend, Suresnes, France.
Laboratory LIPN-UMR 7030-CNRS, University Paris 13, Sorbonne Paris Cité, Villetaneuse, France.
DOI: 10.4236/jcc.2014.29013 PDF HTML 4,898 Downloads 7,042 Views Citations

Abstract

Data governance is a subject that is becoming increasingly important in business and government. In fact, good governance data allows improved interactions between employees of one or more organizations. Data quality represents a great challenge because the cost of non-quality can be very high. Therefore the use of data quality becomes an absolute necessity within an organization. To improve the data quality in a Big-Data source, our purpose, in this paper, is to add semantics to data and help user to recognize the Big-Data schema. The originality of this approach lies in the semantic aspect it offers. It detects issues in data and proposes a data schema by applying a semantic data profiling.

Keywords

Data Quality, Big-Data, Semantic Data Profiling, Data Dictionary, Regular Expressions, Ontology

Share and Cite:

Salem, A. , Boufares, F. and Correia, S. (2014) Semantic Recognition of a Data Structure in Big-Data. Journal of Computer and Communications, 2, 93-102. doi: 10.4236/jcc.2014.29013.

Conflicts of Interest

The authors declare no conflicts of interest.

References

[1]	Becker, J., Matzner, M., Müller, O. and Winkelmann, A. (2008) Towards a Semantic Data Quality Management— Using Ontologies to Assess Master Data Quality in Retailing. Proceedings of the Fourteenth Americas Conference on Information Systems (AMCIS’08), Toronto.
[2]	Madnick, S. and Zhu, H. (2005) Improving Data Quality through Effective Use of Data Semantics. Working Paper CISL#2005-08, 1-19.
[3]	Wang, X., Hamilton, J-H. and Bither, Y. (2005) An Ontology-Based Approach to Data Cleaning. Technical Report CS-2005-05, 1-10.
[4]	K?pcke, H. and Rahm, E. (2009) Frameworks for Entity Matching: A Comparison. Data Knowledge Engineering (DKE’09), Leipzig, 197-210.
[5]	Bilenko, M. and Mooney, R.J. (2003) Adaptive Duplicate Detection Using Learnable String Similarity Measures. Pro- ceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery, and Data Mining, Washington DC, 39-48. http://dx.doi.org/10.1145/956750.956759
[6]	Koudas, N., Sarawagi, S. and Srivastava, D. (2006) Record Linkage: Similarity Measures and Algorithms. In: ACM SIGMOD’06, International Conference on Management of Data, Chicago, 802-803.
[7]	Cohen, W.W. and Richman, J. (2004) Iterative Record Linkage for Cleaning and Integration. Proceedings of the 9th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery (DMKD’04), Paris, 11-18.
[8]	Monge, A.E. and Elkan, C.P. (1997) An Efficient Domain-Independent Algorithm for Detecting Approximately Duplicate Database Records. Proceedings of the Second ACM SIGMOD Workshop Research Issues in Data Mining and Knowledge Discovery (DMKD’97), 23-29.
[9]	Boufarès, F., Ben Salem, A., Rehab, M. and Correia, S. (2013) Similar Elimination Data: MFB Algorithm. IEEE 2013 International Conference on Control, Decision and Information Technologies (CODIT’13), Hammamet, 6-8 May 2013, 289-293.
[10]	Boufarés, F., Ben-Salem, A. and Correia, S. (2012) Qualité de données dans les entrep?ts de données: Elimination des similaires. 8èmes Journées francophones sur les Entrep?ts de Données et l’Analyse en ligne (EDA’12), Bordeaux, 32-41.
[11]	Berti-équille, L. (2007) Quality Awereness for Managing and Mining Data. HDR, Rennes.
[12]	Tamraparni, D., Theodore, J., Muthukrishnan, S. and Vladislav, S. (2002) Mining Database Structure; or, How to Build a Data Quality Browser. Proceedings of the ACM SIGMOD International Conference on Management of Data, (SIGMOD’02), Madison, 2002, 240-251.
[13]	Dean, J. and Ghemawat, S. (2004) MapReduce: Simplified Data Processing on Large Clusters. 6th Symposium on Operating System Design and Implementation (OSDI’04), San Francisco, 6-8 December 2004, 137-150.
[14]	Data Cleaner, Reference Documentation, 2008-2013, datacleaner.org.
[15]	(2011) Oracle Warehouse Builder Data Modeling, ETL, and Data Quality Guide, Performing Data Profiling. http://docs.oracle.com/cd/E11882_01/owb.112/e10935/data_profiling.htm#WBETL18000
[16]	Datiris Profiler. http://www.datiris.com/
[17]	UML. http://www.uml.org/
[18]	Noy, N.F. and McGuinness, D.L. (2001) Ontology Development 101: A Guide to Creating Your First Ontology. Stan- ford Knowledge Systems Laboratory Technical Report KSL-01-05 and Stanford Medical Informatics Technical Report SMI-2001-0880, 1-25.
[19]	Bechhofer, S. (2012) Ontologies and Vocabularies. Presentation at the 9th Summer School on Ontology Engineering and the Semantic Web (SSSW’12), Cercedilla.
[20]	Hauswirth, M. (2012) Linking the Real World. Presentation at the 9th Summer School on Ontology Engineering and the Semantic Web (SSSW’12), Cercedilla.
[21]	Herman, I. (2012) Semantic Web Activities@W3C. Presentation at the 9th Summer School on Ontology Engineering and the Semantic Web (SSSW’12), Cercedilla.
[22]	Kamel, M. and Aussenac-Gilles, N. (2009) Construction automatique d’ontologies à partir de spécification de bases de données. Actes des 20èmes Journées Francophones d'Ingénierie des Connaissances (IC), Hammamet, 85-96.
[23]	Protégé Tool. http://protege.stanford.edu/
[24]	Wordnet Database. http://wordnet.princeton.edu/
[25]	WOLF Database. http://alpage.inria.fr/~sagot/wolf-en.html
[26]	Talend Data Profiling. http://fr.talend.com/resource/data-profiling.html
[27]	MapReduce (2013) The Apache Software Foundation. MapReduce Tutorial.

Journals Menu

Follow SCIRP

	+1 323-425-8868
	customer@scirp.org
	+86 18163351462(WhatsApp)
	1655362766

	Paper Publishing WeChat

Journals Menu

Home

About SCIRP

Service

Policies