Creating Bengali Freebase Using Wikidata

Freebase is a large collaborative knowledge base and database of general, structured information for public use. Its structured data had been harvested from many sources, including individual, user-submitted wiki contributions. Its aim is to create a global resource so that people (and machines) can access common information more effectively which is mostly available in English. In this research work, we have tried to build the technique of creating the Free-base for Bengali language. Today the number of Bengali articles on the internet is growing day by day. So it has become a necessary to have a structured data store in Bengali. It consists of different types of concepts (topics) and relationships between those topics. These include different types of areas like popular culture (e.g. films, music, books, sports, television), location information (restaurants, geolocations, businesses), scholarly information (linguistics, biology, astronomy), birth place of (poets, politicians, actor, actress) and general knowledge (Wikipedia). It will be much more helpful for relation extraction or any kind of Natural Language Processing (NLP) works on Ben-gali language. In this work, we identified the technique of creating the Bengali Freebase and made a collection of Bengali data. We applied SPARQL query language to extract information from natural language (Bengali) documents such as Wikidata which is typically in RDF (Resource Description Format) triple format.


Introduction
In reality, we can see that the World Wide Web does not provide all the knowledge in consistent and uniformly structured way.For that reason, access of the Web is difficult for the purpose of sophisticated data analysis, search and organ-ization [1].Some of the specific problems are mentioned below: • Structured information is "trapped" in unstructured documents which are unable to be easily read by automated processes.
• Multiple, unconnected representations are available of the same real-world entity.
• Often, a lack of separation of meaning from presentation is there.Most unstructured and many structured data sources mix semantics with display information.
• Often, a heterogeneous representation of information is existed across data sources and even within a single data source across time.
• Sometimes Web-based information is presented without the explicit context of other, potentially helpful information sources.This is a condition arising from the one-way nature of Web links.
So to solve the above problem, a useful knowledge base Freebase has been built in English.We can create a Bengali Freebase by making query through Wikidata.Wikidata is a document-oriented database, focused on items which represent topics, concepts or objects.Wikidata has RDF triple format.Actually, a mapping has been built from Freebase properties to Wikidata.Wikidata is a collaboratively edited knowledge base hosted by the Wikimedia Foundation.We can easily get our required seed tuple automatically rather than giving input manually by using Wikidata query service [2].In this way, large scale information integration, entity extraction and data reconciliation problems will be solved by automatically perform structuring, extraction and reconciliation tasks over large, messy, already existing datasets [1].
Most existing works in Bangla language processing focused on research areas such as machine translation [3] [4] [5], reading compression [6] [7], sentiment analysis etc. Few research works focused on information retrieval (such as relation extraction [8] [9]) where the authors tried to use knowledge base.

Related Work
The value of Wikipedia's data has long been obvious, with many efforts to use it.
The Wikidata approach is to crowd-source data acquisition, allowing a global community to edit the data.This extends the traditional wiki approach of allowing users to edit a website.Wiki is a Hawaiian word for fast; Ward Cunningham, who created the first wiki in 1995, used it to emphasize that his website could be changed quickly [10].Some existing popular such system is Semantic MediaWiki, or SMW [11], which extends MediaWiki, the software used to run Wikipedia [12], with data-management capabilities.SMW was originally proposed for Wikipedia but was quickly used on hundreds of other websites as well.Unlike Wikidata, SMW manages data as part of its textual content, thus hindering creation of a multilingual, single knowledgebase supporting all Wikimedia projects.Moreover, the data model of Wikidata is more elaborate than that of SMW, allowing users to Other examples of free knowledgebase projects are OpenCyc and Freebase.
OpenCyc is the free part of Cyc [13], which aims for a much more comprehensive and expressive representation of knowledge than Wikidata.OpenCyc is released under a free license and available to the public, but, unlike Wikidata, is not editable by the public.Freebase, acquired by Google in 2010, is an online platform that allows communities to manage structured data [14].Objects in Freebase are classified by types that prescribe what kind of data an object can have; for example, Freebase classifies Einstein as a "musical artist" since it would otherwise not be possible to refer to recordings of his speeches.Wikidata supports the use of arbitrary properties on all objects.Other differences from Wikidata are related to multi-language support, source information, and the proprietary software used to run the site.The latter is critical for Wikipedia, which is committed to running on a fully open source software stack to allow all to fork, or copy and create one's own version of the project.

Freebase
Freebase is designed to facilitate high "collaborative density" among its users in the organization, representation and integration of large, diverse data sets.Freebase has some own characteristics.They are given below.

A Huge Data Store
This is a scalable, tuple store with some built-in query planning and optimization capabilities which allow deep, naively constructed queries to be satisfied quickly.This assists users in query optimization in building high performing systems.

A Large Data Object Store (LOB)
This is a store of large data objects such as text documents.LOB objects are indexed and annotated in the store.

A Substantial Seed Data Set
An emphasis has been placed on the early seeding of Freebase with data sets of interest to the general population rather than those that are highly esoteric and specialized.This hopefully will result in greater heterogeneity of structure and content that is more representative of the world's sum of general knowledge.It consists of different types of concepts (topics) and relationships between those topics.Topics are: • popular culture (e.g.films, music, books, sports, television); • location information (restaurants, geolocations, businesses); • scholarly information (linguistics, biology, astronomy); • general knowledge (Wikipedia).Journal of Computer and Communications While this data is already useful, we are making efforts for it to grow quickly over time in both quantity and density of relationships.

Wikidata Service
Wikidata is a website that belongs to the Wikimedia family of websites.The most famous site in that family is Wikipedia.Data from Wikidata is available in RDF dumps.Actually RDF stands for Resource Description Framework which is a general method for describing data by defining relationships between data objects and it allows data integration from multiple sources.RDF has triple format which is a set of three entities that codifies a statement about semantic data in the form of subject-predicate-object expressions [15].
Wikidata is a place to store structured data in many languages.The basic entity in Wikidata is an item.An item can be a thing, a place, a person, an idea or anything else.The subject of each Wikipedia article corresponds to a Wikidata item but the definition of a Wikidata item is more flexible and inclusive and there are many items about which there are no Wikipedia articles.Wikidata has identifier numbers for entities and properties.

Entity Identifier Number
As Wikidata treats all languages in the same way, items don't have names, but generic identifiers.Each identifier is the letter Q that is followed by a number.
For example, the item about the capital of Japan is called neither "Tokyo" nor "anything" but Q1490.But to give it a human-readable name, each item has a list of labels in each language associated with it.So we'll see that the English (en) label at Q1490 is "Tokyo", also has corresponding word for the Japanese (ja) label, the Bengali (bn) label and so on.

Property Identifier Number
Every item has a list of statements associated with it.Each statement has a "property" and a "value".There is a long list of possible properties.Like items, properties have generic identifiers, but they begin with the letter P and not Q.
For example, the property to indicate the country is P17, and it has the label "country" in English.The value of P17 (country) for Q1490 (Tokyo) is Q17 (Japan, etc.).There are many other statements about Tokyo: flag (Q20900820, which points to an image at File:Flag of Tokyo), population (13,686,371), mayor (Q389617) and many others.
This way of organizing the information in a structured way makes it easy for computers to process.For an example, Rukaiya is the citizen of Bangladesh.
• Here, Rukaiya is a person, an entity which is called "human".In Wikidata human entity has an identifier no Q5.• Here, citizen of is the relation between two entites which is called a property and this property name is "country of citizenship" and has an identifier no

P27.
• Here, Bangladesh is a country which is an entity.In Wikidata Bangladesh entity has an identifier no Q902.

SPARQL Query Process
It is a necessary to extract information from complaints, either scraped from the Web or received directly from the client for many companies nowadays.The aim is to find inside them some actionable knowledge.To this purpose, verbal phrases must be analyzed, as many complaints refer to actions improperly performed.The Semantic Roles of the actions (who did what to whom) and the Named Entities involved need to be extracted.Moreover, for the correct interpretation of the claims, the software should be able to deal with some background knowledge (for example, a product's ontology).Although there are already many libraries and out of the shelf tools that allow tackling these problems singularly, it may be hard to find one that includes all the needed tasks.There is a query language, SPARQL to extract information from natural language documents, pre-annotated with NLP information.A query language is much easier.
Moreover, the adoption of the SPARQL syntax allows to seamlessly mix, inside the same query, NLP patterns with traditional RDF, simplifying the integration with Semantic Web technologies [16].
SPARQL stands for SPARQL Protocol and RDF Query Language.It is an RDF query language and able to retrieve and manipulate data stored in Resource Description Framework (RDF) format.It was made a standard by the RDF Data Access Working Group (DAWG) of the World Wide Web Consortium.SPARQL allows a query to consist of triple patterns.SPARQL allows users to write queries against what can loosely be called "key-value" data [17].Some important key points have been mentioned below.
• "?" sign has been used before a variable.
• In SPARQL query, by the SELECT word, it will return the result by using variables.
• We use WHERE clause to write the main query code.Here, the query is in triple format.
• Here, for using the property identifier, we use "wdt" keyword before it like wdt: P19 (place of birth) and for entity identifier we use "wd" keyword like wd: Q902 (Bangladesh).
• We can find the identifier number from the Wikidata search box by typing the required entity or property there.Besides this, while using Wikidata online query service, we can use control and spacebar simultaneously and type it there.
• IN SPARQL, for all languages the property has the same identifier number and same as for entity.Then it will be labeled with the required language.
• Here, we return our query output in Bengali.For which result there has been an available Bengali Wikipedia, it will be returned in Bengali.
So we can say that Wikidata is scalable, tuple store with the help of some query planning and optimization capability.We can find out results on different topics which will be much more beneficial.

Conclusion
Freebase is a large collection of structured data.It is available in English.In this work, we have tried to find out the technique of creating Freebase for Bengali language which will be much more helpful for further research work like in Bengali language processing, where the researchers need to get a seed tuple.This research work demonstrated the technique of how to make queries for their required result using Wikidata service rather than making a database by giving input manually which is very time-consuming task.Researchers in areas such as entity extraction and reconciliation, data mining, Semantic Web, information retrieval, ontology creation and analysis can use this technique to support their research works.

R.
Habib et al.DOI: 10.4236/jcc.2023.115011153 Journal of Computer and Communicationscapture more complex information.In spite of these differences, SMW has had a great influence on Wikidata, and the two projects share code for common tasks.

Figure 2 .
Figure 2. Sample SPARQL Query to Wikidata Service with sample outcome.

Figure 3 .
Figure 3.This is the portion of the output of the poets who have the citizenship of Bangladesh.