Development of Answer Validation System Using Responders’ Attributes and Crowd Ranking

Crowdsourcing has found a wide range of application in Community Question Answering (CQA). However, one of its biggest challenges is the need to address the quality of crowd answers contributions. Therefore, this work proposed a system that seeks to validate answers to questions provided by respondents using responders’ attributes and crowd ranking technique. Weights were assigned to respondent answers based on their academic records, experience and understanding of the question to obtain valid answers. Thereaf-ter, valid answers were ranked by the crowd using Borda Count algorithm. The proposed system was evaluated using Usability and User experience (UX) measurement. The result obtained demonstrated the effectiveness of the applied technique.


Introduction
The new information era provides readily available access to information, especially with the advent of the internet. Different questions requiring correct answers are uploaded on the internet on daily basis which leads to the development of question answering (QA) systems, with the aim of providing accurate answers to explicit questions which are contrasting to document retrieval (Ojokoh & Adebisi, 2019;Toba et al., 2014). Schofield and Thielscher (2019) defined community QA as a website or service that requires a method to display pieces of information in the form of a question in natural language, a medium for communal response and a community in which questions and answers are rooted based on the level of participation, and answers provided was discovered to be of higher quality when it was compared with other types of online QA services (Harper et al., 2008). However, answers to questions from users form the pillar of a successful CQA service, in which better answers may be provided as against automatic systems. However, while the attitude and the reliability of users on the web vary, the quality of answers provided may not be of high quality, and this prompted the idea of answer validation by accessing the correctness of answers provided by respondents using different techniques (Magnini et al., 2002(Magnini et al., , 2005.
Validation of answers became essential because crowdsourcing tasks providers have restricted control over the selection of crowd workers and little insight into the level of know-how and dependability of the workers that provide answers.
Crowdsourcing as defined by Howe (2006) is an act of farming out a job ordinarily performed by a selected employee to an open-ended large group of people usually in the form of an open call. The performance of these crowd workers largely determines the worth of the result obtained from a task. Hung et al. (2015) described five types of crowd workers as: Reliable workers (having profound knowledge about specific fields and give response to questions with very high reliability, in that all the answers given by them are correct). Normal workers (have general knowledge to respond to questions, but seldom make mistakes, that is, three out of four of their answers are correct). Sloppy workers (have very miniature knowledge thereby providing erroneous answers, however unintentionally). Uniform spammers (who intentionally give the same answer for all questions). Random spammers (who imprecisely give casual and worthless answers for all).
Several studies have been carried out on how to make better the quality of the answers provided by QA system, focusing on textual entailment, question type analysis, answer ranking by the crowd workers and domain experts and personal and community features (past history) of the answerer to determine the quality of the answers (Ríos-Gaona et al., 2012;Su et al., 2007;Ishikawa et al., 2011;Ojokoh & Ayokunle, 2012;Anderson et al., 2012;Schofield & Thielscher, 2019).
Since past history alone may not be fitting enough to determine the quality of an answer, level of confidence in the answer provided is introduced in order to obtain credible answers from respondents. The proposed system is aimed at using community presence interaction as one of the basis for quality answer selection; capturing crowd specialty as part of the personal features used to validate answers; modelling the criteria used in evaluation automatically and preventing bias crowd ranking of answers by enabling them to specify their preferential schedule using Naïve Bayes Spam filter and Borda count ranking Algorithm.
The remaining part of this paper is structured as follows: Section 2 presents

Related Works
Question Answering (QA) according to Chandra et al. (2017) is a computer science discipline concerned with developing a system that automatically provide answers to questions requested by human in a natural language. QA study attempts to deal with a wide-ranging question types that consist of facts, lists, definitions, how, why, putative, semantically constrained, and cross lingual questions (Cimiano et al., 2014). Ishikawa et al. (2011) manually chose questions and answers at random from Yahoo archives, which were evaluated by four assessors to identify evaluation criteria. These criteria were later used to construct a model to identify high-quality answers. Šimko et al. (2013) presented a method for validating question-answer learning objects involving interactive exercise for learners by employing students' accuracy estimations of answers provided by other students, during learning. The method was deployed within an adaptive Learning Framework and they were able to show that total student crowd estimations are to a great extent analogous to teacher's assessment of provided an- identifying high-quality answers. Also indicating that personal and community-based features have more prediction power in assessing answer quality.
In this paper, we leverage on the fact that the performance of the crowd workers determines the quality of the result of a crowdsourcing task, and hence the need to develop an effective and reliable question answering system that is capable of validating and evaluating the answers provided by the crowd because of their varying reliability as established in past works (Hung et al., 2017;Savenkov et al., 2016). All these are important issues to be addressed in Artificial Intelligence.

The Proposed System
The architectural overview of the proposed system is presented in Figure 1. The subsections that follow describes each of the segments.

User Interface
The user interface module consists of four (4)

Database
The database is the component of the Answer Validation model that stores information about the system and its users. It stores both legitimate questions and answers from web users, and most importantly, answerers' personal information for the purpose of validating their answers which is obtained the first time a respondent uses the system.

Naïve Bayes Spam Filter
Naïve Bayes (NB) Spam Filter, a machine learning algorithm, which is one of the powerful tools for Artificial Intelligence was used in this work to filter inconse-Journal of Service Science and Management quential and redundant messages from the collection of messages or information provided by the crowd. Every incoming text (both question and answer) pass through the trained Naïve Bayes Spam filter to determine the probability of the message being a legitimate message or spam. The NB spam filter is trained with the commonly used online spam words and spam dataset downloaded from kaggle.com. A sample is shown in Figure 2.
From Bayes' theorem, the probability that a message with vector belongs in category c is: (1) Using Naïve Bayes Spam filter, a message is classified as spam whenever , message is spam , message not spam where s c is a message in spam category; h c is a message in ham category; ( ) s p c is the probability that the response x belongs to spam category, s c ; ( ) h p c is the probability that the response x belongs to ham category, h c ; ( ) | s p x c is the likelihood of response x given the spam category; ( ) | h p x c is the likelihood of response x given the ham category and; T is a threshold value. If P is greater than T, the incoming message is being classified as spam message and will be discarded else if P is less than or equal to T, the message will be accepted by the system and presented as a question or accepted as an incoming answer.

Separate Question from Answer
This is the component of the system where a legitimate message from the user is being identified as either a question or answer. If the incoming message is a terface for the answerers to provide answers, and if otherwise, the system will pass it to the next component where the criteria for quality answers will be implemented.

Criteria for Quality Answers
The quality of the result of a question answering system rest on the source of the answers provided by the system. Since the aim of the question answering system is to provide a precise answer in natural language; it is therefore important to provide quality assurance on every answer obtained from the web users, as these users can vary in reliability. The criteria employed for validation and used to ensure quality answers in this work are User attributes, Area of Specialization, Understandability and Confidence (displayed in Table 1).

Weighted Voting System
A game playing situation is applied for ranking answers using a collection of weighted players i P together with a quota q, which is the total number of votes required to pass a motion. This is used to determine the level of reliability of the users that provide answers. A player is a user attribute that is used to allot point to answerers. In a weighted voting system, a player's weight i w refers to the number of points allotted to that player and is always a positive integer value. A weighted voting system is described by specifying the voting weights, 1 2 , , , n w w w  of the players 1 2 , , , n P P P  , and the quota, q. A coalition is called winning if the sum of the players' weights is greater or equal to the quota, and losing if otherwise. The coalitions, which are the criteria used in this work to ensure quality answers from the web users are User attributes, area of specialization, Understandability and Confidence. User attributes that are used comprises of user Course of study, Grade point, number of years of experience in computing and the general level of knowledge of computing. Point is added to the weight of the responder based on their selections from the range of value of the attributes. A user is also allowed to choose any area of specialization such as Networking, Cyber Security and hardware and repairs and so on. Users' understandability of the given question is measured based on a five-level rating scale, as well as the Confidence which is a way in which the answerer can infer how much the system can trust the answer provided. This is also measured based on a five level rating scale. Combining these and the weighted voting system, this phase of the system is represented by: , , : , q P P P P where, P 1 is User's personal attribute, P 2 is Specialization; P 3 is Understandability, P 4 is Confidence.
The totality of weights, w T per Answerer is computed as: where i w is the weight corresponding to each player, i P . The maximum weight, N obtainable by an answerer with q being the minimum weight required for an acceptable (valid) answer is expressed as: N w w w w = + + + (5) then, 2 N q N < ≤ holds for equation (6) In this work, q was obtained by calculating the 70% of N as follows: 70% q N = From Equation (6), q can be said to be less than or equal to N but greater than 2 N . This means that 35 35 2 q < ≤ . Since this work is based on quality answer validation, 70% of N was used as the quota q. Therefore the quota, q will be 25. Table 2 depicts the different criteria considered in this work with the respective maximum weight obtainable.
Depending on the point obtained from each criterion by the Responders (Answerers), these points are aggregated based on their selection. The total weight of the answer is calculated to check whether the weight meets up to the quota. If the total weight of the answer is greater or equal to the quota, the answer is considered valid and is passed to the next phase which is the ranking phase,and if not the answer is discarded.

Crowd Ranking
The last phase employs a crowdsourcing ranking algorithm called Borda count.
The algorithm ranks all the valid answers from phase two using a preference schedule point. It awards points to candidates based on preference schedule, then the candidate with the highest points is declared the winner. For instance, given M, the number of candidate answers, each first-place, second-place and third-place votes is worth , 1, 2 M M M − − points respectively. Consequently, each Mth-place (that is, last-place) vote is worth 1 point. Now, suppose there are n voters, every voter ranks the M candidates according to his preference, and a candidate answer has an average rank score, n s .
where i r is the point assigned by n crowd (ranker). The candidate answers will be ranked according to their performance starting from the best on top of the list (answer with the highest point) to the worst (answer with the lowest point).

Data and Tools
A dataset consisting of 185 Spam messages was downloaded from Kaggle.com and was used to train the Naïve Bayes Filter in order to distinguish between legitimate and inconsequential information provided by the crowd. The system was implemented using HTML, Python Script and Djangoweb framework.

Experimental Setup
Experiments were conducted to verify the system performance and to determine how useful and precise the answers provided were. The users of the system are allowed to post questions which will be answered by responders who are vast in the field of the question being asked. However, before the responders would be allowed to provide answers, they will be required to sigin/sign up as the case may be, verifying their Course of study, Area of specialization, Grade point, number of years of experience in Computing, general level of Computing knowledge and the level of understanding of the question. Also, the confidence level of the responder will be confirmed before posting the answer. In cases where a minimum of five different answers are provided to a particular question, they are ranked by the crowd starting from the most correct to the least correct answer. A sample of asked questions and answers provided is shown in Figure 3.

Evaluation
The method of evaluation used in this work is based on ISO/IEC 9126 standard metrics and the Usability and User experience (UX) measurement instruments adopted in (Tan et al., 2010). The model consists of 21 subcharacteristics distributed on six main characteristics of software measurement metrics. Using the common Goal Question Metric (GQM) approach, a nomenclature for usability and UX attributes were defined and were able to identify an extensive set of questions and measures for each attribute. The metrics used for this work are shown in Table 3.
From the above stated metrics, twenty (20) questions were formed in order to evaluate the Answer Validation system by Users. Eighty five users out of One hundred sample size evaluated the system, with each question (Q 1 , …, Q 20 ) answered using four-level rating scale; Very High, High, Medium and Low respectively. Ratings obtained from the Users were analyzed using weight means techniques in which weights are added (such that Very High = 4, High = 3, Medium = 2 and Low = 1) to users feedback. A sample of the questionnaire is shown in Table 4.

Results and Discussion
The ratings were analyzed and the frequency at which each point occurs was obtained. The metrics were measured and analyzed to form a continuous score in percentage (%). Table 5 illustrates the number of users out of eighty-five (85) that rated the system either Very high, High, Medium or Low based on the given questionnaire. Figure 4 and Figure 5 shows the graphical representation of the obtained results. Table 6 shows the Combination of Very High and High ratings in order to define the User ratings as High, Medium, Low. Figure 6 and Figure   7 show the Combination of Very High and High ratings for Usability and User Experience respectively.  Table 4. Questionnaire for answer validation system evaluation.
S/N Usability metrics User Experience (UX) metrics 1. What is the rate at which you understand the system? What is the rate at which the answers provided by the system are correct?
2. What is the rate at which you think the system is efficient? What is the rate at which you are satisfied with the answers provided by the system 3. What is the rate at which the system tolerates error and corrects you when you made mistakes?
What is the rate at which the language used by the system is simple?
4. What is the rate at which the system is easy to use? What is the rate at which the answers provided by the system in corresponding to their questions are valid?
5. What is the rate at which the system design is attractive? What is the rate at which the system is effective enough in providing valid answer to questions?
6. What is the rate at which you are satisfied with the time response of the system?
What is the rate at which the system can provide high quality answers?
7. What is the rate at which you are satisfied with the visual content of the system?
What is the rate at which the system is reliable in providing answers to computing related questions?
8. What is the rate at which you find it easy to Navigate through the system?
What is the rate at which the system is consistent in performing its functions?
9. What is the rate at which you will like to use the system the next time?
What is the rate at which the system is accessible from your end?
10. What is the rate at which you are satisfied with the system feedback? What is the rate at which you prefer the system to others?      The overall results show that the user experience evaluations of the system based on the metrics given are excellent. This is because in most case of the metrics used "Very High" and "High" (which are good scale to measure superior or improved opinion ) are rated up to 90% and above, Medium are rated less than 10% respectively.
The Relevance of the system is calculated thus:

Conclusion and Future Works
An answer validation system for answers using answerers attributes and crowd ranking has been developed. For the effectiveness of the system, illegitimate questions and answers were filtered out using a trained Naïve Bayes spam filter with a threshold of 0.5. Answerers' personal attributes (such as Grade points, Area of specialization, Years of experience Level of Computing, Course of study, Question Understandability and the answer confidence level (trustworthiness)) were used to ensure high quality answers by employing a weighted system that assigns weights to individual attributes in order to know the weight of the answers for validation. Answers are ranked by the crowd to get the best four answers from the candidate answers obtained from the answerers using Borda count ranking algorithm and least best answer is discarded. The system correctness is 96.47%, Answer satisfaction is 100%, answer Validation is 97.65%, system Simplicity is 97.6%, system Feedback is 88.23% and the system efficiency is 96.47%. Future works could include more User attributes such as age, qualification and so on and ensure that there is an improvement in the system feedback so that users can receive instant live answers to their respective questions. There should be a way in which the answerers are motivated for the task performed in order to enhance their performance. In addition, the system should be more general to accommodate questions from other science related domain.

Conflicts of Interest
The authors declare no conflicts of interest regarding the publication of this paper.