FFCDH: Solution to Enable Face-to-Face Conversation between Deaf and Hearing People ()
1. Introduction
In deaf education, the most formidable challenge to deal with is to find the best means of communication. Since 1500s, deaf people are acknowledged that they are able to be taught. They can learn and understand written symbols by pairing them with the depicted objects. However, many deaf people have been isolated in the society and poorly being taught how to communicate, even though sign languages are available and bilingual-bicultural education is growing widely [1] [2] [3] . However, some deaf students have had a difficult experience to attend regular schools, where verbal or written language is dominantly used as the means of communication and instructional tool. When deaf people study or join a meeting with hearing ones, the communication is generally accomplished by one or more sign language interpreters [4] [5] . The presence of the sign language interpreter aims at bringing the equality to both deaf and hearing people. However, many hearing people know little about the sign language due to fact that verbal language dominates communication means in the hearing world. In the absence of sign language interpreter, the deaf and the hearing cannot communicate directly as a result deaf students are compelled to attend special schools where they can find a community with whom they can use sign language in their daily life conversation. Majority of the deaf individuals use sign language as their first language and learn verbal or written languages later in their lifetime. This makes sign language be an essential tool of instruction for deaf students as they can befit more from sign language than text in learning environment [6] and sign language has proven better performance in supporting kids at kindergarten [7] . Due to the crucial role played by sign language in supporting deaf people, many studies have focused on utilization of sign language in communication systems [8] [9] .
The number of deaf and hard of hearing people, which has increased to over five percent of the global population [10] , and the prevalence of deaf children who were born to hearing parents, is higher than ninety percent [11] . Unfortunately, hearing parents of a deaf child commonly do not have either enough time or support to learn a full sign language. Consequently, their deaf children often begin to learn sign language at school. Thereby, the work [12] shows that the students who are being identified as of hearing loss score significantly lower than the hearing ones. In contrast, deaf children born to deaf parents have an early access to sign language. The deaf parents are also proficient at managing the visual gaze of their children, especially, when sharing books or their language and cognitive development duration [13] [14] . Therefore, it is necessary to equip the hearing parents with a means of communication so that they can communicate and support their children in their critical periods. The communication means should be able to allow any hearing parents to get used to sign language. Additionally, the communication means should provide a unique meaning of the sign to both hearing parents and their deaf children. This is important since the sign language varies not only among separated deaf institutions but also among teachers in the same institution [15] .
In this paper, a solution to support real time communication between deaf and hearing individuals is introduced. As an extension of our previous work [16] , the proposed solution takes into account dialectal and sign language variation, speech utterance and coherent of the speech to sign conversion process. The current implementation targets English and Japanese. However, this generic solution can be extended for any other target sign languages. The obtained results prove that the system can support real-time communication among deaf individual and their hearing counterpart.
The rest of the paper is organized as follows. In Section 2, the theoretical background study, related work and state of art is present. In Section 3, the detail of the requirement and the design of the proposed speech to sign system architecture will be explained. Section 4 will describe the evaluation on the performance of the system. In Section 5, discussion and implications will be explained. Finally, the paper will be concluded in Section 6.
2. Background and Related Works
Attempt to bridge communication means for the deaf people to communicate with the hearing people started since 1964. Robert H. Weitbrecht, a born deaf, invented the first teletype writer (TTY). The TTY is to allow typed messages to be sent over a telephone line. In the telephone relay centers, messages between deaf and hearing people are relayed by specially trained operators. Also European eSIGN project paid huge contribution to the synthesis of signs in communication systems. With advancement in technology, the number of means of communication for the deaf people has increased and in turn several studies are now focusing on supporting speech to sign translation systems so as to sustain real time communication between deaf and hearing people.
Nguyen-Duc et al. [16] proposed a local smart network for speech visualization shown in Figure 1. The network supports speech to text conversion in wearable devices. The network uses an automatic speech recognition (ASR) engine for conversion of speech to text. The network was evaluated by testing accuracy of speech to text conversion, calculating the number of spoken words converted into text and the response time. The solution paves a way for real time speech to sign conversation. However, the solution does not implement sign language utilization and the evaluation does not take into account the comprehension of speech to sign conversion process so as to handle dialectal and sign language variation. This makes the solution unfeasible in real time communication [16] .
Figure 1. A local smart network for visualizing spoken language.
San-Segundo et al. [17] proposed system architecture for translating speech into Spanish sign language in real time domain. The proposed architecture consists of three modules: speech recognition module, natural language translation module and animation module. The natural language translation module operates based on statistical translation and rule-based approach. The translation works well in redistricted domain. However, it induces time delay which makes it unfeasible in real time communication [17] .
López-Ludeña et al. [18] proposed a user-centred methodology for developing a communication system for deaf that consists of four steps: requirements analysis, parallel corpus generation, technology adaptation and system evaluation. However, the methodology relies on parallel corpus generation which can cause delay. This makes it unfeasible in real time communication [18] .
Zhao et al. [19] proposed a machine translation system from English to American Sign Language (ASL). The proposed system uses input data (text) to derive semantic and morphological information. However, the proposed solution lacks evaluation [19] .
Unlike previous approaches, the system architecture proposed in this study is designed to handle dialectal and sign language variation and speech utterance in real time communication. The architecture adopts automatic speech recognition (ASR) engine and uses direct translation that enables it to display signs in real time using finger spelling approach.
3. The Proposed FFCDH Solution
Taking into account dialectal and sign language variation, FFCDH is proposed to overcome such short comings of our previous work [16] . FFCDH is designed to enable face to face conversation between deaf and hearing people in real time without sign language interpreter.
In this work, the deaf and hearing people are assumed to understand atleast the hand arrangements, or sign alphabet, corresponding to the alphabet letters of the hearing’s native language. Additionally, they are assumed to use only one language, i.e., English or Japanese, during their conversation. The communication can follow a turn-talking mechanism [20] [21] , or talking in a group simultaneously. In the latter case, to record the speech of the hearing person individually, an array of microphones will be used [22] . When the deaf and hearing people do not want to use finger-spelling mechanism [23] , they can switch the proposed system to the advance mode in which animated signs from dictionary [24] will be used. In this case, a sign language dictionary must be installed in the proposed system and the dictionary is assumed to have enough vocabulary to support the communication.
3.1. Design Requirements
The proposed system is to let deaf people live almost fairly within the world of others. The key aspects in the requirements aim at allowing the deaf to communicate like hearing people, to have a proper intellectual, and to find themselves happy and productive members of society. Therefore, the proposed system needs to meet the following requirements:
1) Enable the hearing and deaf people to use sign language
The deaf people with or without support of technology, still need the support from the hearing people. Especially when they are at their early ages, the support of their hearing parents plays an important role in their early development. The proposed system must enable hearing parents to support their deaf children. The system needs to have at least two modes, namely simple and advance ones. In the simple mode, finger-spelling will be used to represent words. While the advance mode uses sign language vocabulary to allow the deaf and the hearing people to communicate using natural sign language.
2) Support real time conversation
The proposed system must be able to support the deaf and hearing people to talk to each other directly in a real time manner. When the hearing person speaks, their speech will be recorded and converted to signs for the deaf in real time. When finger-spelling is used, the speed of the conversation can be low. However, when sign vocabulary is used, the speed of the conversion must be fast enough to support normal communication. Besides, the proposed system must support the deaf even when they do not understand the native language of the hearing people whom they are talking to. In these cases, the gestures generated by the deaf will be captured and converted into sound.
3) Eliminate the fear of being labeled “impaired” for the deaf
The deaf have been overlooked and labeled “impaired” or “handicapped”. Thus, almost all of them refuse to use hearing support devices in their daily life [25] . Therefore, the proposed system must take into account this matter not only in terms of technical solution but also its design. Technically speaking, the proposed system must inconspicuously support the deaf anywhere on their daily basis activities particularly when they are at school as well as in the community at large.
3.2. Proposed System
Figure 2 shows the diagram of the proposed system architecture. In order for
the deaf person to see the signs correspond to the speech of the hearing person, the speech of the hearing person is recorded by a microphone embedded in a wearable device. Then wearable device worn by the deaf then streams the recorded speech to a mobile device. Although the voice stream does not require high band-width, the audio stream is still encoded for faster transmission and energy saving purpose. An audio decoder (AD) engine on the mobile device will decode received audio and automatic speech recognition (ASR) will recognize spoken words from the received decoded audio stream. If the accuracy of the recognition process is low, an audio editor (AE) is used to reduce the tempo of the speech. When the tempo of the recorded speech is reduced, the audio will be played in the similar way as the speaker speaks with a lower speed. If the ASR still unable to recognize the received audio, it will ask a cloud-based service.In this paper, the proposed system operates only in one mode in which the recognized words are sent back to the wearable device in text format. When the text messages reach the wearable device, the device will display each character in sign alphabets format on a hand-free display using Gallaudet fonts developed by David Rakowski [23] . For example, Optical Head Mounted Display (OHMD), which is used in existing wearable glasses, i.e., Epson, Vuzix and Google glasses, is considered.However; this work can be extended to an advance mode in which a location detector (LD) engine will detect the appropriate sign database to be used. The signs in the advance modeare not sign alphabets.They are sign vocabularies and in GIF format to show the animation of hands. Finally, the matched signs will be sent to the wearable device to display on the hand-free display OHMD.
To let the deaf respond to their hearing counterparts with a suitable volume, the volume of the deaf person’s voice is recorded and then is turned into an animation and displayed on the OHMD. This is because the deaf can talk but they cannot hear their voice, and the volume of the voice is commonly very high. The deaf can adjust the volume of his or her voice by observing the animation, which shows how loud the voice is.
Motivated by the face to face real time speech to the sign conversion, the algorithm for the proposed system was designed as shown in Figure 3. The system operates based on the proposed algorithm. The conversion process goes through
two core stages: speech to text and text to sign. The audio data is received by the system, and it is converted into text and then into sign. In pursuance of the speech coherent and handling of dialectal variations, the context awareness is applicable whereby the received audio data is matched with existing context in the cloud server database. The perfect match is converted into text. In favor of common understanding and control over sign language variations, direct translation is applicable whereby sign corresponding to the converted text is displayed. The system enables the end user to see the sign corresponding to the converted text in real time. Thus improving the response time, web socket is applicable whereby quick interactive communications between the end user’s mobile device and the cloud server can be sustained. The feasibility of the proposed system has been evaluated and demonstrated in the following section.
4. Evaluation
To evaluate the performance of the proposed system, speech-to-text accuracy and the coherent based on the recognized text will be observed. In addition, the proposed system will be evaluated using English and Japanese languages; evaluation also includes subjects of various nationalities to show system ability in handling dialectal differences. The performance of the proposed system is compared between when a limited database, i.e., offline database, is used as shown in [16] and when a cloud-based database [26] [27] , i.e., Google cloud speech recognition, is used.
The speech-to-text accuracy here is defined as the number of correct recognized words over the total number of spoken words. To calculate the coherent of the recognized content, we firstly split the original content into single sentences. A single sentence is the simplest form of a sentence in term of grammar. For example, a simple sentence must consist of a subject and a verb. The coherent is then defined as the percentage of the number of correct single sentences out of total single sentences.
4.1. Experiment Setup
The evaluation is performed on a test bed as illustrated in Figure 4. For the sake of simplicity, the function of the wearable device and the mobile device are implemented using two conventional computers (Core i5-3437U @2.4 GHz processor and 4GB RAM, Windows 7), called WD and MD, respectively. On the WD, an embedded microphone was used to record the speech of the hearing people as well as the deaf. To enable the signs to be able to be displayed in any hand-free display, the area on the screen of the WD used to display the sign has the same size as a hand-free display. The WD connects to the MD using Bluetooth 4.0. On the MD, a web based application has been developed to allow the ASR engine to recognize a speech using either the limited database, i.e., an offline database [16] or cloud-based database (Google cloud speech recognition engine). This is also to make a fair comparison between the previous study [16] and the proposed system in this paper.
4.2. Evaluation Using English
The experiments were performed by twenty-two participants from fourteen different countries including America, England, Japan, China, Nigeria, Saudi Arabia, Malaysia, Cape Verde, Thailand, Mozambique, Sierra Leone, Barbados, Tanzania and Kenya. The participants were aged between 22 and 38, six of them were females. The participants were invited to join two tests. In the first test each participant red a simple content shown in Table 1. This content consists of simple daily life conversation words. In the second test, a complex content was used. The complex content consists of scientific English words taken from a computer science lecture class. The complex content has two hundred words, fourteen long and complex sentences.
The procedure of the experiment was as follows: each participant is asked to read the content with the normal speed. The speech is then recorded by the audio handle component in the wearable device and is streamed to the mobile device. The ASR engine on the mobile device recognizes the spoken words from the received audio. The recognized words are sent back to the wearable device. On the wearable device, each recognized word is shown by a sequence of sign characters. The participant also sees the English alphabet character to validate the results.
The accuracy of the recognition processes and the coherent of the recognized content when either limited database or the cloud-based database is used are given in Figure 5 and Figure 6, respectively. The average accuracy when the
Figure 4. System evaluation environment.
Figure 5. Performance of the proposed system when the limited database is used to recognize English content.
Figure 6. Performance of the proposed system when the cloud based database is used to recognize English content.
Table 1. English contents used in the evaluation.
limited database was used is lower with 86 percent as shown in Figure 5. The Figure 5 shows that the simple content can be recognized with 91 percent accuracy on average when the cloud based database was used. Similarly, the obtained results show that the average coherent of the recognized simple content in both cases is 91 percent and 95 percent, respectively. When the complex content was used, the unknown scientific terms were the main reason that reduced the accuracy of the recognition process. When the limited database was used, the average recognition accuracy is as low as 57 percent, thus, the coherent is 76 percent. The average recognition accuracy has been improved to 76 percent, raising the average coherent to 87 percent.
4.3. Evaluation Using Japanese
The experiments were performed by twenty-one Japanese natives. The participants were aged between 20 and 44, two of them were females. The same procedure used in section 3.2 was repeated for this evaluation. Also, the same contents used in the section 3.2 were re-used, but they have been translated into Japanese as shown in Table 2.
The accuracy of the recognition processes and the coherent of the recognized content when the limited database or the cloud-based database is used are given in Figure 7 and Figure 8, respectively. Figure 7 shows that the simple and complex content can be recognized with a similar average accuracy when the limited database was used. The average accuracy was improved in any cases when the cloud based database was used as illustrated in Figure 8. Similarly, the obtained results show that the average coherent of the recognized simple content has also been improved by using the cloud based database. The average accuracy in both cases is similar and the average coherent is high because all participants are Japanese native speakers. The coherent is smaller than the accuracy of the recognition process because subjects and verbs in Japanese can be formed by using several words. Therefore, if the whole phrase that represents a subject or verb could not be recognized, the coherent of the recognized content is low.
5. Discussion and Implications
In the obtained results , coherent and accuracy were observed to be of high percentage, this has been interpreted as the ability of the proposed system (FFCDH) to be resilient to dialectal differences and sign variation. Thus, FFCDH can be
Figure 7. Performance of the proposed system when the limited database is used to recognize Japanese content.
Figure 8. Performance of the proposed system when the cloud based database is used to recognize Japanese content.
Table 2. Japanese contents used in the evaluation.
used to sustain short and long conversations without altering the integrity of the spoken speech. Coherent and accuracy were slightly lower in long and complex phrases than short phrases; this has been interpreted that the proposed system can perform better is daily short conversations than in long speech.
The study also implies that real time speech to sign conversion is feasible and can support daily life conversations between deaf and hearing individuals. Also the study implies that cloud-based databases are comprehensive enough to support speech to sign conversion.
6. Conclusion
In this paper, a solution to enable real time face-to-face communication between deaf and hearing people has been introduced. Experimental results have confirmed that when the database is large enough to support real time speech to sign conversion, English and Japanese languages speech can be recognized with more than 90 percent accuracy on average. The average coherent of the recognized content is also around 90 percent. Using our proposed system, the results have confirmed that the deaf can understand almost all the spoken languages. Also, hearing people can start studying sign language by using finger-spelling.