^{1}

^{*}

^{1}

^{*}

Recently, several deep learning models have been successfully proposed and have been applied to solve different Natural Language Processing (NLP) tasks. However, these models solve the problem based on single-task supervised learning and do not consider the correlation between the tasks. Based on this observation, in this paper, we implemented a multi-task learning model to joint learn two related NLP tasks simultaneously and conducted experiments to evaluate if learning these tasks jointly can improve the system performance compared with learning them individually. In addition, a comparison of our model with the state-of-the-art learning models, including multi-task learning, transfer learning, unsupervised learning and feature based traditional machine learning models is presented. This paper aims to 1) show the advantage of multi-task learning over single-task learning in training related NLP tasks, 2) illustrate the influence of various encoding structures to the proposed single- and multi-task learning models, and 3) compare the performance between multi-task learning and other learning models in literature on textual entailment task and semantic relatedness task.

Traditional Deep Learning models typically care about optimizing a single metric. We generally train a model for a specific task and then fine-tune the model until the system researches to the best performance [

In this paper, we implemented a multi-task learning model to joint learn two related NLP tasks, semantic relatedness and textual entailment, simultaneously. The proposed model contains two parts: a shared representation structure and an output structure. Following the previous research [

The semantic relatedness (a.k.a. semantic textual similarity) and textual entailment are two related semantic level NLP tasks. The first task measures the semantic equivalence between two sentences. The output is a similarity score scaling from 0 to 5. Higher scores indicate higher similarities between sentences. The second task requires two input sentences as well, a premise sentence and a hypothesis sentence. It measures whether the meaning of the hypothesis sentence can be determined from the premise sentence. There are typically three kinds of results: entailment, contradiction, and neutral, indicating that the meaning of the hypothesis sentence can be determined, contradict, or have nothing to do with the meaning of the premise sentence, respectively.

[

· Unlike the above-mentioned paper that only evaluates the unidirectional influence from semantic relatedness to textual entailment, our work demonstrates the mutual influence between semantic relatedness task and textual entailment task.

· Compared with previous work that joint the tasks solely with a multi-layer Bi-LSTM structure, our work implemented and evaluated the multi-task learning model based on a variety of structures with different encoding architectures, encoding contexts and encoding directions, and analyzed the impact of different encoding methods to the proposed single- and multi-task learning models.

· Our system achieved comparative results to state-of-the-art multi-task learning and transfer learning models and outperformed the state-of-the-art unsupervised and feature based supervised machine learning models on the proposed tasks.

Next section will give a brief mathematical background of the deep neural structures as well as some preliminary knowledge of multi-task learning. After that, we will illustrate the main structure of our system and discuss the training process. The experimental details and results are described in section 4. In section 5, we will show the results, including feature ablation, comparative studies between the single- and multi-task learning models, and between our model and other state-of-the-art learning models. At the end, we will offer some conclusions and discuss future works.

This section describes the background knowledge of this paper, including an introduction of different encoding structures (CNNs and RNNs), encoding contexts (attention layer, max pooling layer, and projection layer) and encoding directions (left-to-right or bi-directional), and the preliminary of multi-task learning.

Recurrent neural network [

A regular LSTM unit contains five components: an input gate i t , a forget gate f t , an output gate o t , a new memory cell c ˜ t , and a final memory cell c t . Three adaptive gates i t , f t , o t and new memory cell c ˜ t are computed based on the previous state h t − 1 , current input x t , and bias term b. The final memory cell c t is a combination of previous cell content c t − 1 and new memory cell c ˜ t weighted by the forget gate f t and input gate i t . The final output of the LSTM hidden state h t is computed using the output gate o t and final memory cell c t . The mathematical representation of the input gate i t , forget gate f t , output gate o t , new memory cell c ˜ t , final memory cell c t and the final LSTM hidden state h t is shown in Equations (1) to (6).

i t = σ ( W ( i ) x t + U ( i ) h t − 1 ) # (1)

f t = σ ( W ( f ) x t + U ( f ) h t − 1 ) (2)

o t = σ ( W ( o ) x t + U ( o ) h t − 1 ) (3)

c ˜ t = tanh ( W ( c ) x t + U ( c ) h t − 1 ) (4)

c t = f t ⊗ c t − 1 + i t ⊗ c ˜ t (5)

h t = o t ⊗ tanh ( c t ) (6)

Sometimes dependencies in sentences do not just appear from left-to-right and a word can have a dependency on another word before it. In this case, Bidirectional LSTM (Bi-LSTM) [

A Bi-LSTM network could be viewed as a network that maintains two hidden LSTM layers together, one for the forward propagation h → t and another for the backward propagation h ← t at each time-step t. The final prediction y ^ t is generated through the combination of the score results produced by both hidden layers h → t and h ← t . Equation (7) to (9) illustrate the mathematical representations of a Bi-LSTM:

h → t = f ( W → x t + V → h t − 1 + b → ) (7)

h ← t = f ( W ← x t + V ← h t − 1 + b ← ) (8)

y ^ t = g ( U h t + c ) = g ( U [ h → t − 1 ; h ← t − 1 ] + c ) (9)

Here, y ^ t is the predication of the Bi-LSTM system. The symbols → and ← indicate directions. W, U are weight matrices that are associated with input x t and hidden states x t . U is used to combine the two hidden LSTM layers together, b and c are bias terms, and g(x) and f(x) are activation functions.

Different parts of an input sentence have different levels of significance. For instance, in sentence “the ball is on the field”, the primary information of sentence is carried by the words “ball”, “on”, and “field”. LSTM network, though can handle gradient vanishing issue, still have a bias on the last few words over the words appearing in the beginning or middle of sentences. This is clearly not the natural way that we understand sentences. Attention mechanism [

The attention mechanism is calculated in three steps. First, we feed the hidden state h t through a one-layer perceptron to get u t which could be viewed as a hidden representation of h t . We latter multiply u t with a context vector u w and normalize results through a Softmax function to get the weight a t _{ }of each hidden state h t . The context vector could be viewed as a high-level vector to select informative hidden state and will be jointly learned during the training process. The final sentence representation is computed as a sum over of the hidden state h t and its weights a t . The calculation steps of u t and a t are shown in Equation (10) and Equation (11). The mathematic representation that leads to the final sentence representation S is shown in Equation (12):

u t = tanh ( W h t + b ) (10)

a t = e u t T u w ∑ t e u t T u w (11)

S = ∑ t a t h t (12)

A projection layer is another optimization layer to connect the hidden states of LSTM units to output layers. It is usually used to reduce the dimensionality of the representation (the LSTM output) without reducing its resolution. There are several implementations of such layers and, in this paper, we select a simple implementation which is a feed forward neural network with one hidden layer.

Convolutional Neural Network [^{th} word in the sentence and d is the dimension of the word embedding. Given a sentence with n words, the sentence can thus be represented as an embedding matrix W ∈ R n × d .

In the convolutional layer, several filters, also known as kernels, k ∈ R h d will run over the embedding matrix W and perform convolutional operations to generate features c i . The convolutional operation is calculated as:

c i = f ( w i : i + h − 1 ⋅ k T + b ) (13)

where, b ∈ R is the bias term and f is the activation function. For instance, a sigmoid function. w i : i + h − 1 is referred to the concatenation of vectors w i , ⋯ , w i + h − 1 . h is the number of words that a filter is applied to and usually there are three filters with h equals to one, two or three to simulate the uni-gram, bi-gram and tri-gram models, respectively.

c = [ c 1 , c 2 , ⋯ , c n − h + 1 ] (14)

A convolutional layer is usually followed by a max-pooling layer to select the most significant n-gram features across the whole sentence by applying a max operation c ^ = max { c } on each filter.

Inspired by [

Multi-task Learning is a learning mechanism to improve performance on the current task after having learned a different but related concept or skill on a previous task. It can be performed by learning tasks in parallel while using a shared representation such that what is learned for each task can help other tasks be learned better. This idea can be backtracked to 1998, when [

In natural language processing, [

In order to formulate the problem, we first give the definition of Multi-task Learning from [

Definition 1. (Multi-Task Learning) Given m learning tasks { T i } i = 1 m where all the tasks or a subset of them are related, multi-task learning aims to help improve the learning of a model for T i by using the knowledge contained in all or some of the m task.

Based on the definition of Multi-task Learning, we can formulate our problem as { T i } i = 1 2 , where m = 2 corresponding to the relatedness task ( T 1 ) and entailment task ( T 2 ). Both tasks are supervised learning tasks accompanied by a training dataset D i consisting of n i training samples, i.e., D i = { x j i , y j i } j = 1 n i , where x j i ∈ R d i is the j^{th} training instance in T i and y j i is its label. We denote by X i the training data matrix for T i , X i = ( x 1 i , ⋯ , x n i i ) and Y i for its label. In our case, the two tasks share the same training instance but with different labels ( X 1 = X 2 and Y 1 ≠ Y 2 ). Our object is to design and train a neural network structure to learn a mapping F: { X j 1 → Y j 1 , Y j 2 } j = 1 n i or { X j 2 → Y j 1 , Y j 2 } j = 1 n i .

Following the hard parameter sharing approach, we implemented a feed-forward neural network. The main structure of our system is illustrated in

In the input layer, two sentence embedding layers will first transform the input sentences into semantic vectors, which can represent the semantic meanings of these sentences, using a variety of encoding structures. The part (a) and (b) of the

Except for the two examples shown in

The concatenation layer aims to create a vector that can combine the information of the two sentence vectors. Following the previous research [

The input layer and the concatenation layer are shared by both tasks. During the training process, the input sentence pairs of both tasks will be processed by these shared layers and the parameters in these shared layers will be affected by both tasks simultaneously.

On top of the shared structure, we build two output layers, one for each task, to generate task specific outputs for the given two tasks. In term of machine learning, the semantic relatedness task is a regression task, so a linear function is used as the activation function to generate the relatedness scores between sentence pairs. The textual entailment task is a classification problem, so a softmax function is selected as the activation function to generate a probabilistic distribution of the entailment labels between the sentence pairs.

The system can be learned by jointing and optimizing the two task specific loss functions simultaneously. For the relatedness task ( T 1 ), mean square error loss between the system output y and the ground-true y ^ score labeled in the corpora are used as the training loss function. The mathematical formula is:

L m s e = 1 n ∑ i = 1 n ( y i − y ^ i ) 2 (15)

where, n is the number of training samples, and i ∈ [ 1 , n ] is the index number of the training samples.

For the entailment task ( T 2 ), cross-entropy loss between the system output y ^ and the ground-true label y is used as the loss function. The mathematical formula can be described as:

L s e = − ∑ i = 1 n ∑ j = 1 3 y i j log y ^ i j (16)

where, n is the number of training samples, i ∈ [ 1 , n ] is the index number of the training samples and j ∈ [ 1 , 3 ] is the index number of class labels.

The joint loss function is obtained by taking a weighted sum of the loss functions of each of the two tasks, which is written as:

L o s s = λ 1 L m s e + λ 2 L c e (17)

where λ_{1} and λ_{2} are the weights of the loss function of similarity and entailment task and they will be added as hyperparameters during the training process. During the experiments, we first fine-turn the λ in a large range ϵ [0, 10000] and then realize the system can achieve the best performance when λ is narrow down to ϵ [1, 2].

This section shows the experimental results of the proposed model. The details of the experiments, including the use of the corpus, the evaluation metrics and the parameter settings, will be discussed first and the experimental results of the RNN and CNN based models will be shown afterwards.

The Sentence Involving Compositional Knowledge (SICK) benchmark [

We followed the standard split for the training, developing, and testing sets of the corpus. The accuracy is used as the evaluation method for the entailment task. The mathematic representation of the accuracy is:

Accuracy = N correct N total (18)

where N correct is the number of examples that has correct entailment labels. The and Pearson correlation coefficient (Pearson’s r) is used as the evaluation method for the relatedness task. The mathematic representation of the Pearson’s r is:

ρ X , Y = c o v ( X , Y ) б X б Y (19)

where X and Y are the predicated and ground true relatedness score of the testing examples. cov is the covariance and б X , б Y are the standard deviation of X, Y.

The neural network model was trained using the gradient-based optimization Adam [

For the RNN models, the hidden layer size of LSTM is 128 and the hidden layer size of the first fully connected layer is 128 and 256 corresponding to the LSTM and Bi-LSTM models. The hidden layer size of the second fully connected layer is 64.

For the CNN models, the parameters of the filters are length = 128, stride = 1 and padding = 1, and the layers of the Hierarchical ConvNet is from 1 to 4. The hidden layer size of the fully connected layers is the same as the RNN models. We run a max epoch of 20 and mini-batch of 64. All the experiments were performed using PyTorch [

For each RNN model, we compared between the single- and multi-task learning

Sentence | Relatedness | Entailment |
---|---|---|

A: A player is running with the ball. B: Two teams are competing in a football match. | 2.6 | Neutral |

A: A woman is dancing and singing in the train. B: A woman is performing in the rain. | 4.4 | Entailment |

A: Two dogs are wrestling and hugging. B: There is no dog wrestling and hugging. | 3.3 | Contradiction |

models and illustrated the influence of different encoding methods (directions and contexts) to these models.

For the CNN models, we showed the performance a Hierarchical ConvNet with different convolutional layers and filters.

In this section, we will analyze the results of our experiments, including the

comparisons 1) between the proposed single- and multi-task learning models on the given tasks, 2) among various encoding methods of the proposed RNN and CNN models, and 3) between our multi-task learning model and other state-of-the learning models in literature.

From the experiments, it is obvious that multi-task learning can achieve better results than single task learning on both tasks. In addition, we can observe that the performance improvement has a bias on textual entailment task over semantic relatedness task. This observation can be explained by the task hierarchy theory in multi-task learning. In multi-task learning, the common features learned from multiple tasks are usually more sensitive to the high-level tasks than to the low-level tasks. In [

Observing that Bi-LSTM performs consistently better than LSTM under every scenario from

Among these encoding contexts, max pooling layer and projection layer can achieve approximately the same performance and can both surpass the performance of attention layer. This is because the limited amount of training data is slightly insufficient to train the proposed model, so the model starts to overfit the training data after the first several iterations of training. Projection layer and max pooling layer can avoid overfitting by reducing the dimensionality of the sentence representation. On the contrary, attention layer is used to select important components of sentences which does not have the ability to overcome overfitting. As a result, projection layer and max pooling layer show a relatively strong performance over attention layer.

We observe from

We also observe that increasing the CNN layers of the Hierarchical ConvNet can hardly improve the system performance. The reason is also overfitting. Even though, increasing the number of CNN layers can gain the representation ability of the system, it also increases the complexity of the system and raises the risk of overfitting.

Comparisons can also be made between our system with some of the recent state-of-the art learning models on the same benchmark, including the best supervised learning model Dependency-tree LSTM [

From the results, we can observe that our system outperforms the best unsupervised and feature engineered systems in literature on textual entailment task and achieves very competitive results compared to the transfer learning and

Model | Relatedness | Entailment | |
---|---|---|---|

Unsupervised Model | FastText | 0.815 | 78.3 |

SkipThought | 0.858 | 79.5 | |

Feature Enginnerred Model | Dependency-Tree LSTM | 0.868 | -- |

Illinois-LH | -- | 84.5 | |

Transfer Learning Model | InferSent | 0.885 | 86.3 |

Multi-task Learning Model | Joint | -- | 86.8 |

Ours-RNN | 0.848 | 85.6 | |

Ours-CNN | 0.849 | 85.4 |

multi-task learning models. In addition, the performance of our model on semantic relatedness task is comparable to other models in literature.

The reason that the transfer learning outperforms our models is that transfer learning model takes advantage of knowledge learned from external tasks. For instance, the InferSent system is pre-trained with SNLI dataset, containing 520 K training instances on textual entailment tasks. When being applied to SICK benchmark, the knowledge learned from previous task can be directly transferred to a new task and improved the learning ability of the new task. On the contrary, our models do not rely on previous learned knowledge and were trained absolutely from scratch.

The reason that the state-of-the-art multi-task learning model can outperform our models is that it used a hierarchical architecture. Research [

In this paper, we explored the multi-task learning mechanisms in training related NLP tasks. We performed single- and multi-task learning on textual entailment and semantic relatedness task with a variety of Deep Learning structures. Experimental results showed that learning these tasks jointly can lead to much performance improvement compared with learning them individually.

We believe that this work only scratches the surface of multi-task learning on training related NLP tasks. Larger dataset, better architecture engineering and probably combining pre-training knowledge in the training process could bring the system performance to the next level.

The authors declare no conflicts of interest regarding the publication of this paper.

Zhang, L.R. and Moldovan, D. (2019) Multi-Task Learning for Semantic Relatedness and Textual Entailment. Journal of Software Engineering and Applications, 12, 199-214. https://doi.org/10.4236/jsea.2019.126012