A Deep Look into Extractive Text Summarization

This investigation has presented an approach to Extractive Automatic Text Summarization (EATS). A framework focused on the summary of a single document has been developed, using the Tf-ldf method (Frequency Term, Inverse Document Frequency) as a reference, dividing the document into a subset of documents and generating value of each of the words contained in each document, those documents that show Tf-Idf equal or higher than the threshold are those that represent greater importance, therefore; can be weighted and generate a text summary according to the user’s request. This document represents a derived model of text mining application in today’s world. We demonstrate the way of performing the summarization. Random values were used to check its performance. The experimented results show a satisfactory and understandable summary and summaries were found to be able to run efficiently and quickly, showing which are the most important text sentences according to the threshold selected by the user.


Introduction
The creation of internet, social networks, forums, information technologies spread in a revolving way, inducing the interaction of information increasingly difficult to understand, create, save, develop and store. The entire document still has to be read completely to decide if the information it contains is relevant or not, but it becomes a slow and overwhelming activity. But what if the information could be summarized in such a way as to obtain keywords that can help reduce time and effort in the decision-making process, therefore, automatic text Journal of Computer and Communications summaries are the solution to this problem (ATS).
Before talking about text summaries, we must first clarify: What a text summary is (TS)? A topical TS is a text that contains the most important information of one or more texts in a simplified form. The common stages for a TS: Identification of the most relevant text; Interpretation of information and obtaining a summary with the interpreted information.
The objective of ATS is to reduce the amount of text while preserving the main idea of the original document, allowing the reader to interpret the information read in a faster way. The ATS has gained popularity due to the need for analysis of large amounts of textual information, for example: generate summaries of books, comics, reviews, news, scientific articles, internet, social networks, among others, any type of textual information can be summarized, the excessive growth of information has forced researchers to seek different ways to obtain summaries of text and even its scope is such that they have achieved and created tools that allow summarizing content with illustrated text. The ATS can be applied in a single document or in a multi-document, depending on the specifications required.
The research is divided by the description of the Automatic Text Summarization techniques, later the Extractive Automatic Text Summarization is described and analyzed, immediately, the different approaches to generate ATS are mentioned later the experiment to perform EATS is demonstrated and finally, the experimental results are clearly and satisfactorily shown.

ATS Techniques
There are two approaches, Abstractive Automatic Text Summarization (AATS), [1] determines that they aim to concisely paraphrase the information content in documents by creating new information. The automatic text summarization (ATS) [2] concludes that they are those that choose the most noticeable sentences in the documents for further concatenation, to form a summary. [3] mentions that the most successful systems use EATS approaches as they cut and join parts of the text to produce a reduced version, [4] determines, therefore that EATS results in summaries with information available from the original text without any changes. [5] proposes that a typical EATS consists of 2 phases, the first are the pre-processes, [6] determines that its objective is to transform textual data into clear elements, eliminating inconsistencies for future interpretation. In addition, they can attach new sentences that are not contained in the original document and in the second phase, [7] indicates that it is the use of an approach and its objective is to reduce the length and detail of a document, preserving the sense and the most important points without changing its meaning. Figure 1 shows the diagram of a typical EATS. Journal of Computer and Communications

Approaches to ATS
1) Graph-based [8] demonstrates that each statement is treated as a node, two sentences are connected with an edge if both sentences have some similarity, they are essentially a way of deciding the importance of a vertex within a graph, based on the information extracted from the graph structure.
2) Based on Machine Learning (ML), [9] states that the selection of important text is represented as a binary classification problem, dividing all sentences in the input into summary sentences, the probability that the sentence should be in the summary is the sentence score. Therefore, the classifier is the one that determines the score sentences by taking as input the sentence representation and as output the sentence score. The sentences with more punctuation are selected to form the summary.
3) Neural networks, [10] states that are those modified to discover the importance and unimportance to determine the value of the summary of each sentence in a document. 4) Grouping or Clustering [11] proposes the analysis of documents by grouping similar information for later comparison. The more the information is repeated, a summary can be constructed using a sequence of sentences related to the calculated clusters. 5) Tf-Idf method [12] deduces that it is named after the document frequency (Tf) inverse document frequency (Idf), it is a statistical method that shows the importance of a token in a document. Where Tf (term) = number of times the term appears in the document, Idf = total logarithm of number of documents/number of documents containing. It is calculated with the formula: where df = is the number of times the token appears in all documents, dn = is the number of documents.

Related Work
EATS over the years has gained interest from researchers, implementing mul-  [13], ratify that perform an analysis of advantages and disadvantages using logical Fuzzy algorithm in EATS, [14] [15] propose a semantic method for EATS multi-document using statistical methods based on machine learning which is based on graphs. [16] [17] expose a method based on genetic algorithms for obtaining EATS [1], develop neural network based on sequential model with the characteristic of offering visualization of predictions regarding the content information, [18] propose a model to extract individual sentences modeling the relationship between sentences, [19] determine the extraction of sub-sentences based on tree decisions, based on a neural model, [20] point out that it is oriented to RTAE of news through a hybrid algorithm between semantic analysis and random fields, coherent and detailed information.

Development
This research details the procedure for EATS development using the Tf-Idf method. Figure 2 shows the proposed diagram for the EATS with the Tf-Idf method.

Preprocesses Application
Literature [6] shows that with the application of preprocesses it is ensured that the document information has been filtered, the idea is to standardize the text for subsequent analysis, and it is also a means to generate a future structuring in an efficient manner.

Division of Documents
The division of the document into subsets of documents is done by [21], they determine that through the representation of the vector space, it is necessary to create a jagged array due to its characteristics of unequal rows and columns allows to generate the storage of each of the vectors coming from the documents, for each vector, the Tf-Idf method [22] concludes that it is a statistical method that reflects the importance of a token in a given document, is a combination of two values where [23] confirms that Tf is a term frequency used to measure the number of times a token appears in a document, and Idf shows how important the token is in the document [24], they point out that it is essentially a way of assigning weight to a word or token (term) with a document, therefore, the higher Tf the token will be more representative in the document. The calculation of Idf is done through the following formula: where df(w) = number of documents where the token appears, d = total number of documents. If Idf(w) is low and if the token appears in many documents it means that the token obtains low discrimination power, on the other hand, if Idf is high and the token appears in few documents it means that the token has a high discrimination level in the whole document.

Values of Tf-Idf
The value of Tf-Idf increases proportionally with the number of times a token appears in the document. Using the modified formula proposed by [25], they report that the calculation of Tf-Idf is performed as follows: where Tf(w, s) = number of times the w token appears in the document. Idf(w) = is the number of documents in which the token appears. It is necessary to average Tf-Idf for each document with the equation: where w(s) number of total tokens in the document. Idf(i, d) = number of value obtained by calculating Idf for each token.

Threshold Value
The Threshold is the summary percentage level requested to the user, basically it is the amount of text to be summarized. The Threshold is used to calculate the maximum value of Tf-Idf. To calculate it, it is done through the following formula: The EATS is obtained from those documents whose average value (Tf-Idf) is equal to or higher than the Threshold (Tf-Idf).

Results
This research was developed in Visual Studio Community 2019. A document with a total of 2431 tokens in Spanish language was used to evaluate the performance of the EATS. Figure 3 shows an example of the analyzed document.
Two threshold levels were randomly selected to verify the operation of the program. Two tests were performed with a Threshold level of 90% and 35%.

Execution Time
The total execution time with a Threshold of 90% and 30% respectively was 37.962 seconds for both, this is due to the fact that the calculation is performed on the same set of documents, therefore, there is no variation in both calculations. Figure 4 shows the execution time.

Number of Tokens per Document
By dividing the document into smaller documents, a total of 47 independent documents were obtained. A given number of tokens were obtained for each document. Figure 5 shows the result of tokens per document.

Tf per Token in the Document
Obtaining the Tf of each token inside of each document was essential, allowing the calculation of Tf-Idf, Figure 6 shows the example of the Tf calculation for document number 1.

Frequency of Token per Document
It was necessary to count the number of document in which a token appears, in order to perform the necessary operations for the Tf-Idf calculation. It is ob-

Idf per Document
Idf determines the tokens hierarchy within the document, the higher the Idf value the higher the relevance. It is observed that the highest values correspond to tokens number 11, 13, 15, 16 with a value of 3.87 respectively. Figure 8 shows the result of the Idf calculation for document number 1.

Tf-Idf Calculation Result
The results of Tf-Idf calculation for document number 1 shows the lowest value was Tf number 16 with a value of 2.77 and Tf number with highest value correspond to tokens with number 2 and 17 with a value of 99.81. Figure 9 shows the example of the results for the Tf-Idf calculation for document number 1.     To check the performance of the EATS, a test was performed with a 90% Threshold value for the 47 documents, results are shown below.

Results of Calculation of Terms with a Threshold Value of 90%
The result of this type of Automatic Extractive Text Summarization calculation is due to the fact that it allows extracting a subset of the text with greater importance. Figure 10 shows the calculations results with a Threshold of 90%, it can be observed that the lowest value was document number 25 with a value of 4.62 and the maximum value was document number 12 with a value of 35.41.

Results of Values Equal to or Greater than a Threshold of 90%
When performing the calculation with a Threshold of 90%, a value of 31.87 was obtained. This value was compared with the values in Figure 10, document number 12 having a value of 35.41, therefore this is the most representative document. Figure 11 shows the result of the comparison with a threshold of 90%.

Final Result with a Value Equal to or Greater than a
Threshold of 90% Figure 12 shows the final result with a value equal to or greater a Threshold of 90%.

Hardware Specifications
Laptop Intel (R) Core TM i7-8750 H CPU A 2.20Ghz. NVIDIA GeForce GTX 1650 graphics card. 32 Gb Ram memory.

Conclusions
The application and execution of text mining (TM) preprocesses are fundamental for the optimal functioning of EATS, the application of a spell checker promotes a decrease in the execution time of the general process of TM preprocesses.
The applications of the pre-process provide the necessary elements for text structuring, in addition, they facilitate the counting of those similar elements for obtaining Tf of the tokens.
With the application of the Tf-Idf method, mathematical operations of statistical type are generated which results in request to be stored, product of the divi- The Tf-Idf method allows weighting the vectors, the used approach shows that it is an effective and functional method, the characteristics of the text depend on the language used, in this research it was adapted for Spanish language, working correctly. The value of Idf demonstrates the importance of the token within the documents.
The EATS process is a laborious process, but is performed in a careful way the results are satisfactory and clear and it can also be applied to a single document or several documents.
The values obtained through the Tf-Idf method allow to compare results using the comparing value will depend on the Threshold level requested by the user.
EATS aims to show the most relevant text according to the Threshold measure requested by the reader thus with this investigation it is verified that it works correctly way, with the experimental results presented, it is verified that texts can be summarized through the Tf-Idf method efficiently.
The main idea of EATS is to help reducing the total reading time, the final results are statistical in nature but the results are presented showing the most important text. The performance of the execution of this model is not affected on computers with low hardware resources, it does not consume large memory resources so it can be applied on any computer.

All the EATS examples carried out in this research have been applied for
Spanish language and the results are satisfactory, therefore the method works correctly.