Findings Seminal Papers Using Data Mining Techniques

The aim of this contribution is to show the detection of seminal papers using data mining techniques. To achieve the objective of this research, Rapidminer Studio software and its data mining tools are used, based on data created with information extracted from Google Scholar and Scopus, in three different areas of knowledge. In this process, other softwares such as Microsoft Excel and Publish or Perish are used. Comparing the results obtained for the searches in Knowledge Management, Entrepreneurship and Marketing, it was obtained that there is no marked similarity between the sets of articles that were obtained in Google Scholar and Scopus. The values for the Similarity Index remained below 0.52%, similar between Knowledge Management and Entrepreneurship but decreasing for Marketing. The detection of outliers using Data Mining techniques and in particular using Rapidminer, allowed to determine the seminals papers for the three search terms analyzed and allowed to characterize these in the space, in Google Scholar and Scopus. It was shown that the seminal articles can be different if Google Scholar or Scopus is used. The results suggest determining for other search terms whether the trend found is maintained or not.


Introduction
Knowing the articles that laid the foundations of a specialty or a specific topic of research has been defined, for many years, as one of the essential objectives of a literature review (Hart, 1998). The literature review, necessary in any investigation, has been defined with fairness by (Webster & Watson, 2002) as the analysis One of the main objectives in the realization of a state of the art is to identify those articles that have seated so much, the possible conceptual bases, as methodological of discipline, that is to say, those contributions that in fact "do not age" (Singer, 2009). It is usual, therefore, in the specialized literature, to find both: to determine those seminal articles published in a given journal (Parkinson et al., 2013), the role of one of these contributions in a particular discipline (Dolman, Miralles, & de Jeu, 2014), in a specific technique (Nash, Walker, Gidwani, & Ajuied, 2015;Nash, Walker, Lucas, & Ajuied, 2016) or the most important in a given branch of science (Riordon, Zubritsky, & Newman, 2000).
The importance of identifying the so-called seminal articles has been recognized as a de facto standard in the realization of a state of the art in the most dissimilar disciplines. To identify these articles of unquestionable significance in an investigation (Berkani, Hanifi, & Dahmani, 2020;Silva, Villa, & Cabrera, 2020), different alternatives have been proposed such as the use of collaborative models (Wang & Blei, 2011) and the use of personalized systems for the recommendation of the most relevant articles (Pera & Ng, 2011). Less studied has been the fact of how to identify these and their possible genealogy (Bae, Hwang, Kim, & Faloutsos, 2011, 2014. The fact is that the current researcher is faced with a quantity of information that does not do anything simply to find the most relevant jobs and this requires considerable time and effort (Alonso, Perez, & Hidalgo, 2016;Bravo Hidalgo & León González, 2018).
Within this problematic this contribution started from the investigative idea that the seminal articles are recognized as such, do not age, it is for two reasons: 1) They have been cited in a significant way, that is, they are recognized by the scientific community.
2) They remain valid for several years.
These two simple reasons should lead them to stand out as outliers in space: where VY is the Validity in Years of a given article, that is, the time elapsed from the publication of the article until the current date: C is the number of appointments received during that period for the article in question.
Data mining offers different possibilities for data analysis (Berkhin, 2006) including different techniques (Bakar, Mohemad, Ahmad, & Deris, 2006;Buthong, Luangsodsai, & Sinapiromsaran, 2013) and algorithms for the detection of values atypical (Ramaswamy, Rastogi, & Shim, 2000). At the same time, different applications have been developed (Rangra & Bansal, 2014) that facilitate the use of data mining. Among these, the Rapidminer offers a whole set of possibilities for  (Amer & Goldstein, 2012;Jungermann, 2009) and in particular for the detection of outliers (Buthong et al., 2013). The outlier has long been defined (Barnett & Lewis, 1974) as an observation, or set of observations, that seems to be inconsistent with the data set under analysis.
This contribution was proposed from these considerations to determine if in the space VY = f(C) could be distinguished the seminal articles as outliers using the possibility offered by the Rapidminer (https://rapidminer.com/) to classify them in said space. Another aspect that cannot be ignored is how the articles are determined and the number of citations received by each one. For this purpose, it was also proposed to explore in this research which was the coincidence in relation to the articles considered as seminal when using the Google Scholar (Mar-

Material and Methods
To form the space VY = f(C), we proceeded to search both Scopus and Google Scholar for the following terms in English, in the Title of the articles and for the period 1960-2019: 1) Knowledge management 2) Marketing
For each of the search terms, the 990 most-cited articles were selected. These In order to compare the similarity between the two sets of articles determined for each term, a Similarity Index (SI) was calculated from: that can be configured in this is shown in Figure 1.
The first Operator reads the file in Excel and processes the Cites and Validity fields, this was done for each search term and for each of the Bases used (Google Scholar and Scopus). The second identifies the Outliers in the data set. This allows you to specify both the number of neighbors (k), and the number of Outliers (n). To be able to compare the different search terms, these parameters were adjusted, after some preliminary tests, to the values: The calculation of the distances between the values of k was made using the Euclidean distances between these values. In practical terms, an attempt was made to answer the question: How to determine the 10 articles that can be considered seminal for each of the search terms analyzed?

Analysis of Seed Articles
Define abbreviations and acronyms the first time they are used in the text, even after they have been defined in the abstract. Abbreviations such as IEEE, SI, MKS, CGS, sc, dc, and rms do not have to be defined. Do not use abbreviations in the title or heads unless they are unavoidable (Figure 2). Table 2 presents the results for the SI for the case of articles determined as Figure 2. Outliers in the space VY Scopus = f(C Scopus ); knowledge management case.  (Gronroos, 1984) Outliers and that can be categorized as seminal using Google Scholar and Scopus and for the three search criteria used. The results obtained for the three search terms used are shown below in Table 2. In other words, this table identifies each of the detected documents as Outliers.

Conclusion
When comparing the results obtained for searches in Scopus and Googles Scholar for Knowledge Management, Entrepreneurship and Marketing, it was obtained that there is no marked similarity between the sets of articles that were The detection of outliers using Data Mining techniques and in particular using Rapidminer, allowed to determine the seminals papers for the three search terms analyzed and allowed to characterize these in the space VA = f(C) in Google Scholar and Scopus. It was shown that the seminal articles can be different if Google Scholar or Scopus is used. The results suggest determining for other search terms whether the trend found is maintained or not.