The Application of Epidemiology for Categorising DNS Cyber Risk Factors

This Epidemiology can be applied to cybersecurity as a novel approach for analysing and detecting cyber threats and their risks. It provides a systematic model for the analysis of likelihood, consequence, management, and prevention measures to examine malicious behaviours like disease. There are a few research studies in discrete cybersecurity risk factors; however, there is a sig-nificant research gap on the analysis of collective cyber risk factors and measuring their cyber risk impacts. Effective cybersecurity risk management requires the identification and estimation of the probability of infection, based on a comprehensive range of historical and environmental factors, including human behaviour and technology characteristics. This paper explores how an epidemiological principle can be applied to identify cybersecurity risk factors. These risk factors comprise both human and machine behaviours profiled as risk factors. This paper conducts a preliminary analysis of the relationships between these risk factors utilising Domain Name System (DNS) data sources. The experimental results indicated that the epidemiological principle can effectively examine and estimate cyber risk factors. The proposed principle has a great potential in enhancing new machine learn-ing-enabled intrusion detection solutions by utilising this principle as a risk assessment module of the solutions.


Introduction
The cyber terrain continues to expand at a rapid pace. From vehicles to fridges, against IoT networks, which denotes a set of linked computers cooperating to implement suspicious and repetitive events to corrupt the resources of a victim such as DNS amplification attacks.
Cybersecurity systems, especially intrusion detection and prevention variants which exist in the industry are mostly discovering abnormal behaviours using methods that use anomaly-based, signature-based, heuristics-based, or hybrid-based [4]. The methods are effective at discovering well-known malicious activities attacks and known botnets. The detection methods often fail at recognizing new variants of attacks and new botnet families. These methods demand domain experts' knowledge to cope with the new types of botnets [5]. The existing methods need a manual update to their blacklists; therefore, they need more computational power, and cannot discover new attack families. One of the recent methods used is machine learning-based intrusion detection that attempts to understand the abnormal behaviours from data and classify them [6].
Machine learning methods are also vulnerable to adversarial attacks that would exploit the learning process [7]. We attempt to develop a new methodology that enhances the detection procedure of botnets and new attack families by assessing the progression of cyber threats. The new methodology depends on epidemiology which is a study that examines disease distribution and progression [8]. We propose utilising a novel epidemiology-based cyber-risk detection approach for understanding cybersecurity risk. This paper examines current literature on discrete cybersecurity risk factors and identifies the research gap in the analysis of collective cyber-security risk factors. This paper explores how epidemiological principles can be applied to determine a range of factors. These factors comprise both human and machine behaviours and characteristics profiled as risk factors. This paper conducts a preliminary analysis of the relationships between these risk factors utilising DNS data. DNS data contains a strong indication of human and machine behaviour indicators. Hence, it is a relevant data type to explore the relationship between people, devices, data and process; all fundamental elements of IoT.

Epidemiology and Cyber Security
Over recent decades humans have become increasingly connected; both in physical communities and in technology. The concept of epidemiology first emerged around 300 BCE out of the Hippocratic philosophy which began to shift public health from "mysticism to patient-oriented empiricism" [9]. Epidemiology is contemporarily defined as "the study (scientific, systematic, data-driven) of the distribution (frequency, pattern) and determinants (causes, risk factors) of health-related states and events (not just diseases) in specified populations (patient is community, individuals viewed collectively), and the application of (since epidemiology is a discipline within public health) this study to the control of health problem" [8]. It is a relationship and pattern-driven disciple, aimed at "comparisons to establish cause-effect relationships, evaluate information and make good decisions that will improve outcomes" [10]. The author illustrates that "human disease does not occur at random; there are factors or determinants which can increase or decrease the likelihood of developing disease". Therefore, an infection can be determined through a calculation of risk, where risk comprises likelihood multiplied by consequence.
Epidemiologists study root cause, community burden, history, impact, prevention, and management for diseases. To determine the risk of a particular disease or diseases on a person or community, epidemiologists study a range of "risk factors" [11]. These risk factors include genetic profiles, environmental factors [8], behaviours and health status including nutrition and inoculation history. These concepts can be directly applied to cybersecurity, where the human disease is equivalent to computer compromise, and human communities are equivalent to networks or the internet of everything (IoE). Cybersecurity is becoming more prevalent and critical by the day. The internet of everything is evolving, and the likelihood and consequences of cybersecurity attack are increasingly devastating. Cybersecurity attacks are a contemporary and human-led "disaster". Attacks can destroy individual livelihoods, businesses and whole economies, as well as weaken Nation states economically and militarily [12].
They can take down critical infrastructure, from communications to water and power [13]. The director of Homeland Security at the Center for Strategic and International Studies, David Heyman summarised cybersecurity risk in that, "we have a great sense of vulnerability, but no sense of what it takes to be prepared" [14].
Applications of epidemiology to cybersecurity have been prevalent for dec-ades. In 1983, the technical computer virus was defined as "a program that can 'infect' other programs by modifying them to include a possibly evolved copy of itself. With the infection property, a virus can spread throughout a computer system or network using the authorizations of every user using it to infect their programs. Every program that gets infected may also act as a virus and thus the infection grows" [15]. This naming convention initiated the biological theme which has expanded to other forms of malware including "worms". The impact of cybersecurity incidents can be measured using epidemiological terminology, through "prevalence" and "cost of illness", where prevalence is the "number of existing cases of a disease in a population at a given time" [16] and cost of illness is likened to the cost for remedy including lack of productivity, costs of replacing hardware, software, potential reputational damage, etc.
These elements can be applied to "provide a systematic framework for the application and analysis of disease causes, spread and consequence, which can then be assessed to inform effective prevention and management methodologies" [17]. Cybersecurity experts are faced with constantly evolving threats. Actor tools, techniques, strategies and targets change by the day, resulting in a significant range of "risk factors" for consideration. These risk factors range from individual hardware and software attributes to configurations, networks, environments and behaviours. Hence, epidemiology provides a novel approach for the systematic analysis of these numerous risk factors. As epidemiologists monitor and prepare for public health disasters such as COVID-19, cybersecurity professionals can apply equivalent principles for planning, prevention and response to security disasters [18]. Epidemiological approaches are also highly effective for "allocating limited resources to obtain maximal benefits in disaster situations" [19]. Provided the resourcing issues that the cybersecurity is currently facing, this is highly pertinent for supporting efficient cybersecurity incident response. Figure 1 demonstrates the linkages between epidemiology and cybersecurity.
The internet was first considered in similarity to biological systems in 1999 [20]. Since then, the immense likeness between the "propagation of pathogens (viruses and worms) on computer networks and the proliferation of pathogens in cellular organisms (organisms with genetic material contained within a membrane-encased nucleus)" [21] has inspired researchers to apply concepts of epidemiology to information and communications technology (ICT). The vast majority of research into epidemiological applications to cybersecurity is focused on the spread of malware across technology elements. Epidemiology has also been applied as a mathematical modelling technique for virus propagation analysis through fully connected networks. This is very limited as it will not determine "the effect of the topological structure of the Internet on the spread of computer viruses" [22]. Endpoint computers are the only risk factor elements considered in this method. Researchers have explored propagation inspired by biological paradigms over standard computer networks [24], peer-to-peer networks [23] [24] [25], IoT devices [26] and WiFi routers [27].

Extant literature focuses on intrusion detection and prevention methodologies
for technical machine vulnerabilities (hardware and software). Literature is largely focused on a singular node and link network vulnerabilities. As such there is "limited systematic understanding of the factors that determine the likelihood that a node (computer) is compromised", in aggregate [28]. Scholars are noting a critical gap in "cyber epidemiology" research, which "treats individuals as highly distinct, independent, and important agents within a socio-technical system", and "advocate an approach to understanding how cybercrime thrives due to a failure to develop the understanding needed for effective behavioural control measures that are presented at the right place and the right time" [29].
Research into macro risk determination methods based on a range of factors is Journal of Computer and Communications scarce [28]. This research is essential for the detection, prevention, response and recovery of cyber attacks in an increasingly complex and interconnected world.  [29], plus elements of human behaviour which utilise artificial intelligence to evolve. Comprehensive and reactive models will "provide a scalable, resilient, and cost-effective mechanism that may keep pace with constantly evolving security needs" [25].
The closest form of truly comprehensive aggregate risk factor analysis can be seen in research on cognitive modelling of "dynamic simulations involving attacker, defender, and user models to enhance studies of cyber epidemiology and cyber hygiene" [30]. The researchers contend the importance of "wargaming" and simulation in both health care and cybersecurity, highlighting that "just as simulations in healthcare predict how an epidemic can spread and how it can be contained, such simulations may be used in the field of cyber-security as a means of progress in the study of cyber-epidemiology" [30]. The authors argue that epidemiology can be applied for the simulation of pandemic or disease outbreak, though prediction models based on existing behavioural data of threats.
These simulations provide "realistic synthetic users for full-scale training/wargame scenarios", which will "enable much-needed research in cybersecurity and cyber-epidemiology" [30]. Such simulations need to include methods to inject non-deterministic behaviours [31].

Technical Applications of Epidemiology to Cybersecurity-DNS
Epidemiological concepts can be applied to extant research findings to form an aggregate risk profile. In recent research, a DNS Anomaly Detection tool (Bot-DAD) was proposed to detect a bot-infected machine in a network using DNS fingerprinting [32]. This technique analyses host DNS fingerprints on an hourly basis and identifies anomalous behaviour that diverges from standard machine behaviour [32]. Panza et al. [16] used clustering to group DNS domains basing on the similarity between their users' activity, then compared these groups by

DNS Risk Factors
DNS data contains a strong indication of human and machine behaviour indicators. Hence, it is a relevant data type to explore the relationship between people, devices, data and process, which is form the fundamental elements of the Internet of Everything (IoE). It has a high volume of data, user types, host machine configurations and is encryption free. DNS is simply the machine-aided mechanism for resolving a word-based domain to an internet protocol address for any host on the internet [28]. It is a "yellow pages" for the internet, as humans understand and can remember English worded domain names (e.g., google.com), while computers understand numbers (IP addresses). DNS queries are generated when someone sends an email or visits a website. The DNS system leverages the DNS precursor, root name server, TLD name server and the authoritative name server to identify and route the end-user to the IP address that supports the domain that was searched for. The standard DNS function is characterised as normal machine behaviour.
Human behaviour often initiates the DNS query. This is most frequently through an action such as opening a browser, clicking a link, or typing a domain address or search query. These actions have an associated risk factor. For example, a user looking to access "http://google.com" will usually have a lower risk of compromise over a user looking to access a Dark Web host such as "http://hss3uro2hsxfogfq.onion/". This behaviour is classified as normal human behaviour, with a sub-classification of risk based on normal and anomalous activity type. These activity types can be intentional, or unintentional. DNS has inherent security weaknesses and vulnerabilities. These can be described in two categories: protocol attacks and server attacks. Protocol attacks compromise the DNS function. The first form of protocol attack is DNS cache poisoning, which allows malicious actors to "poison" the records and trick the DNS to re-direct and resolve malicious domains.
The second form of protocol attack is often referred to as "DNS spoofing", conducted in conjunction with "DNS ID hacking", where a malicious actor "spoofs" the packet's source address and ID fields, to answer a legitimate DNS request meant for a legitimate DNS server and impersonate the DNS reply. This  [32]. Malware also utilises Domain Generation Algorithms (DGA) to periodically generate several domain names that can be used for command and control servers [31]. These behavioural patterns can be characterised as malicious human behaviour, as a human is most often required to undertake these exploits, as shown in Figure 2.

Epidemiological Approaches to DNS Attach Analysis
The dataset used for this analysis was sourced from [32]. The authors utilised the dataset to create a DNS Anomaly Detection tool, called BotDAD, designed to use DNS fingerprinting to detect machines infected with botnets [32]. BOTDAD is an enterprise approach to anomaly detection and aims to build on extant approached which analysed based off failed queries. Similar DNS datasets have also   [36] and classification of IP flows [24]. This dataset has been used as an exemplar for rich DNS data. This data comprises a campus' DNS network traffic consisting of more than 4000 active users (in peak load hours) for random days in the month of April-May 2016. This set comprises DNS data (.pcap) from 23 April 2020 to 9 May 2016. 10 days of data was sufficient for a proof of concept, and contains enough data for a relational analysis and profiling of behaviour. A preliminary analysis was conducted on this dataset, to identify the categories of risk factors and the relationships between these risk factors.

Data Features
Feature engineering underpins Artificial Intelligence (AI) and Machine Learning   Normal machine behaviour comprises processes, queries and patterns that are expected elements based on system configuration. In the context of DNS, this is the true resolution of legitimate queries to the corresponding legitimate IP address. This also includes legitimate queries sent by machine services to support user applications.

Machine behaviournormal
Anomalous machine behaviour is when a process, query or pattern error occurs. Machines do not make "mistakes". In DNS data, this occurs when a packet is lost, or a DNS server is unable to resolve the query. This can be caused by a loss of confidentiality (machine query compromised due to leaked password), availability (server is down) or integrity (protocol attacks). Malicious behaviour includes illegitimate queries sent by malware to command and control servers. This behaviour can indicate that the compromised machine is being utilised as an infrastructure to conduct further malicious activity.

Human behaviournormal
Normal human behaviour comprises non-malicious queries to support legitimate internet browsing. In DNS data, this is when legitimate users are using the DNS service as it is designed. The majority of DNS query activity in the dataset was used for website browsing and email transmission.

Human behaviouranomalous or malicious
Anomalous human behaviour is most often caused by human error. An example of this in the DNS dataset is where the text query has contained a typographical error that is still a recognised domain name, resulting in the DNS resolver pointing to a domain that is inconsistent with the user's intent. This domain could be high-risk. Malicious human behaviour is where an actor intentionally compromises the DNS service. This occurs initially through protocol and server attacks. Malicious human behaviour is also seen in the command and control of malware to spread infection and commandeer additional infrastructure within the network for malicious use. values which differ by default e.g. MacOS is 64 for UDP, Windows is 128 for UDP, Linux can be 255 or 64 for UDP [37].
Operating System (software) is a key machine-type risk factor. The preliminary analysis of the DNS dataset demonstrates that there is a relationship between operating system type and risk of malware infection. All forms of malware identified were variants known to target and infect Windows Operating Systems only [31]. There is no evidence of macOS or Linux malware, or that macOS and Linux operating systems were compromised in this dataset.
There is a relationship between browsing high-risk websites and high- These relationships prove the concept of risk factor categorisation and contextual analysis to determine risk of compromise. This is explored in Table 1. Figure 5 illustrates the spread of malware through the university network, over time and by category. This demonstrates that a "mudpack" was the first instance of malware, sighted on day 1 (24 April 2016), as demonstrated in Table 2. From here, conficker was evidenced, followed by a surge in modpacks, which was followed by necurs, nymaim, pitou and suppobox malware variants. From analysing Figure 5 and Table 2 and Figure 6, it appears that Conficker,   /2016  10  358  45  3  2  1  0  1   2 25/04/2016  9  204  21  1  Suppobox, Tofsee and UD3 could have potentially been defeated and have no more occurrences. This cannot be determined with certainty due to the limited dataset. Modpack is seen to increase over time to a peak on day 9, then start to reduce on day 10. Necurs, Virut and Pitou are seen to remain reasonably constant over time, while Nymaim starts strong and decreases slightly over time.

Epidemic Parameters
The DNS dataset contains data features that map directly to epidemiological dynamics, as illustrated in Table 3. Time (days) in which the malware is undertaking actions, and/or has the potential to undertake actions to infect other hosts.
Fatality Rate Percentage of hosts that have been irreversibly compromised or unable to perform its function and/or damaged beyond repair.
Time from end or incubation to death Time (days) from when the malware commences an action, to the host becoming irreversibly compromised ("bricked").
Recovery time Time (days) from when the malware commences an action, to host recovery back to normal operations and a status of "not infected".
Length of hospital stay Time (days) in which the malware is undertaking actions.
Hospitalisation rate Rate (count as a percentage) of hosts that demonstrate indicators of compromise (IOCs) that are confirmed infected for some time where the malware is undertaking action.
Time to hospitalisation Time (days) from hosts demonstrate indicators of compromise (IOCs) to when they are confirmed infected due to evidence of malware action.

Conclusions and Future Work
This paper has discussed the applications of epidemiology to cybersecurity. It This research will manifest into a range of considerable contributions to cybersecurity. Further research is underway to utilise Artificial Intelligence and Machine Learning models to monitor and automate the analysis of these risk factors in aggregate. Further research is also underway to utilise DNS data features, related to epidemic dynamics, and apply these to epidemiological models to analyse the spread patterns of different malware variants including the reproduction number. This work will be also extended in developing the epidemiology principle as a risk assessment model for enhancing the performances of machine learning-based intrusion detection systems.