The Computational Theory of Intelligence: Information Entropy

This paper presents an information theoretic approach to the concept of intelligence in the computational sense. We introduce a probabilistic framework from which computational intelligence is shown to be an entropy minimizing process at the local level. Using this new scheme, we develop a simple data driven clustering example and discuss its applications.


Introduction
This paper attempts to introduce a computational approach to the study of intelligence that the researcher has accumulated over years of study. This approach takes into account data from psychology, neurology, artificial intelligence, machine learning, and mathematics.
Central to this framework is the fact that the goal of any intelligent agent is to reduce the randomness in its environment in some meaningful way. Of course, formal definitions in the context of this paper for terms like "intelligence", "environment", and "agent" will follow.
The approach draws from multidisciplinary research and has many applications. We will utilize the construct in discussions at the end of the paper. Other applications will follow in future works. Implementations of this framework can apply to many fields of study including general artificial intelligence (GAI), machine learning, optimization, information gathering, clustering, and big data, and extend outside of the applied mathematics and computer science realm to even more areas including sociology, psychology, and neurology, and even philosophy.

Definitions
One cannot begin a discussion about the philosophy of artificial intelligence without a definition of the word "intelligence" in the first place. With the panopoly of definitions available, it is understandable that there may be some disagreement, but typically each school of thought generally shares a common thread. The following are three different definitions of intelligence from respectable sources: 1. The aggregate or global capacity of the individual to act purposefully, to think rationally, and to deal effectively with his environment [19].
2. A process that entails a set of skills of problem solving -enabling the individual to resolve genuine problems or difficulties that he or she encounters and, when appropriate, to create an effective product -and must also entail the potential for finding or creating problems -and thereby providing the foundation for the acquisition of new knowledge [5].
Vernon's hierarchical model of intelligence from the 1950's [1], and Hawkins' On Intelligence in 2004 [7] are some other great resources on this topic. Consider the following working definition of this paper, with regard to information theory and computation: computational intelligence (CI) is an information processing algorithm that 1. Records data or events into some type of store, or memory.
2. Draws from the events recorded in memory, to make assertions, or predictions about future events.
3. Using the disparity between the predicted and events and the new incoming events, the memory structure in step 1 can be updated such that the predictions of step 2 are optimized.
The mapping in 3 is called learning, and is endemic to CI. Any entity that is facilitating the CI process we will refer to as an agent, in particular when the connotation is that the entity is autonomous. The surrounding infrastructure that encapsulates the agent together with the ensuing events is called the environment.

Structure
The paper is organized as follows. In section 2 we provide a brief summary of the concept of information entropy as it is used for our purposes. In section 3, we provide a mathematical framework for intelligence and show discuss its relation to entropy. Section 4 discusses the global ramifications of local entropy minimization. In section 5 we present a simple application of the framework to data analytics, which is available for free download. Sections 6 and 7 discuss relavent related research, and future work.

Entropy
A key concept of information theory is that of entropy, which amounts to the uncertainty in a given random variable, [8]. It is essentially, a measure of unpredictability (among other interpretations). The concept of entropy is a much deeper principal of nature that penetrates to the deepest core of physical reality and is central to physics and cosmological models, [17,15,6].

Mathematical Representation
Although terms like Shannon entropy are pervasive in the field of information theory, it will be insightful to review the formulation in our context. To arrive at the definition of entropy, we must first recall what is meant by information content. The information content of a random variable, X denoted I [X], is given by where P [X]is the probability of X. The entropy of X , denoted E [X], is then defined as the expecation value of the information content.
Expanding by the definition of the expectation value, we have where {x i }is the set of possible values X can take.

Relationship of Shannon to Thermodynamic Entropy
The concept of entropy is deeply rooted at the heart of physical reality. It is a central concept in thermodynamics, governing everything from chemical reactions to engines and refrigerators. The relationship of entropy as it is known in information theory, however, is not mapped so straightforwardly to its use in thermodynamics.
In statistical thermodynamics, the entropy S, of a system is given by where p i denote the probability of each microstate, or configuration of the system, and k b is the Boltzmann constant which serves to map the value of the summation to physical reality in terms of quantity and units.
The connection between the thermodynamic and information theoretic versions of entropy relate to the information needed to detail the exact state of the system, specifically, the amount of further Shannon information needed to define the microscopic state of the system that remains ambiguous when given its macroscopic definition in terms of the variables of classical thermodynamics. The Boltzmann constant serves as a constant of proportionality.

Renyi Entropy
We can extend the logic of the beginning of this section to a more general formulation called the Renyi entropy of order α, where α ≥ 0 and α = 1 defined as Under this convention we can apply the concept of entropy more generally to extend the utility of the concept to a variety of applications. It is important to note that this formulation approches 1 in the limit as α → 1. Although the discussions of this paper were inspired by Shannon entropy, we wish to present a much more general definition and a bolder proposition.

Intelligence: Definitions and Assumptions
Consider the sets S and O and let I ∈ I be the following mapping I : S → O. The function I represents the intelligence process, a member of I , the set of all such functions. It maps input from set S to O. First, let reflect the fact that I is mapping one element from S to one element in O, each tagged by the identifier i ∈ N, which is bounded by the cardinality of the input set. The cardinality of these two sets need not match, nor does the mapping between I need to be bijective, or even surjective. This is an iterative process, as denoted by the index, t. Finally, let O t represent the collection of o i t .
Over time, the mapping should converge to the intended element, o i ∈ O, as is reflected by Introduce the function: which in practice is usually some type of error or fitness measuring function. Assuming that L is continuous and differentiable, let the updating mechanism at some particular t for I be In other words, I iteratively updates itself with respect to the gradient of some function, L. Additionally, L must satisfy the following partial differential equation where the function d is some measure of the distance between O and O t , assuming such a function exists, and ρ is called the learning rate. Applications of this process to abstract topological spaces where such a distance function is a commodity is an open question. For this to qualify as an intelligence process, we must have

Unsupervised and Supervised Learning
Some consideration should be given to the sets S and O. If O = P (S) where P (S) is the power set of S, then we will say that the mapping I is an unsupervised mapping. Otherwise, the mapping is supervised. The ramifications of this distinction is as follows. In supervised learning, the agent is given two distinct sets and trained to form a mapping between them explicitly. With unsupervised learning, the agent is tasked with learning subtle relationships in a single data set or, put more succinctly, to develop the mapping between S and its power set discussed above [9,16].

Overtraining
Further, we should note that just because we have some function I : S → O satisfying the definitions and assumptions of this section does not mean that this mapping be necessarily meaningful. After all, we could make a completely arbitrary but consistent mapping via the prescription above, and although this would satisfy all the definitions and assumptions, it would be complete memorization on the part of the agent. But this, in fact is exactly the definition of overtraining a common pitfall in the training stage of machine learning and about which one must be very diligent to avoid.

Entropy Minimization
One final part of the framework remains, and that is to show that entropy is minimized, as was stated at the beginning of this section. To show that, we consider I as a probabilistic mapping, with indicating the probability that I maps s i j ∈ S to some o j ∈ O. From this, we can calculate the entropy in the mapping from S toO, at each iteration t. If the projection I s i has N possible outcomes, then the Shannon entropy of eachs i ∈ S is given by If |S| = M , then the total entropy is simply the sum of E i t [S] , i ∈ 1, 2, ..., N . But for the purposes of standardization across varying cardinalities, it is useful to speak of the normalized entropy, As t → ∞, the mapping from each s i to its corresponding o i converges, and we have lim t→∞ P t s i → 1, i ∈ 1, 2, ..., N and therefore Further, using the definition for Renyi entropy in 5 for each t and i To show that the Renyi entropy is also minimized, we can use an identity involving the p-norm and show that the log function is maximized t → ∞ for α > 1, and minimized for α < 1. The case where α → 1 was shown above when we used the definition of Shannon entropy. To continue, note that P t s i = 1 (19) where the sum is taken over all possible states o i ∈ O in the range ofI t s i . But from 15, we have for finite t and thus the log function is minimized only as t → ∞. To show that the Renyi entropy is also minimized for α ∈ (0, 1), we repeat the above logic but note that the with the sign reversal of α 1−α , we need to show that P t s i α is minimized as t → ∞.

Entropic Self Organization
In section 3 we talked about the definitions of intelligence via the mapping I : S → O. Here, we seek to apply the entropy minimization concept to P (S) itself, rather than a mapping. Explicitly, let σ ⊂ P (S) where for every s ∈ S, there is a unique s ∈ σ such that s ∈ s. That is, every element of S has one and only one element of σ containing it. The term entropic self organization refers to finding the Σ ⊂ P (S) such that H α [σ] is minimized over all σ satisfying 21:

Global Effects
In nature, whenever a system is taken from a state of higher entropy to a state of lower entropy, there is always some amount of energy involved in this transition, and an increase in the entropy of the rest of the environment greater than or equal to that of the entropy loss [17]. In other words, consider a system S composed of two subsystems, s 1 and s 2 . Then Now, consider that system in equilibrium at times t = 1, and t = 2, denoted S 1 and S 2 . Due to the second law of thermodynamics. and Now, suppose one of the subsystems, say, s 1 decreases in entropy by some amount, ∆s during the transition by time t = 2, i.e. s 2 1 = s 1 1 − ∆s Then what can be said of s 1 2 , the entropy of the rest of the system that So the entropy of the rest of the system has to increase by an amount greater than or equal to the loss of entropy in s 1 . This will require some amount of energy, ∆E.
While the second law of thermodynamics has been verified time and again in virtually all areas of physics, few have extended it as a more general principal in the context of information theory. In fact, we will conclude this paper with a postulate about intelligence: • Computational intelligence is a process that locally minimizes and globally maximizes Renyi entropy.
It should be stressed that although the above is necessary of intelligence, it is not sufficient in the justification of an algorithm or process as being intelligent.

Application
Here, we implement the discussions of this paper to practical examples. First, we consider a simple example of unsupervised learning; a clustering algorithm based on Shannon entropy minimization. Next we look at some simple behavior of an intelligent agent as it acts to maximize global entropy in its environment.

Clustering by Entropy Minimization
Consider a data set consisting of a number of elements organized into rows. Take for example the data that can be found at [11]. This particular example contains 300 samples, each a vector from R 3 . This simple proof of concept will group the data into like neighborhoods by minimizing the entropy across all elements at each respective index in the data set. This is a data driven example, so essentially we use a genetic algorithm to perturb the juxtaposition of members of each neighborhood until the global entropy reaches a minimum (entropic self organization), while at the same time avoiding trivial cases such as a neighborhood with only one element. A prerequisite for running this code is that one must have the Python framework installed, which is also freely available for many operating systems at [3].
The clustering source code is freely available at [10]. To run, simply download it and enter > chmod 777 e n t r o p y c l u s t e r . py > python e n t r o p y c l u s t e r . py −f e n t r o p y c l u s t e r d a t a . c s v Please note that this is a simple prototype, a proof of concept used to exemplify a concept. It is not optimized for latency, memory utilization, and it has not been optimized or performance tested against other algorithms in its comparative class, although dramatic improvements could be easily acheived by integrating the information content of the elements into the algorithm. Specifically, we would move elements with a high information content to clusters where that element would otherwise have a low information content. Furthermore, observe that for further efficacy, a preprocessing layer may be beneficial, especially with topological data sets like the iris data set. Nevertheless, applications of this concept applied to clustering on small and large scales will be discussed in a future work.
We can visualize the progression of the algorithm and the final results, respectively, in the graphs pictured below. For simplicity, only the first two (non noise) dimensions are plotted. The accuracy of the clustering algorithm was 8.3% error rate in 10000 iterations, with an average simulation time: 480.1  seconds. Observe that although there are a few 'blemishes' in the final clustering results, with a proper choice of parameters including the maximum computational epochs the clustering algorithm will eventually succeed with 100% accuracy. Also pictured in figure 2 are the results of the clustering algorithm applied to a data set containing four additional fields of pseudo-randomly generated noise, each in the interval [−1, 1]. The performance of this trial was worse than the last in terms of speed, but was had about the same classification accuracy. The accuracy of the clustering algorithm was 6.0% error rate in 10000 iterations, with an average simulation time: 1013.1 seconds.

Global Entropy Maximization
In our next set of examples consider a virtual agent confined to move about a 'terrain', represented by a three-dimensional surface, given by one of the two following equations, respectively: We will confine x, y such that (x, y) ∈ ([x min , x max ], [y min , y max ]) and note that the range of each respective surface is z ∈ [0, 1]. The algorithm proceeds as follows. First, the agent is initialized with a starting position, s = (x 0 , y 0 ). It updates s by incrementing or decrementing its coordinates by some small value, = ( x , y ). As the agent meanders about the surface, data is collected as to its position on the z−axis.
If we partition the range of each surface into equally spaced intervals, we can form a histogram H of the agent's positional information. From this H we can construct a discrete probability function, P H and thus calculate the Renyi entropy. The agent can then use feedback from the entropy determined using H to calculate an appropriate from which it upates its position, and the cycle continues. The overall goal is to maximize its entropy, or timeout after a predetermined number of iterations.
In this particular simulation, the agent is initialized using a 'random walk', in which is is chosen at random. Next, it is updated using feedback from the entropy function.
From the simple set of rules, we see emergent desire for parsimony with respect to position on the surface, even in the less probable partitions of z, (asz → 1). As our simulation continues to run, so tends P H to a uniform distribution.
To run the simulation and obtain the above data, simply download the source code freely available at [12] and enter:

Related Work
Although there are many approaches to intelligence from the angle of cognitive science, few have been proposed from the computational side. However, as of late, some great work in this area is underway.
Many sources claim to have computational theories of intelligence, but for the most part these "theories" merely act to describe certain aspects of intelligence [2]. For example, Meyer in [14] suggests that performance on multiple tasks is dependent on adaptive executive control, but makes no claim on the emergence of such characteristics. Others discuss how data is aggregated. This type of analysis is especially relevant in computer vision and image recognition [13].
The efforts in this paper seek to introduce a much broader theory of emergence of autonomous goal directed behavior. Similar efforts are currently under way.
Inspired by physics and cosmology, Wissner-Gross asserts autonomous agents act to maximize the entropy in their environment [20]. Specifically he proposes a path integral formulation from which he derives a gradient which can be analogized as a causal force propelling a system along a gradient of maximum entropy over time. Using this idea, he created a startup called entropica that applies this principal in ingenious ways in a variety of different applications, ranging from anything to teaching a robot to walk upright, to maximizing profit potential in the stock market.
Essentially, what Wissner-Gross did was start with a global principal and worked backwards. What we did in this paper was to was to arrive at a similar result from a different perspective, namely, entropy minimization.

Conclusion
The purpose of this paper was to lay the groundwork for a generalization of the concept of intelligence in the computational sense. We discussed how entropy minimization can be utilized to facilitate the intelligence process, and how the disparities between the agent's prediction and the reality of the training set can be used to optimize the agents performance. We also showed how such a concept could be used to produce a meaningful, albeit simplified, practical demonstration.
Some future work includes applying the principals of this paper to data analysis, specifically in the presence of noise or sparse data. We will discuss some of these applications in the next paper.
More future work includes discussing the underlying principals under which data can be collected hierarchically, discussing how computational processes can implement the discussions in this paper to evolve and work together to form processes of greater complexity, and discussing the relevance of these contributions to abstract concepts like consciousness and self awareness.
In the following paper we will examine how information can aggregate together to form more complicated structures, the roles these structures can play.