_{1}

^{*}

The big problem of Big Data is the lack of a machine learning process that scales and finds meaningful features. Humans fill in for the insufficient automation, but the complexity of the tasks outpaces the human mind’s capacity to comprehend the data. Heuristic partition methods may help but still need humans to adjust the parameters. The same problems exist in many other disciplines and technologies that depend on Big Data or Machine Learning. Proposed here is a fractal groupoid-theoretical method that recursively partitions the problem and requires no heuristics or human intervention. It takes two steps. First, make explicit the fundamental causal nature of information in the physical world by encoding it as a causal set. Second, construct a functor F: C C′ on the category of causal sets that morphs causal set C into smaller causal set C′ by partitioning C into a set of invariant groupoid-theoretical blocks. Repeating the construction, there arises a sequence of progressively smaller causal sets C, C′, C″, … The sequence defines a fractal hierarchy of features, with the features being invariant and hence endowed with a physical meaning, and the hierarchy being scale-free and hence ensuring proper scaling at all granularities. Fractals exist in nature nearly everywhere and at all physical scales, and invariants have long been known to be meaningful to us. The theory is also of interest for NP-hard combinatorial problems that can be expressed as a causal set, such as the Traveling Salesman problem. The recursive groupoid partition promoted by functor F works against their combinatorial complexity and appears to allow a low-order polynomial solution. A true test of this property requires special hardware, not yet available. However, as a proof of concept, a suite of sequential, non-heuristic algorithms were developed and used to solve a real-world 120-city problem of TSP on a personal computer. The results are reported.

This paper is part and extension of the new causal theory, which was previously introduced (see for example [

The main goal of the causal theory is to derive structure from symmetries in the causal set. Every causal set has a permutation groupoid. The groupoid induces a block system B on S, consisting of a set

The original motivation for the development of this theory was the author’s interest in the binding problem, and more specifically the automation of the refactoring of computer software. A good introduction to the binding problem can be found in Sections 1 - 3 of [

It turns out that refactoring is a problem of discrete optimization. The functional to be optimized is the intrinsic metric of causal sets. The functional is positive definite, homogeneous, and linear, and satisfies the property of locality. Because of locality, the process of optimization is fully parallelizable. It is a sum of positive terms independent of each other (some restrictions apply), and each term can be optimized by a separate processor, hence solving the problem in (nearly) constant time. The properties of recursion, scalability, compression, locality and parallelism are today the most coveted and sought-after in many disciplines such as Computer Science, Big Data, Supercomputing, Artificial Intelligence, and others placed at the very forefront of technology. But Turing computers lack these properties, because function M separates S from w, leaving S in the computer memory and w in the hands of human developers and inaccessible to the processor. Then, as the volume of data grows and exceeds the capacity of the human mind to comprehend them, development slows and may soon come to a standstill. These ideas are further expanded in Section 2.

Self-programming, defined as the automatic conversion of causal information acquired by sensors into refactored, object-oriented, human readable code, is also a problem of discrete optimization, and falls in the same category as refactoring [

The deep connection between the causal theory and the collection of natural, observable, and measurable phenomena we know as emergence, self-organization, adaptation, and intelligence cannot be overlooked [

This is, no doubt, a vast panorama. It has to be addressed one piece at a time. In the present paper, the theory is discussed first, and then focus is sharply shifted to the Traveling Salesman problem (TSP) only. However, even with such a narrow focus, the human connection is still present. It turns out that humans are at least as good as computers to solve some TSP problems, and in some cases may even outperform algorithms by an order of magnitude, in spite of the fact that hardware runs at least 1,000,000 times faster than wetware. The prospect is thus raised that humans may be using better algorithms than those used today in computers [

It is not possible to reach the theoretical limit for performance using sequential algorithms. Supercomputers would not help, either. The limit stems from the properties of locality and parallelism that causal information naturally possesses, but that supercomputers do not utilize. The functional to be optimized is a sum of positives, each one of which must be separately minimized, as discussed in more detail in Section 3.4 and Equation (8) below. However, and in order to provide a proof of concept, a suite of sequential algorithms was developed. The algorithms were designed for research, not production. The implementation emphasizes compliance with the theory, maximum flexibility, and no heuristics or adjustable parameters. They are applied here to the solution of TSP problems, but they can be easily extended to the general case of causal information, and not just for TSP. The algorithms run in the debugger, and contain numerous internal tests for early bug detection, needed to support constant changes as it always happens with research work. Benchmarks were not taken because performance is not a consideration.

Unlike other approaches such as quantum computation, the theoretical limit is easy to achieve in the causal theory. All it requires is a multicore processor, such as a GPU or a neuromorphic processor [

This paper presents the symmetric TSP problem as an application of the causal theory. TSP is one example of the P = NP problem, the famous problem of theoretical computer science asking whether for any problem for which we can verify a solution in polynomial runtime we can also find a solution in polynomial runtime. It has been selected as one of the Clay Millennium problems, and is considered by many as the most important unsolved problem in Mathematics. The expectation is that, by systematically partitioning S into a progressively smaller number of more and more comprehensive blocks, it should be possible to compensate the NP-hardness of TSP and solve it deterministically in low-order polynomial time and without the use of heuristics. Presented here is a case study of gr120, a symmetric TSP problem with 120 cities in Germany, with integer distances, solved in polynomial time by causal techniques.

The paper begins with an overview of the causal theory in Section 2. Matters such as the formulation of TSP as a causal set, block systems, permutation groupoids, the functor F, macrostates and their properties, and the grounding of the causal theory as a fundamental theory of Physics are discussed there.

The scope of this paper is limited to the solution of TSP problems. However, the causal theory needs to be introduced because it is what makes the solution possible. The causal theory proposes information as a fundamental entity with properties of its own. It also proposes that information should be encoded as a collection w of (cause, effect) pairs over a set S of symbols, instead of the more traditional encoding as a string of symbols. Such an encoding is known as a causal set

Actually information is always encoded as causal information, except for the addition of a function

Function M has obvious advantages in engineering. It is what makes computation and communication possible. But M also has a serious disadvantage. For the purposes of the causal theory, function M is not needed. The theory treats information as an entity, and focuses on its properties, specifically on the properties of the mathematical object C that represents information and irrespective of any map to any particular system. This gives the causal theory a degree of autonomy and generality that the traditional theories of information and computation lack. And this lack is the serious disadvantage of function M.

The limitations of M are hardly new. A solid awareness of these limitations exists today, and a huge effort is being conducted in disciplines such as Artificial Intelligence, Computational Intelligence, Supercomputation, Internet of Things, Big Data, and others in an attempt to overcome the limitations by eliminating M altogether. In Big Data, the volume of data has grown beyond the limitations of the human mind to comprehend them [

The abstraction from M in the causal theory has another, much more fundamental effect: it makes the theory independent from space, time, and matter (substrate), and as such more fundamental than other theories of Phy- sics and a causal predecessor to them. In the causal theory, information is the fundamental entity, and space, time, and matter are perceptions.

This section presents an introduction to the causal theory. Additional details and references are available in other publications [

Where do causal sets come from? This question is addressed in Section 6 of [

Sometimes we know the causes and want to find their effects. That’s prediction. Sometimes we know the effects, and want to find their causes. That’s research, investigation, measurement, observation. Sometimes we know causes and effects at a coarse scale and need more detail. Then we use instruments, telescopes, MRI and fMRI machines, medical tests, surveillance, to find the details we need. We are always after causal sets. Causality is the root of prediction, and our ability to predict empowers us. The Traveling Salesman problem is easily expressible as a causal set, see Section 3.1 below.

When a geometric object such as a chair has automorphisms, we say it has a symmetry. When a mathematical object such as a set has automorphisms, we also say it has a symmetry. In both cases, the automorphisms map “parts” of the object to similar parts, and we call these parts structure. Groups and groupoids are tools used to study the symmetries and the structures.

Let S be a finite set with N elements. If S is represented as a sequence (string), then the automorphisms of S are the permutations

To represent the permutations, consider two-line notation. Arbitrarily select some reference sequence for the elements of S from

be a causal set. A causal set is a partially ordered set where S is a finite set and partial order w is irreflexive

With C given, let

Let now B be a partition of S that is compact over the reference sequence

In general, the property of compactness of B is not preserved for other sequences in

・ a symmetry has been found and structure has arisen; the “parts” that map to similar parts are the compact subsets of the partition;

・ B is called a block system, the

・ compression of the information takes place, because the block system represents the same information as the causal set, but with a coarser granularity, and the blocks contain references to the finer granularity; and

・ it becomes possible to separate the permutations in

The block system of

Algorithms for finding block systems use the property of invariance. They start with a seed of 2 elements of

Causal sets have many more natural properties that are important and actively sought after in a variety of disciplines, including Big Data, supercomputation, and discrete optimization, and can be quantitatively calculated directly from the causal set by Mathematics. A survey of the properties would be difficult to prepare and somewhat out of topic in this paper, but a few more have been discussed in [

As just explained in Section 2.3, when causal set

is a causal set smaller than the original C. As a causal set,

A functor F in the category of causal sets can be defined based on the construction of

Functor F is an endofunctor that morphs C into

The condition for F to be associative under union is that

In the context of the jungle of hierarchies discussed in the preceding paragraph, it follows that the causal sets in the overlapping parts of the hierarchies are not disjoint and hence not associative under F, whereas the causal sets in the distinct tops are disjoint and hence associative. The property of associativity supports the task of comprehending a large body of data, i.e. finding meaningful features and grasping their significance, which still stands as THE most critical unsolved problem in machine learning and Big Data. The link between functor associativity and comprehension is too important to discuss here and will be further discussed in a future paper.

The hierarchy of blocks has, from the start, several mathematical properties that are crucial for the causal theory and in discrete optimization and computation in general. Each level of the hierarchy represents the same information, but at a coarser granularity than the preceding level, and includes references to the finer granularity of the preceding level, hence compression of the information takes place. The hierarchy is a mathematical fractal, because every level of the hierarchy has the same mathematical properties: its blocks are made of blocks of the lower level, and serve as components for blocks of the next higher level. And the blocks in each level have a partial order inherited from the partial order in the lower level, and which is inherited by the blocks of the next higher level. The levels of the hierarchy become progressively smaller at higher levels. At the top of the hierarchy there is a single block, or a set of trivial blocks that cannot be compressed any further. As examples, consider Maxwell’s equations, which are a set of 4 equations, and the equation of General Relativity, which is the following single equation:

All of which means that information compression takes place. It has been conjectured, but not yet proved, that the compression reaches the Kolmogorov limit. Being a fractal, the hierarchy is scalable and supports power laws. Fractals and power laws are well known to exist nearly everywhere in nature. And here is the core property of causal sets:

Causal sets are naturally recursive, and give rise to scale-free hierarchies, mathematical fractals, power laws, and information compression.

In a more general case, a large causal set can generate not just one hierarchy but a large collection of inter-connected hierarchies, an entire network of invariant blocks. An example of one such network in nature is the network of cognits that exists in the brain, where blocks are referred to as “cognits” and proposed to exist not just in the brain but also in the mind [

Logic, meaning, and semantics are clearly associated with the hierarchical structures of invariants that causal recursion generates. The hierarchies are “new facts,” derived from previously existing facts, the previously existing facts being the causal pairs in w. This perfectly fits the definition of mathematical logic, and this author has called it Causal Mathematical Logic (CML) [

This section introduces the important notion of macrostate, and the action, population and entropy of macrostates. The causal theory is a theory of Physics. Some terms and concepts used in the theory that come from Physics and Mathematics are briefly introduced here in that context. The causal set is a mathematical model of a physical system. The symbols in S correspond to variables in the phase space of the system, and partial order w is the dynamic law. The dynamics of the system is a search of the directed graph of the causal set. A state is the collection of variables already visited by the search at some point. A transition corresponds to the search following an edge of the graph from a visited vertex and visiting an unvisited one. A trajectory is a path of causal edges in the graph that starts from some initial state and ends at some final state, and can potentially be followed by the search. A macrostate is the set of all paths that start from a given initial state and end at a given final state. The population of a macrostate is the number of trajectories in that macrostate. In TSP, a state is the set of visited cities, a path in the graph is a tour, and a macrostate is the set of all tours of the same length. In the causal theory, there is no time. Time is causality itself.

The theory applies to information because information is about physical systems. It proposes that information is fundamental, and that spacetime and mass are perceptions. Information has observable properties of its own, such as symmetry and structure, and entropy, all of which follow directly from the fundamental principle by way of Mathematics. Information also has energy [

Because the theory is scale free, it applies to physical systems at all scales and can represent them at any desired granularity. It can also “learn” more detail and smoothly transition to a finer granularity when more detail becomes available from observation or experiment. When it does, then the hierarchies of invariants change and new behaviors emerge.

The importance and physical meaning of macrostates stem from the following property: the causal theory predicts a general relation between the action and the population of a macrostate (see for example figure 1 in [

The action vs. entropy function is again confirmed in this paper from observed TSP results. It keeps re-ap- pearing in completely unrelated circumstances. Macrostates containing long tours have very large populations, while macrostates with short tours have very small populations. Randomly generated tours are almost always very long, around 50,000 km for gr120, simply because there are so many of long tours than there are short tours. During optimization, as tour lengths approach the optimum, the populations per macrostate dwindle to 0 or 1, or 2 in some cases. Population also drops dramatically towards the other end, the end with most action and very long tours, where the populations and entropy again become small. The “knee” on the left of the population curve separates the large populations regime from the small populations regime, and may play a useful role for monitoring the approach to the optimum. Here are two TSP examples, where the notation

・ 1265 (7175 - 7359). Starts from 1 tour per macrostate, begins to grow near 7254, peaks to 25 - 29 in range 7279 - 7289, then drops to 1 - 3 to the right of 7350. Roughly places (left knee - peak - right knee) as (7254 - 7285 - 7350) for this set.

・ 2908 (7049 - 7223). Starts from 1 tour per macrostate, begins to grow near 7063, peaks at 26 - 33 in range 7088 - 7156, then drops to 1 - 4 near 7216. Roughly places (left knee - peak - right knee) as (7063 - 7110 - 7216) for this set.

It would seem that the causal Gauss-like population curve is quite general, suggesting that the Gauss distribution itself may actually follow from the causal theory. The Gauss distribution is known to be ubiquitous in nature and the natural sciences. But the causal theory is even more ubiquitous. These properties have not been used for the present studies, but they may find an application to predict which search techniques are better in each regime.

The notions discussed in this section may also one day help to answer the Kolmogorov limit for the compression of information, causal information in this case. I have not studied this issue, but a recursive implementation of a “decompressor” for a hierarchy of blocks should be quite simple, perhaps provably simple, and it does not even have to involve the causal set or groupoid. For now, I can only conjecture about it.

The direct application of groupoid-theoretical techniques to discrete optimization problems, within the context of the causal theory, is discussed in this section.

In this section, an expression for the causal set for the symmetric closed-tour TSP is introduced. In TSP, the elements of S are the cities, and the cause-effect pairs in w correspond to the problem statement, that the warehouse city―the city where the salesman’s journey begins―is first in the tour, and the destination city, in this case the same warehouse city, is last in the tour. One peculiarity of causal modeling is that causality replaces time. There is no time or space in a causal set, only causality. Time, if needed, is treated as a perception. The immediate consequence is that the destination city, which is the “same” warehouse city that the salesman departs from, is not really “the same city” but only “the same city at a later time” ―a different city in causal lingo―and needs to be explicitly included as another city in the causal model, which is the causal successor of all other cities including the initial warehouse city. Hence, for a TSP with N cities:

where S is the set of symbols representing cities, set w defines a partial order on S,

The TSP problem is a particular case of a much more general class of problems that can be solved with the group-theoretical method known as Causal Mathematical Logic (CML) [

A set, such as the set S of cities in the causal set of Equation (6) by itself, has symmetry. The symmetry is represented by the collection of allowed tours, which is simply “all tours” as far as set S is concerned. If S has N cities, then S has exactly N! tours, and each one of them represents exactly the same set S because the elements in a set have no order. By definition, a mathematical object such as a causal set has symmetry when it can be represented in more than one way. This set of N! permutations is called the symmetric group of S, and also happens to be the groupoid of S. However, there is no non-trivial structure associated with this symmetry: the groupoid only induces a trivial block system onto S, and no non-trivial hierarchy of blocks is generated. The symmetry is just too strong for it to be relevant.

However, when partial order w is added and causal set

The process of optimization itself also breaks the symmetry of S even further because it progressively eliminates longer tours. As optimization proceeds, shorter and shorter tours are found and longer tours are discarded. Optimization works as an additional constraint placed on S, and it too breaks the symmetry and generates additional structure. The more we optimize, the more non-trivial structure is generated, and the structures are at their maximum when approaching the optimum.

Locality, the property that an object or structure is influenced only by its immediate environment, is the cornerstone of scalability and parallelism. Consider the minimization of a function F defined as the sum of N non negatively-defined functions, where N can be very large:

where the functions

For the hardware architecture of Section 7, the functions

The functional of TSP, the total length of the tour, is positive-definite and satisfies the conditions for F provided the transformations to which F is subjected are local. Transformations such as swapping cities or shifting or flipping blocks, are local, and the synchronization among them―blocking neighboring cities from participating in more than one transformation at a time―is local too. The rest of the tour remains unaffected and capable of engaging in numerous simultaneous and asynchronous transformations. Under such conditions, it is technically possible to assign one core to each

The total execution time for such a network will depend on the average total shift of a city or block subject to repeated swapping. If the shift depends on N, for example if it grows linearly with N, then the execution time will be of

The average city shift tends to be shorter when the tour is good and some blocks have formed, because cities tend to stay within their blocks, which provide a natural upper limit to the shifts. Hence it may help to start from a good seed, perhaps a good Dijkstra tour. The blocks themselves can shift, but as the optimization progresses a hierarchy is built and blocks become parts of bigger blocks, and so on, which become less and less likely to shift. Actually, the complexity of group-theoretical swapping algorithms improves closer to the optimum, rather than deteriorating.

One more consideration is needed: pre-calculated inter-city distances are a global parameter and would violate the requirements for locality. For an implementation of a small problem, it may be possible to store one entire row of the matrix of distances in each core’s memory, but probably not for larger problems. However, larger problems remain strictly local for an Euclidean TSP, where relative cartesian coordinates for each city are stored locally in that city’s core and used to calculate distances, or some other way to calculate them. It is known that humans and some animals can estimate distances from visual perception and use them to solve TSP problems more efficiently by using relative positions.

Block system formation in the course of an optimization process are essential for the process to succeed. Optimization can be viewed as a process that starts from an initial seed tour and repeatedly transforms it in various ways so that it becomes progressively shorter and shorter. The transformations are chosen in such a way that they favor the formation of shorter tours. However, due to irregularities in the inter-city distances and the complexity of the combinations it generates, this process is not uniform over the entire tour. It tends to create clusters of strongly bound cities, interspersed among weakly bound or not yet bound cities. City and block swapping causes the cities to compete with each other until they find stronger bonds and stabilize and form first clusters. The clusters keep competing among themselves and with the remaining cities, until they further stabilize and form stable blocks that remain invariant under the transformations, just because their bonds are strong enough. Similarly, as the process continues, blocks form blocks of blocks, and an entire hierarchy arises.

The causal theory proposes a functional to be minimized, defined by the basic mathematical metric for sets. In TSP, the functional to be optimized is the length of the tour, which is the sum of the distances between pairs of adjacent cities in the tour. For a tour

where

Machine learning is a vast field of research, and must be mentioned here because it is a very important and direct application of the causal theory. The intent in this section is not to review the field but only to place that research in the context of the causal theory. Machine learning has been evolving slowly but steadily in the course of decades, and this evolution has brought it from simple approaches that rely heavily on hand engineering and computer simulation to more sophisticated approaches that come closer and closer to the causal theory.

At the forefront of those developments is Deep Learning (see for example this video [

Ng’s working hypothesis about the single learning algorithm is correct. A decomposition in terms of a heuristic orthogonal base is not. The base is a feature and pre-specifying it would be feature engineering and would result in scaling problems an unphysical results. What is needed is a fractal base, one that is orthogonal across fractal levels. The causal pairs and their block system hierarchies play this role. Orthogonality alone isn’t enough.

The single learning algorithm is already known, and it is Causal Mathematical Logic (CML) [

To prove that P = NP for TSP, one would have to prove that all cases of TSP, or more in general for CML, can be solved quickly. There is one case per causal set, and the number of causal sets of all sizes is countable infinite, so that a proof by induction seems to be a natural choice.

Chaos theory pioneer Edward N. Lorenz has defined chaos as a case where “the present determines the future, but the approximate present does not approximately determine the future”. If a given causal set is considered as the past and the corresponding hierarchies as the future, then any process that changes the causal set, such as machine learning, will be chaotic. The resulting hierarchies are very sensitive to details in the causal set (butterfly effect). This effect was first observed in a study of small causal sets (about 12 elements or less) [

A proof by exhaustion would work for problems of a given subset, for example all problems of a prescribed size or range of sizes, but it would fall short of proving the statement. A direct proof is highly unlikely to exist, because, as argued in [

On the other hand, the physical world does provide numerous examples of CML in action. The existence of fractals and power laws, the property of scaling in nature, the ability of humans to solve TSP, all the observed phenomena of emergence and self-organization, are examples of CML in action.

This section is addressed to the person who is interested in programming details and practical hints. It reports explicit results obtained from computation for TSP gr120. Many theoretical details of practical importance are also included. All results were obtained by application of the tunneling algorithms described below, and all computations were performed in debug mode on a personal desktop computer using an Intel Pentium 2.70 GHz 2 cores processor with 4 GB of memory and a 64-bit operating system.

City tunneling (ct) is the workhorse for the computation. The code is very light. A tour with 120 cities can be fully surveyed for all 120 cities and the shortest tour found in about 0.042 sec. This time interval grows linearly with the number of cities, so it is about 350 μsec per city. The code is experimental and has not been optimized for production. It is designed for ease of change and experimentation. It has not yet reached full automation, and there still remain some operations that I do manually, such as copy the results from a ct run to a file that’s used as input for the run that follows. There is no user interface.

I will now report numerical results obtained in the course of the study in a more or less logical order, not necessarily chronological, to make it easier for the reader to follow the argument. Here I report only summary tables, the tours themselves are found in the Supplementary Material [

In this section, I introduce the suite of sequential algorithms based on the causal theory and specifically developed for the TSP problem. The algorithms were developed only for proof-of-concept and a guide for future developments, and do not represent the long-term goal, which is to develop a hardware prototype that can solve large problems, including TSP, hopefully in microseconds using a specially designed GPU-style chip. I call them tunneling algorithms because they are designed to tunnel through the walls of local minima basins. The purpose of tunneling is to evaluate a given seed tour in regard with its capacity to generate shorter tours, and to do so by using only local information.

Many depth-first algorithms create a collection of local minima different from the global minima that exist in the search space. The local minima are determined by the search algorithm, and not by the search space. To every local minimum, there corresponds a catchment basin, also determined by the algorithm, from where a simple descent search cannot break out. The algorithms are designed to “tunnel” through or over the walls of the basin and search for tours outside it that could allow the depth-first process to continue. The term is taken from Quantum Mechanics, where a charged particle such as an electron trapped in a potential well can free itself by tunneling through the walls of the well (thus making the existence of transistors possible). The combination of locality, efficiency, and the ability to tunnel make tunneling algorithms a very powerful tool.

These algorithms bear some resemblance with the classical simulated annealing approach, particularly for planning-type problems, but make their optimization decisions based on group theory and not on heuristics. A comparison with simulated annealing is in Section 5.3. The most basic tunneling algorithms are city tunneling and block tunneling, but there are several others. The algorithms start from a given seed tour and find only one tour, the shortest that algorithm can find for the given seed. City tunneling is discussed first.

The most basic tunneling algorithm is city tunneling, or ct. It starts from any given tour, such as a random tour or a Dijkstra tour, or any other tour obtained in the course of optimization. Cities are selected from set S in a random order. Each selected city is (virtually) removed from its current location in the tour, inserted back into the tour at the leftmost location, and then repeatedly swaped with the neighboring city to the right until the end of the tour. After each swap, the (signed) increment of tour length is locally calculated. After completion of the tunnel the city is re-inserted to the best location, which could be its original position, and left there. The next city in the random order is selected, and the process continues until all cities have been used. ct is very light code, and is the workhorse among the tunneling algorithms. Let:

be the part of a tour where a swap is about to happen between adjacent cities B and C. After the swap, this piece of tour becomes:

and the resulting (signed) increment of length

where

The calculation of

As explained in Section 3, group-theoretical blocks of cities form when two or more cities bind together. The process of optimization forces the cities to compete for places in the tour and to bind together and create a block when the bond is sufficiently strong to resist further transformations imposed by the optimization. As optimization proceeds, more and more blocks are formed, and fewer and fewer free cities are left. A city-stabilized tour bears blocks, and may still have free cities, and any further improvement in tour length can be achieved only by tunneling blocks, not cities. The algorithm can be described in the following steps:

Step 1. Arbitrarily select a city c that is not the warehouse from the seed tour.

Step 2. Using successive transpositions of adjacent cities, and advancing in one direction, say left and right, move the city to each one of the slots in the tour. For each slot, calculate the increment of tour length using Equation (12).

Step 3. Leave the city in the slot where the increment is the most negative.

Step 4. Arbitrarily select another city and repeat until all cities are exhausted.

The (signed) increment

Sets

But when only the city-stabilized tour is available, there is no block and no transformation. The blocks are not visible, and the transformation is not defined. With only one tour, the algorithm in pages 40 - 41 in [

1) Select a block size in the range 2 to

2) Select a block of that size from the tour and repeat what follows for all such blocks.

3) Tunnel the block over the tour and leave it where the tour length is the shortest.

For each block, the shift of the block from its original position to its new position is the transformation, and the block is clearly invariant under this transformation because it exists in both t and

There are some complications. One of them is that a block of cities is a set, not a sequence. The cities in the block can be permuted in any order, which may be computationally expensive. Another complication is that the action of shifting a block from one place in a tour to another upsets the city balance, resulting in a tour that may not be city-stabilized and could be improved by restoring the balance. To address the problem, several categories of permutations were defined and graded by cost, and a different algorithm was created for each category. This way, it becomes possible to adjust the power of the search at each step to the degree of difficulty of that step, and to do so automatically. This can be done by starting each step with a low-graded algorithm, and escalating only if necessary. The idea proved to be very effective in the case study, where several threads reached the optimum using only the least cost algorithms. The categories are:

1) ct―City tunneling. Scales as

2) bt―Block tunneling as-is. Scales as

3) bipi―Block in-place inversion. A block is inverted in-place (flipped over) in a tour. The computational effort is tiny. Scales as

4) bti―Block tunneling with inversion. In each tunnel, the block is considered both as-is and inverted. Because the additional computational cost is so small, this category is regularly included with as-is.

5) bct―Block-city tunneling. Block tunneling with block as-is and block inversion followed with city tunneling to restore the city balance before a decision is made regarding which tours to keep and which to discard. Scales as

6) fbp―In-place full block permutations.

7) fbpt―Full block permutations followed with block tunneling before a decision is made.

8) fbpbct―Full block permutations followed with block tunneling and city tunneling before a decision is made.

Other categories with various degrees of difficulty can be defined, for example block rotation, both in place or with tunneling, but have not been implemented. Also, the categories can be subdivided by block size, for example bt2 would be block tunneling as-is and inverted with blocks of size 2. Of course, with blocks of size 2, this is the same as full block permutation. The sequential algorithms have not been optimized or benchmarked, because they are intended only for proof-of-concept and not production. However, the hardware proposed in Section 7 should work within

There is value in blocks. A groupoid-theoretical block is a strongly coupled collection of cities that is weakly coupled to the rest of the cities. A block is not necessarily stable all the way to the optimal tour, but chances are that it will remain stable at least within a certain range of tour lengths, hopefully sufficient for our purposes. Hence, we prefer to use groupoid blocks for the optimization whenever possible.

The transformation of block inversion, or inverting a block in place, simply consists of reversing the order of the cities in the block. It is advantageous to combine block tunneling with block inversion in a single algorithm. The block is tunneled into a slot, the increment is calculated, locally as usual, and then the block is inverted in place and the new increment is calculated, also locally, because in both cases the internal length of the block remains the same for a symmetric problem.

Heuristic versions of block tunneling have been proposed.

Depth-first search algorithms, even tunneling algorithms, can cease to be effective at a local minimum they can’t tunnel through. At that point, it may be necessary to switch to a more powerful algorithm, even if it is less efficient. The tunneling arsenal includes several more types of tunneling, such as full permutation of the cities in a block, or permutations of the sub-blocks in a block. It is not known if these types are needed in general. The case study was solved without them.

A worst-case scenario for a problem with N cities would be one where full permutation of the cities in a large block turns out to be necessary. The largest possible block size is

There are obvious similarities between the city-swapping algorithm ct and the well known method of Simulated Annealing (SA). However, ct is actually used in a very different manner. In comparison with SA, ct considers group-theoretical blocks as valuable entities, with their own identity and a value that deserves to be preserved. A direct comparison at the level of benchmarks is not possible at this time because ct, and the other sequential algorithms, are intended for research, and have been optimized for ease of change and early bug detection and diagnostic, but not for performance. Production code has not been developed because it needs special hardware, which does not exist yet. The following lists some of the most important conceptual differences between ct and SA.

・ SA does not use scale-free fractals, as ct does, and as a result scaling in SA is limited and places limitations on the size of the solvable problems. Further improvements of parameter adjustments may be possible, but are not likely to result in major improvements. With ct, it should not be too difficult to come close to the theoretical limit of O(1), which means constant execution time independent of N, but this has not been demonstrated yet.

・ In SA, the assumed heuristic transition probability between states depends not only on the states but also on the temperature, which is a global parameter and compromises the locality of TSP.

・ In SA, blocks of cities1 are viewed as mere computational artifacts that can be swapped and benefit the annealing process by increasing the acceptance rate (see for example [

・ ct considers the problem as a physical system and takes advantage of the lessons learned from Physics. SA does not. Nature has solved the optimization problem, SA is not there yet.

・ SA can guarantee an acceptable solution in a fixed amount of time, but not the optimum solution. ct guarantees either the optimum solution or a partition of the problem by half or more.

・ SA is a probabilistic heuristic process, ct is a deterministic group-theoretical process.

・ SA works by finding best neighbors for a given seed state, by sometimes is forced to find worst neighbors if the search algorithm gets trapped in a local minimum of that algorithm. To do this, a heuristic probability of acceptance is defined that prefers best neighbors but also allows worst neighbors with a lower acceptance rate. This is where temperatures come into play. ct uses well-defined techniques to tunnel out of traps.

・ SA cannot tell a local minimum from a global one. ct can.

A case study of the gr120 TSP example was undertaken in order to demonstrate the applicability of the causal methods to NP-hard problems and estimate performance and other features. The problem is defined in the TSP Library [

Some results of the study are summarized in

Problem gr120 has 120 cities. This is a closed, symmetric problem, meaning that the distances between cities are independent of the direction of travel. The city where the tour starts and ends is called the warehouse city. In order to express the TSP problem in causal notation, and as discussed in Section 3.1, it is necessary to add an extra city, which is a replica of the warehouse city at a later time. The 121 cities are numbered 0 to 120, with 0 being the warehouse city and 120 being the causal replica. In the causal theory, Equation (6) requires for city 0 to be first, and city 120 to be last in any tour. All other cities remain mobile and are allowed to bind in any order. There are two optima, labeled 6942A and 6942B. The vertical scale in the plot represents tour length. The scale is not uniform, but marks are provided at levels 0.5%, 1%, and 2% above the optimum. Many interesting features can be observed in the optimization graph. They are discussed in the sections that follow.

The graph of

Three major sections can be distinguished in the graph, and are labeled as (a), (b), and (c). Section (b) is where the algorithms found the optimum for the first time, and also where I invested most of my research work. I worked with Section (b) for a long time while designing the algorithms in terms of the causal theory and trying to reduce manual intervention and automate procedures to bring them closer to a production level, all of that while rigorously avoiding the introduction of any heuristics. The use of heuristics would have considerably reduced the value of the theory and also the scope of its application. In addition, I had to dispel the possibility that the initial success could have been due to pure luck, and not to the systematic effectiveness of the search algorithms. And this is why I invested so much research in Section (b), particularly below the 2% level. Section (b) alone has yielded half of the 8 paths to optimality, suggesting that the approach to the optimum is robust and the optimum should not be too hard to find because it can be reached in many different ways.

Sections (a) and (c) represent preliminary production work, intended to test the algorithms, and highlight several of their features, although I never intended to bring them to full production capability. This algorithms, as explained above, are only for proof of concept, and not for full production. Sections (a) and (c) also help to dispel the notion that random seed tour 56081A has some hidden property or is in some sense unique in its ability to generate a good path. In fact, it is remarkable that this single seed, tour 56081A, has generated two completely different paths, one leading to optimum 6942A, and the other leading to the other optimum 6942B. This feature attests to the power and generality of causal optimization. Sections (b) and (c) were also instrumental for

obtaining an estimate of the execution time, which is close to 30 minutes. This estimate leaves plenty of room for improvement―perhaps by a factor of 10 or more―if the algorithms were to be brought to a full production level.

Nearly all segments of search in the graph are either steady descent or steepest descent. The only exceptions are in Section (b), and are the result of my intervention as part of the research. However, segments (7319A, 7408A, 7251A) and (6989A, 7044C, 6942A), are not an exception: both are steepest descent. Segment (7319A, 7408A, 7251A) is actually a single bct2 step, directly from 7319A to 7251A, which consisted of an initial bt2 step that generated 7408A, followed by the ct step to 7251A. The second segment, (6989A, 7044C, 6942A), is an fbpt6 to 7044C, followed by a ct to 6942A, as it should, except that I had not yet implemented what should have been the combination fbpct6 and had no choice but to perform both steps manually and plot the intermediate results. In other words, a fbpct6 step would be a maximum descent directly from 6989A to 6942A. The ct step from 7044C also happened to generate tour 6997A, which leads to 6942A by way of two light bct2 steps. I plotted this path as well.

The only two ascending arrows in the graph that I have chosen to plot teach us another important lesson. The first arrow ascends from 7319A to 7408A, for a total ascent of 89 km. The second arrow ascends from 6989A to 7044A, for an ascent of 55 km. The two ascents are different, and both are automatically determined by the algorithm and not by an arbitrary heuristic. No heuristics and no probabilities are necessary, as they are for example in SA.

There are two segments in the graph, fbp26 in Section (b), and fbp33 in Section (c), that need further discussion. It is not possible to calculate all permutations for blocks of these sizes using a desktop computer. But the initial 120-city TSP has been reduced to a 26-city TSP in the first case, or to a 33-city TSP in the second case, and the problem can be further pursued if needed. Alternatively, finding close-to-optimal solutions when computer power or time are limited is also of major importance. Tour 6989A is 0.6% above the optimum, and tour 6951B is only 0.13% above optimum, if an estimate of the position of the optimum is available, or they may just be short enough for the purpose.

However, in the first case, three different bypasses have been found. The first one is a single bct step directly from 7025A to the optimum 6942A, shown as an initial bt step to 7019A followed with a ct step to 6942A. The second bypass starts with a fbpct6 step passing through 7044C and then ct to optimum, and the third bypass is an fbpt6 step to 7044C followed with 3 light steps. The lesson we learn is that, whenever an excessive degree of difficulty is encountered, it may be worth to try alternatives.

In the last case, the fbp33 segment at the end of Section (3) in the graph, it may be possible to find a bypass, but I haven’t tried because the intention was to present a clean production thread that aims directly to the optimum. However, this paper is for research, and all questions that help understanding are legitimate. But then, the question remains, how do we know that the final step requires fbp33 if we can’t even get there? In this case, and for research purposes only, I compared tour 6951B with the optimum tour 6942B. Of course this answer is of no help for a search, but it is a valid action if the purpose is to gain understanding.

The comparison between 6951B and 6942B is shown at the bottom of TheTours.txt. It starts with a block of size 19 made of 19 trivial sub-blocks of size 1, identical in both tours. Then there follows a block of size 25 with the 4 cycles (5 47 72 61 83 101 56 98), (35 100 82 103 109 113 36 34 111 105 66 9), (79 38 26 62), and (4), of sizes 8, 12, 4, and 1, respectively. After this, there is a block of size 8 with the cycle (76 52 108 96 94 63 87 11), and finally a block of size 69 with 69 trivial sub-blocks, each of size 1, identical in both tours. The 4 cycles of the block of size 25 are inter-tangled, and not bound. The block of size 8 has only 1 cycle, hence it is bound, however bipi8 does not yield any improvement. The level of difficulty here can be as high as fbp33, as no kind of block tunneling can untangle the 5 cycles together. In this case, the initial TSP of 120 cities has been reduced to one of size not higher than 33 cities.

By contrast, all bipi steps, as the one near the end of Section (c), are very light regardless of the size of the block.

Section (a) of the optimization graph, with the only exceptions of the initial ct and the bt2 segment from 7106A to 7063A, is entirely made of bct2 segments. Section (c), again with the exception of initial ct and the 2 last segments bipi30 and fpb33, is also made of bct2. If counting segment (6944A, 6942A) twice, because it is part of two different optimization paths, then of the 7 segments that reach the global optimum 6942A, 4 are bct2. What gives bct2 such power? And why does it work so well with blocks of size 2, but not with other sizes?

I believe bct2 is so powerful because it allows the search process not only to tunnel out of a local minimum basin through its walls, as bt does, but also to overcome the basin entirely by jumping over its edge. After cities were stabilized by ct to form initial blocks and bt was invoked for every block size but failed to find a tunnel, we know that both ct and bt are trapped in a basin. This is the precise point where the steepest descent prescription must be abandoned, and the next logical step is to try to go over the edge of the basin. bct2 tunnels the same as bt2 does, except that, after each move, and even if the resulting tour is longer, it uses ct to stabilize cities and only then decides on whether to accept or reject the resulting tour. If the resulting tour is shorter that the seed, it means that the resulting tour is outside the ct, bt basin.

Going over the edge of a basin without knowing where the edge is located, is a new type of search. The bct algorithm effectively tries to jump over the edge or find a place where the wall is thinner and tunnel through it. In the process, it senses the height and shape of the wall as it crawls its way upwards. The indicator of escape is a tour shorter than the current seed, which is outside the basin. There are many ways to implement this algorithm and make it more efficient, but I implemented only the most basic form at this time.

A preliminary concept for a hardware architecture that would approximate the theoretical limit for group-theo- retical optimization methods was presented in Section 4 of [

As explained in the Introduction, the prediction based on the causal theory, later confirmed by neuroscientific research, that dendritic trees must be optimally short, is an early success of the causal theory. The proposed architecture attempts to do what the brain does: save biological resources by making short dendrites, and in the process optimize the functional of Equation (8) and generate the patterns or hierarchies of information discussed in Section 2.4.

The crucial question that needs to be resolved is that dendrites in either the brain or the causal machine must satisfy two completely different conditions. They must be (or become) optimally short, and at the same time they must correspond to the (cause, effect) pairs in the partial order w. In humans, there are 86 billion neurons and about 10,000 dendrites per neuron. Hebbian learning explains how dendrites are formed, but it is not clear why there are so many of them or how they are made (or become) optimally short. A possible explanation might be the following. Neurons in the brain do not have access to each other except by way of their dendrites. There is no bus. Hence, dendrites play the role of sensors: they sense the environment around a neuron, and provide the neuron with the information it needs to select the best connection, one that corresponds to a (cause, effect) pair and at the same time is optimally short. This hypothesis appears to be confirmed by the pyramidal cells in layer 6 of the human neocortex, which may have as many as 200,000 dendrites each.

It may appear from the discussion that, once the “best” connections have been found for a neuron, only a few of the dendrites would remain in the circuit and the rest would be disconnected. This may be true for a static environment, such as the one we are discussing here. In a real-world scenario, the process of learning by way of a stream of causal information constantly coming into the brain from the senses, is anything but static. Set S and partial order w constantly change as the result of new information, and a dendrite that is optimally short at one point may not be at another. To support the changes, the structure of dendrites must remain flexible, and this requires many dendrites per neuron. At least for short-term and mid-term memory.

In the case of long-term memory, the very slow process of methylation “freezes” the functioning dendrites (and may eliminate the rest, but this is not confirmed). This is how people keep childhood memories for their entire lives. The same mechanisms are proposed for the causal machine.

The cores in the causal machine are of course stationary in the architecture. However, it is easier to explain how the hardware works if one assumes that the neurons actually move, or virtually move along the linear array, constantly swapping their locations between adjacent neurons, and physically rearranging their locations in such a way that dendrites become optimally short. They know how to do this by counting the number of active dendrites that “pull” from them in each direction. If the differential pull acting on a neuron is positive (to the right), but negative for the neuron at its right, then the two neurons will swap their positions. All neurons constantly do this, hence they work in parallel, and their collective action quickly optimizes the overall length of the dendrites. In the brain, neurons do this to save biological resources, and not to optimize anything to create the structures of thought. Intelligence is a side effect of resource optimization, and was not intended by evolution. Note that this process is driven by resource preservation, and acts in the direction of minimizing the entropy and action in the system, towards point

As the neurons adjust their locations in the array, they form clusters. These clusters are group-theoretical blocks. Why? They are clusters because they “resist”, i.e. are invariant under the transformations of the swapping process, the internal binding in a cluster is stronger than external bindings that try to pull the cluster apart, and this is precisely the group-theoretical requirement for blocks to form.

Once the first clusters have formed, they tend to travel together. They do this simply because they are stronger than the rest of the neurons. Hence, now we have blocks of neurons swapping their locations with single neurons or other blocks of neurons. This swapping corresponds to the processing of the second-level causal set

There is a number of mathematical properties that are being feverishly sought after in front line disciplines and technologies such as Big Data, Supercomputation, Discrete Optimization, Neuroscience, Complexity, Internet of Things, Machine Learning, Robotics, Surveillance, Computer Science. The properties include scalability, partition, recursion, structure, compression, granularity, determinism, certainty, prediction, fractals, power laws, meaning, semantics, and others. It turns out that these properties are precisely the mathematical properties of causal sets, and the reason they are lacking is that the mathematics of causal sets―the causal theory―has not yet been applied.

Causal sets are information in its most crude form. They come from all sources that information comes from. The properties are theoretical, hence they can be directly and quantitatively applied in engineering and development. There is no need to approximate them with heuristics or engineer them into the technologies, they are already there.

Many of the ideas described here are not just new but also unique in the literature. All the properties emerge from three factors: the universality of the principle of causality, the unification of information as a representation of the principle, and the power and generality of causal groupoids and their symmetries over more traditional groups. The methods are here described in the context of TSP, but they are general and can be extended to CML in general.

This paper has focused on explaining the theory and the properties of interest, and provided supporting evidence for the working hypothesis that a unified encoding of information as a causal set and the recursive groupoid-theoretical partition of NP-hard problems can compensate for their combinatorial complexity and lead to a deterministic solution in low-order polynomial time. To illustrate the concept, sequential, but non-heuristic algorithms were developed and applied for a case study of the Traveling Salesman problem, presented in this paper. All experiments were successful and none failed. The theoretical limit for the algorithmic complexity may be as low as

Evidence from computations can verify a theory and make it more believable and more reliable. I hope this goal is achieved and will help to open new fields of research and new possibilities in technology.

Felix Lanzalaco called my attention to literature about TSP in humans published in journals of Cognitive Psychology that I don’t normally read. He also connected TSP with the phase offset in the human hippocampus [