1. Introduction
A formal language is set of the words over some alphabet [1]. The standard set operations, like union and interception, can be performed under formal languages [2]. Another simple basic operation of formal languages is their concatenation. However, the complexity of the inverse operation of decomposing a formal language into a nontrivial concatenation of factor languages is more sophisticated and has a long history of studying [3]. A concatenation is trivial if one of the languages consists exactly of the empty string {ε}. A non-empty language is said to be prime if it cannot be written as a catenation of two languages neither one of which is the singleton language consisting of the empty word.
We investigate decomposition problems for the class of finite languages. The finite language is finite set of words. This class contends the codes, the codes are languages recognized by special kind of automata, so-called flower automata. The representation of the finite language in special graph of automata makes it possible to decompose into a product of prime languages.
Section 2 contains the basic definition and notation of formal languages, codes, graphs, finite automata and formal serials over the semiring.
Section 3 discusses the algorithm of decomposition prefix codes and finite language and a practical example of application of this algorithm.
Section 4 applies computer discrete algebra system GAP to decompose prefix codes and finite language.
2. Definitions and Notation
In this section, we remember the definition and notation about formal languages, codes, free monoid, free algebra, automata and graphs. The following definitions taken from [4] [5] [6] will be used.
An alphabet Σ is finite set letters
. A word or string w is finite length sequence of letters over alphabet Σ. We denote as
the set of all finite words. The set of
with respect to the concatenation operation forms a free monoid. Language L is subset of monoid
. Free monoid
contains the empty word ε. Semigroup
is monoid
without empty word ε.
A basic operation of free monoid
is concatenation of two words
. The concatenation can be expanded to the formal languages
. The result of concatenation is the language
.
This operation is like integer multiplication. The inverse operation that is factorization is more complicated. There is problem to find factors/divisors for big integer numbers and prime integer numbers.
The complexity of the inverse operation of decomposing a language L into a nontrivial concatenation
is also complicated.
A concatenation is obvious if one of languages
consists exactly of the empty string. A language L is prime language if it cannot be decomposed to a non-trivial concatenation of two languages L1∙and L2. The prime factorization of language is the decomposition into prime factors, prime languages.
The problem of prime factorization is undecidable for context-free languages [6]. A set X is a code if any word in
can be written uniquely as a product of words in X. To say other words, word
has a unique factorization in words from X. A code X never contains the empty word ε, because word
has different presentation. Any subset words from a code X is a code too.
The word u is a prefix of a word v, denoted as
, if
, for some
. We say that u and v are prefix comparable if either
, or
. The set X is a prefix code if no element of X is a proper prefix of another element in X. This is equivalent to say that there are no comparable words
of the relation
in the set X. For all words
, if
, then
.
We use standard graph theory notions, as contained in [7], so we only fix the notation and give a few definitions and example below.
A digraph (directed graph)
consists of a finite set of vertices V and a set of edges
. For a subset of vertices
, let
denote the induced sub(di)graph
, which is obtained by restricting the vertex set V of G to U and redefining the edge set E appropriately.
The graph
has vertices
and edges
(Figure 1).
An automaton A [2] [4] [8] over alphabet Σ consists of a set of states Q, the initial states
, the final/terminal states
, and a set
called the set of edges. The automaton is denoted by:
The automaton is finite when the set Q is finite. The language L is recognized by A, denoted L(A), is the set of words in
which are labels of paths from I to T.
Figure 2 shows the automaton A with four states, the set of initial states
, the set of terminal states
, the set of edges
. The finite language
is recognized by automaton A.
Recall that an algebra [9] over a field K is a K-vector space A with a binary operation (multiplication)
,
specified on it, satisfying the following requirements:
1)
,
for any
;
2)
for any
,
.
We will additionally assume that:
3) There is a unit in A, i.e. an element 1 such that
for any
;
4) Algebra A is associative, i.e.
for any
.
Figure 1. Graph G with V = 5 vertices and E = 9 edges.
Throughout the following, we will additionally assume that the field K is the field of rational numbers or the field Z/2Z. We can embed monoid
over alphabet
into free algebra of polynomials
with homomorphism
by definition on letters of alphabet Σ:
.
We need to use the formal series over the alphabet Σ with coefficients in semiring K. We recall the definition of semiring [10].
Let K be a semiring.
A semiring K is a set equipped with two operations denoted + and
satisfying the following axioms:
1) The set K is a commutative monoid for addition + with a neutral element denoted by 0;
2) The set K is a monoid for multiplication
with a neutral element denoted by 1;
3) Multiplication is distributive on addition;
4) For all
,
.
A formal series over alphabet Σ with coefficients in K is a mapping:
We denote the set of formal series σ over
by
or
. The value of σ on
is denoted
and we can formally denote:
The support of a series
is the set:
If the set
is finite, then we denote that set of formal series σ by
.
Another words the set of formal series
such that
for all but a finite number of
is denoted
and then σ is called a polynomial.
We define the formal series
,
, and
by:
For example, let K is Boolean semiring with addition and multiplication table:
alphabet
,
and series
is defined
,
,
,
:
Then,
It is easy to see that in this case,
is a free semiring of noncommutative polynomials and for finite languages
,
.
3. Decomposition of Codes and Finite Languages
At the first, we consider decomposition problems for the class of finite prefix codes. Then, subset X is a prefix code if no element of X is a proper prefix of another element in X. This is equivalent to the fact that there are no comparable words
of the relation
in the set X. That is for all words,
, if
, then
. For example, the set
is prefix code. A convenient representation for the prefix code is a tree view.
Figure 3 shows the presentation of prefix code
.
The bold lines present the words of code, the dotted lines present the words
. A given code X can be associated a subtree of the literal representation of this code X. The infinite tree may present the free monoid
. In this case the root of tree of relation
over
is drawing in Figure 3 as node 1.
Regular prefix codes are languages recognized by finite automata and such that no word is a prefix of another. The representation of the code is a deterministic finite automaton. Let two automaton
present two prefix codes
appropriately, then we can draw finite set of words
same kind like Figure 3. In this case, tree presents code X consists of the words of prefix code X1 and then words of prefix code X2. It is easy to see that X is prefix code
Figure 3. Presentation of prefix code X.
too. The word of code X is the leaf of this tree. There is the only (unambiguous) path from root of this tree to the leafs for each word of code X. Then, deterministic automaton A, which presents code X, has a graph like in Figure 4. The words
are words of codes
and
, where n, m are the number of words in codes X1 and X2.
We denote the vertex
of graph
of minimal deterministic automaton A and call divisor-border vertex
of automaton A. The DB-vertex of complete minimal deterministic automaton A [11] of prefix code X is the vertex that we have to divided graph G by this vertex on several automaton
. The automaton
is automaton appropriately factors
of code
.
Theorem
A prefix code
is divided on k prefix code
if and only if a prefix code X has graph G of its minimal deterministic automaton A with
DB-vertex.
Proof
If prefix code
is divided on k prefix code
, then the words of this code X present the unambiguous paths from root of the tree for prefix code X to the leaves of the tree. So, we can factorize this tree after the end of words each codes
as the result of this operation is like Figure 4. Then, graph G of its minimal deterministic automaton A has
DB-vertex.
If the graph G has
DB-vertex, we can prove by mathematical induction. The theorem is obvious for =2. Suppose that it is true for
, then prove it for
. We can divide the graph G on last DB-vertex Vn on two graph Gn and G1. For graph Gn, we have presentation
and just concatenate the words of automaton for graph G1. The results
.
Example
For prime prefix code
,
.
We construct the automaton from the prefix code X.
Display(A);
| 1 2 3 4 5 6 7
---------------------------------------
a | 5 4 1 6 5 1 4
b | 5 7 1 3 5 5 4
Initial state: [ 2 ]
Accepting state: [ 1 ]
Figure 5 shows the automaton A with DB-vertex (state) V = 4.
The state V = 5 is a “dead” state of automaton A.
Figure 6 shows Graph
and
for automata A.
We decompose the prefix code
to the product of two prefix codes
.
Now, we can formulate Theorem of decomposing for finite language L.
Theorem
A finite language
is divided on k finite languages
if and only if a finite language L has graph G of its minimal deterministic automaton A with
DB-vertex and the number of final state automaton A is equal one.
Proof
It is the same as the Theorem for prefix codex.
Algorithm for factorization prefixes code and finite language.
Input
An finite prefix set
.
Step 1
Built the minimal deterministic automaton A of prefix set X.
Step 2
Define the number n of DB-vertex in the graph of automaton A.
It can be do with complexity of
.
Step 3
Cut the graph of automaton A and find
factors of prefix set
.
There is algebraic approach to decomposing the finite language. The main idea is the embedding language
into free algebra of polynomials
or set of formal series σ over semiring K. We can embed language
over alphabet
with homomorphism
by definition on letters of alphabet Σ:
The homomorphism
maps the word
into monomials ˚
.
Figure 5. Automaton A with DB-vertex V = 4.
Figure 6. Graph
and
for automata A.
The set of monomials generate the ideal
. Then, we can try to find the divisors of this ideal
. We use an exhaustive search for the divisors just added monomials
with less power to the ideal
and apply Buchberger algorithm for this set to find the noncommutative Groebner basis. If the Groebner basis will be equals the added monomials
, then we find this divisors for finite language L. This algorithm has exponential time, but it will be ended because finite alphabet S and finite language L.
4. Apply System GAP for Decomposition Prefix Codes and Finite Language
The computer discrete algebra system GAP is the system for Groups, Algorithms and Programming. The name GAP reflects the original aim of the system, but now GAP has many packages for wide field investigation of modern algebra. The discrete algebra system GAP has become somewhat broader, and contained the information about algorithms and programming for other algebraic structures, such as semigroups, algebras, discrete automata and many other things.
Below describes the data and structures used in packages “automata” for finite deterministic and nondeterministic automata [12] [13] and some functions to determine property about them.
We can easy create free monoid over alphabet
.
gap> m1:=FreeMonoid(["a","b"]);
gap> gm1:=GeneratorsOfMonoid(m1);
[ a, b ]
gap> a:=gm1[1];
a
gap> b:=gm1[2];
b
alphabet
.
It is easy to form the list of elements free monoid (associative words) and product of two lists:
gap> s1:=[a,b*a,b*b];
[ a, b*a, b^2 ]
gap> s2:=[a*a,b*a,b*b];
[ a^2, b*a, b^2 ]
gap> s3:=Mult_Sp(sp1,s2);
[ a^3, a*b*a, a*b^2, b*a^3, (b*a)^2, b*a*b^2, b^2*a^2, b^3*a, b^4 ]
Now, we transform the list of associative words to list of words:
gap> aw1:=AssocWord_Sp(s3, gm1);
[ "aaa", "aba", "abb", "baaa", "baba", "babb", "bbaa", "bbba", "bbbb" ]
Build the minimal deterministic automaton new3 from list of words:
new3:=ListOfWordsToAutomaton("ab",aw1);
gap> Display(new3);
| 1 2 3 4 5 6 7
------------------------------------------
a | 5 4 1 6 5 1 4
b | 5 7 1 3 5 5 4
Initial state: [ 2 ]
Accepting state: [ 1 ]
Then, we can draw the graph of automaton new3:
newS3:=DotStringForDrawingAutomaton(new3);
At the last, we discuss algebraic algorithm decomposing the finite set of words free monoid Σ*.
We can embedding the finite set of words set
in free semiring of polynomials as described earlier
and find the noncommutative Groebner bases in this ideal generate by polynomials from words
[14]. In our example, K is field of rational numbers.
gap> A1:=FreeAssociativeAlgebraWithOne(Rationals, "a", "b");;
gap> gA1:=GeneratorsOfAlgebraWithOne(A1);
[ (1)*a, (1)*b ]
gap> e:=One(A1);
(1)*
gap> a:=gA1[1];
(1)*a
gap> b:=gA1[2];
(1)*b
We defined the generators of free associative algebra with one.
Now, we use the list of associative words as the prefix code. The prefix code is the same as above example with automaton new3.
gap> s2:=[a*a,b*a,b*b];
[ a^2, b*a, b^2 ]
gap> s3:=[ a^3, a*b*a, a*b^2, b*a^3, (b*a)^2, b*a*b^2, b^2*a^2, b^3*a, b^4 ];;
gap> pL2:=[ (1)*a^3, (1)*a*b*a, (1)*a*b^2, (1)*b*a^3, (1)*(b*a)^2, (1)*b*a*b^2, (1)*b^2*a^2, (1)*b^3*a, (1)*b^4,(1)*a^2, (1)*b*a, (1)*b^2 ];;
gL2:=GP2NPList(pL2);
[ [ [ [ 1, 1, 1 ] ], [ 1 ] ], [ [ [ 1, 2, 1 ] ], [ 1 ] ], [ [ [ 1, 2, 2 ] ], [ 1 ] ], [ [ [ 2, 1, 1, 1 ] ], [ 1 ] ],
[ [ [ 2, 1, 2, 1 ] ], [ 1 ] ], [ [ [ 2, 1, 2, 2 ] ], [ 1 ] ], [ [ [ 2, 2, 1, 1 ] ], [ 1 ] ],
[ [ [ 2, 2, 2, 1 ] ], [ 1 ] ], [ [ [ 2, 2, 2, 2 ] ], [ 1 ] ], [ [ [ 1, 1 ] ], [ 1 ] ], [ [ [ 2, 1 ] ], [ 1 ] ],
[ [ [ 2, 2 ] ], [ 1 ] ] ][ [ [ 1, 2, 2 ], [ 1, 2, 1 ], [ 1, 1, 1 ], [ 1 ] ], [ 1, 1, 1, 1 ] ]
Function GP2NPList() transforms list of polynomials into internal presentation polynomials in the package gbnp . Now, we use the list gL2 for
gap> GB := Grobner(gL2);
#I number of entered polynomials is 12
#I number of polynomials after reduction is 3
#I End of phase I
#I End of phase II
#I List of todo lengths is [ 0 ]
#I End of phase III
#I The computation took 2 msecs.
[ [ [ [ 1, 1 ] ], [ 1 ] ], [ [ [ 2, 1 ] ], [ 1 ] ], [ [ [ 2, 2 ] ], [ 1 ] ] ]
gap> PrintNPList(GB);
a^2
ba
b^2
The result polynomials
consists a basis of divisor for polynomials list
. Thus, we obtain the decomposition of the prefix code into products of codes using the Buchberger algorithm. Similarly, factorization can be obtained for a finite language.
5. Conclusions
The linear time algorithm is designed, which finds the prime decomposition for prefix codes. It can be applied to some modifications for finite languages.
The algorithm for factorization of finite languages has exponential complexity. From the point of view of abstract algebra, we can apply Buchberger algorithm for an exhaustive search of the factors of the finite language.
The study was conducted using system for computational Discrete Algebra GAP. The results of decomposition of languages are obtained when considering modified graphs of this automaton.
There is a big gap between complexity factorization of prefix codes (lineal complexity) and finite languages (exponential complexity). Further research will related to the search for a polynomial algorithm to determine the class of languages for which such an algorithm is applicable.
Founding
This work is supported by a grant from the research program of Chinese universities “Higher Education Stability Support Program” (Section “Shenzhen 2022―Science, Technology and Innovation Commission of Shenzhen Municipality”).