Supporting Information Extraction from Visual Documents

Visual Information Extraction (VIE) is a technique that enables users to perform information extraction from visual documents driven by the visual appearance and the spatial relations occurring among the elements in the document. In particular, the extractions are expressed through a query language similar to the well known SQL. To further reduce the human effort in the extraction task, in this paper we present a fully formalized assistance mechanism that helps users in the interactive formulation of the queries.


Introduction
The process of knowledge acquisition from generic information domains has its central phase in the Information Extraction (IE), which aims to extract from the located documents relevant information that appear in certain semantic or syntactic relationships [1].In particular, IE tries to process the relevant information found on the documents in order to make it available to structured queries.Most often, information extraction systems are customised for specific application domains, and require manual or semi-automatic training sessions.
In order to apply IE to visual documents, we have developed a technique called Visual Information Extraction (VIE) that is an IE approach based on the visual appearance of the information, conceived as its user-perceived rendering.This allows one to shift the IE problem from the low level of code (e.g., raster graphics, vector drawing, word processor formatted text, web page, etc.) to the higher level of visual features, providing a paradigm of the kind "what you see drives your search" that supports a natural query formulation.In particular, the extractions are based on the spatial arrangement occurring among the objects of a document which is modelled by common qualitative spatial relations like topological (i.e., overlapping, adjacency, containment) and direction (i.e., left, up, etc.) relations.Then, the focus of our approach is on the spatial feature, which best characterises it and also represents the main novelty with respect to the current IE literature.
The origin of the VIE technique can be found in [2] [3].In those works, the approach is inspired by the box spatial relation theory [4], such that graphical objects are syntactically described and manipulated through their bounding boxes whatever the shape of the object is.However, the box formalism has some intrinsic limitations that make no possible to use it in contexts where graphical objects are represented by complex figures.The major shortcoming of such an early proposal is that the Visual Information Extraction technique is strongly customised to specific application domains.Indeed, in [2] [3] the VIE framework is ad-hoc adapted to perform extractions on web pages, PDF documents, respectively, and the technique presented in each of these works is ad-hoc customised for the corresponding domain.Finally, in [5] the VIE theory is refined, allowing to work directly on the real shapes, and generalised in order to be applied to a wide range of interesting visual domains, including geospatial data.
The Visual Information Extraction technique has been accomplished thanks to an extraction language, namely the Spatial Relation Query Language (SRQL), which allows users to write SQL-like queries based on the visual arrangement of the information.Moreover, both the spatial relation theory and the SRQL language are implemented in a graphical software system, the Spatial Relation Query (SRQ) tool, which also allows the use of further semantic information to refine the queries, thus providing an environment where the information extraction can be performed integrating spatial relations, visual attributes, textual content, etc.
The main purpose of the research presented in this paper is to further reduce the human effort needed to compose the extraction queries through an assistance mechanism that may be used to interactively formulate the queries.To this aim, we have developed a spatial relation-based assistance technique whose theoretical issues are described in this paper.In particular, we reused the spatial relation formalism of [5] (which has been extended with connection-based relations in order to model also interconnections among objects) and we extensively developed the information extraction task theory based upon the new notion of pictorial view, providing a more refined specification of the queries that has been integrated in the algorithm for the assisted formulation of the queries.
The paper is organised as follows.Related work is summarised in Section 2. In Section 3 we illustrate the spatial relation formalism underlying the VIE technique, whereas in Section 4 we introduce the notion of Pictorial View and give an in-depth overview of the formalisation of the query process.Section 5 contains the algorithm for the interactive formulation of the queries together with some experimentation examples.Finally, some concluding remarks are outlined in Section 6.

Related Work
Information extraction is a very active research area that has received a growing attention from different communities, such as the Artificial Intelligence, Information Retrieval and Processing and Web communities.
In particular, the enormous amount of information present in the web makes it the most appealing domain for developing new information extraction techniques.In the last decade, many approaches have been proposed in this field, and they are usually classified in the following categories (for an extended survey of these approaches we refer the reader to [6] [7]): • HTML-aware tools (e.g., LIXTO [8] and SCRAP [9]), that make use of the HTML parse trees to create extraction rules.• Wrapper induction tools (e.g., DEPTA [10]), where the extraction rules are derived from a given set of training examples or using pattern discovery techniques.• NLP-based tools (e.g., WHISK [11] or the technique in [12]), where the rules are derived by considering phrases and sentences and applying either syntactic analysis, filtering, part-of-speech tagging or lexical semantic tagging.• Modelling-based tools (e.g., DEByE [13] or the Bayesian learning framework in [14]), which rely on locating in the web page portions of data conforming to a predefined structure.• ontology-based tools (e.g., WeDaX [15] or [16]), that are based on the data and not on the structure of the source documents, and ontologies are used to locate constants in the page and construct objects with them.Furthermore, a number of languages for wrappers development have been designed (e.g., Minerva [17] or TSIMMIS [18]) as well as specific-targeted approaches tools like WiNTs [19], VENTex [20], GRAPE [21] or SRES [22].
What differentiates our approach from the cited works is that we directly exploit the visual appearance of the information, rather than relying on the structural and textual information.Furthermore, we do not require any predefined model (as it is the case for the approaches based on training sets, ontologies or wrappers).
Information Extraction literature does include works dealing with the visual appearance of information (see, e.g., [19] [20] [23] [24]), but they are based on visual web page analysis and are targeted to specific tasks (like table extraction or similarities detection).Finally, in [25] the authors propose a visual information extraction algorithm which is specifically targeted to extraction from PDF documents.Some more recent works are presented in [26]- [28].The former presents an algebra for expressing spatial and textual rules that can be defined directly at a layout level.It is an interesting approach that, however, is restricted to the web-page domain.On the other hand, the work in [27] deals with the domain of charts, and proposes an approach for improving their accessibility.Finally, the work in [28] presents an hybrid approach based on decision tree learning and extracting rules.

Spatial Relation Theory
In this section we provide the basic notions of the spatial relation theory underlying our VIE technique.

Graphical Objects
A graphical object is formally defined as pair ( ) , where C denotes the set of all the points 2 p ∈  forming the external contour of O (which is disjoint from its internal area), and A is the set of the object attributes.Each attribute is a pair ( ) , a v , where a is a property name and v its value.More in detail, A models the object semantic properties (e.g., name, colour, font or latitude), and never includes attributes depending on C (such as the object width, area, etc.), whereas C is used essentially for its syntactic manipulation.
We consider only graphical objects which do not contain holes and have a contour that can be modelled as a closed curve without self loops (simple curve).We do this since this restriction still defines a wide and significant domain, excluding only graphical objects with very irregular shapes, which would be hard to handle and, often, of little interest.
In the following, we shall write p C ∈ to indicate a generic point of C and x p , y p to indicate the x and y-coordinate, respectively, of p.Moreover, we shall use the expression O x to denote ( ) val a .Figure 1 shows a fragment of a geographic map containing five nations (a, c, e, f, g), a lake (b) and a river (d).All these map symbols can be naturally modelled as graphical objects.As an example, let us give a possible definition of the graphical object e O corresponding to the nation e on the map.In this case, the component C of e O would be the set of points forming the nation contour highlighted in the figure, whereas the graphical object attributes A may be, e.g., the set name, , latitude,150 , longitude, 20 , population, 20000 , map_colour, green e

Spatial Relations
In this section, we briefly illustrate the spatial relation formalism underlying the VIE technique.
So far [5], we have considered only position-based spatial relations, describing where a graphical object is placed relative to one other.Below we give a brief summary of such relations.

Position-Based Relations
The position-based relations can be subdivided in three classes: • disjoint relations, holding between objects with disjoint contours and areas (e.g., "left of", "above", "close to"); • overlap relations, holding between objects with intersecting areas and possibly intersecting contours (i.e., "full inclusion" or "partial intersection"); • attach relations, holding between objects with intersecting contours but not areas (i.e., adjacency in one or more points).The four distinct disjoint spatial relations are UP, DOWN, LEFT and RIGHT, and are sufficient to model any disjoint spatial arrangement between two graphical objects.Actually, since of course UP and DOWN are inverse of each other, as well as LEFT and RIGHT, we could restrict the set of disjoint spatial relations to just one of the possible pairs, e.g., LEFT and DOWN.However, for sake of completeness, in the following we will present the complete set of disjoint spatial relations.
In order to define them, we need to introduce the following terminology for a generic graphical object ( ) Note that, in our formalism, we refer to the canonical orientation of the cartesian axes (i.e., the x coordinate increases rightwards, and the y one increases upwards).Then, given two graphical objects ( ) ( ) , the four disjoint spatial relations are defined as in Table 1.Intuitively, the UP relation models a spatial arrangement between a and b whenever the graphical object b is completely (both internal points and external contour) below the graphical object a.For example, Figure 2(a) shows a spatial arrangement where the UP relation holds between an hexagon a and a circle b.Note that, in the figure, the round callouts are not part of the graphical objects, but are only used to indicate their names.
The two overlap spatial relations INCLUDE and INTERSECT are defined as shown in Table 2.
In particular, the INCLUDE relation models a spatial arrangement between a and b whenever the graphical object b is properly contained in a, and then their external contours do not intersect.
) On the other hand, the INTERSECT relation models a spatial arrangement between a and b whenever there is a partial or full overlapping of their internal points, and there is an intersection between their external contours.
For example, Finally, the attach spatial relation TOUCH is defined as shown in Table 3.
As a matter of fact, the TOUCH relation models a spatial arrangement between a and b whenever there is no overlapping of their internal points, but there is an intersectionn between their external contours.
For example, Figure 2(c) shows a spatial arrangement where the TOUCH relation holds between the objects a and b.

Connection-Based Relations
In this section, let us introduce the new connection-based spatial relation LINK(h,k), which models the inter connections among graphical objects, and is defined as shown in Table 4.Note that LINK(h,k) is an explicit relation, therefore it is visualised through a polyline.
In other words, LINK(h,k) is a relation that models a spatial arrangement between graphical objects a and b whenever they are disjoint and there is a connection between a point p h C ∈ ⊆ and a point p k C ′ ′ ∈ ⊆ .For example, Figure 3 shows a spatial arrangement where the LINK(h,k) relation holds between the objects a and b.It is worth noting that, in the definition of LINK(h,k) given in Table 4, h and k are predefined (both in number and position) subsets of C and C′ , respectively.
In particular, in general graphs (e.g., state-transition diagrams, see Figure 4(a)), where the specific connection point is not relevant for the relation semantics, these subsets coincide with the overall contours, whereas in "puzzle" like languages (e.g., tiles, see Figure 4(b)) they coincide with predefined contiguous segments of the contours.In the latter case, the specific connection point is still not relevant, as long as it belongs to one of the predefined segments.Finally, in plex ( [29]) languages (e.g., flow charts, see Figure 4(c)) h and k contain specific predefined points of the contours, which determine the relation semantics.For example, in Figure 4(c)), the flowchart diamond D has exactly three connection points, with a precise position on the contour of the object, )  corresponding to the input flow, the "true" output flow and the "false" output flow, respectively.

VIE Query Theory
In this section we provide the theoretical foundations of the VIE query task showing how to formalise queries to extract information from visual documents, i.e., generic documents composed of graphical objects with attributes arranged in a two-dimensional space.
To begin, in order to formally represent a visual document let us introduce the notion of pictorial view, which is defined as a set of triples , , , , , , O O′ are graphical objects and i R is a spatial relation.The triple ( ) O R O′ means that the spatial relation i R holds between the objects i O and i O′ in the given visual representation.In other words, we represent visual documents through the notion of pictorial view, conceived as a collection of graphical objects and a set of spatial relations arranging them in a two dimensional space.Pictorial views allow one to characterise the essential content of visual representations, including only the knowledge strictly necessary to manipulate its graphical objects.Indeed, a visual document can have many different pictorial views, each one containing only the part of the overall information which is strictly necessary for that specific view.Therefore, a pictorial view is not required to refer to all the graphical objects present in a visual document or list all the relations holding among them.
Then, we can define in detail the structure a query, modelling the task of information extraction from a pictorial view.In the following, we use P to denote a generic pictorial view, to denote the set of graphical objects and spatial relations included in P, respectively, and a set of attribute names relative to the graphical objects in  .Definition 1 (Query) A VIE query Q can be: 1) a simple query, i.e., a sequence 2) a combined query, i.e. the combination of two queries through the set operations ∪ (union), ∩ (intersection), \(difference), e.g.
. Definition 2 (Query step) A query step s can be one of the following: 4) an attribute filter step, written as a sequence ( ) , where i a ∈  , i v is a constant value and { } , , , , , i op ∈ < > = <> <= >= ; 5) a order step, written as ( ) ,$ UM DM LM RM are "internal attributes" denoting the uppermost and downmost y-coordinate and the leftmost and rightmost x-coordinate of a point in a object contour, respectively; 6) a extract step, written as

( )
First or

( )
Last ; 7) a subquery step, written as a valid query Q′ .Definition 3 (Query Context) A query context is an ordered list of graphical objects from P, written as , returns a list of graphical objects denoted by ( ) , , Apply Q P C , which is recursively defined as follows.
For simple queries: (note that the actual order of objects in the result of these steps is left unspecified).5) if ( ) Apply , , Apply , , Apply , , Apply , , Apply , , , , For combined queries, let 1 2 , Q Q be two (sub) queries, with ( ) Apply , , , , Apply , , , , Apply , , , , , , , Apply , , , , Apply , , , , To better understand how a VIE query extracts graphical objects from a pictorial view following the rules introduced in the above query results definition, let us consider the visual document depicted in Figure 5.For sake of simplicity, we avoid to introduce specific pictorial views of this document, and in the queries reported below we consider only spatial (e.g., Left or Right) and attributes (e.g., the shape) which can be clearly from that document.In other words, the following queries will be based on an overall pictorial view containing all the ten graphical objects together with all the disjoint spatial relations holding among them.• ( ) (this query extracts all the objects having a square on the left, i.e.T1, T3, T4, C2, C3, S3) • ( ) (this query extracts all the objects having a circle on the upright, i.e., T2, S2, T3, T4) this query combines the previous 1 Q and 2 Q and allows to extract all the objects having a square on the left and a circle on the upright, i.e., T3, T4) • ( ) Q as a subquery in order to extract the first object in y-order having a square on the left and a circle on the upright, i.e., T3)

Assisted Query Formulation
In the Introduction, we the origin and of the VIE research, whereas in the previous section we presented the new VIE query theory.In this section, we show how the VIE framework has been upgraded by developing an incremental algorithm to assist the user in the interactive formulation of extraction queries.The algorithm follows the query formalisation illustrated in Section 4, which as been implemented in a prototypal software application written in Java.The overall query definition algorithm is shown in Figure 6.
Starting form a pictorial view (which is shown in the application interface) and an empty query (which conventionally selects all the graphical objects of the pictorial view), in each iteration the application highlights the objects returned by the current query and allows the user to choose the next step to add to the query.Then, the objects resulting from the application of this step to the current query are previewed on the interface.Once the user confirms the step, the query is updated to the current results.This process is repeated until the user confirms the query, which is then stored in the application query library.
More in detail, Figure 7 show the specific algorithm fragments underlying the Get_Step function.As an example, as shown in Figure 7(a) after selecting the "Relation Step" mode, the user should choose a feasible graphical object from the highlighted ones.Then, the interface opens a menu displaying the feasible spatial relations holding between the selected object and the other ones, and the user can select one of these relations.This process is repeated so that the user can select more objects and relative relations, which are combined through the AND operator.Finally, the corresponding direct relation step (which is the only relation step currently supported by the application) is built and returned to the overall algorithm.application graphical interface.In particular, the user builds a query which selects the graphical objects having some other object below them (Figure 9) and a weight attribute greater than or equal to 34 (Figure 10).Then, this query is stored in the library (Figure 11

Conclusions
In this paper we have developed a general query theory which has been integrated in the VIE technique, allowing to formalise how graphical objects can be extracted from visual documents.This theory has been used to define an incremental algorithm implemented in a software application that allows to formulate queries through a user-friendly graphical interface, directly interacting with the visual document.
Our main further research aims to integrate this application within the SRQ tool, providing it with an assisted visual editor to perform VIE without the need to write complex SRQL statements.To this aim, the library of queries created by the application can be exported (see the XML fragment in Figure 12 as an example) in a format compatible with the SRQ internal query representation.
Once accomplished, this integration would allow us to experiment our assisted query mechanism also on the different, meaningful domains which are currently supported by the SRQ tool, like web pages or geospatial data.

Inside
O to indicate the set of points enclosed by the contour C of the graphical object O, that we shall call the internal points of O. Finally, we shall write a A ∈ to indicate a generic attribute in A and shall use ( ) attr a and ( ) val a to refer to its property name and value, respectively, and, for any attribute a with ( ) attr a x = , we shall use the common objectoriented notation .
represents the highest y coordinate of the points in the contour C, that is the y coordinate of each point in the set of upmost points of the contour C. y coordinate of the points in the contour C, that is the y coordinate of each point in x coordinate of the points in the contour C, that is the x coordinate of each point in x coordinate of the points in the contour C, that is the x coordinate of each point in the set of rightmost points of the contour C.

Figure 2 (
b) shows a spatial arrangement where the INTERSECT relations holds between the objects a and b, respectively.

Figure 8 .
Figure 8.The application main interface.
(a)) and reused to build a second query which adds an extraction step (Figure 11(b)) in order to select only the first of such objects, in extraction order.

Figure 12 .
Figure 12.Query library in XML format.

Table 1 .
Definition of disjoint spatial relations.

Table 3 .
Definition of the TOUCH spatial relation.

Table 4 .
Definition of the LINK(h,k) spatial relation. LINK(h,k) )