This paper provides an implementation of a novel signal processing co-processor using a Geometric Algebra technique tailored for fast and complex geometric calculations in multiple dimensions. This is the first hardware implementation of Geometric Algebra to specifically address the issue of scalability to multiple (1 - 8) dimensions. This paper presents a detailed description of the implementation, with a particular focus on the techniques of optimization used to improve performance. Results are presented which demonstrate at least 3x performance improvements compared to previously published work.
Geometric Algebra (GA) is a relatively new area of mathematics which finds applications in many different fields of research particularly in computer graphics and robotics. Traditional matrix-based methods of defining geometrical objects using vectors to characterize constructions are described in [
In this paper, we present an alternative design of a Geometric Algebra co-processor and its implementation on a Field Programmable Gate Array (FPGA). To the best of the authors knowledge only four GA designs in hard- ware exist in the literature [
This paper is organized as follows. In Section 2, the fundamental concepts in GA are discussed. In Section 3 the top level GA co-processor architecture is introduced with an extended discussion in Section 4 for the GA coprocessor design. The key building blocks for the GA co-processor in blade logic, register file, and the mem- ory write sequencer are discussed in this section. The simulation, synthesis results and comparison to the state of the art are discussed in Section 5. Finally in Section 6, we conclude our findings.
Geometric Algebra (GA) is a coordinate-free approach to geometry based on the algebras of Grassman [
The concept of a blade is fundamental to all objects defined within a GA framework. Let the orthonormal basis vectors be
The blade signifies the subspace i.e. scalar, vector or bivector. Hence the linear subspace consists of a scalar (1) a 0 blade element, vectors
In GA, the outer product is also known as the wedge product denoted by
The outer product works in all dimensions. If a 2-dimensional subspace
In GA it is possible to add different grade vectors to form a multivector combining different grade vectors in a single product. This is possible due to the fundamental building block in GA which is called the geometric product given by (1) which consists of the dot product and the outer product of vectors a and b. The geometric product gives information about the magnitude and orientation of the vector.
Basis blade | Total | |
---|---|---|
1 | 1 | |
2 | ||
4 | ||
8 |
For the orthonormal basis vectors defined by
For
Generalizing, the geometric product of basis vectors
The element
Previous work that has implemented GA in hardware has taken a variety of approaches. The design in [
1 | ||||
---|---|---|---|---|
1 | 1 | |||
1 | ||||
1 | ||||
ALU to write evaluation results to the appropriate addresses of the on-board RAM on completion of an opera- tion.
Due to resource limitations on the hardware only a single basis blade pipeline was implemented for geometric products. The FPGA implementation ran at 20 MHz and in real terms, was found to be much slower than soft- ware packages like Gaigen [
In [
In the next section of this paper, we will present the proposed architecture that is scalable to n (up to a maxi- mum of 8) dimensions.
Before discussing the architecture some basic GA definitions need to be made. If a and b is considered to be the multivectors in the
The geometric product is given by:
Then if all the coefficients of a and b are non-zero, the simplest implementation for GA based computations for each multiplication between basis blades results in a separate multiplier hardwired to the appropriate multi- vector elements. Following the multiplication, an adder collects the output whose multiplications result in the same blade. In
The top level module for the GA co-processor contains instances of GA Core, Memory, Memory Write Se- quencer and the Register File. In addition it contains the logic for the state machine that makes the GA co-pro- cessor function correctly and also contains the logic for converting between 40-bit and 16-bit data.
In the following subsections we discuss the register file, memory and memory sequencer and a detailed dis- cussion on the GA Coprocessor will be provided in the following section.
The register file is dual-ported with separate read and write selects. In case of the bypass mode, data can be di- rectly written from the sequencer or multiplier to the register file. Each register in the register file comprises of 41 bits. The structure of a single register is shown in
The number of registers
register file are cleared. The 41st bit of every register is the register valid bit (V) andit is set to “1” or “0” to in- dicate that the data it holds is valid or invalid respectively.
When the register file receives an active Write signal, WSelect is compared with bits 39 to 32 of every register which represents the blade information (B). If this matches for a register then its corresponding equal signal be- comes high. This is then used as the load signal to write to the register. If none of the registers match, then the data is written to the register pointed to by the counter. The counter is then incremented by one. Whenever data is written to a register, its valid bit is set to 1 indicating that the register holds valid data.
For reading from the register file, RSelect is again compared with bits 39 to 32 of every register. If this matches for a register, then its corresponding “en” signal becomes high. This is then used to enable that regis- ter’s tristate using the signal enable and the contents of that register are put on the Data_out bus. If none of the registers match, then zeroes are driven on the Data_out bus except in the memory write phase, when the bus is in a high impedance state.
The memory can be located inside or outside of the GA co-processor. In the proposed design the memory is lo- cated within the GA co-processor and each memory location is 40 bits wide which allows maximum data band- width to be obtained during processing.
The proposed memory model can be scaled using the generic parameter AddressWidth and the DataWidth that is set according to the dimension of the vector space. For example, in an 8D vector space at least 3 × 256 = 768 memory locations are required to store the two input multivectors and the resultant multivector and hence the address width is 10 bits.
A Memory Write Sequencer is required to read data from the register file and write it to memory. After the processing phase the resultant multivector will be available in the register file. This is transferred to memory in the Memory Write phase.
In GA, often certain applications [
For example, in the hardwired version as shown in
Non Hardwired | Hardwired | ||
---|---|---|---|
Basis Vectors | Valid Bit | Basis Vectors | Valid Bit |
1 | 1 | 1 | 1 |
e2 | 1 | - | 0 |
e13 | 1 | e2 | 1 |
e123 | 1 | - | 0 |
- | 0 | - | 0 |
- | 0 | - | 0 |
- | 0 | e13 | 1 |
- | 0 | e123 | 1 |
becomes more apparent for higher order vector processing. This will be discussed in detail in the experimental results section of this paper.
While the top level architecture of the basic n-dimensional GA co-processor has now been defined, in order to maximize the performance benefits it is essential to optimize each key element in the design. This section now describes each of the modules within the Geometric Algebra Co-processor in detail. Pipelining the multiplier and adder is crucial to achieve parallelization in the hardware.
The proposed GA co-processor core (shown in
The signals MC1, MC2, MB1, MB2 and Valid are registered as CoeffA, CoeffB (C1 and C2), BladeA, BladeB and Valid respectively as shown in
In the proposed design, the fixed point multiplier and adder are pipelined to improve the instruction throughput. Arbitrarily, the multiplier pipeline is 5 stages and the adder pipeline is 6 stages. However, the use of pipelined processing elements adds to the processing time and introduces the problem of data hazards.
Data hazards occur when the pipeline changes the order of read/write accesses to operands. For example, in the proposed design Read after Write (RAW) hazard can occur i.e. the register file is modified and read soon af- ter—the first instruction may not have finished writing to the register file, while the second instruction may use incorrect data.
Consider that two instructions
The design uses a standard fixed point signed 5-stage multiplier that takes two 32-bit input operands A and band returns a 32-bit product. The Most Significant Bit (MSB) in signed arithmetic indicates the sign.
In normal operation, CoeffA and CoeffB become the inputs A and B of the multiplier respectively. Registered values of Blade and Valid enter the multiplier blade pipelines which are 5 stages. Together they form the multip- lier valid pipeline. Data in every stage of the multiplier has its blade and valid bit in the corresponding stage of the blade and valid pipeline. As shown in
As shown in
Similar to the multiplier, the 6-stage adder blade pipeline and a 6-stage adder valid pipeline exist for the adder. The 8-bit signals in the adder blade pipeline are called BA(1), BA(2), BA(3), BA(4), BA(5) and WSelect_t. The signals in the adder valid pipeline are called VA(1), VA(2), VA(3), VA(4), VA(5) and Write_t. As can be seen in
The sequencer reads data from the memory and feeds it to the multiplier, adder or directly to the register file depending on the mode of operation. As shown in
That means the address which was sent to the memory just before the stall occurred has to be sent again for a fetch. Depending on when the stall occurs, different scenarios can occur. For example, in one scenario as shown in
The blade logic which explains the blade relationships within the algebra is central to the architecture and is discussed in the following.
The summary of blade index relationships has already been explained in
Furthermore, the sign due to the blade index arises due to the invertible nature of the geometric operation. The sign is calculated from two evaluations, one arising from the swapping of the blade elements and the other due to the signature of the blades. For example, the sign due to swapping the blade index multiplication of
The XOR gates account for the swapping of the blade elements and the AND gates for the blades. The signature for which the GA is defined also contributes to the sign element. For example in certain cases [
If the algebra is defined such that
The stalls in the architecture originate from the GA core and are required to avoid Read after Write (RAW) data hazards in the adder. The stall signals is the same as the match signal (as shown in
At reset, all the registers in the adder blade pipeline and adder valid pipeline are set to zero. This means that all data contained in the adder pipeline are invalid. The first valid data in the adder pipeline will be the first one with VM(5) set to high. If none of the blades in the adder pipeline match BM(5), then the signal match remains low. If the first valid blade is “00000000”, then it will match all the blade adder signals i.e. BA(1) to WSelect_t.
Therefore, the match logic for each stage of the adder needs to be further qualified with its corresponding va- lid bits i.e. signals VA(1) to Write_t. That means only if there is valid data in BM(5) and a valid data in the adder with the same blade, the stall signal will be made high. The stall freezes the multiplier pipeline and the value of
Product, BM(5) and VM(5) will be held. When the stall is high, the GA core forces “00000000” into the adder blade pipeline and “0” into the adder valid pipeline. Therefore, the data which enters into the adder when the stall is high are all invalid. When the data which would have caused the hazard has been written to the register file, the stall signal goes low and allows the multiplier to push new data into the adder.
blades b1 to b6. The adder valid pipeline indicates that the data in the adder corresponding to blade b2 is invalid. Also BM(5) contains blade of value b7 which is valid. Since b7 does not match any of the blades in the blade pipeline, the match signal remains low. In the next cycle, the data corresponding to blade b1 will be written to the register file. The rest of the data, blades and valid bits move forward through the pipeline. Now the value in BM(5) is b3 which is valid. As data of blade b3 is already being processed in the adder pipeline and it is in stage 5, the match_d5 signal goes high, which in turn generates a stall. The stall freezes all the pipelines before the adder and so the data, blade and valid bit will be held at the output of the multiplier. This can be seen in the fol- lowing cycle as the value of BM(5) and VM(5) remain the same.
Although the pipelines before the adder have frozen, the adder pipeline continues forward. A value of “00000000” which is represented by b0 is fed into the adder blade pipeline. The corresponding valid bit for this is set to zero indicating that this data is invalid and is used as a filler to fill up the pipeline. Now the data corres- ponding to b3 which was in Stage 5 is in stage 6, so match_d6 becomes high generating another stall. In the next cycle another invalid data of blade b0 is fed in so now b3 is written to the register file and match remains low. In the following cycle, the new data with blade b3 is fed into the adder pipeline. It uses the correct, updated value of b3 from the register file. So, in this example a two cycle stall is required to avoid the data hazard.
The block labeled “LOGIC” on
During the load state, data is loaded into the memory. Once the coefficients of multivector 1 and 2 (C1 and C2) are loaded in memory (specified by A1 and A2), the Load_End signal goes high. This causes the state machine
to move to the processing state. In the processing state, the GA co-processor operates (specified by Cfg_Bits) on the data loaded in memory. A high on Process_End signal indicates the end of processing and the state machine moves to the writesetup state. In the writesetup state, CReset is asserted thus clearing the counter in Register File. In writeback state that follows the writesetup state, the Memory Write Sequencer writes the resultant multivector to memory.
There are two ways by which the state machine can move to the next state from writeback state. One is by looking for a “0”, the 41st bit of a register that indicates that there is no more valid data in the Register File. The second way of moving to memory write setup state is to check if the counter in the Register File exceeds the value specified by C3, a register value that stores the number of elements of the vector space. The first condition reduces the effective cycles by writing only the non-zero coefficients to memory while the second condition saves cycles by fixing an upper limit on the number of registers to be written to memory by looking at the di- mension of the vector space in which the operation is performed. After the writeback state, the state machine moves to the memorydump state. In the memorydump state, the addresses on the Load_Addressare loaded and the Load_En signal is asserted.
The data read from the memory is driven on the Dump_Data bus after converting the 40-bit data into 16 bits. For converting the data from the 16-bit to the 40-bit, two 16-bit registers are required.
The signal Process_End indicates the end of the processing phase. This signal informs the state machine in the GA module that the processing has been completed and the GA co-processor can move on to the next state. If the end_of_read signal is high and all the data in the pipeline is invalid, then we can be sure that the last valid data fed into the pipeline from the sequencer has been written to register file, and the processing is complete.
There are 12 levels of registers in the pipeline between the output of sequencer and the register file. So if the signals Valid, VM(1) to VM(5), VA(1) to VA(5), Write_t and end_of_read are zero, Process_End can go high. If the multiplier or the adder in the design is to be replaced with versions with different number of stages then the multiplier blade pipeline, the multiplier valid pipeline, the adder blade pipeline and the adder valid pipeline should also be changed to match the number of stages in their respective units. The Process_End logic should also be changed to reflect the change in the number of pipeline stages.
This section describes the experimental verification results. To demonstrate the performance of the proposed design we implemented the design in VHDL and targeted it to a FPGA platform. The first experiment was exe- cuted to compare the processing cycles and overhead required for the geometric operations in different dimen- sions and results shown in
The extra cycles account for the stalls and also the cycles introduced by the sequencer. The number of cycles for the ideal case varies as 22n where n is the dimension of the vector space. To load n data elements into mem- ory, 3n + 1 cycles are required where +1 is for the one cycle overhead for the load operation. To write n data elements of the resultant multivector to memory, n + 2 cycles are required where +2 cycles overhead is required, one of which is used to reset the counter in the Register File. As can be seen from
For example, let us consider the number of cycles in each case for 3D vector space (row 4 in
The second experiment was performed to target the proposed design on an FPGA platform. To verify these results from simulations, a3D-GA co-processor was implemented on a Xilinx Virtex-II FPGA. The logic gene- rates the Clock, n Reset and Start signal. The Start signal is active low signal which means a “0” on Start would make the state machine move from idle to clearall. This is necessary because Start is mapped to the user push button switch on the FPGA which generates an active low signal. The state machine in the GA module is modi- fied as follows. The memory dump state is removed and a new state called complete is added. The signal Finish goes high when the state machine is in the complete state.
The synthesis result for dimensions 1 to 3 shows the usage of dual-ported select RAM memories and for higher dimensions it uses block select RAM memories. Furthermore, the design in the 8th dimension cannot be supported on this FPGA. For dimensions 1 to 6, there exists a critical path from the stall logic in GA core to the sequencer and then to memory. It is also observed that the design frequency decreases when it is scaled from 1D
Dimension | Ideal | Load | Processing | Memory Write | Effective | Extra | Result Dump | Actual Effective |
---|---|---|---|---|---|---|---|---|
1 | 4 | 13 | 26 | 4 | 43 | 39 | 6 | 49 |
2 | 16 | 25 | 46 | 6 | 77 | 61 | 12 | 89 |
3 | 64 | 49 | 94 | 10 | 153 | 89 | 24 | 177 |
4 | 256 | 97 | 292 | 18 | 407 | 151 | 48 | 455 |
5 | 1024 | 193 | 1076 | 34 | 1303 | 279 | 96 | 1399 |
6 | 4096 | 385 | 4180 | 66 | 4631 | 535 | 192 | 4823 |
7 | 16,384 | 769 | 16,532 | 130 | 17,431 | 1047 | 384 | 17,815 |
8 | 65,536 | 1537 | 65,812 | 258 | 67,607 | 2071 | 768 | 68,375 |
to 6D. This is because the address widths and counter widths get scaled and as a result morelogic is required. To compare the performance of our proposed core to the state of the art, the frequency of operation is chosen at the lower bound i.e. 65 MHz.
When calculating the GA operations the GOPS (Equation (7)) is particularly important because the designer is then able to determine whether the timing constraint put by the clock cycles and GOPS provided by that partic- ular implementation is relevant.
where no of cycles is the number of processing clock cycles (column 4,
Reference | Frequency (MHz) | HW Resources | No of Processing Cycles | GOPS (In Thousands) | GOPS (In Thousands, Normalized‡) |
---|---|---|---|---|---|
Perwass et al. [ | 20.0 | 1 M, 1A (2) | 1761, 5442, 20443 | 112.521, 36.752, 9.793 | 56.26, 18.37, 4.89 |
Mishra & Wilson [ | 68.0 | 2 M, 3A (5) | 841,2242,7043 | 809.61, 303.72, 96.623 | 161.92, 60.74, 19.32 |
Gentile et al. [ | 20.0 | 24 M, 16A (40) | 562 | 357.15 | 8.9 |
Lange et al. [ | 170.0 | 74† | 3663 | 464.5 | 6.28 |
Franchini et al. [ | 50 | 24 M | 561 | 892.8 | 37.2 |
Franchini et al. [ | 100 | 64 M | 4052 | 246.9 | 3.8 |
11803 | 84.7 | 1.3 | |||
This Work | 65.0 | 1 M, 1A (2) | 1771 | 367.2 | 183.6 |
4552 | 142.8 | 71.4 | |||
13993 | 46.4 | 23.2 |
1GA in 3D; 2GA in 4D; 3GA in 5D; M = Multiplier, A = Adder; †Number of DSP48Es in Xilinx (These are Multiply and Add Units); ‡Normalized to Hardware resources used (GOPS/Hardware).
for 4D and 5D respectively. As can be seen from
In the current design the resultant multivector is first written from the register file to the memory and then later it is transferred to the output port from the memory. This is clearly not required as the result could be transferred from register file to the output memory directly. In that case the states writesetup and writeback can be removed from the state machine and the new state registerdump replaces memorydump. In the registerdump state the contents of the register file is put out on the output bus. As the output bus has a smaller bandwidth (i.e. 16 bits as compared to 41 bits in the register file), 3 times the clock cycles will be required. Then the actual effective cycles for the design can be obtained by subtracting the cycles for memory write phase and accounting for the cycles required for transferring the contents of the register file directly to the output bus. For example, in 3D the actual effective cycles could be 167 (177-10). Instruction-level parallelism exists when instructions in a se- quence are independent and thus can be executed in parallel by overlapping [
It was observed that for higher frequency of operation the timing critical paths should be broken. The timing critical path in the register file arising from the match logic can be broken at the cost of more resources in the register file. For example, in the 8th dimension, if there are another 2 set of comparators in the register file, then part of the match signal can be generated one stage earlier. This means that 4 × 256 = 1024 8-bit comparators are required in the register file. As the logic is split between two stages, the number of levels of logic between each stage can be reduced resulting in an increase of frequency.
For improving the effective number of cycles for operation, more than one GA Core, each with its multiplier, adder and independent register files can be used. In such a case, a common sequencer feeds data to the different cores. The complexity in this multiple GA Core design lies in the sequencer which should sequence data such that each pipeline works efficiently without many stalls. The sequencer has to intelligently feed data into the different GA Core pipelines to eliminate dependencies. This would also mean the use of buffering and additional memory.
Furthermore, the GA co-processor is envisaged to interface with a standard processor. This can be achieved in two ways, either by mapping the GA co-processor into the physical memory and giving it a real address on the address bus (memory mapped I/O) or by mapping it into a special area of input/output memory (isolated I/O) [
The VHDL code is provided for general download and use in the repository [
An alternative design of a Geometric Algebra co-processor was successfully created. The design was success- fully synthesized and functionally tested on an FPGA. We present a comparison of different FPGA and ASIC platforms in terms of performance and modularity of the architecture. On the FPGA, we developed a custom implementation for GA Geometric Product, which can be easily extended to other products within GA. We have introduced a modular architecture of the GA core and a linearly dependent on the dimension register file which can support up to an 8 dimensional GA, with a normalized performance that is better than previously reported results.
The GA implementation consists of a pipeline arithmetic blocks to compute the geometric-product and sup- port for several parallel pipelines to obtain high throughput. Performance results for a single core show that us- ing an FPGA results in a superior performance when compared to the state of the art platforms. Furthermore, the FPGA architecture was able to provide a flexible platform that could handle a variety of GA products without performance degradation, particularly being able to use multiple dimensions in a single implementation. Finally, the GA platform provides the opportunity to expand the GA instruction sets and also accommodate multiple cores, increasing the computational capacity, making an even more compelling GA platform for various applica- tions.