Floorplanning is the process of identifying structures that should be placed close together, and allocating space for them in such a manner as to meet the sometimes conflicting goals of available space (cost of the chip), required performance, and the desire to have everything close to everything else.
Within the Xilinx chips it is often the case that the smallest area design is also the highest performance design. This flies in the face of many design methodologies, where area and speed are considered to be things that should be traded off against each other.
The reason this is so is probably because there are limited routing resources, and the more routing resources that are used, the slower the design will operate. Optimizing for minimum area allows the design to use fewer resources, but also allows the sections of the design to be closer together. This leads to shorter interconnect distances, less routing resources to be used, faster end-to-end signal paths, and even faster and more consistent place and route times. Done correctly , there are no negatives to Floorplanning.
What negatives could there be? Well, if the Floorplanning is done with no regard for the architecture of the chip, then it is possible to actually do a worse job than the Xilinx placer section of the place and route software. It is also possible that there are constraints that are not well understood until placement is complete, and routing commences. So the issue then is what constitutes the "Done correctly".
As a general rule, data-path sections benefit most from Floorplanning, and random logic, state machines, and other non-structured logic can safely be left to the placer section of the place and route software.
Data paths are typically the areas of your design where multiple bits are processed in parallel with each bit being modified the same way with maybe some influence from adjacent bits. Example structures that make up data paths are Adders, Subtractors, Counters, Registers, and Muxes.
How to Floorplan a design
Although there are no hard and fast rules to Floorplanning, this section outlines the basic structure for a Floorplanned design, and highlights the issues you need to consider when Floorplanning a design. As described above, Floorplanning has its greatest return when applied to data path elements. The Xilinx XC4000 devices, and all of the derivative families (the A, D, E, EX, H, L, XL, Spartan, and SpartanXL families) all have the following basic structure:
To support these characteristics, consistently implement all data path elements with a bit pitch of two bits per row, and data path elements are always vertical structures, of one or more columns.
The Xilinx FPGAs are biased to have data flow along horizontal interconnect, and to have arithmetic functions operate in vertical columns. The bias comes from the horizontal long lines with tri-stateable buffers, and the vertical pre-built and routed carry logic.
The carry logic is also used to build fast counters, so although you may not initially think of a counter as an arithmetic function, it falls into the same pattern as adders, subtractors, and arithmetic comparisons, because of its use of the carry chain. This view can be clarified by thinking of a counter as an incrementor, followed by a holding register.
The bit pitch of two bits per row is driven primarily by the structure of the carry logic, but is also the bit pitch that the tri-stateable buffers implement. What this means is that the natural structure of arithmetic functions in these devices implements 2 bits of a function (a two bit slice) in one row of CLBs, and for simple functions, in one column. A simple function such as a ten bit synchronous up-counter will therefore take 5 rows and 1 column, a total of 5 CLBs.
Although the XC4000 devices and the A, D, E, H, and L derivatives allow the carry signal between CLBs to interconnect in both an up and down direction within a column, the more recent XC4000EX, XC4000XL, Spartan and SpartanXL devices only support the carry signals being routed up a column. For all devices, within a CLB, the carry routing is up, with regard to the two function generators. It is expected that this up only bias will exist in future products from Xilinx. To be compatible with all these products, you should only uses the up direction for carry, and this bias then affects all other functions that are generated. For the example 10 bit counter described in the previous paragraph, the Floorplan will have bit 0 and 1 in the CLB at the bottom of the column of 5 CLBs, and the top CLB will have bits 8 and 9.
Following Xilinx's standard, the two main
function generators are shown on the left of diagrams, and are labeled
F and G, and the two flip-flops are shown on the right and are labeled X
and Y.
For the example counter, in the CLB at the bottom of the five CLB group (the one with the RLOC=R4C0 attribute), the F function generator will be used to implement the logic that feeds the D pin of the X flip-flop, the output of which, is the least significant bit of the counter, Q0. The G and Y sections of the same CLB implement bit 1 of the counter. The next CLB above (the one with the RLOC=R3C0 attribute) implements bit 2 and 3. This continues up the column, through to the top CLB which implements bits 8 and 9. |
When two or more functions of your design are Floorplanned in this way and placed side by side, with the signals that flow from one function to the next aligned on the same row, and in near or adjacent columns, the design will place and route much faster and the resulting design will perform faster than a design without Floorplanning, and that relies on the Xilinx place and route software to decide on placement. Of course, custom building each function section of your design with detailed Floorplanning for each function generator and flip-flop can be a complex, time consuming, and potentially error prone process.
The Xilinx Place and Route software uses a hierarchical placement constraint system called relative location attributes. Each level of the hierarchy has an origin in the top left corner that has a relative location of row zero and column zero. As a constraint this is represented as R0C0. Rows are numbered from top to bottom, and columns are numbered from left to right. When a relative location attribute (RLOC) is assigned to a part of the hierarchy that is not a single CLB, then the underlying RLOCs are added to the attached attribute to calculate the RLOC value for each of the underlying RLOCs. This process continues throughout the hierarchy, resolving each CLB RLOC to a value that is relative to the RLOC at the top of the hierarchy. This process, and other issues related to how RLOCs are processed are discussed in full in the Xilinx "Libraries Guide" document, in the "Attributes, Constraints, and Carry Logic" chapter, in the "Relative Location (RLOC) Constraints" section. Although this section of Xilinx's documentation is quite complex, it is recommended that you review it to better understand how the RLOCs in the modules support Floorplanning.
An Example design, with various levels of Floorplanning
This section examines the results of Floorplanning, and compares the resulting structure, the place and route time, and the design performance. The example while contrived is typical of the types of logic that benefit from Floorplanning. The example design comprises four sixteen bit binary up counters, that all feed into a selection multiplexer. The output of the selection multiplexer is registered, and the output of this register is connected to the FPGA pins.
There are two basic timing path categories that need to be analyzed. The first is the maximum delay in any of the counters. And the second is the maximum delay from any of the counters to the multiplexer output register. For the counter, the maximum delay will be from the clock to out time of the LSB flip-flop, through the logic that establishes the next counter value, to the D input of the MSB flip-flop, and meeting its setup time. The reciprocal of this maximum internal delay within the counter is the maximum clock rate at which the counter will count reliably.
Seven different levels of Floorplanning are applied to this simple design, using the XC4005E, XC4010E, and XC4010XL as targets. The '-2' speed grade is used for all examples, and place and route programs used are as follows:
The combination of running the XC4010E devices with both place and route programs allows comparison of these programs on the XC4000E families. Running both the XC4010E and XC4010XL on the M1.4 program, allows comparison of these two product families. While the goal is to show the value of Floorplanning, the program and product comparisons are interesting.
The same seven levels of Floorplanning were applied to each of these four product/program combinations. The seven design styles have the following characteristics:
To understand the differences in the results for these design styles, the following descriptions of the behavior of the place and route software, as well as an analysis of the device resources should be helpful.
Style 1 uses no Floorplanning or guidance on using the carry logic that is available in these products. The results are consistently the poorest. Style 2 changes the structure of the counters to use carry logic, and for this style through to style 7, the performance and size of the counters does not change much. There is no direct Floorplanning of the counters with regard to their relative placement. While this does not affect the counters, it may not be optimal for the routing from the counters to the multiplexer. As can be seen in the following diagrams, the style 2 designs have placed the counters near each other, but they are not aligned.
Style 3 adds Floorplanning to the counters, and by aligning the counters, the routing to the multiplexer should be more straightforward. This should improve the delays from the counters through the multiplexer to the output register. As can be seen in the diagrams, the multiplexer logic is placed somewhat randomly around the core of the 4 counters.
Style 4 places the output register in the next column to the right of the four counters, and the flip-flops of this register are aligned with the counter bits. Although this should help significantly, it does not, because the 8 logic blocks that hold the 16 flip-flops of the output register do not have sufficient gate resources to implement the 16 four-input multiplexers. Some of the multiplexers are placed with the flip-flops, and some are placed near by.
Style 5 attempts to alleviate the problems with style 4, by moving the output register to the next column to the right, leaving room for the 8 multiplexers that couldn't fit in with the flip-flops. None of the place and route programs take full advantage of this opportunity for improvement.
Style 6 resolves the performance issue of the multiplexer, by replacing it with a Floorplanned multiplexer with output register. This multiplexer performs an additional optimization of not placing all the flip-flops in the same column, but rather, placing the flip-flops with the multiplexers. A four-to-one multiplexer requires all the gate resources of a CLB, so to build a 16 bit wide multiplexer with four inputs will require 16 CLBs. Strictly maintain a Floorplanning structure of two bits of data path implemented per row of structure. The 16 CLBs are Floorplanned to use two columns by eight rows, with bits 0 and 1 on the row at the bottom, and bits 14 and 15 at the top. This exactly matches the bit position of the counters, except the counters have an additional block at the top, for the TC and CEO outputs. This is resolved by placing the counters with RLOC-ORIGINS on row 1, but the multiplexer is placed on row 2.
At this point you may wonder what additional improvement could be made to style 6. Consider the routing from the left most counter to the multiplexer. It must pass through the other three counters to get to the multiplexer. Similarly, the output of counters two and three must also pass through the fourth counter to get to the multiplexer. Therefore, there is more routing congestion around counter four, although it has the shortest path to the multiplexer. The output of the first counter must traverse the furthest distance to get to the multiplexer. In synchronous designs like this, the slowest path out of a group of paths will be the limiting factor. For the counters to run at their fastest, they need to have their routing congestion minimized. For the paths from the four counters to the multiplexer to be minimized, the multiplexer and the four counters need to be placed so as to minimize the worst-case distance. Both of these goals are achieved in style 7 by placing the multiplexer and its output register in the middle of the structure, with two counters to its left, and two counters to its right.
As can be seen from the following tables and diagrams, style 7 delivers the fastest counters, the fastest counter to multiplexer output register time, the fastest placement time, and the fastest routing time. Studying the schematics for design styles 1 and style 7 shows almost no additional effort to create design 7's result. Selecting counters and multiplexers that are pre-Floorplanned, together with five placement attributes is all that is required. (Some thought as to what the placement constraints should be, obviously is also needed)
XC4005EPC84-2 Processed with PPR V5.2.1c |
||||||
Design Style |
Counter Delay (nS) |
Max Frequency (MHz) |
Counter to MUX REG delay (nS) |
Partition + Placement time (S) |
Routing Time (Seconds) |
CLBs Used |
1 |
17.1 |
58.4 |
11.8 |
4+28 |
12 |
72 |
2 |
13.1 |
76.3 |
10.8 |
6+15 |
13 |
48 |
3 |
13.4 |
74.6 |
11.7 |
6+14 |
17 |
48 |
4 |
13.1 |
76.3 |
14.4 |
7+12 |
17 |
48 |
5 |
14.3 |
69.9 |
14.5 |
6+12 |
16 |
48 |
6 |
13.3 |
75.1 |
9.4 |
3+11 |
16 |
48 |
7 |
13.1 |
76.3 |
8.9 |
3+11 |
14 |
48 |
XC4010EPC84-2 Processed with PPR V5.2.1c |
||||||
Design Style |
Counter Delay (nS) |
Max Frequency (MHz) |
Counter to MUX REG delay (nS) |
Partition + Placement time (S) |
Routing Time (Seconds) |
CLBs Used |
1 |
17.5 |
57.1 |
12.9 |
7+53 |
32 |
88 |
2 |
13.3 |
75.1 |
11.2 |
4+13 |
12 |
48 |
3 |
13.5 |
74.0 |
12.6 |
4+11 |
15 |
48 |
4 |
13.1 |
76.3 |
14.6 |
4+11 |
17 |
48 |
5 |
13.2 |
75.7 |
14.2 |
3+11 |
14 |
48 |
6 |
13.3 |
75.1 |
10.2 |
2+10 |
16 |
48 |
7 |
13.1 |
76.3 |
8.9 |
1+10 |
15 |
48 |
XC4010EPC84-2 Processed with M1.3.7 (PAR –L4 –D5) (A) |
||||||
Design Style |
Counter Delay (nS) |
Max Frequency (MHz) |
Counter to MUX REG delay (nS) |
Placement time (Seconds) |
Routing Time (Seconds) |
CLBs Used |
1 |
21.9 |
45.6 |
19.4 |
65-7=58 |
574-65=509 |
55 |
2 |
13.7 |
72.9 |
10.0 |
47-7=40 |
142-47=95 |
48 |
3 |
13.8 |
72.4 |
10.3 |
38-8=30 |
170-38=132 |
48 |
4 |
13.8 |
72.4 |
12.7 |
28-8=20 |
132-28=104 |
56 |
5 |
13.7 |
72.9 |
13.1 |
28-8=20 |
128-28=100 |
56 |
6 |
13.7 |
72.9 |
9.4 |
15-8=7 |
80-15=65 |
48 |
7 |
13.7 |
72.9 |
8.9 |
14-8=6 |
75-14=61 |
48 |
XC4010XLPC84-2 Processed with M1.3.7 (PAR –L4 –D5) (B) |
||||||
Design Style |
Counter Delay (nS) |
Max Frequency (MHz) |
Counter to MUX REG delay (nS) |
Placement time (Seconds) |
Routing Time (Seconds) |
CLBs Used |
1 |
18.5 |
54.0 |
8.8 |
68-20=48 |
147-68=79 |
55 |
2 |
11.6 |
86.2 |
7.0 |
53-21=32 |
134-53=81 |
48 |
3 |
11.9 |
84.0 |
6.9 |
46-21=25 |
128-46=82 |
48 |
4 |
12.1 |
82.6 |
10.6 |
34-22=12 |
95-34=61 |
56 |
5 |
11.7 |
85.4 |
10.7 |
33-21=12 |
91-33=58 |
56 |
6 |
11.9 |
84.0 |
6.8 |
25-20=5 |
64-25=39 |
48 |
7 |
11.7 |
85.4 |
6.1 |
26-21=5 |
69-26=43 |
48 |
XC4010XLPC84-2 Processed with M1.4.12 (MAP –K, PAR –L4 –D5) |
||||||
Design Style |
Counter Delay (nS) |
Max Frequency (MHz) |
Counter to MUX REG delay (nS) |
Placement time (Seconds) |
Routing Time (Seconds) |
CLBs Used |
1 |
18.2 |
54.9 |
11.3 |
64-20=44 |
185-64=121 |
83 |
2 |
11.3 |
88.5 |
9.8 |
39-21=18 |
183-39=144 |
72 |
3 |
11.8 |
84.7 |
10.6 |
33-20=13 |
108-33=75 |
72 |
4 |
11.6 |
86.2 |
10.8 |
32-21=11 |
128-32=96 |
72 |
5 |
11.7 |
85.4 |
11.0 |
32-21=11 |
116-32=84 |
72 |
6 |
11.6 |
86.2 |
6.8 |
24-21=3 |
59-24=35 |
48 |
7 |
11.7 |
85.4 |
6.1 |
24-20=4 |
61-24=37 |
48 |
XC4010XLPC84-2 Processed with M1.4.12 (MAP –K, PAR –L5 –D5) |
||||||
Design Style |
Counter Delay (nS) |
Max Frequency (MHz) |
Counter to MUX REG delay (nS) |
Placement time (Seconds) |
Routing Time (Seconds) |
CLBs Used |
1 |
17.3 |
57.8 |
11.3 |
99-20=79 |
224-99=125 |
83 |
2 |
11.7 |
85.4 |
9.9 |
58-21=37 |
229-58=171 |
72 |
3 |
12.1 |
82.6 |
10.5 |
46-20=26 |
140-46=94 |
72 |
4 |
11.6 |
86.2 |
11.1 |
44-21=23 |
117-44=73 |
72 |
5 |
11.7 |
85.4 |
10.9 |
44-21=23 |
134-44=90 |
72 |
6 |
12.1 |
82.6 |
6.7 |
27-21=6 |
60-27=33 |
48 |
7 |
11.7 |
85.4 |
6.1 |
27-21=6 |
66-27=39 |
48 |
XC4010XLPC84-2 Processed with M1.4.12 (PAR –L4 –D5) |
||||||
Design Style |
Counter Delay (nS) |
Max Frequency (MHz) |
Counter to MUX REG delay (nS) |
Placement time (Seconds) |
Routing Time (Seconds) |
CLBs Used |
1 |
18.8 |
53.2 |
9.1 |
63-20=43 |
199-63=136 |
55 |
2 |
12.0 |
83.3 |
7.7 |
45-20=25 |
132-45=87 |
48 |
3 |
12.2 |
81.9 |
6.7 |
36-21=15 |
116-36=80 |
48 |
4 |
11.9 |
84.0 |
10.3 |
30-20=10 |
97-30=67 |
56 |
5 |
12.0 |
83.3 |
10.5 |
31-21=10 |
103-31=72 |
56 |
6 |
11.6 |
86.2 |
6.8 |
24-20=4 |
58-24=34 |
48 |
7 |
11.7 |
85.4 |
6.1 |
24-20=4 |
61-24=37 |
48 |
XC4010XLPC84-2 Processed with M1.4.12 (PAR –L5 –D5) |
||||||
Design Style |
Counter Delay (nS) |
Max Frequency (MHz) |
Counter to MUX REG delay (nS) |
Placement time (Seconds) |
Routing Time (Seconds) |
CLBs Used |
1 |
18.1 |
55.2 |
7.7 |
105-21=84 |
257-105=152 |
55 |
2 |
12.0 |
83.3 |
6.7 |
72-21=51 |
199-72=127 |
48 |
3 |
11.8 |
84.7 |
6.8 |
55-21=34 |
138-55=83 |
48 |
4 |
12.1 |
82.6 |
10.5 |
40-21=19 |
148-40=108 |
56 |
5 |
12.1 |
82.6 |
10.6 |
40-20=20 |
102-40=62 |
56 |
6 |
12.1 |
82.6 |
6.7 |
29-22=7 |
61-29=32 |
48 |
7 |
11.7 |
85.4 |
6.1 |
27-21=6 |
66-27=39 |
48 |
Interpreting the Floorplan Pictures
The full manual has all the pictures for all 8 of the above tables of data.
This page only has the pictures for the last table, Which is the M1 PAR V1.4.12,
with -L 5 and -D 5, which represent high effort in both placer and router.
At the time of writing this page, the XC4000XL is Xilinx's leading FPGA
family, and the M1 PAR version 1.4.12 is the current version of the place
and route software.
The color coding of the following Floorplans is as follows:
XC4010XL-S1-F The 4 counters are binary ripple counters (CB16CE), from the Xilinx
unified library XC4000E, the multiplexer and output register are also taken
from this library. There is no Floorplanning in this style, and the choice
of a ripple counter, while available in the library, is a poor choice. |
|
XC4010XL-S2-F The 4 counters are binary counters that use the built-in carry logic
(CC16CE), from the Xilinx unified library XC4000E, the multiplexer and output
register are also taken from this library. While there is no explicit Floorplanning
in this style, the counters include internal Floorplanning, because the
carry logic imposes a column structure on the counters. |
|
XC4010XL-S3-F This style adds four RLOC_ORIGIN Floorplanning constraints to the
style 2 design, placing the four counters in adjacent column, and aligning
the MSBs of the counters (and all other bits). |
|
XC4010XL-S4-F This style replace the un-Floorplanned output register of the previous
styles with a Floorplanned register, and places it in the column to the right
of the fourth counter. It also is aligned with regard to bit positions. |
|
XC4010XL-S5-F This style is like style 4, except the output register is placed
in the column to the right of the column used for the register in style 4.
|
|
XC4010XL-S6-F This style uses a Floorplanned multiplexer and output register built by FlibGen module generator, and places it in the two columns to the right of the fourth counter. The odd bit multiplexers and output register flip-flops are in one of these two columns, and the even bits are in the other column. |
|
XC4010XL-S7-F This style uses the same components of style 6, but the Floorplan has been changed. The first two columns contain the first two counters, the next two columns are the multiplexer and output register, and the last two columns contain the third and fourth counter. |
If you have read this page and found it useful, please send an email to philip@fliptronics.com