Hardware anti-aliasing

Traditional methods of hardware anti-aliasing are memory and processor intensive. However, a new algorithm might offer a solution

Traditional methods of hardware anti-aliasing are memory and processor intensive. However, a new algorithm might offer a solution

This paper describes algorithms for accelerating anti-aliasing in 3D graphics through low-cost custom hardware. The rendering architecture employs a multiple-pass algorithm to perform front-to-back hidden surface removal and shading. Coverage mask evaluation is used to composite objects in 3D. The key advantage of this approach is that anti-aliasing requires no additional memory and decreases rendering performance by only 30-40 per cent for typical images. The system is image partition-based and is scalable to satisfy a wide range of performance and cost constraints.

An A-buffer implementation does not require several passes of the object data, but does require sorting object by depth before compositing them. The amount of memory required to store the sorted layers is limited to the number of subpixel samples, but it is significant since the colour, opacity and mask data are needed for each layer. The compositing operation uses a blending function that is based on three possible subpixel coverage components and is more computationally intensive than the accumulation buffer blending function.

The A-buffer hardware implementation described in this paper maintains the high performance of the A-buffer using a limited amount of memory. Multiple passes of the object data are sometimes required to composite the data from front-to-back even when anti-aliasing is disabled.

The number of passes required to rasterise a partition increases when anti-aliasing is used. However, only in the worst case is the number of passes equal to the number of subpixel samples. It is possible to enhance the algorithm to correctly render intersecting objects. The current implementation does not include that enhancement. Furthermore, the algorithm correctly renders images of moderate complexity which have overlapping transparent objects without imposing any constraints on the order in which transparent objects are submitted.

The hardware accelerator is a single ASIC which performs the 3D rendering and triangle setup. It provides a low-cost solution for high performance 3D acceleration in a personal computer. A second ASIC is used to interface to the system bus or PCI/AGP. The rasteriser uses a screen-partitioning algorithm with a partition size of 16x32 pixels. Screen partitioning reduces the memory required for depth sorting and image compositing to a size that can be accommodated inexpensively on-chip. No off-chip memory is needed for the Z-buffer and dedicated image buffer. The high bandwidth, low latency path between the rasteriser and the on-chip buffers improves performance.

Several hardware algorithms have been developed which maintain either high quality or performance while reducing or eliminating the large memory requirement of supersampling. An accumulation buffer requires only a fraction of the memory of supersampling, but requires several passes of the object data. The hardware accelerator implements a triangle setup to reduce required system bandwidth and balance the computational load between the accelerator and the host processors. Multiple rendering ASICs can operate in parallel to match CPU performance.

The anti-aliasing algorithm is distributed among three of the major functional blocks of the ASIC, the Plane Equation Setup, Hidden Surface Removal and Composite Blocks.

The hardware accelerator only interrupts the processor when it has finished processing a frame. This leaves the CPU free to perform geometry, clipping and shading operations for the next frame while the ASIC is rasterising the current frame.

Hidden surface removal

The partition size is 16x32 pixels so that a double-buffered Z-buffer and image buffer can be stored on-chip. This reduces cost and required memory bandwidth while improving performance. External memory is required for texture map storage, so texture map rendering performance scales with the memory speed and bandwidth. In addition to these three design principles, another goal was to provide hardware support for anti-aliased rendering. Two types of anti-aliasing quality were desired: a fast mode for interactive rendering and a slower, high quality mode for producing final images.

For high quality anti-aliasing, the ASIC uses a traditional accumulation buffer method to anti-alias each partition by rendering the partition at every subpixel offset and accumulating the results in a off-chip buffer. Because this algorithm is well known, this high-quality anti-aliasing mode is not discussed in this paper.

The more challenging goal was to also provide high quality anti-aliasing for interactive rendering in less than double the time needed to render a non-anti-aliased image. We assumed that this type of anti-aliasing would only be used for playback or pre-viewing, so it could only consume a small portion of the die area. Therefore, the challenge in implementing anti-aliasing was how to properly anti-alias without maintaining the per pixel coverage and opacity data for each of the layers individually.

The solution to this problem involves having the ASIC perform Z-ordered shading using a multiple pass algorithm. This permits an unlimited number of layers to be rendered for each pixel as in the appropriate architecture.

However, because the architecture performs anti-aliasing by integrating area samples in multiple passes to successively anti-alias the image, the number of passes is equal to the number of subpixel positions in the filter kernel. For example, rendering an anti-aliased image using a typical filter kernel of eight samples would require eight times as long as rendering it without anti-aliasing. Obviously, this is too high a performance penalty for use in interactive rendering.

With our modified A-buffer algorithm, the number of passes required to anti-alias an image is a function of image complexity (opacity and subpixel coverage) in each partition, not the number of subpixel samples. The worst case arises when there are at least eight layers which have eight different coverage masks which each cover only one subpixel. This rarely, if ever, occurs in practice. In fact, an average of only 1.4 passes is required when rendering with a 16x32 partition and an 8-bit mask.

The Plane Equation Setup calculates plane equation parameters for each triangle and stores them for later evaluation in the relevant processing blocks. The Scan Conversion generates the subpixel coverage masks for each pixel fragment and outputs them to the rendering pipeline. During the Hidden Surface Removal, fragments of tessellated objects are flagged for specific blend operations during shading. The Composite Block shades pixels by merging the coverage masks and alpha values.

Each pixel is divided into 16 subpixels, but only half of the samples are used. The mask is stored as an 8-bit value using the bit assignments. Pixels which are completely covered by opaque objects are resolved in a single pass. When a pixel contains portions of two or more triangles, it is desirable to merge the pixel fragments so that the pixel can be fully composited in one pass.

Shading is implemented in the Composite Block, which, like the Hidden Surface Removal block, has a two-layer buffer for storing pixel colors and masks. Since the buffers occupy on-chip memory, it is necessary to minimise the state information stored in them. Consequently, the pixel colour, alpha and mask for the previously composited data is stored as a single 40-bit value. In addition, a single bit controls the blending functions for color, alpha and mask.

Compiled by Ajith Ram

(c) 1998 Stephanie Winner, Mike Kelley, Brent Pease, Bill Rivard, and Alex Yen, Apple Computer, 3Dfx, SGI

Read more on IT risk management