MMX Technology explained

Although not very easy to incorporate into the 3D rendering pipeline, MMX can offer a significant performance increase

Although not very easy to incorporate into the 3D rendering pipeline, MMX can offer a significant performance increase

The 3D graphics rendering pipeline is controlled by an application, usually through calls to an API. Scene management may be integrated in the application or may be part of the 3D pipeline beneath the API. The 3D pipe processes a geometric database of points or vertices handed to it by the scene manager. These vertices are linked into polygons, usually triangles.

Typical scenes have 300 to 30k polygons, requiring 10k to 1M poly/sec processing for rendering 30 frames/sec. The final output is pixels, usually at rates of 5-100 mill/sec. We limit the scope of our discussion to pixel processing, aka "rendering" or "rasterisation".

A typical 3D rasterisation dataflow receives transformed, lit vertices and linking structures such as discrete triangles (three vertices each) or triangle meshes. In a mesh, each new vertex defines a new triangle whose other two vertices are the preceeding two in the mesh list. For final output, the pipeline draws images on a CRT screen. Each vertex consists of:

X, Y and Z coordinates in "screen space"

R, G and B colour components. Possibly "A" (alpha) transparency factor also. U and V texture coordinates (for polys which are to be textured). Other features of vertices are typically maintained as a "state" variable of the rendering pipeline. These variables must be changed explicitly each time a new texture, shading, blending or lighting model is desired.

Typical state variables include:

Texture ( a pointer to the current bitmap to be used for texturing, if any

Texture Filtering Mode ( point sampling, bilinear filtering, MipMapped and so forth

Texture Application Mode ( wrap, modulate, decal, blend and so forth

Shading Mode ( flat, Gouraud (smooth), pseudo-specular or Phong (highlighted)

Material ( determines light reflectivity

Z-buffering ( off or on and type of Z-compare

Clipping ( off or on various types

Blending mode ( none, source, destination, stipple (screen door) and so forth

Anti-aliasing ( off or on

Dithering ( off or on

Fog ( off or on and variance with depth (Z)

The sequence of operations on vertices can vary, but it usually follows a pattern like:

Clip and Cull (Remove) Polygons ( which are outside of the viewport or are backfacing, on the "backside" of the object. From partially-clipped objects, new polygons must be created which exclude the clipped-out areas.

Setup Scan Lines ( a scan line is a "horizontal" or constant Y-value sequence of pixels. By calculating the slopes of the edges of each triangle, we determine the start and end pixels of each scan line, the number of pixels on the scanline and the change in r, g, b, u, v and Z per pixel. These deltas are often called dr, dg, db, du, dv, and dZ.

Setup Texture Perspective Correction ( for texturing, additional parameters for each scanline or even smaller spans of scanlines, are required to correct for the foreshortening imposed by a perspective view. Without this correction factor, textures distort and show discontinuities along polygon edges.

Perfect perspective correction requires two divides PER PIXEL or a divide and two muliplications. Perspective correction can be approximated without divides, by adding second-order (quadratic) differentials to the du and dv values at each pixel. These differentials are labeled ddu and ddv. Alternatively, the algorithm can do the divide once per 8 or 16 pixels, and linearly interpolate for the intervening 7 or 15 pixels.

Draw each pixel on each scan line

While the above setup operations can benefit from the parallelism of MMX technology, the number of calculations is limited and they often use floating point arithmetic. Much more benefit accrues in the subsequent steps of drawing each pixel on each scanline. To draw each pixel, a variety of operations occur depending on the state variables of the renderer discussed earlier. The order of operations may vary and may not include all of the following, but generally they are:

( Hidden surface removal

Determines whether the pixel is hidden by others in the scene nearer the viewer. That is, if the Z-value (depth in the 3D co-ordinate system) of the pixel is greater than the one already drawn for a given X,Y location, don't draw it. This can be achieved by Z-buffering or by pre-sorting polygons or scan-line spans. For Z-buffering, the Z-values for pixels already drawn in the image must be refetched from the buffer, compared to new values and conditionally updated.

( Colour

Calculate the pixel colour via flat shading or Gouraud shading and/or texture mapping. MMX technology can Gouraud-shade two or four pixels concurrently, at rates of three to eight clocks per pixel.

( Texture

Look up the texture element(s) or texel(s) which are to be mapped to the current pixel. Multiply/add them together if linear or bilinear filtering or trilinear filtering is specified. Texturing may require looking up a software-based palette for each texel if the source bitmap is palettised (the usual case). It may also involve decompressing the texture via some statistical algorithm. And it may involve Gouraud shading to combine the underlying material colour and/or light colour with the texture.

MMX Technology does best on 16-bit hi colour and 24-bit true colour data. The packed add, multiply and logical operations actually make 24-bit the only format attractive for calculations, as three 8-bit R, G and B components or 8.8-bit fixed point (48-bits total for RGB). After completing calculations (like Gouraud shading, alpha and so forth), the algorithm can convert to RGB16 (555 or 565) for updating the display buffer. MMX technology has no built-in arithmetic nor packing for 5-bit chunks.

24-bit internal operations allow high-quality features which are not possible with 8-bit palettised colour, including:

Alpha blending for transparency, fog, haze

True RGB (multi-coloured) lights for orange firelight and so forth

Pseudo-specular highlighting

More than 256 colours on the screen at once

Truly smooth shading, free of banding artifacts

Less (or no) dithering artifacts

No palette flashes or clashes with other application palettes

Compatibility with hi colour content (textures) and features, similar to HW accelerators

Of course, the amount of graphics memory in the system determines whether the application can use 16-bit or 24-bit colour. For 1Mb graphics cards, the application probably may stay with 8-bit palettised (640 x 480 x 2 bytes requires 600Kb, which can fit in 1Mb only if not double-buffered or if the backbuffer is in system memory). For 2Mb or 4Mb, 16-bit is attractive, as the 1.2Mb requirement of 640 x 480 double-buffered still leaves 800Kb for texture caching and/or Z-buffering.

8-bit colour does permit moving eight pixels simultaneously with a single MOVQ instruction. It also permits 16 simultaneous compares and/or merges, given the dual-pipeline MMX technology execution units and instruction "pairing". Unfortunately, 8-bit palettised rendering requires reading one byte at a time from a texture palette.

Traditionally, 8-bit rendering code for older processors played "speed tricks" which actually degrade performance on newer CPUs. These should be avoided for high speed on the Pentium processor and Dynamic Execution processors. Examples include:

Self-modifying code, which overwrites part of the instruction stream with a new address or immediate value. This avoids the use of another register to contain data. On newer CPUs, every modified instruction can waste dozens of clocks, because deep pipelines must be emptied and cache lines invalidated and refetched. Random (non-sequential) access to memory locations are a result of doing table-lookups for multiplication, colour conversions, shading or dithering.

Frequent data-dependent branches especially if their success is irregular. Access of 32-bit registers (EAX, EBX, ECX, EDX) as 8-bit or 16-bit (AL, AH, AX and BL), to combine multiple 8-bit pixels in single 32-bit writes or to zero-out individual bytes.

3D rasterisation is data-bandwidth intensive and performance problems often occur due to an inability to move data between main memory, CPU caches, CPU registers, graphics memory and buses. As shown in Figure 4, peak bandwidths vary widely in the various pieces of the system.

Most PC systems have two caches, one on the CPU chip called "L1" (Level One) and a larger external cache called "L2" (Level Two). Software optimised to utilise the caches can have 2x to 10x higher performance, as the caches run 2x to 10x as fast as main memory.

Application SW cannot explicitly control the caches. It cannot command them to hold a specific data item as the hardware attempts to keep the most useful data cache-resident.

For best performance, all accesses should be to cached data in the order that the HW brings it into the cache. Accesses should be on the natural boundaries of the cache HW (quadwords and 32-byte cache lines). Applications should try to re-use data as much as possible once it is present in the cache. Unfortunately, SW cannot explicitly test whether particular data items are cached, and testing would cut the performance anyway. SW sequences can be cleverly crafted to get data efficiently into and out of the caches.

Pre-fetching data several clocks before the instructions that actually use it. By putting one MOV reg, memory instruction in a loop and incrementing the memory address by 32 every time, entire 32-byte cache lines will be loaded into the cache. Then subsequent uses of that data within the loop or in the next code sequence get cache hits.

Of course, this technique of data "pre-touching" works well only if the algorithm can find something useful for the CPU to do during the prefetching time, which does not involve writes to memory, such as shifts or multiplies. One experiment showed a simple loop of loads and stores (using MOVQ) got only 58 Mb/s in a naïve implementation. But with pre-touching, it increased to 96Mb/s.

Align large data structures on 4K byte boundaries. Such alignment increases the likelihood of DRAM "page hits", yielding faster main memory loads into the caches.

Access 8-bytes per instruction or at least 4-bytes. Avoid single-byte or word moves. Access data sequentially in increasing and consecutive addresses. This means that when a cache line-fill occurs, the hardware probably can continue satisfying fetch-requests from the temporary line-fill buffer, rather than forcing the program to wait until the cache itself has been updated.

Arrange data to fit within cache lines. For example, it may be beneficial to rearrange texture maps in rectangular blocks or strips or tiles so that each cache line contains a rectangular area of the texture. If done carefully, the overhead in the texture mapping instruction loop will be small and cache locality of texturing will improve.

Access data on aligned boundaries. For example, MOVQ should always address a 0, 8, 16 and 24 location. If the code does an 8-byte access to a 4-byte aligned address like 28, then the access will take at least three extra clocks (cache gets accessed twice), and maybe more if it crosses a cache line boundary (as does an 8-byte access to address 28 crosses the line boundary at address 32).

Even for non-cached memory locations, such as graphics card RAM or uncacheable main-memory regions, SW optimization of accesses can offer speedup.

Avoid "ping-pong" accesses to two (or more) memory arrays, either reads or writes. Most DRAM run a lot faster when "page hits" occur, meaning successive reads or writes to the same general area. So group accesses to be multiple reads to one array, followed by multiple reads to another, if two must be accessed. Ping-ponging also hampers the ability of the P6-family internal CPU write-buffers to efficiently burst stream writes to memory.

Access data sequentially in increasing and consecutive addresses (Again, to ensure DRAM page hits).

Access data aligned to its elemental size, usually 8 bytes. Regroup data types in individual arrays rather than in heterogenous structures or visa versa. The programmer should do whatever gives the most locality and sequentiality of access.

Depending on how carefully the data is arranged and accessed, there can be huge variations in memory and graphics bandwidth. The measurements are sustained rates for very small code loops on a Pentium processor 166MHz system with:

Windows 95

32 Mb EDO DRAM main memory

The 82430HX PCIset

PBSRAM (Pipelined Burst Static RAM) 256KB L2 cache

An S3 Virge* graphics chip with 2Mb EDO graphics RAM

Performance benefits from MMX technology in 3D rendering vary widely. Naive implementations may yield no speedup at all. Intel's VTune performance-tuning tool can help significantly in analysing MMX technology code. Generally, speedup requires:

Inner loops which do lots of computation for every memory access. For example, a Gouraud-shaded specular highlight fog-blended renderer should benefit. But if the same loop added Z-buffering and destination-alpha blending, it would require at least three more memory accesses and would probably slowdown signficantly.

Creating 3D applications and content for MMX technology is very much like creating for 3D hardware accelerators, as both use 16-bit or 24-bit RGB formats. In some cases, software rendering runs faster and certainly more flexibly than current 3D hardware. In other cases, it makes sense to combine foreground objects and special effects, rendered in software, with hardware-rendered backgrounds. Hardware imposes certain overheads in setting-up registers and synchronisation, but, of course, has higher bandwidth to graphics memory. So MMX technology does not make 3D hardware unnecessary nor does HW make MMX technology unnecessary.

Compiled by Ajith Ram

(c) 1999 Intel Corporation

W.WKS.WP4-T1.220799.DOC I.S. Department 19/08/99 12:20

Read more on PC hardware