Overview of the MPEG-4 StandardMPEG-4 is an ISO/IEC standard developed by MPEG (Moving Picture Experts Group), the committee that also developed the Emmy Award-winning standards known as MPEG-1 and MPEG-2. MPEG-4 is the result of another international effort involving hundreds of researchers and engineers worldwide. MPEG-4, whose formal ISO/IEC designation is ISO/IEC 14496, was finalised in October 1998 and will be an International Standard in 1999. Currently, MPEG is working on fully backward compatible extensions under the title of MPEG-4 Version 2. MPEG-4 builds on the proven success of three fields: ( Digital television ( Interactive graphics applications (synthetic content) ( Interactive multimedia (World Wide Web, distribution of and access to content) MPEG-4 provides the standardised technological elements enabling the integration of the production, distribution and content access paradigms of the three fields. This document provides an overview of the MPEG-4 standard, explaining which pieces of technology it includes and what sorts of applications are supported by this technology. Scope and features of the MPEG-4 Standard The MPEG-4 standard provides a set of technologies to satisfy the needs of authors, service providers and end users alike. For authors, MPEG-4 enables the production of content that has far greater reusability, has greater flexibility than is possible today with individual technologies such as digital television, animated graphics, World Wide Web pages and their extensions. Also, it is now possible to better manage and protect content owner rights. For network service providers, MPEG-4 offers transparent information that can be interpreted and translated into the appropriate native signalling messages of each network with the help of relevant standards bodies. The foregoing, however, excludes Quality of Service considerations, for which MPEG-4 provides a generic QoS descriptor for different MPEG-4 media. The exact translations from the QoS parameters, set for each media to the network QoS, are beyond the scope of MPEG-4 and are left to network providers. Signalling of the MPEG-4 media QoS descriptors end-to-end enables transport optimisation in heterogeneous networks. For end users, MPEG-4 brings higher levels of interaction with content within the limits set by the author. It also brings multimedia to new networks, including those employing relatively low bit-rate and mobile ones. An MPEG-4 application document exists on the MPEG Home page that describes many end user applications, including interactive multimedia broadcast and mobile communications. For all parties involved, MPEG seeks to avoid a multitude of proprietary, non-interworking formats and players. MPEG-4 achieves these goals by providing standardised ways to: ( Represent units of aural, visual or audio-visual content called "media objects" ( Describe the composition of these objects to create compound media objects that form audio-visual scenes ( Multiplex and synchronise the data associated with media objects so that they can be transported over network channels providing a QoS appropriate for the nature of the specific media objects ( Interact with the audio-visual scene generated at the receiver's end Coded representation of media objectsMPEG-4 audio-visual scenes are composed of several media objects organised in a hierarchical fashion. At the leaves of the hierarchy we find primitive media objects such as: ( Still images (e.g. as a fixed background) ( Video objects (e.g. a talking person ( without the background) ( Audio objects (e.g. the voice associated with that person) MPEG-4 standardises a number of such primitive media objects, capable of representing both natural and synthetic content types, which can be either two or three dimensional. In addition to the media objects mentioned above, MPEG-4 defines the coded representation of objects such as: ( Text and graphics ( Talking synthetic heads and associated text used to synthesise the speech and animate the head ( Synthetic sound A media object in its coded form consists of descriptive elements that allow handling the object in an audio-visual scene as well as of associated streaming data, if needed. It is important to note that in its coded form, each media object can be represented independent of its surroundings or background. The coded representation of media objects is as efficient as possible while taking into account the desired functionalities. Examples of such functionalities are error robustness, easy extraction and editing of an object or having an object available in a scaleable form. Composition of media objectsAn audio-visual scene in MPEG-4 is described as being composed of individual objects. The figure contains compound media objects that group primitive media objects together. Primitive media objects correspond to leaves in the descriptive tree while compound media objects encompass entire sub-trees. For example, the visual object corresponding to the talking person and the corresponding voice are tied together to form a new compound media object, containing both the aural and visual components of that talking person. Such grouping allows authors to construct complex scenes and enables consumers to manipulate meaningful (sets of) objects. More generally, MPEG-4 provides a standardised way to describe a scene allowing the user to: ( Place media objects anywhere in a given co-ordinate system ( Apply transforms to change the geometrical or acoustical appearance of a media object ( Group primitive media objects in order to form compound media objects ( Apply streamed data to media objects in order to modify their attributes (e.g. a sound, a moving texture belonging to an object; animation parameters driving a synthetic face) ( Change, interactively, the user's viewing and listening points anywhere in the scene The scene description builds on several concepts from the Virtual Reality Modelling Language (VRML), in terms of both its structure and the functionality of object composition nodes, and extends it to fully enable the aforementioned features. Description and synchronisation of streaming data for media objectsMedia objects may need streaming data that is conveyed in one or more elementary streams. An object descriptor identifies all streams associated to one media object. This allows handling hierarchically encoded data as well as the association of meta-information about the content (called 'object content information') and the intellectual property rights associated with it. Each stream is characterised by a set of descriptors for configuration information, e.g., to determine the required decoder resources and the precision of encoded timing information. Furthermore, the descriptors may carry hints to the Quality of Service (QoS) it requests for transmission (e.g. maximum bit rate, bit error rate, priority, etc.). Synchronisation of elementary streams is achieved through time stamping of individual access units within elementary streams. The synchronisation layer manages the identification of such access units and the time stamping. Independent of the media type, this layer allows identification of the type of access unit (e.g. video or audio frames, scene description commands) in elementary streams, recovery of the media object's or scene description's time base and it enables synchronisation among them. The syntax of this layer is configurable in a large number of ways allowing use in a broad spectrum of systems. Delivery of streaming dataThe synchronised delivery of streaming information from source to destination, exploiting different QoS as available from the network, is specified in terms of the aforementioned synchronisation layer and a delivery layer containing a two-layer multiplexer. The first multiplexing layer is managed according to the DMIF specification, part 6 of the MPEG-4 standard (DMIF stands for Delivery Multimedia Integration Framework). This multiplex may be embodied by the MPEG-defined FlexMux tool that allows grouping of Elementary Streams (ESs) with a low multiplexing overhead. Multiplexing at this layer may be used, e.g. to group ES with similar QoS requirements, reduce the number of network connections or the end to end delay. The "TransMux" (Transport Multiplexing) layer models the layer that offers transport services matching the requested QoS. Only the interface to this layer is specified by MPEG-4 while the concrete mapping of the data packets and control signalling must be done in collaboration with the bodies that have jurisdiction over the respective transport protocol. Any suitable existing transport protocol stack such as (RTP)/UDP/IP, (AAL5)/ATM, or MPEG-2's Transport Stream over a suitable link layer, may become a specific TransMux instance. The choice is left to the end user/service provider and allows MPEG-4 to be used in a wide variety of operation environments. Use of the FlexMux multiplexing tool is optional and this layer may be empty if the underlying TransMux instance provides all the required functionality. The synchronisation layer, however, is always present. It is possible to: Identify access units, transport timestamps and clock reference information and identify data loss Optionally interleave data from different elementary streams into FlexMux streams Convey control information to: ( Indicate the required QoS for each elementary stream and FlexMux stream ( Translate such QoS requirements into actual network resources ( Associate elementary streams to media objects ( Convey the mapping of elementary streams to FlexMux and TransMux channels Parts of the control functionalities are available only in conjunction with a transport control entity like the DMIF framework. Interaction with media objectsIn general, the user observes a scene that is composed following the design of the scene's author. Depending on the degree of freedom allowed by the author, however, the user has the possibility to interact with the scene. Operations a user may be allowed to perform include: ( Changing the viewing/listening point of the scene, e.g. by navigation through a scene ( Dragging objects in the scene to a different position ( Triggering a cascade of events by clicking on a specific object, e.g. starting or stopping a video stream ( Selecting the desired language when multiple language tracks are available More complex kinds of behaviour can also be triggered, e.g. a virtual phone rings, the user answers and a communication link is established. Management and identification of intellectual propertyIt is important to have the possibility to identify intellectual property in MPEG-4 media objects. Therefore, MPEG has worked with representatives of different creative industries in the definition of syntax and tools to support this. A full elaboration of the requirements for the identification of intellectual property can be found in "Management and Protection of Intellectual Property in MPEG-4" which is publicly available from the MPEG home page. MPEG-4 incorporates identification of the intellectual property by storing unique identifiers that are issued by international numbering systems (e.g. ISAN, ISRC, etc. [ ISAN: International Audio-Visual Number, ISRC: International Standard Recording Code] ). These numbers can be applied to identify a current rights holder of a media object. Since not all content is identified by such a number, MPEG-4 Version 1 offers the possibility to identify intellectual property by a key-value pair (e.g. »composer«/»John Smith«). MPEG-4 also offers a standardised interface that is integrated tightly into the Systems layer to people who want to use systems that control access to intellectual property. With this interface, proprietary control systems can be easily amalgamated with the standardised part of the decoder. The full text of this White Paper can be found on the home page of the Motion Picture Experts Group, edited by Rob Koenen. Compiled by Richard Pitt.