Reacting to the advent of digital television and interactive video,
the Moving Pictures Experts Group has announced a new format –
MPEG-4
Overview of the MPEG-4 Standard
MPEG-4 is an ISO/IEC standard developed by MPEG (Moving Picture
Experts Group), the committee that also developed the Emmy
Award-winning standards known as MPEG-1 and MPEG-2. MPEG-4 is the
result of another international effort involving hundreds of
researchers and engineers worldwide. MPEG-4, whose formal ISO/IEC
designation is ISO/IEC 14496, was finalised in October 1998 and
will be an International Standard in 1999. Currently, MPEG is
working on fully backward compatible extensions under the title of
MPEG-4 Version 2. MPEG-4 builds on the proven success of three
fields:( Digital television( Interactive graphics applications
(synthetic content) ( Interactive multimedia (World Wide Web,
distribution of and access to content)MPEG-4 provides the
standardised technological elements enabling the integration of the
production, distribution and content access paradigms of the three
fields. This document provides an overview of the MPEG-4 standard,
explaining which pieces of technology it includes and what sorts of
applications are supported by this technology. Scope and features
of the MPEG-4 StandardThe MPEG-4 standard provides a set of
technologies to satisfy the needs of authors, service providers and
end users alike. For authors, MPEG-4 enables the production of
content that has far greater reusability, has greater flexibility
than is possible today with individual technologies such as digital
television, animated graphics, World Wide Web pages and their
extensions. Also, it is now possible to better manage and protect
content owner rights.For network service providers, MPEG-4 offers
transparent information that can be interpreted and translated into
the appropriate native signalling messages of each network with the
help of relevant standards bodies. The foregoing, however, excludes
Quality of Service considerations, for which MPEG-4 provides a
generic QoS descriptor for different MPEG-4 media. The exact
translations from the QoS parameters, set for each media to the
network QoS, are beyond the scope of MPEG-4 and are left to network
providers. Signalling of the MPEG-4 media QoS descriptors
end-to-end enables transport optimisation in heterogeneous
networks.For end users, MPEG-4 brings higher levels of interaction
with content within the limits set by the author. It also brings
multimedia to new networks, including those employing relatively
low bit-rate and mobile ones. An MPEG-4 application document exists
on the MPEG Home page that describes many end user applications,
including interactive multimedia broadcast and mobile
communications.For all parties involved, MPEG seeks to avoid a
multitude of proprietary, non-interworking formats and players.
MPEG-4 achieves these goals by providing standardised ways to: (
Represent units of aural, visual or audio-visual content called
"media objects"( Describe the composition of these objects to
create compound media objects that form audio-visual scenes(
Multiplex and synchronise the data associated with media objects so
that they can be transported over network channels providing a QoS
appropriate for the nature of the specific media objects( Interact
with the audio-visual scene generated at the receiver's end
Coded
representation of media objectsMPEG-4 audio-visual scenes are
composed of several media objects organised in a hierarchical
fashion. At the leaves of the hierarchy we find primitive media
objects such as: ( Still images (e.g. as a fixed background)( Video
objects (e.g. a talking person ( without the background) ( Audio
objects (e.g. the voice associated with that person)MPEG-4
standardises a number of such primitive media objects, capable of
representing both natural and synthetic content types, which can be
either two or three dimensional. In addition to the media objects
mentioned above, MPEG-4 defines the coded representation of objects
such as: ( Text and graphics( Talking synthetic heads and
associated text used to synthesise the speech and animate the head
( Synthetic sound A media object in its coded form consists of
descriptive elements that allow handling the object in an
audio-visual scene as well as of associated streaming data, if
needed. It is important to note that in its coded form, each media
object can be represented independent of its surroundings or
background. The coded representation of media objects is as
efficient as possible while taking into account the desired
functionalities. Examples of such functionalities are error
robustness, easy extraction and editing of an object or having an
object available in a scaleable form.
Composition of media
objectsAn audio-visual scene in MPEG-4 is described as being
composed of individual objects. The figure contains compound media
objects that group primitive media objects together. Primitive
media objects correspond to leaves in the descriptive tree while
compound media objects encompass entire sub-trees. For example, the
visual object corresponding to the talking person and the
corresponding voice are tied together to form a new compound media
object, containing both the aural and visual components of that
talking person. Such grouping allows authors to construct complex
scenes and enables consumers to manipulate meaningful (sets of)
objects.More generally, MPEG-4 provides a standardised way to
describe a scene allowing the user to: ( Place media objects
anywhere in a given co-ordinate system ( Apply transforms to change
the geometrical or acoustical appearance of a media object ( Group
primitive media objects in order to form compound media objects (
Apply streamed data to media objects in order to modify their
attributes (e.g. a sound, a moving texture belonging to an object;
animation parameters driving a synthetic face) ( Change,
interactively, the user's viewing and listening points anywhere in
the sceneThe scene description builds on several concepts from the
Virtual Reality Modelling Language (VRML), in terms of both its
structure and the functionality of object composition nodes, and
extends it to fully enable the aforementioned features.
Description and synchronisation of streaming data for media
objectsMedia objects may need streaming data that is conveyed
in one or more elementary streams. An object descriptor identifies
all streams associated to one media object. This allows handling
hierarchically encoded data as well as the association of
meta-information about the content (called 'object content
information') and the intellectual property rights associated with
it. Each stream is characterised by a set of descriptors for
configuration information, e.g., to determine the required decoder
resources and the precision of encoded timing information.
Furthermore, the descriptors may carry hints to the Quality of
Service (QoS) it requests for transmission (e.g. maximum bit rate,
bit error rate, priority, etc.).Synchronisation of elementary
streams is achieved through time stamping of individual access
units within elementary streams. The synchronisation layer manages
the identification of such access units and the time stamping.
Independent of the media type, this layer allows identification of
the type of access unit (e.g. video or audio frames, scene
description commands) in elementary streams, recovery of the media
object's or scene description's time base and it enables
synchronisation among them. The syntax of this layer is
configurable in a large number of ways allowing use in a broad
spectrum of systems.
Delivery of streaming dataThe
synchronised delivery of streaming information from source to
destination, exploiting different QoS as available from the
network, is specified in terms of the aforementioned
synchronisation layer and a delivery layer containing a two-layer
multiplexer.The first multiplexing layer is managed according to
the DMIF specification, part 6 of the MPEG-4 standard (DMIF stands
for Delivery Multimedia Integration Framework). This multiplex may
be embodied by the MPEG-defined FlexMux tool that allows grouping
of Elementary Streams (ESs) with a low multiplexing overhead.
Multiplexing at this layer may be used, e.g. to group ES with
similar QoS requirements, reduce the number of network connections
or the end to end delay.The "TransMux" (Transport Multiplexing)
layer models the layer that offers transport services matching the
requested QoS. Only the interface to this layer is specified by
MPEG-4 while the concrete mapping of the data packets and control
signalling must be done in collaboration with the bodies that have
jurisdiction over the respective transport protocol. Any suitable
existing transport protocol stack such as (RTP)/UDP/IP, (AAL5)/ATM,
or MPEG-2's Transport Stream over a suitable link layer, may become
a specific TransMux instance. The choice is left to the end
user/service provider and allows MPEG-4 to be used in a wide
variety of operation environments. Use of the FlexMux multiplexing
tool is optional and this layer may be empty if the underlying
TransMux instance provides all the required functionality. The
synchronisation layer, however, is always present. It is possible
to:Identify access units, transport timestamps and clock reference
information and identify data loss Optionally interleave data from
different elementary streams into FlexMux streams Convey control
information to: ( Indicate the required QoS for each elementary
stream and FlexMux stream ( Translate such QoS requirements into
actual network resources ( Associate elementary streams to media
objects ( Convey the mapping of elementary streams to FlexMux and
TransMux channels Parts of the control functionalities are
available only in conjunction with a transport control entity like
the DMIF framework.
Interaction with media objectsIn
general, the user observes a scene that is composed following the
design of the scene's author. Depending on the degree of freedom
allowed by the author, however, the user has the possibility to
interact with the scene. Operations a user may be allowed to
perform include: ( Changing the viewing/listening point of the
scene, e.g. by navigation through a scene ( Dragging objects in the
scene to a different position ( Triggering a cascade of events by
clicking on a specific object, e.g. starting or stopping a video
stream( Selecting the desired language when multiple language
tracks are available More complex kinds of behaviour can also be
triggered, e.g. a virtual phone rings, the user answers and a
communication link is established.
Management and identification
of intellectual propertyIt is important to have the possibility
to identify intellectual property in MPEG-4 media objects.
Therefore, MPEG has worked with representatives of different
creative industries in the definition of syntax and tools to
support this. A full elaboration of the requirements for the
identification of intellectual property can be found in "Management
and Protection of Intellectual Property in MPEG-4" which is
publicly available from the MPEG home page. MPEG-4 incorporates
identification of the intellectual property by storing unique
identifiers that are issued by international numbering systems
(e.g. ISAN, ISRC, etc. [ ISAN: International Audio-Visual Number,
ISRC: International Standard Recording Code] ). These numbers can
be applied to identify a current rights holder of a media object.
Since not all content is identified by such a number, MPEG-4
Version 1 offers the possibility to identify intellectual property
by a key-value pair (e.g. »composer«/»John Smith«). MPEG-4 also
offers a standardised interface that is integrated tightly into the
Systems layer to people who want to use systems that control access
to intellectual property. With this interface, proprietary control
systems can be easily amalgamated with the standardised part of the
decoder.
The full text of this White Paper can be found on the
home page of the Motion Picture Experts Group, edited by Rob
Koenen. Compiled by Richard Pitt.