Pixvana Open Projection Format

XR Lab

A new open file specification for 360/VR video streaming.

Streaming great looking VR video experiences is a big challenge due to the large bandwidth requirements. While many apps and services such as YouTube let you stream 360 videos to VR headsets, the results are often soft and full of artifacts.

Our field of view adaptive streaming (FOVAS) technique solves many of these challenges. Remember, FOVAS maximizes quality while cutting bandwidth by breaking the frames into individual tiled views. Each tile gets the perfect resolution where you are looking while decimating the pixels outside your view. As you turn your head, we switch the tiled stream based on the new head angle. The projections, tiling and angles are defined and indexed in a simple descriptor file.

Today, we are opening up our descriptor file, the Open Projection Format (OPF), for adaptive streaming of VR video. Today, we are opening up our descriptor file, the Open Projection Format (OPF), for adaptive streaming of VR video. We will also show how this file format works hand-in-hand with standard formats like MPEG-DASH and H.264 encoded video, to create a 360 streaming system that can be deployed using existing standards and tools.

 

Process Chain

To create a FOVAS stream requires a multi-step process. Our system operates on AWS using S3 storage and GPU instances but this could be done on a single desktop (slowly) or any cloud infrastructure that supports GPU instances. The process:

  • Retrieves a master file in cube map or equirectangular format
  • Splits the media file into small segments, typically 30-60 frames for parallel processing
  • Renders a projection for each desired media segment
  • Gathers the projected segments and sends them to a H.264 video encoder
  • Encodes the projected media as an H.264 video stream with short GOPs

This process happens in parallel for each output projection and then again for each adaptive  quality level. For a spherical or cubic projection there is one output at three, four or five quality levels. For an offset cube, frustum or pyramid, you could have a few dozen or hundreds of streams that describe the full adaptation set. The final step of the processes is to build an index of the video streams into a single MPEG-DASH Media Presentation Description (MPD) file.

Our processing system can deliver high-speed parallel encoding of segments and streams to efficiently output projected and encoded .mp4 fies. The final output consists of three elements: a set of .mp4 encoded audio/video streams, and MPD file describing the adaptation for tiles and quality, and a SPIN Projection Format file to map the streams to a head location and the projection. The combination of descriptors and video streams means that the switching logic on the playback side is trivial.

 

Benefits

Our approach to view adaptation and tiling is one that others have discussed but not opened up to the community at large. Our guiding principle was to make a system that could be deployed across mobile, and desktop VR headsets with high performance and low latency. That means:

  • Wring as much quality out of today’s codecs, so we settled on H.264 because this a widely optimized delivery codec that works everywhere.
  • Employ off the shelf encoding libraries such as FFMPEG and MP4Box to drive the encoding process and validate files
  • Adhere 100% to the current MPEG-DASH standard for deployment so that network operators and CDNs can easily manage the files.
  • Be tolerant to network conditions where the latency might be as high as several hundred milliseconds.

There are new approaches being explored by other researchers built around the newer HEVC (H.265) standard. Other companies are building experiments around H.265 that divides the encoded streams into multiple spatial tiles or slices. Unfortunately the licensing costs are still unknown for HEVC. Our system could be augmented with new approaches and newer codecs as they become widely available.

 

Open Projection Format

The format itself is simple and extensible. We employed JSON as our descriptor format since it is trivial to parse and easy to read. The object notation is broken down into elements:

  • Projection format: The type of 3D shape and the attribute describing the shape and central head position. The projection can include spherical, cubic, frustum, pyramid, icosahedron, and others.
  • Adaptation Index: Matches the tiles to an angle and
  • Stereo format: A boolean to specify stereo or mono encoding and ancillary data describing the spatial mapping for the stereo data (top-bottom, left-right, differential)

 

Here is an example of the OPF formatted data:
{

 "url" : "http://host.com/manifest.mpd",

 "format" : "frustum",

 "formatInfo" : {

   "tiles" : [

     {

       "id" : "tile.00",

       "yawDegrees" : 45,

       "pitchDegrees" : 90

     },

     {

       "id" : "tile.01",

       "yawDegrees" : 105,

       "pitchDegrees" : 90

     },

     {

       "id" : "tile.02",

       "yawDegrees" : 165,

       "pitchDegrees" : 90

     },

     {

       "id" : "tile.03",

       "yawDegrees" : 225,

       "pitchDegrees" : 90

     },

     ...

   ]

 }

}

 

Conclusion

Pixvana engineers are working with industry partners at Valve to finalize the OPF format for public release. We hope that this format can move the VR streaming industry forward using today’s tools and encoders. We will be posting the final specification by the end of 2016, after we have incorporated partner feedback. Pixvana will provide example OPF files, MPD files, .mp4 data streams and example models that should allow others to implement our encoding system.

We welcome your contributions. Please contact us at opfspec@pixvana.com if you have ideas or feedback on this new format or would like to contribute.

XR Lab