YouTube 360 video format

Written by Paul Bourke
January 2020

See the similar, but more complex conversion of frames from the GoPro MAX 360 Action Camera.

In the following the internal format YouTube uses to store 360 video content will be explained. As with many documents in the technology space this may be out of date in the future as YouTube may choose to change the way they store 360 videos.

When 360 video, monoscopic or stereoscopic, is uploaded to YouTube it is generally in the equirectangular format. This is the default format created by the vast majority of software provided by the camera manufacturers, and others. YouTube does not retain this format but instead remaps the footage. If you subsequently download the footage then it appears in that new remapped format. For example, here is a single frame from a downloaded 4K YouTube video.

Scene from "Elephants on the Brink", YouTube Discovery channel.

While one might be tempted to think this is two partial panoramas, it is in fact the 6 faces of the conventional cube map. The layout is slightly cunning in that it forms two strips, upper and lower half of the image. The upper strip contains faces left-front-right and the bottom strip contains faces bottom-back-top, noting that the face names can vary depending on conventions. This is essentially splitting the cube into two halves and laying each flat.

The discussion here is for the YouTube "4K" format, the other aspect ratios are just variations on this theme. Similarly knowing how this works should make it straightforward to work out what is happening in the stereoscopic case.

In any pipeline to reconstruct the equirectangular one is likely to extract each face, rotate according to local conventions (especially the orientation of the top and bottom faces), scale to create square images, and then run through a cube to equirectangular converter. The faces extracted, rotated and scaled are shown below. The reader should be able to determine which face came from where.

Converting downloaded YouTube movies back to equirectangular can be readily scripted. The process might be to use ffmpeg to extract the frames. ImageMagick "convert" to extract the 6 cube faces, apply something like cube2sphere to turn the cubemaps faces into equirectangular, and then finally building the movie again using ffmpeg, reassigning the audio track. Of course the result will not be as good as the original due to these multiple image manipulation steps, multiple encodings and the extreme compression YouTube performs.

The final reconstructed equirectangular is shown below, noting that the equiangle version of the cube map projection is used.

For example, the ImageMagick "convert" command lines for MacOS or Linux to extract the 6 cube maps from the YouTube frames might be as follows

convert -crop 1280x1024+0+0       $1 -flip -resize 1280x1280\!             frame_l.tga
convert -crop 1280x1024+1280+0    $1 -flip -resize 1280x1280\!             frame_f.tga
convert -crop 1280x1024+2560+0    $1 -flip -resize 1280x1280\!             frame_r.tga
convert -crop 1280x1024+0+1024    $1 -flip -rotate -90 -resize 1280x1280\! frame_d.tga
convert -crop 1280x1024+1280+1024 $1 -flip -rotate  90 -resize 1280x1280\! frame_b.tga
convert -crop 1280x1024+2560+1024 $1 -flip -rotate -90 -resize 1280x1280\! frame_t.tga

Notes

Why create a format based on cube maps where the horizontal and vertical resolution isn't the same? In the 4K case, the frame size is 3840x2048, so each cube face is 1280x1024.
What is so special about an aspect ratio of 1.875 (3840x2048), the possible explanation of the uneven horizontal and vertical resolution? For square faces, each being 1280 pixels, the frame size would be 3840x2560. Or for a 1K square cube map the overall frame size would be 3072x2048, an aspect of 1.5. Why were these seemingly more sensible aspect ratios not chosen?
The resolution of "2K" footage is 2560x1440, aspect ratio of 16x9. In this case there is not even an integer number of pixels for each face, 2560/3 = 853.3333...?! A strange choice. Fortunately it is a rather mute point given who would want 2K 360 video anyway, even 4K with the extreme YouTube compression is questionable, doubly so for stereoscopic 360 video.
Update Sept 2024. It seems there are other sizes and aspects in usage. The following are know to exist, there may be others: 3840x1920, 1920x1080, 3840x2160, 7680x3840, 2048x1080. And again, the last one is particularly strange because 2048 is not an integer multiple of 3.

Processing video using ffmpeg

An ffmpeg command line, based on one contributed by Rodrigo Polo is given below. It uses the versatile "v360" command set convert the frames from a YouTube 360 video directly to an equirectangular movie. Of course other ffmpeg switches can be used at the same time to suit particular needs, for example, replace the third line with desired compression options, time subset, and so on.

   ffmpeg -y -hide_banner -i sourvemoviename \
      -vf "v360=c3x2:e:cubic:in_forder='lfrdbu':in_frot='000313',scale=3840:1920,setsar=1:1" \
      -pix_fmt yuv420p -c:v libx264 -crf 18 \
      -c:a copy destinationmoviename

With regard to the various options. "forder" specifies the order the frames appear, selectable from left, right, up, down front, back. The "in-frot" specifies the rotation in multiples of 90 degrees that might be applied to the cube face. "scale" is the dimensions of the destination movie. In other words, this is an extremely general converter and not just for the particular layout YouTube uses. With regard to the different movies sizes and aspect ratios discussed above, it appears that this ffmpeg converter deals with that correctly, undoubtedly calculating the correct cube face size from the overall dimensions but also treating the cube face in normalised coordinates such that the aspect doesn't matter.