GPU Accelerated Compositing in Chrome

The compositor is implemented on top of the GL ES 2.0 client library which proxies the graphics calls to the GPU process (using the method explained above). When a page renders via the compositor, all of its pixels are drawn (remember, drawing != painting) directly into the window’s backbuffer via the GPU process.
The compositor’s architecture has evolved over time: initially it lived in the Renderer’s main thread, was then moved to its own thread (the so-called compositor thread), then took on additional responsibility orchestrating when paints occur (so-called impl-side painting). This document will focus on the latest version; see the GPU architecture roadmap for where older versions may still be in use.
In theory, the threaded compositor is fundamentally tasked with taking enough information from the main thread to produce frames independently in response to future user input, even if the main thread is busy and can’t be asked for additional data. In practice this currently means it makes a copy of the cc layer tree and SkPicture recordings for layer regions within in an area around the viewport’s current location.
Recording: Painting from Blink’s Perspective
The interest area is the region around the viewport for which SkPictures are recorded. When the DOM has changed, e.g. because the style of some elements are now different from the previous main-thread frame and have been invalidated, Blink paints the regions of invalidated layers within the interest area into an SkPicture-backed GraphicsContext. This doesn’t actually produce new pixels, rather it produces a display list of the Skia commands necessary to produce those new pixels. This display list will be used later to generate new pixels at the compositor’s discretion.
The Commit: Handoff to the Compositor Thread
The threaded compositor’s key property is operation on a copy of main thread state so it can produce frames without needing to ask the main thread for anything. There are accordingly two sides to the threaded compositor: the main-thread side, and the (poorly named) “impl” side, which is the compositor thread’s half. The main thread has a LayerTreeHost, which is its copy of the layer tree, and the impl thread has a LayerTreeHostImpl, which is its copy of the layer tree. Similar naming conventions are followed throughout.
Conceptually these two layer trees are completely separate, and the compositor (impl) thread’s copy can be used to produce frames without any interaction with the main thread. This means the main thread can be busy running JavaScript and the compositor can still be redraw previously-committed content on the GPU without interruption.
In order to produce interesting new frames, the compositor thread needs to know how it should modify its state (e.g. updating layer transforms in response to an event like a scroll). Hence, some input events (like scrolls) are forwarded from the Browser process to the compositor first and from there to the Renderer main thread. With input and output under its control, the threaded compositor can guarantee visual responsiveness to user input. In addition to scrolls, the compositor can perform any other page updates that don’t require asking Blink to repaint anything. Thus far CSS animations and CSS filters are the only other major compositor-driven page updates.
The two layer trees are kept synchronized by a series of messages known as the commit, mediated by the compositor’s scheduler (in cc/trees/thread_proxy.cc). The commit transfers the main thread’s state of the world to the compositor thread (including an updated layer tree, any new SkPicture recordings, etc), blocking the main thread so this synchronization can occur. It’s the final step during which the main thread is involved in production of a particular frame.
Running the compositor in its own thread allows the compositor’s copy of the layer tree to update the layer transform hierarchy without involving the main thread, but the main thread eventually needs e.g. the scroll offset information as well (so JavaScript can know where the viewport scrolled to, for example). Hence the commit is also responsible for applying any compositor-thread layer tree updates to the main thread’s tree and a few other tasks.
As an interesting aside, this architecture is the reason JavaScript touch event handlers prevent composited scrolls but scroll event handlers do not. JavaScript can call preventDefault() on a touch event, but not on a scroll event. So the compositor cannot scroll the page without first asking JavaScript (which runs on the main thread) if it would like to cancel an incoming touch event. Scroll events, on the other hand, can’t be prevented and are asynchronously delivered to JavaScript; hence the compositor thread can begin scrolling immediately regardless of whether or not the main thread processes the scroll event immediately.
Tree Activation
When the compositor thread gets a new layer tree from the main thread it examines the new tree to see what areas are invalid and re-rasterizes those layers. During this time the active tree remains the old layer tree the compositor thread had previously, and the pending tree is the new layer tree whose content is being rasterized.
To maintain consistency of the displayed content, the pending tree is only activated when its visible (i.e. within the viewport) high-resolution content is fully rasterized. Swapping from the current active tree to a now-ready pending tree is called activation. The net effect of waiting for the rastered content to be ready means the user can usually see at least some content, but that content might be stale. If no content is available Chrome displays a blank or checkerboard pattern with a GL shader instead.
It’s important to note that it’s possible to scroll past the rastered area of even the active tree, since Chrome only records SkPictures for layer regions within the interest area. If the user is scrolling toward an unrecorded area the compositor will ask the main thread to record and commit additional content, but if that new content can’t be recorded, committed, and rasterized to activate in time the user will scroll into a checkerboard zone.
To mitigate checkerboarding, Chrome can also quickly raster low-resolution content for the pending tree before high-resolution. Pending trees with only low-resolution content available for the viewport are activated if its better than what’s currently on screen (e.g. the outgoing active tree had no content at all rasterized for the current viewport). The tile manager (explained in the next section) decides what content to raster when.
This architecture isolates rasterization from the rest of the frame production flow. It enables a variety of technologies that improve the responsiveness of the graphics system. Image decode and resize operations are performed asynchronously, which were previously expensive main-thread operations performed during painting. The asynchronous texture upload system mentioned earlier in this document was introduced with impl-side painting as well.
Tiling
Rasterizing the entirety of every layer on the page is a waste of CPU time (for the paint operations) and memory (RAM for any software bitmaps the layer needs; VRAM for the texture storage). Instead of rasterizing the entire page, the compositor breaks up most web content layers into tiles and rasterizes layers on a per-tile basis.
Web content layer tiles are prioritized heuristically by a number of factors including the tile’s proximity to the viewport and its estimated time to being on-screen. GPU memory is then allocated to tiles based on their priority, and tiles are rastered from the SkPicture recordings to fill the available memory budget in priority order. Currently (May 2014) the specific approach to tile prioritization being reworked; see the Tile Prioritization Design Doc for more info.
Note that tiling isn’t necessary for layer types whose contents are already resident on the GPU, such as accelerated video or WebGL (for the curious, layer types are implemented in the cc/layers directory).
Rasterization: Painting from cc/Skia’s perspective
SkPicture records on the compositor thread get turned into bitmaps on the GPU in one of two ways: either painted by Skia’s software rasterizer into a bitmap and uploaded to the GPU as a texture, or painted by Skia’s OpenGL backend (Ganesh) directly into textures on the GPU.
For Ganesh-rasterized layers the SkPicture is played back with Ganesh and the resulting GL command stream gets handed to the GPU process via the command buffer. The GL command generation happens immediately when the compositor decides to rasterize any tiles, and tiles are bundled together to avoid prohibitive overhead of tiled rasterization on the GPU. See the GPU accelerated rasterization design doc for more information on the approach.
For software-rasterized layers the paint targets a bitmap in memory shared between the Renderer process and the GPU process. Bitmaps are handed to the GPU process via the resource transfer machinery described above. Because software rasterization can be very expensive, this rasterization doesn’t happen in the compositor thread itself (where it could block drawing a new frame for the active tree), but rather in a compositor raster worker thread. Multiple raster worker threads can be used to speed up software rasterization; each worker pulls from the front of the prioritized tile queue. Completed tiles are uploaded to the GPU as textures.
Texture uploads of bitmaps are a non-trivial bottleneck on memory-bandwidth-constrained platforms. This has handicapped the performance of software-rasterized layers, and continues to handicap uploads of bitmaps necessary for the hardware rasterizer (e.g. for image data or CPU-rendered masks). Chrome has had various different texture upload mechanisms in the past, but the most successful has been an asynchronous uploader that performs the upload in a worker thread in the GPU process (or an additional thread in the Browser process, in the case of Android. This prevents other operations from having to block on potentially-lengthy texture uploads.
One approach to removing the texture upload problem entirely would be to use zero-copy buffers shared between the the CPU and GPU on unified memory architecture devices exposing such primitives. Chrome does not currently use this construct, but could in the future; for more information see the GpuMemoryBuffer design doc.
Also note it’s possible to take third approach to how content is painted when using the GPU to rasterize: rasterize the content of each layer directly into the backbuffer at draw time, rather than into a texture beforehand. This has the advantage of memory savings (no intermediate texture) and some performance improvements (save a copy of the texture to the backbuffer when drawing), but has the downside of performance loss when the texture is effectively caching layer content (since now it needs to be re-painted every frame). This “direct to backbuffer” or “direct Ganesh” mode is not implemented as of May 2014, but see the GPU rasterization design doc for additional relevant considerations.
Compositing with the GPU process
Drawing on the GPU, Tiling, and Quads
Once all the textures are populated, rendering the contents of a page is simply a matter of doing a depth first traversal of the layer hierarchy and issuing a GL command to draw a texture for each layer into the frame buffer.
Drawing a layer on screen is really a matter of drawing each of its tiles. Tiles are represented as quads (simple 4-gons i.e. rectangles; see cc/quads) drawn filled with a subregion of the given layer’s content. The compositor generates quads and a set of render passes (render passes are simple data structures that hold a list of quads). The actual GL commands for drawing are generated separately from the quads (see cc/output/gl_renderer.cc). This is abstracted away from the quad implementation so it is possible to write non-GL backends for the compositor (the only significant non-GL implementation is the software compositor, covered later). Drawing the quads more or less amounts to setting up the viewport for each render pass, then setting up the transform for and drawing each quad in the render pass’s quad list.
Note that doing the traversal depth-first ensures proper z-ordering of cc layers, and the z-ordering of the potentially-multiple RenderLayers associated with that cc layer is guaranteed earlier by the ordering of the RenderObject tree traversal when the RenderObjects for a layer are painted.
Varied Scale Factors
One significant advantage of impl-side painting is that the compositor can reraster existing SkPictures at arbitrary scale factors. This comes in useful in two main contexts: pinch-to-zoom and producing low-resolution tiles during fast flings.
The compositor will intercept input events for pinch/zoom and scale the already-rastered tiles appropriately on the GPU, but it also rerasters at more suitable target resolutions while this is happening. Whenever the new tiles are ready (rastered and uploaded) they can be swapped in by activating the pending tree, improving the resolution of the pinch/zoom’d screen (even if the pinch isn’t yet complete).
When rasterizing in software the compositor also attempts to quickly produce low-resolution tiles (which are typically much cheaper to paint) and display them during a scroll if the high-res tiles aren’t yet ready. This is why some pages can look blurry during a fast scroll -- the compositor displays the low-res tiles on screen while the high-res tiles raster.