The Nvidia GTX 1080 is based on the Pascal GPU architecture that was first introduced in the high end, datacenter class GP100 GPU. Nvidia’s new Pascal GPU (GP104) is designed to deal with upcoming Direct X 12 games and Vulkan graphics. Nvidia will be placing a lot of focus on VR going forward as well.
Nvidia have been at the forefront of power efficiency for some time now, I was a huge fan of the Maxwell architecture and I still remember the first GTX750 ti I reviewed way back in February 2014.
This was the first Maxwell card we tested on launch day and it delivered solid 1080p performance, all without even requiring a single PCIe power connector – and best of all it required only 60 watts when gaming. When the higher end Maxwell cards were released later in the year, Nvidia dominated the market.
The GTX 1080 is comprised of a whopping 7.2 billion transistors, including 2560 single precision CUDA Cores. It is built on the 16nm FinFET manufacturing process which will drive higher levels of performance while increasing power efficiency – critical when operating at very high clock speeds.
Nvidia Gameworks is a key focus for Nvidia although it would be fair to say that it has been the centre point of plenty of controversy over the last year. Nvidia state that the GTX 1080 performance with GameWorks libraries enable developers to ‘readily implement more interactive and cinematic experiences.’ Nvidia claim that the Pascal architecture, 16nm FinFET manufacturing process and GDDR5X memory give the GTX1080 a total 70% lead over the previous GTX980 flagship.
Perspective Surround is a new technology. It uses SMP to render a wider field of view and the correct image perspective, across all three monitors, with a single geometry pass.
Nvidia have also developed a new Simultaneous Multi Projection technology, which for the first time allows the GPU to simultaneously map a single primitive onto up to sixteen different projections from the same viewpoint.
Each of these projections can be either mono or stereo and it will allow the GTX 1080 to accurately match the curved projection required for VR displays as well as the multiple projection angles required for surround display configurations.
Pascal graphics cards are based around different configurations of Graphics Processing Clusters (GPCs), Streaming Multiprocessors (SMs) and memory controllers. Each Streaming multiprocessor is paired up with a PolyMorph Engine that handles tessellation, vertex fetch, viewport transformation, perspective correction and vertex attribute setup. The GP104 PolyMorph Engine includes the Simultaneous Multi Projection unit discussed above. The combination of a SM plus one PolyMorph Engine is referred to as a TPC.
The GP104 powered GTX 1080 consists of four GPCs, twenty Pascal Streaming Multiprocessors and eight memory controllers. Inside the GTX 1080 each GPC ships with a dedicated raster engine and five SMs. Each of these SM’s contains 128 CUDA Cores, 256kb of register file capacity, a 96KB shared memory unit, 48kb of total L1 cache storage and eight texture units.
The SM is critical, it is a highly parallel multiprocessor which can schedule warps (groups of 32 threads) to CUDA cores and other execution units within the SM. Almost all operations flow through the SM during the rendering pipeline. With 20 SM’s listed, the GTX 1080 comprises 2560 CUDA cores and 160 texture units.
The GTX 1080 has eight 32 bit memory controllers (256 bit in total). Connected to each 32 bit memory controller are eight ROP units and 256kb of Level 2 cache. The full GP104 chip in the GTX 1080 has a total of 64 ROPs and 2,048kb of Level 2 cache.
|GPU||Geforce GTX 980
|Geforce GTX 1080
|GPU Boost Clock||1216mhz||1733mhz|
|Texture fill-rate||155.6 Gigatexels/Sec||277.3 Gigatexels/Sec|
|Memory Clock (Data Rate)||7,000 mhz||10,000 mhz|
|Memory Bandwidth||224 GB/sec||320 GB/s|
|L2 Cache size||2048 KB||2048KB|
|TDP||165 watts||180 watts|
|Transistors||5.2 billion||7.2 billion|
|Die Size||398 mm2||314 mm2|
GDDR5X is a critical performance improvement that Nvidia wanted to bring with Pascal. It is a faster and more advanced interface standard, achieving 10 Gbps transfer rates, or roughly 100 picoseconds (ps) between data bits. Putting this into context, light travels at only an inch in a 100 ps time interval; the GDDR5X IO circuit has less than half that time available to sample a bit as it arrives, or the data will be lost as the bus transitions to a new set of values.
Coping with such a tremendous speed of operation required the development of a new IO circuit architecture. It took them years to implement. Nvidia have been focusing on reducing power consumption since early 2014, and with these new circuit developments and lower 1.35V GDDR5X standards, combined with process technologies this allowed for a 43% higher operating frequency without increasing power demand.
If you were paying attention on the first page of this review, you will see that the memory interface on the GTX 1080 is 256 bit. But the earlier GTX980 ti and GTX Titan X were both 384 bit. All of these cards use a system of enhanced memory compression, utilising lossless memory compression techniques to reduce DRAM bandwidth demands. Memory compression helps in three key areas; it:
- Reduces the amount of data written out to memory.
- Reduces the amount of data transferred from memory to L2 cache, effectively providing a capacity increase for the L2 cache, as a compressed tile (block of frame buffer pixels or samples) has a smaller memory footprint than an uncompressed tile.
- Reduces the amount of data transferred between clients such as the texture unit and the frame buffer.
Of all the algorithms the GPU compression pipeline has to handle, one of the most important is delta colour compression. The GPU will calculate the differences between pixels in a block and then store that block as a set of reference pixels as well as the delta values from the reference. If the deltas are small then only a few bits per pixel are required. If the packed result of reference values and the delta values is less than half the uncompressed storage size, then delta colour compression is successful and the data is stored at 2:1 compression, which is half size.
The GTX 1080 benefits from a significantly enhanced delta colour compression system.
- 2:1 compression has been improved to be more effective than before.
- A new 4:1 delta colour compression mode is added to cover cases where the per pixel deltas are very small and are possible to pack in 25% of the original storage.
- A new 8:1 delta colour compression mode is added to combine 4:1 constant colour compression of 2×2 pixel blocks with 2:1 compression of the deltas between those blocks.
The latest games place a dramatic workload on a graphics card, often running with multiple independent or asynchronous workloads that ultimately work together to create the final rendered image. For overlapping workloads, such as when a GPU is dealing with physics and audio processing Pascal has new ‘dynamic load balancing’ support.
Maxwell architecture dealt with overlapping workloads with static partitioning of the GPU into a subset that runs graphics and a subset that runs compute. This was only efficient provided that the balance of work between the two loads was roughly matching the partitioning ratio. New hardware dynamic load balancing solves the problem by allowing either workflow to adapt to use idle resources.
Time critical workloads are another important asynchronous compute scenario. An asynchronous timewarp operation for instance must complete before scanout starts, or a frame is dropped. The GPU therefore in this scenario needs to support very fast, low latency preemption to move the less critical workload from the GPU, so the more critical workload can run as soon as possible.
A single rendering command from a game engine can possibly call for hundreds of draw calls, with each draw call asking for hundreds of triangles, with each triangle containing hundreds of pixels which each have to be shaded and rendered. Traditionally a GPU cycle that implements preemption at a high level in the graphics pipeline would have to complete all of this work before switching tasks, which could theoretically result in a very long delay.
Nvidia wanted to improve this situation so they implemented Pixel Level Preemption in the GPU architecture of Pascal. The graphics units in Pascal are superior in that they can now keep track of their intermediate progress on rendering work, so when preemption is requested they can stop at that point, save off context information to start up later, and preempt quickly.