This blog post details both the Tesla P100 accelerator and the Pascal GP100 GPU architectures.

The TESLA P100 accelerator integrates the Pascal GP100 GPU and HBM2 memory stacks in a new board design that also provides NVLink and PCIe connectivity.

The P100 includes two 400-pin high speed connectors. One of these connectors is used for the NVLink signals on/off the module; the other is used to supply power, control signals and PCIe I/O. The Tesla P100 accelerator can be installed into a larger GPU carrier or system board. The GPU carrier makes the appropriate connections to other P100 accelerators or PCIE controllers.

fig.1 - TESLA P100 accelerator front view

fig.2 - TESLA P100 accelerator back view

P100 declared performances

The P100 comes in two different version: the P100 PCIe and the P100 NVLInk. The first one is targeted to the Intel x86 systems (based on Xeon processors) while the second is targeted to the HPC servers that support the CPU-GPU NVLink connection like the OpenPOWER servers based on the IBM P8 GPU.

The GP100 GPU

The GP100 GPU is the heart of the P100 accelerator but it’s also produced in different versions to equip the other NVIDIA Pascal accelerators. In particular the GP104 is used in the GeForce GTX 1080 and in the TESLA P4 while the GP102 is used in the TITAN X, TESLA P40 and the future P80 accelerators. In the last part of this post you’ll find more details on the GP104 and GP102 version.

Key features of the GP100 GPU:

  • Mixed-Precision Computing
  • Unified Memory
  • NVLink: high speed, high bandwidth interconnect
  • HBM2: CoWoS stacked memory architecture (only for the P100 accelerator)
  • 16nm FinFET: enables more features, higher performance, and improved power efficiency

Mixed-Precision Computing

The combined use of different numerical precisions in a computational method is known as mixed precision. The NVIDIA Pascal architecture provides features aimed at providing even higher performance for applications that can utilise lower precision computation, by adding vector instructions that pack multiple operations into a 32-bit data-path. Specifically, these instructions operate on 16-bit floating point data (“half” or FP16) and 8- and 16-bit integer data (INT8 and INT16).

Research in deep learning have found that DNN architectures have a natural resilience to errors due to the back propagation algorithm used in training them, and most of the times, 16-bit floating point (half precision, or FP16) are sufficient for training neural networks. Moreover, for many networks deep learning inference can be performed using 8-bit integer  computations without significant impact on accuracy.

In the GP100 GPU, two FP16 operations can be performed using a single paired-operation instruction.

Compute Preemption

Compute Preemption is another important new hardware and software feature added to GP100 that allows compute tasks to be preempted at instruction-level granularity. Compute Preemption prevents long-running applications from either monopolising the system (preventing other applications from running) or timing out.

Unified Memory

It provides a single, seamless unified virtual address space for CPU and GPU memory. In a typical PC or cluster node today, the memories of the CPU and GPU are physically distinct and separated by the PCI-Express bus. Before CUDA 6, that is exactly how the programmer has to view things. Data that is shared between the CPU and GPU must be allocated in both memories, and explicitly copied between them by the program.

Unified Memory creates a pool of managed memory that is shared between the CPU and GPU, bridging the CPU-GPU divide. Managed memory is accessible to both the CPU and GPU using a single pointer. The key is that the system automatically migrates data allocated in Unified Memory between host and device so that it looks like CPU memory to code running on the CPU, and like GPU memory to code running on the GPU.

GP100 extends GPU addressing capabilities to enable 49-bit virtual addressing. This is large enough to cover the 48-bit virtual address spaces of modern CPUs. Therefore, GP100 Unified Memory allows programs to access the full address spaces of all CPUs and GPUs in the system as a single virtual address space, unlimited by the physical memory size of any one processor.

Memory page faulting support in GP100 is a crucial new feature that provides more seamless Unified Memory functionality.

Page faulting means that the CUDA system software doesn’t need to synchronise all managed memory allocations to the GPU before each kernel launch: if a kernel running on the GPU accesses a page that is not resident in its memory, it faults, allowing the page to be automatically migrated to the GPU memory on-demand. Alternatively, the page may be mapped into the GPU address space for access over the PCIe or NVLink interconnects (mapping on access can sometimes be faster than migration).

With the new page fault mechanism, global data coherency is guaranteed with Unified Memory. This means that with GP100, the CPUs and GPUs can access Unified Memory  allocations simultaneously.

NVLink

Today, two or more GPUs are more commonly being paired per CPU as developers increasingly expose and leverage the available parallelism provided by GPUs in their applications. As this trend continues, PCIe bandwidth at the multi-GPU system level becomes a bigger bottleneck.

NVLink provides GPU- to-GPU data transfers at up to 160 Gigabytes/second of bidirectional bandwidth - 5x the bandwidth of PCIe Gen 3 x16.

While NVLink primarily focuses on connecting multiple NVIDIA Tesla P100 accelerators together, it is also possible to use as a CPU-to-GPU interconnect. For example, Tesla P100 accelerators can connect to IBM’s POWER8 with NVIDIA NVLink technology. POWER8 with NVLinkTM supports four NVLinks

HBM2 (just on TESLA P100)

High Bandwidth Memory 2 (HBM2) offers three times (3x) the memory bandwidth of the GDDR5 memory: because HBM2 memory is stacked memory and is located on the same physical package as the GPU, it provides considerable space savings compared to traditional GDDR5.

The TESLA P100 accelerator is the first GPU accelerator to use HBM2. Rather than requiring numerous discrete memory chips surrounding the GPU as in traditional GDDR5 GPU board designs, HBM2 includes one or more vertical stacks of multiple memory dies. The memory dies are linked using microscopic wires that are created with through-silicon vias and microbumps. The combination of HBM2 stack, GPU die, and Silicon interposer are packaged in a single 55mm x 55mm BGA package.

Another benefit of HBM2 memory is native support for error correcting code (ECC) functionality. ECC provides higher reliability for compute applications that are sensitive to data corruption. ECC technology detects and corrects single-bit soft errors before they affect the system.

16nm FinFET

FinFETs are 3d structures that rise above the substrate and resemble a fin, hence the name. The 'fins' form the source and drain, effectively providing more volume than a planar transistor for the same area. This, in turn, enables the use of lower threshold voltages and results in better performance and power.

FinFET devices can operate from a lower supply voltage than planar transistors since they have a lower threshold voltage. This drop in supply voltage can improve dynamic power consumption significantly.

Many semiconductor design companies are moving rapidly to manufacturing their devices on the advanced 16nm and 14nm FinFET based process geometries, simply because the performance and power benefits are compelling.

The GP100 GPU structure

The full GP100 is composed of an array of Graphics Processing Clusters (GPCs), Texture Processing Clusters (TPCs), Streaming Multiprocessors (SMs), and memory controllers.

A full GP100 consists of six Graphics Processing Clusters, each GPC contains 10 Streaming Multiprocessors, 5 Texture Processing Clusters. Each Streaming Multiprocessor contains 64 Single Precision Cuda Cores for a total of 64x60 = 3840 FP32 Cuda Cores.

The GP100 used on the TESLA P100 uses 56 SM units (3584 FP32 Cuda Cores).

The Streaming Multiprocessor (SM)

The GP100 SM is partitioned into two processing blocks, each having 32 single-precision CUDA Cores (64 FP32 Cuda Cores in total), an instruction buffer, a warp scheduler, and two dispatch units.

Each SM in GP100 features 32 double precision (FP64) CUDA Cores, which is one-half the number of FP32 single precision CUDA Cores. A full GP100 GPU has 1920 FP64 CUDA Cores. GP100 supports full IEEE 754‐2008 compliant single precision and double precision arithmetic, including support for the fused multiply‐add (FMA) operation and full speed support for denormalised values.

For reference the Kepler GK110 had a 3:1 ratio of SP units to DP units instead of the 2:1 ratio of the GP100.

The GP102 GPU variant

The GP102 GPU is derived from the GP100 but is targeted at lower cost HPC application. This is the chip employed on the TITAN X Pascal, and TESLA P40 accelerators.

Like the GP104, the GP102 has 128 FP32 Cuda Cores per Streaming Processor but on the TITAN X (that is the top of the consumer products based on Pascal technology) just 28 SMs are enabled for a total of 3584 FP32 Cuda Cores while on the TESLA P40 are available 30 SMs with a total of 3840 FP32 Cuda Cores.

The new NVIDIA TESLA P40 accelerator is engineered to deliver the highest throughput for scale-up servers, where performance matters most. Tesla P40 has 3840 CUDA cores with a peak FP32 throughput of 12 TeraFLOP/s, and like it’s little brother P4, P40 also accelerates INT8 vector dot products (IDP2A/IDP4A instructions), with a peak throughput of 47.0 INT8 TOP/s.

The GP104 GPU variant

The GP104 GPU is derived from the GP100 but is targeted at gaming. This is the chip employed on the GeForce GTX 1080.

It has excellent single precision performance (nearly on par with the vastly more expensive Tesla P100), but much lower double precision performance. While the GP100 is targeted to workloads where both single-precision and double-precision floating point calculations will need to be carried out: the so-called "mixed workloads". However, for workloads in which double-precision calculations aren't done very often its advantages aren't anymore significant.

Basically the main differences between the GP100 and the GO104 GPUs can be summarised as follows:

  • GP100 has 64 FP32 cores per SM, GP104 has 128 FP32 cores per SM
  • GP100 has 2 processing blocks per SM, GP104 has 4 processing blocks per SM
  • GP100 has 64KB of shared memory per SM, GP104 has 96KB of shared memory per SM
  • GP100 has 32k registers per processing block, GP104 has 16k registers per processing block
  • GP100 has 32 FP64 cores per SM, GP104 has 4 FP64 cores per SM

This is a comparison of the two Streaming Processors block diagrams:


Published at 02 November 2016 Published by Matteo Zola

(0 comments)

Machine Learning Meetup Italy by Addfor

To be the first to know when we publish new blog posts register yourself in our meetup Machine Learning Italy.

Categories

Tags

Ensemble Methods, Machine Learning, Python, Scikit Learn, Tutorials, Random Forests, Gradient Boosted Trees, lstm, neon, tensorflow, Deep Learning

Meetup

Subscribe to this group if you want to keep in touch with Machine Learning community

Machine Learning Meetup Italy by Addfor

Machine Learning Italy Meetup