Jun 21, 2008

Tesla v2.0

Sunnyvale (CA) - Nvidia today announced its second generation of Tesla floating point accelerators based on the GT200 series of graphics processors. It is the first big upgrade for the company’s supercomputing product portfolio - streamlining the offering and introducing double precision support as well as much more performance than the original 8-series, which was introduced one year ago.

High-performance computing (HPC) applications are likely to see several new technologies this week. In the hardware arena, AMD already announced its 1+ TFlop GPU earlier today and Nvidia is following with a GT200 based GPGPU, also claiming to be capable of hitting 1 TFlop per processing unit in single precision applications. Compared to the first generation, the floating point performance is up from 518 GFlops.

The new T10P processing unit represents a massive die, integrating 1.4 billion transistors and 240 processing cores, which is up from 128 cores in the 8-series of GPUs.

Nvidia has cut the deskside supercomputer (D870), answering to trends of customers who have been purchasing workstation graphics cards rather than an expensive external add-on, and is now limiting the product portfolio to a 4-GPU 1U blade and a Tesla add-in card. The S1070 blade integrates GPUs clocked at 1.5 GHz, a total of 960 processing cores, 4 GB of GDDR3 800 memory per GPU for a 16 GB total, 408 GB/s memory bandwidth and a total processing capability of 4 TFlops. Power consumption is up from 550 watts in the first generation to 700 watts

The blade will be offered with either 2 PCIe interfaces ($7995) or one PCIe connect ($8295), both of which are slightly more expensive than the S870 blade, which sold for $7500 at introduction.

The entry-level Tesla product remains an add-in card, in this case the C1060, which essentially represents Quadro graphics card on steroids. The card includes on T10P processor, 102 GB/s memory bandwidth and a power consumption rating of 160 watts, down from 170 watts of the previous generation. Nvidia said that thermal restrictions forced the company to clock the C1060 GPUs at 1.33 GHz instead of the 1.5 GHz in the blade. As a result, the C1060 will not hit 1 TFlops and is estimated to check in at about 900 GFlops.

The C1060 will be offered for $1699 MSRP, up from the $1500 price tag of the original C870.

Besides performance improvements, the T10P also delivers 64-bit or double-precision capability, which is required for most fluid dynamics and financial stream processing applications. Double precision is substantially more intensive than single precision calculations and with decrease the performance of the card dramatically. Nvidia told us that double-precision calculations will result in a 90% speed penalty and deliver only 100 GFlops per T10P processor.

There is also news surrounding the CUDA application platform, which Nvidia says can be used more any multi-core processing environment out there: This summer, the company will release a beta version of CUDA that developers can apply to multi-core CPUs. Nvidia claims that CUDA has been downloaded 60,000 times so far, but it is safe to say that there aren’t 60,000 developers working on HPC applications - and even Nvidia admits that most of those 60,000 developers are "playing" with CUDA trying to create "consumer applications." The expansion into the CPU area could help the company reach a far greater developer base than it is able to attract with a GPU-only software foundation.

Technically, there is nothing that prevents from CUDA to also be used for ATI’s GPU products, Nvidia told us. However, not surprisingly, Nvidia said that it won’t be offering CUDA for ATI products and stated that "someone else can do that." ATI offers its own high-level development tools called Brook+.


