Friday, August 7, 2009

Technology:SEC to Ban Flash Trades of U.S. Stocks, Schumer Says

Thursday, August 6, 2009

Technology:Don Becker On The State Of HPC | Linux Magazine

  • A Beowulf pioneer provides insights and experience from the HPC trenches.
  • Linux magazine HPC Editor Douglas Eadline had a chance recently to discuss the current state of HPC clusters with Beowulf pioneer Don Becker
  • DB: Everyone is calling everything they have a cloud strategy. From renting computing time (time-sharing) to managing virtual machines and up-time with completely transparent assist.
  • What we have seen, if you have lightweight compute node, is the very first job, if it is a big memory job, like a large matrix computation, we can get a 40-50% performance improvement because we don’t have a dirty virtual to physical mapping. We have a clean set of pages in the virtual to physical mapping because we have done essentially nothing at boot time. That is only one of the ways to accomplish performance, but even with our system that advantage is there only for that very first job.
  • As soon as the kernel starts putting objects in memory, they are not completely immovable, but as soon as the kernel grabs a page of memory for itself, you can’t shuffle pages around that allocation. So you get that bump only on the first run.
  • With virtualization you have just made this problem one level more difficult.
  • Now you have reconcile that with the virtualization people who claim there is no performance impact.
  • In HPC, there is a large performance impact because of page table entry thrashing (TLB look aside buffer and caches)
  • That is big critical issue and relatively new issue in HPC, it was important before, but now that we are seeing regular deploys of 32 to 128GB of memory per physical machine. With the standard 4K page size that might be 32 million pages you have to manage. If you are stepping through memory that is 32 million mappings that might be pushed in an out of the cache.
  • Everyone is calling everything they have a cloud strategy.
  • If you want guaranteed performance you are renting a machine not a cloud.
  • It is a big step forward from grid though. The computing community version of grid was we have libraries to make all these different installs communicate rather than doing machine virtualization. Think of it as library level virtualization. For every service they could think of they provide a library function. For every service they did not think of they spent years writing library functions making different and potentially disparate operating systems, distributions, and versions interoperate. I think that turned out to be, in my opinion, a huge failure.
  • They could not guarantee consistency, that is they could not guarantee you could run any executable anywhere, and never guaranteed that by running the same executable you would get the same results.
  • That is one of the fundamental assumptions you have to make in HPC. If I run a program over there, I have to know what executable is running, what libraries it is linking to and in what order. I need to reproduce that exact same result everywhere in my run.
  • Cloud computing provides virtualization at the machine level, you to do more work and it is more of a synchronization rather than a guaranteed consistency but it is a step better than what grids were.
  • There are both large pages (2 or 4 MByte pages) and giant pages (4 GByte pages). That is an exciting area. I think 4MB pages are sufficient for right now.
  • But if we could do 4GB pages, just a handful of Gigabyte pages would solve the problem for large memory jobs and provide predictable execution time and minimal traffic to memory, that is traffic to help manage memory rather than user code memory.
  • Another area is where everything is going to change for HPC I/O is flash disk (Solid Sate Disk or SSD). Right now they are abysmally bad, Intel has recently updated the firmware on their SSD to solve some of the worst anomalous conditions. They will get better.
  • But, it changes the I/O expectations from being this very slowing growing curve, we went from 50 MB/sec to 70 MB/sec to 90 MB/s sustained write rate over a period of 8-10 years. We did not even get a factor of 2 sustained write rate on the best drives. So now we are going to see a semiconductor curve instead of the disk drive curve for write rates. That will change everything.
  • But it will change with how we have to deal with these rates in the file system structure because the previous models do not apply. We can’t throw it all away because it took several decades to get file systems right, but some dramatic changes are needed.

Tuesday, August 4, 2009

Technology:Windows 7 - enabling massive parallelism to the masses

  • ...the first Windows operating system to treat the graphics processing unit (GPU) as a real peer to the CPU...

  • The model for The Windows 7 PC is to use a CPU and GPU together in a heterogeneous computing platform. Previously, GPUs were almost exclusively limited to rendering and accelerating graphics and video. With the introduction of Windows 7, the GPU and CPU will exist in a co-processing environment where each can handle the computing task they are best suited for. The CPU is exceptionally good at performing sequential calculations, I/O, and program flow, whereas the GPU is perfectly suited for performing massive parallel calculations. With the introduction of DirectX Compute in Windows 7, Microsoft is really opening up the immense parallel computing horsepower of the GPU natively right in the operating system.

  • Parallel programming is the next big thing for the world of computing – it has started already. DirectX Compute will accelerate this discontinuity by enabling massive parallelism to the masses. What we are talking about is co-processing— essentially using the right tool for the job.

Technology:DirectX Compute Support on NVIDIA’s CUDA Architecture GPUs

  • Microsoft’s DirectX Compute is a new GPU Computing API that runs on NVIDIA’s current CUDA architecture under both Windows VISTA and Windows 7. DirectX Compute is supported on current DX10 class GPU’s and future DX11 GPU’s. It allows developers to harness the massive parallel computing power of NVIDIA GPU’s to create compelling computing applications in consumer and professional markets.

  • Compute Shader
    The Compute Shader is an additional stage independent of the Direct3D 11 pipeline that
    enables general purpose computing on the GPU.

    In addition to all shader features provided by the unified shader core, the Compute Shader also supports
    scattered reads and writes to resources through Unordered Access Views, a shared memory pool within a group of executing threads, synchronization primitives, atomic operators, and many other advanced data-parallel features. A variant of the Direct3D 11 Compute Shader has been enabled that can operate on Direct3D 10-class hardware. It is therefore possible to developing Compute Shaders on actual hardware, but an updated driver is required. The full functionality of the Direct3D 11 Compute Shader will is intended for support of Direct3D 11-class hardware, so in order to evaluate the full functionality, developers will need to use the Reference Rasterizer until such hardware is available.


  • Multithreaded Rendering
    The key API difference from Direct3D 10 in Direct3D 11 is the addition of deferred contexts, which enables scalable execution of Direct3D commands distributed over multiple cores. A Deferred Context captures and assembles actions like state changes and draw submissions that can be executed on the actual device at a later time. By utilizing Deferred Contexts on multiple threads, an application can distribute the CPU overhead needed in the Direct3D11 runtime and the driver to multiple cores, enabling better use of an end-user's machine configuration. This feature is available for use on current Direct3D 10 hardware as well as the reference rasterizer.

  • Dynamic Shader Linkage
    In order to address the combinatorial explosion problem seen in specializing shaders for performance, Direct3D 11 introduces a limited form of runtime shader linkage that allows for near-optimal shader specialization during execution of an application. This is achieved by specifying the implementations of specific functions in shader code when the shader is assigned to the pipeline, allowing the driver to inline native shader instructions quickly rather than forcing the driver to recompile the intermediate language into native instructions with the new configuration. Shader development is exposed through the introduction of classes and interfaces to HLSL.


  • DirectX Compute Shader
    • New shader type supported in D3D11
    • Designed for general purpose processing
    • Doesn’t require a separate API -integrated with D3D
    • Shares memory resources with graphics shaders
    • Thread invocation is decoupled from input or output domains
    • Single thread can process one or many data elements
    • Can share data between threads
    • Supports random access memory writes
  • Compute Shaders on D3D10 Hardware
    • Subset of the D3D11 compute shader functionality that runs on current D3D10.x hardware
    • Drivers available now from NVIDIA and AMD
    • You can start experimenting with compute shaders today.
  • Compute Shader 4.0
    • New shader models -CS4.0/CS4.1
  • What’s Missing in CS4.0 Compared to CS5.0?
    • Atomic operations
    • Append/consume
    • Typed UAV (unordered access view)
    • Double precision
    • DispatchIndirect()
    • Only a single output UAV allowed (Not a huge restriction in practice)
    • Thread group grid dimensions limited to 65535
    • Thread group size is restricted to maximum of 768 threads total (1024 on D3D11 hardware)
    • Thread group shared memory restricted to 16KB total (32Kb on D3D11 hardware)

      Still a lot you can do.

  • So What DoesCS4.x Give Me?
    • Scattered writes
    • Thread Group Shared Memory
      - Allows sharing data between threads
      - Much faster than texture or buffer reads, saves bandwidth
      - Fast reductions, prefix sum (scan)
      - Efficient interoperability with D3D graphics


Monday, August 3, 2009

Technology:Windows 7: XNA Framework Math Libraries, HLSL intrinsic functions

  • Depending on the hardware platform your game is targeting, you may have multiple processors available to perform floating-point calculations. For example, on the Xbox 360 platform and on many Windows-based computers, you have a CPU, on which your game code is running, but you also have GPU, on which your shader code is running. Therefore, you can also perform floating point calculations on the GPU using shaders and HLSL intrinsic functions. GPUs are very fast at performing large numbers of floating-point operations. This can be very useful, for example, if you need to perform a large number of floating point operations to animate a particle system.


Techonology:The Economic Impact of Microsoft's Windows 7

  • By the end of 2010, more than 7 million people worldwide in the IT industry and at IT using organizations will be working with Window 7, or 19% of the global IT workforce.1 The 350,000-plus IT companies that produce, sell, or distribute products or services running on Windows 7 will employ 3 million; another 4 million will be employed at IT-using firms.
  • For every dollar of Microsoft revenue from launch in October 2009 to the end of 2010 from Windows 7, the ecosystem beyond Microsoft will reap $18.52.2 During that period, this ecosystem will sell more than $320 billion in products and services revolving around Windows 7.
  • To achieve those revenues, companies in the Microsoft global ecosystem working with Windows 7 are expected to invest nearly $115 billion by the end of 2010 developing, marketing, and supporting products and services built around Windows 7.