# Mappingtand to Physics Bupletern portentage's the Acottilocteas





ORNL is managed by UT-Battelle for the US Department of Energy

**Bronson Messer** 

Scientific Computing & Theoretical Astrophysics Groups

Oak Ridge National Laboratory

&

Department of Physics & Astronomy

University of Tennessee





### Strange things are afoot...

- We've done some parallel programming so far this week.
- Mostly confined to domain decomposition, implemented with MPI.
- This is 20th-century computing.



## Nuclear astrophysics INCITE allocations from 2010 - present

- Average number of cpu-hours/yr ~152 M
- In aggregate, just less than 10% of the total available each year from 2010 - 2014
  - Allocations at NERSC are also aboveaverage in size
  - PRAC is dominated by astrophysics, including considerable allocations for stellar astrophysics
- This excellent record is due to
  - the formulation of large, important problems
  - demonstrated ability to efficiently exploit the largest computational platforms
- Will this trend continue to the exascale?
   Can we continue to solve big problems efficiently?





## The Effects of Moore's Law and Slacking <sup>1</sup> on Large Computations

Chris Gottbrath, Jeremy Bailin, Casey Meakin, Todd Thompson, J.J. Charfman

Steward Observatory, University of Arizona

<sup>1</sup>This paper took 2 days to write

### **Abstract**

We show that, in the context of Moore's Law, overall productivity can be increased for large enough computations by 'slacking' or waiting for some period of time before purchasing a computer and beginning the calculation.

work and slack in the context of moores law



astro-ph/9912202





### Google-stalk...

DETECTION OF INTERSTELLAR BS IN THE CIRRUS DARK CLOUD OF THE NUMBBUM ASSOCIATION

AN INTUITIVE MODEL AND ITS SUBSEQUENT OBSERVATION

J.J. Charfman

deutsch english

suchen

On the Utter Irrelevance of LPL Graduate Students: An Unbiased Survey by Steward Observatory Graduate Students (2002)

Charfman, J. J., Bsc, J. B., Eriksen, K. A., Knierman, K., Leistra, A., Mamajek, E., Monkiewicz, J., Moustakas, J., Murphy, J., Rigby, J., Young, P. A.

### Abstract

We present a new analysis of the irrelevance of Lunar and Planetary Laboratory (LPL) graduate students at the University of Arizona. Based on extensive Monte Carlo simulations we find that the actual number of useful results from LPL graduate students is \$0\pm0.01 (5\sigma)\$. Their irrelevance quotient far surpasses that of string theorists.

### Details der Publikation

Download

http://arxiv.org/abs/astro-ph/0204041

Archiv

arXiv (United States)

Keywords

Astrophysics

Typ

text

1981) with broad-band alphabetical filters at the front of this book.

Alternate theories of BS formation on solid surfaces by Mayonnaise Hamburger (1977) have been considered, but only gas-produced BS can account for the amount reported here. However the yellow stuff reported by Hamburger (this volume) may well be mustard.

BS emission was generally observed in mornings and evenings, but was finally detected on Friday afternoon at a 1.8 $\sigma$  level. Its lifetime was estimated to be approximately 5 days, but the level of emission was generally variable. The detection of this molecule has considerable impact on our understanding of Giant Gas Clouds which will be discussed in a later paper, but we want to note the most important of these at

645

B. H. Andrew (ed.), Interstellar Molecules, 645-648.

Copyright © 1980 by the IAU.



### J. J. Charfman - The Snarky Nicolas Bourbaki of Arizona's Steward Observatory

Ø

2





dangerous bends, indeed, are ahead...



### **Architectural Trends - No more free lunch**

- CPU clock rates quit increasing in 2003
- P = CV<sup>2</sup>f
   Power consumed is proportional to the frequency and to the square of the voltage
- Voltage can't go any lower, 1,000 so frequency can't go higher without increasing 100 power
- Power is capped by heat dissipation and \$\$\$
- Performance increases have been coming through increased parallelism



Herb Sutter: Dr. Dobb's Journal: <a href="http://www.gotw.ca/publications/concurrency-ddj.htm">http://www.gotw.ca/publications/concurrency-ddj.htm</a>



## Why are machines slowing down?

7 MegaWatts

- 1 MW-hour =  $3.6 \times 10^9$  joules
- 1 gram TNT = 4184 joules (NIST)



1 jaguar-hour ~ 7 MW-hours ~ 2.5x10<sup>10</sup> joules ~ 6 tons TNT

- 1 supernova simulation ~ 0.6 jaguar-months ~ 2.6 kton
- Trinity ~ 19 kton

But, you should care even if you don't use this much electricity...

\*OAK RIDGE National Laborat

## What Burning Hot Laptops Can Do to Your Legs

5:10 AM - October 6, 2010 - By Jane McEntegart -Source : Tom's Guide US



Hot laptops can lead to permanent skin discoloration from 'Toasted Skin Syndrome.'

Most laptop owners are familiar with what happens when they use their computers on their laps. Things can get pretty warm and it doesn't take long for it to get uncomfortable. While most of us would move to the coffee table or stick a cushion under the laptop when things start to get sweaty, it seems not everyone takes the same measures and it's resulting in laptop-induced injuries.

The journal Pediatrics this week warned of the dangers of resting a hot laptop on your legs for extended periods of time and highlighted a couple of cases where it had resulted in erythema abigne, a skin condition the journal describes as "a reticular, pigmented, sometimes telangiectatic dermatosisthat is caused by prolonged exposure to a heat or infrared source." Laptop-induced rythema abigne is now being referred to as "Toasted Skin Syndrome."







### **ORNL's "Titan" Hybrid System:** Cray XK7 with AMD Opteron and NVIDIA Tesla processors





SYSTEM SPECIFICATIONS:

>10x Peak performance of 27.1 PF

24.5 GPU + 2.6 CPU

18,688 Compute Nodes each with:

16-Core AMD Opteron CPU

NVIDIA Tesla "K20x" GPU

32 + 6 GB memory

512 Service and I/O nodes

200 Cabinets

710 TB total system memory

Cray Gemini 3D Torus Interconnect

8.8 MW peak power ~2x





## What will the exascale look like?

- "Node architectures are expected to change dramatically in the next decade, becoming more hierarchical and heterogeneous."
- "... computer companies are dramatically increasing on-chip parallelism to improve performance. The traditional doubling of clock speeds every 18 to 24 months is being replaced by a doubling of cores or other parallelism mechanisms."
- "Systems will consist of one hundred thousand to one million nodes and perhaps as many as a billion cores."

Architectures and Technology for Extreme Scale Computing, Workshop Report, 2009; <a href="http://www.er.doe.gov/ascr/ProgramDocuments/Docs/Arch-TechGrandChallengesReport.pdf">http://www.er.doe.gov/ascr/ProgramDocuments/Docs/Arch-TechGrandChallengesReport.pdf</a>



**Scientific Grand Challenges** 

### **Hierarchical Parallelism**

- MPI parallelism between nodes (or PGAS)
- On-node, SMP-like parallelism via threads (or subcommunicators, or...)
- Vector parallelism
  - SSE/AVX/etc on CPUs
  - GPU threaded parallelism



- Exposure of unrealized parallelism is essential to exploit all near-future architectures.
- Uncovering unrealized parallelism and improving data locality improves the performance of even CPU-only code.
- Experience with vanguard codes at OLCF suggests 1-2 personyears is required to "port" extant codes to GPU platforms.
  - Likely less if begun today, due to better tools/compilers



# So how should we program for these new systems?

- What to do Good Threading (OpenMP)
  - –Must do high level threading
  - Thread must access close shared memory rather than distant shared memory
  - –Load Balancing
- What to do Good Vectorization
  - Vectorization advantage allows for introducing overhead to vectorize
  - –Vectorization doesn't rely on having HW that is described as "vector"



## Good news! Stellar astrophysics tends to have a lot of unrealized parallelism at present

- Simulation codes for stellar evolution and explosions
  - Exemplars of "multiphysics application codes"
  - Typically many degrees-of-freedom per spatial grid point
    - o radiation transport
    - o nuclear burning
  - Spatial domains typically parallelized via domain decomposition



- high-density physics
- nuclear structure and reactions







- A common directive programming model for today's GPUs
  - Announced at SC11 conference
  - Offers portability between compilers
    - Drawn up by: NVIDIA, Cray, PGI, CAPS
    - Multiple compilers offer portability, debugging,
      - permanence
  - Works for Fortran, C, C++
    - Standard available at <u>www.OpenACC-standard.org</u>
    - Initially implementations targeted at NVIDIA GPUs
- Current version: 1.0 (November 2011)
- Compiler support:
  - Cray CCE: complete 1.0
  - PGI Accelerator: complete 1.0 + extensions
  - CAPS: complete 1.0 + extensions





### Why use OpenACC Directives?

- Productivity
  - -Higher level programming model
  - -a la OpenMP
- Portability
  - -ignore directives, portable to the host
  - -portable across different accelerators
  - -performance portability
- Performance feedback



### 29 0, workSize[i] \* sizeof(float) \* WA, 0, NULL, NULL); 30 31 // create OpenCL buffer on device that will be initiatlize from the host memory on first use 32 33 d B[i] = clCreateBuffer(cxGPUContext, CL MEM READ ONLY | CL MEM COPY HOST PTR, mem size B, h B data, NULL); // Output buffer OpenCL CUDA C d C[i] = clCreateBuffer(cxGPUContext, CL MEM WRITE ONLY, workSize[i] \* WC \* sizeof(float), NULL, NULL); // set the args values clSetKernelArg(multiplicationKernel[i], 0, sizeof(cl mem), (void \*) &d C[i]); clSetKernelArg(multiplicationKernel[i], 1, sizeof(cl mem), (void \*) &d A[i]); clSetKerneilarg(multiplicationKernel[i], 2, sizeof(cl\_mem), (void \*) &d\_B[i]); clsetKernellAxQ(multipixicationKernelfligat\*3, C/sifzeof(float)1 cathlock istzevAblock istzevAblock istzevAblock clSetKernelArg(multiplicationKernel[i], 4, sizeof(float) \* BLOCK\_SIZE \*BLOCK\_SIZE, 0 ); int by = blockIdx.y; 47 48 // Execute Multiplicateon on BIOCK SIZE in parallel 49 50 size t localWorkSitzenStep {BLOOK STZET,ZBLOOK SIZE}; size\_t glopalWorkSize [Sub EngreundUp (BLOCK\_SIZE, WC), shrRoundUp (BLOCK\_SIZE, workSize[0])}; 51 52 // Launchl kernels on deviaces= aEnd; for (unsigned int i = 0; +1 a & temperature but a limit by the limit b 53 54 Multiplication - whon blocking execution glorations; = B [h + wb \* ty + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b + x = b 55 56 57 配EnqueueNDRangeKernell(GommandQueue)[i], multiplicationKernel[i], 2, 0, globalWorkSize, localWorkSize, Csub += AS(ty, k) \* BS(0, WULL, &GPUExecution[i]); syncthreads(); 58 59 for (wasigned into if = wB, \* iBLOCKIDEVICeCount; BLOCK\_SIZE \* bx; C[c + wB \* ty + tx] = Csub;60 61 62 clFinish(commandQueue[i]); 1 void 63 } 30 Vold 31 domatmul(float\* C, float\* A, float\* B, unsigned int hA, unsigned int wA, unsigned wB) 2 computeMMO\_saxpy(float 64][WB]fortwasing (whit in at hat int hat int hat int hat int wa, int wa A, int wa, int wah unsigned int mem size A = sizeof(float) \* size A; 66 // Non-blocking copyundignesulin from devicemeto host 4 { 67 5 #pragma acc region 68 6 { 7 #pragma acc for paralle 9 vector (16) ungoll (4) float \*d\_A, \*d\_B, \*d\_C; for (int i = 0; i <70 A/; +CPU aync with GPU (int i = 0; i <70 / cuda sync with GPU for (int j = 0; 71 < wmclWait For Events (addersice Counts) GPUD one ); cudaMalloc((void\*\*) &d C, mem size A); cudaMalloc((void\*\*) &d C, mem size C); cudaMemcpy(d A, h A, mem size A, cudaMemcpyHostToDevice); } Release mem and events mem size B, cudaMemcpyHostToDevice);</pre> (cudaMemcpyCd A, h B, mem size B, cudaMemcpyHostToDevice); 9 10 11 12 13 14 15 78 clReleaseMemObjectmcbB(lil),; d C, mem size\_C, cudaMemcpyDeviceToHost); 16 79 clReleaseEvent(GPUExecution[i]); 17 80 clReleaseEvent (GPUDone [1]'); 18 } 81 cudaFree(d-C); 19 } 82 } 83 kernel void 84 matrixMul( \_\_global float\* C, \_\_global float\* A, \_\_global float\* B, \_local float\* As, \_local float\* Bs) 85 86 int bx = get group id(0), tx = get local id(0); int by = get\_group\_id(1), ty = get\_local\_id(1); int aEnd = WA \* BLOCK SIZE \* by + WA - 1; OpenAc float Csub = 0.0f; for (int a = WA\*BLOCK\_SIZE\*by , b = BLOCK\_SIZE \* bx; a <= aEnd; a += BLOCK SIZE, b += BLOCK SIZE\*WB) { 95 As[tx + ty \* BLOCK SIZE] = A[a + WA \* ty + tx];96 Bs[tx + ty \* BLOCK SIZE] = B[b + WB \* ty + tx];97 barrier(CLK LOCAL MEM FENCE); 98 for (int k = 0; k < BLOCK SIZE; ++k) 99 Csub += As[k + ty \* BLOCK SIZE]\*Bs[tx + k \* BLOCK SIZE] ; 101 barrier(CLK LOCAL MEM FENCE); 102 103 C[get\_global\_id(1) \* get\_global\_size(0) + get\_global\_id(0)] = Csub; 104 National Laboratory | COMPUTING FACILITY 105 }

LEADERSHIP

### **Risk factors**

- Will there be machines to run my OpenACC code on?
  - Now? Lots of Nvidia GPU accelerated systems
    - Cray XK7s: CSCS tödi, HLRS hermit, ORNL titan...
    - Lots of other GPU machines in Top100 (OpenACC is multi-vendor)
  - Future? OpenACC can be targeted at other accelerators
    - PGI and CAPS already target Intel Xeon Phi, AMD GPUs
  - Plus you can always run on CPUs using same codebase
- Will OpenACC continue?
  - Support? Cray and PGI (at least) are committed to support OpenACC
    - Lots of big customer pressure to continue to run OpenACC
  - Develop? OpenACC v2.0 standard out
  - Lots of new partners joined committee at end of last year
- Will OpenACC be superseded by something else?
  - Auto-accelerating compilers? If only!
    - Never really managed it for threading real HPC applications on the CPU
    - Data locality adds to the challenge
  - OpenMP accelerator directives? OpenACC work not wasted
    - Very similar programming model; can transition easily
    - Cray (co-chair), PGI very active in OpenMP accelerator subcommittee



### **Posited Exascale Specs**

| System attributes          | 2010     | 2018<br>"2015" |          | 2023?<br>"2018" |           |
|----------------------------|----------|----------------|----------|-----------------|-----------|
| System peak                | 2 PF     | 200 PF/s       |          | 1 Exaflop/s     |           |
| Power                      | 6 MW     | 15 MW          |          | 20 MW           |           |
| System memory              | 0.3 PB   | 5 PB           |          | 32–64 PB        |           |
| Node performance           | 125 GF   | 0.5 TF         | 7 TF     | 1 TF            | 10 TF     |
| Node memory BW             | 25 GB/s  | 0.1 TB/s       | 1 TB/s   | 0.4 TB/s        | 4 TB/s    |
| Node concurrency           | 12       | O(100)         | O(1,000) | O(1,000)        | O(10,000) |
| System size (nodes)        | 18,700   | 50,000         | 5,000    | 1,000,000       | 100,000   |
| Total node interconnect BW | 1.5 GB/s | 150 GB/s       | 1 TB/s   | 250 GB/s        | 2 TB/s    |
| MTTI                       | day      | O(1 day)       |          | O(1 day)        |           |



## Example goals/highlights from NP Exascale Workshop report (2009)



All of these goals are attainable, but will require new algorithms and implementations to bridge the gap to the posited architectures.



## Achieving high spatial (or phase-space, etc.) resolution will be very difficult.



Total memory on the entire exascale system will be O(10 PB)

# Locally bulk-synchronous programming model is not a viable path for maximum performance on these new platforms

- FLOP/s are cheap and moving data is expensive.
- Even perfect knowledge of resource capabilities at every moment and perfect load balancers will not rescue billion-thread SPMD implementations of PDE simulations.
- Cost of rebalancing frequently is too large, but the Amdahl penalty of failing to rebalance is fatal.
- To take full advantage of asynchronous algorithms, we need to develop greater expressiveness in scientific programming.
  - Create separate threads for logically separate tasks, whose priority is a function of algorithmic state, not unlike the way a time-sharing OS works.
  - Join priority threads in a directed acyclic graph (DAG), a task graph showing the flow of input dependencies; fill idleness with noncritical work or steal work.

Comments taken directly from keynote address by David Keyes at EU-US HPC Summer School, June 2012



## Asynchronous execution models via task scheduling

- Examples exist already in other domains
  - –MAGMA (linear algebra)
  - -MADNESS (DFT)
  - Uintah (terrestrial combustion)
- Operator-split physics modules become "tasks" associated with execution threads





# Will the exascale (or before) machine be primarily a "strong-scaling" platform?

- Memory constraints provide a hard ceiling for spatial resolution and number of unknowns.
  - -bytes/FLOP goes down by an order of magnitude
- Simulations will be certainly be larger, but likely not as large as one would expect if scaling with FLOPs is assumed.
  - -no more than ~10x the number of MPI ranks?
  - —this connotes no more than factors of ~2 in resolution in each dimension for 3D
- OK: considerable understanding can be realized by fully exploring parameter space.
  - progenitor mass, rotation, metallicity
  - -transport approximations, additional physics



## Simulation, code, and data management become even harder



Revision control, regression testing, viz, workflow...



### Summary

- The future is now! Computers are not getting faster from the perspective of a nuclear astrophysicist. They are only getting "wider." This is true on the world's largest supercomputers and on your laptop!!!!
- The Xeon(Phi)/GPU/BG\Q choice is no choice at all. They are all versions of a single narrative.
- Stellar astrophysics is rife with unrealized parallelism, but architectural details and memory (i.e. cost, power) constraints will present considerable challenges.
- Bulk-synchronous execution is a terrible way to try to exploit near-future architectures. A new programming model will require considerably more effort than a simple multi/many-core port.
- Managing large simulations is something we can barely do now, but how about managing 1000's, 10's of thousands, or 100's of thousands of simulations? We should not expect to rely on solutions to be thrown over the fence from developers in other communities.

