Lesson 1: Introduction
· Motivation (why parallelize?)
· Types of parallelism:
- Implicit vs Explicit
- Floyd Taxonomy
· Example architectures:
- SMP/multicore; Cell BE; GPGPU;
- clusters; grids; clouds;
· Problem bottlenecks/classification:
- compute-bound
- I/O-bound (or memory-bound)
· Programming models:
- shared memory (e.g. threads)
- message-passing (e.g. MPI)
- vectorization
· Introduce systems used in class:
- Intel Core 2 Duo E8400 @3.00GHz (duck/swan labs)
- Intel Xeon E3113 @3.00GHz (ps3 cluster head node)
- nVidia 9400GT (duck/swan labs)
- Cell BE (ps3 cluster)
Lesson 2: Threading
· Thread basics
- What are threads?
- Threads and processes
- Advantages of thread programming
- Common models for programming with threads
· POSIX threads
- Creation and termination
- Synchronization primitives
- Joining and detaching
- Mutexes
Lesson 3: SIMD Extensions
· Introducing SIMD Extensions
- Vector vs Scalar processing
- Vector data types
- Byte order
- Programming interfaces
· SIMD Programming with GCC
- Rudimentary example
- Vector reorganization
- Data type conversion
- Elimination of conditional branches
Lesson 4: Cell BE Programming (part 1)
· The Cell Broadband Engine
- Cell BE Architecture
- Cell BE Programming Model
· SPE Programming
- A simple SPE program
- Using DMA transfer
Lesson 5: Cell BE Programming (part 2)
· Parallel SPE Programming
- SIMD programming on the SPE
- Using Multiple SPEs
· Advanced Cell Programming
- Communication between PPE and SPE
· (mailboxes, signal notification registers)
Lesson 6: Cell BE Programming (part 3)
· Advanced Cell Programming
- Effective Utilization of DMA Transfer
- Differences in PPE and SPE Data Representation
- Vector Data Alignment
- Scalar Operations on SPE
- SPE Program Embedding
Lesson 7: GPU Programming (part 1)
· Introduction to GPU Programming
- GPUs and GPU Architecture
- Introduction to CUDA
· GPU Programming with CUDA
- A simple CUDA example
- Blocks and thread hierarchy
- Memory hierarchy
- Device and host separation
- Putting everything together
Lesson 8: GPU Programming (part 2)
· Optimizing Performance
· Advanced CUDA Facilities
- Shared Memory
- Texture Memory
- Page-Locked Host Memory
- Concurrent Execution
Lesson 9: Cluster Programming with MPI
· Introduction to Clusters
· Message-Passing Interface API
- MPI Environment Management
- Point-to-Point Communications
- Collective Communications
Lesson 10: Exploiting Parallelism
· Introduction to Grids & Public Computing
· Parallelism at different levels
- Granularity and level
- Which way(s) to use?
- System-level concerns
· Closure