University of Surrey - Guildford

Registry > Module Catalogue
View Module List by A.O.U. and Level  Alphabetical Module Code List  Alphabetical Module Title List  Alphabetical Old Short Name List  View Menu 
2010/1 Module Catalogue
 Module Code: COM3012 Module Title: PARALLEL ARCHITECTURES
Module Provider: Computing Short Name: COM3012
Level: HE3 Module Co-ordinator: BRIFFA JA Dr (Computing)
Number of credits: 15 Number of ECTS credits: 7.5
Module Availability


Assessment Pattern

Assessment Pattern


Unit(s) of Assessment


Weighting Towards Module Mark( %)




4 hour practical examination




Coursework (individual):


1 piece of assigned coursework, due in week 6




Coursework (individual):


1 piece of assigned coursework, due in week 10




Qualifying Condition(s) 


A weighted aggregate of 40% is required to pass the module.


Module Overview

Module Overview


The course introduces concepts of parallel computing by considering different architectures that support this, and working through different categories of examples. The implementation of such solutions and their subsequent analysis gives practical experience and an understanding of the difficulties involved.



Good programming skills; prior knowledge of C/C++ is helpful, as these are the languages used throughout the module. Students who do not have a C/C++ background are strongly encouraged to take a prior course on C++.

Module Aims

The module aims to develop the student’s ability to think clearly about the relationship between a problem abstraction and architectural implementation details. We focus on the techniques for the development of solutions of scientific computing problems on parallel architectures. A number of case studies are considered to illustrate facets of the subject. This course will enable you to gain experience in building parallel solutions for scientific computing problems. Most of the course will use C and C++ to access the parallel computing libraries of the various architectures.

Learning Outcomes

Learning Outcomes


By the end of the course the students will be able to:


1.       explain the major benefits and limitations of parallel computing;


2.       identify and explain the differences between available parallel architectures;


3.       develop parallel solutions for scientific computing problems on various architectures;


4.       analyse the performance of a parallel solution


Module Content

Module Content


The lesson schedule and contents below is subject to change, following revisions based on earlier deivery.


Lesson 1: Introduction


·          Motivation (why parallelize?)


·          Types of parallelism:


-          Implicit vs Explicit


-          Floyd Taxonomy


·          Example architectures:


-          SMP/multicore; Cell BE; GPGPU;


-          clusters; grids; clouds;


·          Problem bottlenecks/classification:


-          compute-bound


-          I/O-bound (or memory-bound)


·          Programming models:


-          shared memory (e.g. threads)


-          message-passing (e.g. MPI)


-          vectorization


·          Introduce systems used in class:


-          Intel Core 2 Duo E8400 @3.00GHz (duck/swan labs)


-          Intel Xeon E3113 @3.00GHz (ps3 cluster head node)


-          nVidia 9400GT (duck/swan labs)


-          Cell BE (ps3 cluster)


Lesson 2: Threading


·          Thread basics


-          What are threads?


-          Threads and processes


-          Advantages of thread programming


-          Common models for programming with threads


·          POSIX threads


-          Creation and termination


-          Synchronization primitives


-          Joining and detaching


-          Mutexes


Lesson 3: SIMD Extensions


·          Introducing SIMD Extensions


-          Vector vs Scalar processing


-          Vector data types


-          Byte order


-          Programming interfaces


·          SIMD Programming with GCC


-          Rudimentary example


-          Vector reorganization


-          Data type conversion


-          Elimination of conditional branches


Lesson 4: Cell BE Programming (part 1)


·          The Cell Broadband Engine


-          Cell BE Architecture


-          Cell BE Programming Model


·          SPE Programming


-          A simple SPE program


-          Using DMA transfer


Lesson 5: Cell BE Programming (part 2)


·          Parallel SPE Programming


-          SIMD programming on the SPE


-          Using Multiple SPEs


·          Advanced Cell Programming


-          Communication between PPE and SPE


·          (mailboxes, signal notification registers)


Lesson 6: Cell BE Programming (part 3)


·          Advanced Cell Programming


-          Effective Utilization of DMA Transfer


-          Differences in PPE and SPE Data Representation


-          Vector Data Alignment


-          Scalar Operations on SPE


-          SPE Program Embedding


Lesson 7: GPU Programming (part 1)


·          Introduction to GPU Programming


-          GPUs and GPU Architecture


-          Introduction to CUDA


·          GPU Programming with CUDA


-          A simple CUDA example


-          Blocks and thread hierarchy


-          Memory hierarchy


-          Device and host separation


-          Putting everything together


Lesson 8: GPU Programming (part 2)


·          Optimizing Performance


·          Advanced CUDA Facilities


-          Shared Memory


-          Texture Memory


-          Page-Locked Host Memory


-          Concurrent Execution


Lesson 9: Cluster Programming with MPI


·          Introduction to Clusters


·          Message-Passing Interface API


-          MPI Environment Management


-          Point-to-Point Communications


-          Collective Communications


Lesson 10: Exploiting Parallelism


·          Introduction to Grids & Public Computing


·          Parallelism at different levels


-          Granularity and level


-          Which way(s) to use?


-          System-level concerns


·          Closure



Methods of Teaching/Learning

Methods of Teaching/Learning


The delivery pattern will consist of:


·          10 one-hour lectures in weeks 1-10 (1 per week)


·          Labs will be scheduled after class in weeks 1-10, initially based on exercises and subsequently to support coursework


Coursework and labs will address learning outcome 3; each assignment and lab exercise will deal with a specific architecture. The labs require specialized equipment, and the students will have 24-hour access to the facilities.


Selected Texts/Journals

Selected Texts/Journals


There is no single core text that covers the whole course. The following are recommendations for reading.


Online Texts:


·          Arevalo et al., “Programming the Cell Broadband Engine™ Architecture: Examples and Best Practices”, IBM Redbooks, Aug 2008. ISBN 0738485942. Available online at


·          “Cell Broadband Engine Programming Handbook”, Version 1.12, April 2009. Available online.


·          “Cell Programming Primer”, Version 1.6, Feb 2008. Available online.


·          “nVidia CUDA™ Programming Guide”, Version 2.3, July 2009. Available online at


·          “nVidia CUDA™ C Programming Best Practices Guide”, Version 2.3, July 2009. Available online at


Recommended Texts:


·          Grama et al., “Introduction to Parallel Computing”, Second edition, Addison-Wesley, 2003.


·          Mattson et al., “Patterns for Parallel Programming”, Addison-Wesley, 2005


·          Scarpino, "Programming the Cell Processor", Prentice Hall, 2009.


·          David B. Kirk and Wen-mei W. Hwu, “Programming Massively Parallel Processors: A Hands-on Approach”, Morgan Kaufmann, 2010. ISBN 0123814723


Background / Reference:


·          Stroustrup, “The C++ Programming Language”, Special Edition, Addison-Wesley, 2000.


·          Eckel, “Thinking in C+rdquo;, Vol. 1-2, Second edition, Prentice-Hall, 2000



Last Updated

APRIL 2011