Pre-Symposium Tutorials
Tuesday, August 1, 2000
Full Day Tutorial: (9:00 a.m. - 4:30 p.m.)
Tutorial 1:
Access
Grid Tutorial - Building and operating an Access Grid Node (Canceled)
Rick Stevens, Argonne National Laboratory and University of Chicago
This full day tutorial will cover all aspects of creating and operating
an Access Grid node. The tutorial begins with an overview, our
philosophy of the Access Grid, background material, a list of current
sites and a historical timeline of significant AG events. In section two
we discuss Access Grid architectural issues including hardware and
software choices and room considerations. In that section we go over
assembly, wiring diagrams, software installation and room layout,
lighting and gear arrangement. In section three we show how to operate
the AG, including how to manage sound and video for an optimal
experience, how to use the software included with the AG and how to
manage network problems. The last section shows how to use the AG in
different operation modes, from Lectures and Q&A sessions to Site Visits
to a Distributed Panel Session.
Selected portions of the tutorial will be supplemented or delivered by
others from remote AG nodes.
Morning Half-Day Tutorials (8:30 a.m. - 12:00 p.m.)
Tutorial 2:
Legion – The Grid Operating System (Canceled)
Andrew Grimshaw, University of Virginia
Legion is an integrated Grid operating system that has been deployed at commercial,
government, and academic sites around the world. Legion
· eliminates the need to move and install binaries manually on multiple platforms,
· supports single sign-on for users with strong PKI-based authentication and flexible
access control for users,
· provides a shared, secure virtual file system that spans all the machines in a Legion
system,
· supports remote execution of legacy codes, and their use in parameter space studies,
· provides transparent remote execution of sequential and parallel jobs on
remote
resources using the “native” MPI implementation,
· cross-site and cross-platform MPI execution of applications.
This tutorial will provide background on the Legion system and teach how
to run existing parallel codes within the Legion environment. The target
audience is supercomputer users who are already familiar with parallel
processing tools such as MPI, or who have the need to execute the same
application hundreds or thousands of times. The tutorial will consist of
an introduction to the Legion system, architecture, and object model;
followed by an in-depth presentation of the users’ view of Legion. We
will address issues such as logging on to the system, compiling and
registering binaries, and using MPI.
Selected
Legion features include:
· Security: Security was built into Legion from its inception. Legion’s
security model supports strong PKI-based authentication with a single
sign-on for users, data integrity both on the wire and on disk, and
flexible access control for users. The result is a complete security
environment that protects all of the stakeholders in a Grid, from users
to resource owners.
· Distributed file system: Legion provides a transparent virtual file
system that spans all the machines in a Legion system. Input and output
files can be seen by all the parts of a computation, even when the
computation is split over multiple machines that don't share a common
file system. Different users can also use the virtual file system to
collaborate, sharing data files and even accessing the same running
computations. The Legion file system can be accessed via library
functions based on the Unix stdio and stream calls, via command line
tools such as legion_cat and legion_ls, or via NFS when the Legion-NFS
binding is used. The Legion-NFS binding provides completely transparent
access to the Legion file system to applications.
· Next generation applications: Legion's object-based architecture
dramatically simplifies building new applications and add-on tools for
tasks such as visualization, application steering, load monitoring, and
job migration.
· Transparent remote execution: Legion allows user to execute programs
throughout a Legion system without needing to know or specifiy where
they will be executed. (Of course the user can specifiy where they will
be executed, or the necessary characteristics of a host.) Legion will
take care of moving data as needed, dealing with security and access
control, etc. This is possible even with legacy codes where the sources
are not available. This capability is particularly powerful when used to
execute large numbers of jobs, for example to execute a parameter space
study.
· Binary management: Legion eliminates the need to move and install
binaries manually on multiple platforms. After Legion schedules a set of
tasks over multiple remote machines, it automatically transfers the
appropriate binaries to each host. A single job can run on multiple
heterogeneous architectures simultaneously; Legion will ensure that the
right binaries go to each, and that it only schedules onto architectures
for which it has binaries. Legion also provides legion_make, a utility
that remote compiles applications on different architectures,
eliminating the need to log onto multiple plaforms to build binaries.
· Fault-tolerance: Legion has a powerful distributed event/exception
management system that facilitates the construction of
application-specific failure detection and recovery schemes. The default
behavior is that when the Legion libraries detect an object failure the
object is automatically restarted. The event management system has been
used to construct a variety of application specific fault-tolerance
libraries including a MPI two-phase distributed consistent check-point
library that automatically restarts an application on detection of
failure. The same mechanism can be used to suspend the application for
later execution, or to migrate the application to a different set of
resources.
· Parallel computing: Legion supports MPI, PVM, a parallel C++, and a
parallel object-based Fortran. Legion-MPI applications can execute
across sites and across platforms. In addition, “native” MPI jobs can be
started remotely. These features also make Legion attractive to
administrators looking for ways to increase and simplify the use of
shared high-performance machines. The Legion implementation emphasizes
extensibility, and multiple policies for resource use can be embedded in
a single Legion system that spans multiple resources or even
administrative domains.
Biography:
Andrew S. Grimshaw is an Associate Professor of Computer Science and
Director of the Institute of Parallel Computation at the University of
Virginia. His research interests include high-performance parallel
computing, heterogeneous parallel computing, compilers for parallel
systems, operating systems, and high-performance parallel I/O. He is the
chief designer and architect of Mentat and Legion. Grimshaw received his
M.S. and Ph.D. from the University of Illinois at Urbana-Champaign in
1986 and 1988 respectively.
Andrew Grimshaw
Department of Computer Science
University of Virginia
Charlottesville, VA 22903
(804) 982-2204
fax: (804) 982-2214
grimshaw@Virginia.edu
Tutorial 3:
Java and High Performance Computing: The Past, Present and Future (Frick room)
Rajkumar Buyya
Monash University, Melbourne, Australia
Mark Baker
University of Portsmouth, UK
Java is potentially an excellent platform for developing large-scale
science and engineering applications. Java has advantages, including it
is a descendant of C++, comes with built-in multithreading, inherent
portability, as well as aspects such as visualisation and user
interfaces.
The tutorial is divided into two parts. In order to encourage
participation of those not aware of Java programming, the first part
covers an introduction to Java programming with emphasising on its key
features such as networking, concurrency, and graphics programming. The
second part covers issues related parallel and distributed computing
using Java. In this tutorial we will look at not only how Java can be
used to develop high performance applications, but also as part of the
computing infrastructure that enables these Java applications to run. We
will be not only considering localised platforms, such as clusters of
computers, but also ones on a more global scale, such as the emerging
computational GRID systems.
Thet Tutorial will cover the following areas:
o Java programming and core features
o Java constructs
o Multithreading
o Graphic programming
o Network programming
o Distributed programming
o Web building and social issues
o An overview of Java's potential as a platform for high-performance applications.
o A brief review of international efforts in this area.
o A discussion about the Java Grande Forum and their work on Java numerics, concurrency and applications.
o MPJ - the MPI-like interface to Java.
o A Jini-based infrastructure for supporting MPJ applications.
o A summary, where we discuss the lessons that have been learnt so far as well as the likely future trends of Java as a platform for high-performance computing.
Biography:
Rajkumar Buyya
Monash University, Australia
Rajkumar Buyya is a Research Scholar at the School of Computer Science
and Software Engineering, Monash University, Melbourne, Australia. He
was awarded Dharma Ratnakara Memorial Trust Gold Medal for his academic
excellence during 1992 by Kuvempu/Mysore University. He is co-author of
books: Mastering C++ and Microprocessor x86 Programming; and recently,
he has edited a two volume book on High Performance Cluster Computing:
Architectures and Systems (Vol. 1); Programming and Application (Vol.2)
published by Prentice Hall, USA. He served as Guest Editor for the
special issues of international journals: Parallel and Distributed
Computing Practices, Informatica: An International Journal of Computing
and Informatics, and Journal of Supercomputing.
Rajkumar is a
speaker in the IEEE Computer Society Chapter Tutorials Program. Along
with Mark Baker, he co-chairs the IEEE Computer Society Task Force on
Cluster Computing. He has contrbuted to the development of HPCC system
software environment for PARAM supercomputer developed by the Centre for
Development of Advanced Computing, India.
Rajkumar conducted tutorials on advanced technologies such as Parallel,
Distributed and Multithreaded Computing, Client/Server Computing,
Internet and Java, Cluster Computing, and Java and High Performance
Computing at international conferences. He has organised/chaired
workshops, symposiums, and conferences at the international level in the
areas of Cluster Computing and Grid Computing. He also serves as a
reporter for Asian Technology Information Program, Japan/USA. His
research papers have appeared in international conferences and journals.
His research interests include Programming Paradigms and Operating
Environments for Parallel and Distributed Computing.
Dr Mark Baker
University of Portsmouth, UK
Mark Baker started working in the field of High Performance Computing at
Edinburgh University (UK) in 1988. In Edinburgh he was involved in the
development of parallel linear solvers on a large Transputer-systems
using Occam. From 1990 until 1995 Mark was a project leader of a group
at the University of Southampton (UK). This group was involved in
developing and supporting environments and tools for a range of parallel
and distributed systems. It was whilst at Southampton that Mark started
to actively investigate and research software for managing and
monitoring distributed environments. In 1995 Mark took up a post as
Senior Research Scientist at NPAC, Syracuse University (USA). Whilst at
NPAC Mark researched and wrote the widely sited critical review of the
Cluster Management Systems. At Syracuse Mark worked on a range projects
involving the major HPC groups and Labs. in the US. It was during this
period that he worked closely with Prof. Geoffrey Fox on a variety of
cluster and metacomputing related projects.
Since 1996, Mark has been a Senior Lecturer in the Division of Computer
Science at the University of Portsmouth. At Portsmouth Mark lectures on
network architectures, client/server programming and open distributed
systems. Mark's current research is focused on the development of tools
and services for PC-based distributed systems. Mark also tracks
international metacomputing efforts and is involved with Java Grande and
the definition of a Java interface to MPI.
Mark has recently contributed a number of articles on cluster computing,
including a chapter for the Encyclopaedia of Microcomputers, a paper for
Software Practice and Experience and was the editor and a contributor to
a white paper on cluster computing. Mark is co-chair of the recently
established IEEE Computer Society Task Force on Cluster Computing (TFCC)
and is currently a visiting Senior Research Scientist at Oak Ridge
National Lab., USA.
Mark is on the international editorial board of the Wiley Journal,
Concurrency: Practice and Experience and regularly reviews papers for
many journals in his field, including IEEE Computer and Concurrency.
Mark gave the Cluster Computing tutorial at HPDC in Los Angeles in 1999.
A full list of Mark's activities can be found on his Web site.
Tutorial 4:
The Cactus Code: A framework for parallel scientific computing (Phipps room)
Gabrielle Allen
Gerd Lanfermann
Max-Planck-Institut fuer Gravitationsphysik
Albert Einstein Institut
Cactus is an open source parallel programming environment designed for
scientists and engineers. It has a modular structure enabling parallel
computation across different architectures and large-scale collaborative
code development between different groups. Users add their own application
modules, written in Fortran or C/C++, to compliment the provided toolkit
modules which provide access to computational features such as parallel
I/O, checkpointing and interpolation.
This tutorial will give a practical introduction to the Cactus Code,
describing its design requirements and their realization, as well as
the architecture of Cactus and the tools and capabilities it provides.
A worked example will demonstrate the implementation of a simple but
illustrative example, in particular focusing on the few steps required
to introduce parallelism. Finally, we will illustrate how Cactus can
provide easy access to many of the cutting edge software technologies
being developed in the academic research community, such as the Globus
Metacomputing Toolkit, HDF5 parallel I/O, adaptive mesh refinement, and
remote steering and visualization.
This course is targeted at scientists and engineers who want easy access to
parallel computational techniques, to computational scientists who wish
to make their tools or techniques available to a wide community of scientific
users, and to anyone interested in making high performance computing more
accessible for the average user.
Biography:
Gabrielle Allen is a research programmer at the Albert Einstein Institute
(Max Planck Institute for Gravitational Physics) where she has been a key
member of the Cactus Team for the past two years. Her research interests
include numerical relativity, scientific and high performance computing and
software development. Gabrielle received her PhD in computational astrophysics
from the University of Wales in 1993.
Gerd Lanfermann is a research programmer at the Albert Einstein
Institute (Max Planck Institute for Gravitational Physics) where he
has been a member of the Cactus Team and developing Cactus for the
past three years.
Gerd received his Diploma degree in theoretical physics
from the Free University of Berlin in 1999. He is especially interested in
the HPC aspects of numerical simulations, such as numerical relativity.
Contact Info:
Gabrielle Allen
Max-Planck-Institut fuer Gravitationsphysik
Albert Einstein Institut
Am Muehlenberg 5, D-14476 Golm, Germany
Email: allen@aei-potsdam.mpg.de
Phone: +49 331 5677471 (or 56770)
Mobile: +49 0177 6333909
Fax: +49 331 5677298
Afternoon Half-Day Tutorials (1:30 p.m. - 5:00 p.m.)
Tutorial 5:
Software Configuration for Clusters in a Production HPC Environment (Monongahela room)
Doug Johnson, Troy Baer and Jim Giuliani
Ohio Supercomputer Center
With the increases in performance of commodity hardware and the
proliferation of more exotic hardware being priced nearer to what
could be considered commodity, the viability of clusters as a
multi-user, high performance computing platform has become more
concrete. At the same time, the software environments have become more
full-featured, but also more complex. In this tutorial we will present
an overview of what we feel are the necessary software components and
implementation details for a viable computational science
platform. The topics covered will include development environment,
application performance analysis, and system management.
Software tools available on a cluster have increased to include many
different language and programming model choices. We will present a
survey of the compilers and languages available. The use of these
languages with shared memory, distributed shared memory and hybrid
programming models will be introduced along with libraries and
parallel application development frameworks.
Application performance analysis will be presented with a three tier
approach to performance characterization; timing, profiling and
hardware utilization. This will include an introduction to tools
developed at OSC for analysis of hardware utilization.
Clusters offer a wide range of choices for system management. System
wide monitoring of performance and resource availability will be
covered. Resource management which includes; scheduling, parallel
program execution, job environment modifications, interactive use in a
job scheduled environment, accounting and internal network topology
will be presented. Methods for remote administration including access
to hardware level resources will be covered.
--------------------------------------------------------------------
Biography:
Doug Johnson is a Systems Developer/Engineer at the Ohio Supercomputer
Center and is the technical lead for the centers clustering project.
Troy Baer has been a systems developer/engineer in the Science and Technology
Support Group at the Ohio Supercomputer Center (OSC) in Columbus, Ohio,
since 1998. Before working at OSC, Mr. Baer worked as
a graduate research associate at the Ohio State University Gas Turbine
Laboratory, and as an intern with
the Ohio Aerospace Institute at NASA Glenn Research Center in Cleveland,
Ohio. Mr. Baer holds bachelor's and master's degrees from the Ohio State
University in aeronautical and astronautical engineering, specializing in
computational fluid dynamics.
Jim Giuliani has been a systems developer/engineer in the Science and Technology
Support Group at the Ohio Supercomputer Center (OSC) since 1998. At OSC,
Jim leads training workshops, provides consultation, and helps convert
computer codes so researchers can efficiently use the Center's
surepcomputers and licensed software. Prior to joining OSC, Jim served as
Operations Manager for The Ohio State University (OSU) Department of
Computer and Information Science. He has also held R&D positions in industry
and at The Ohio Aerospace Institute in Cleveland, Ohio.
Jim is the recipient of the NASA Lewis Awareness Award and the NASA
Certificate of Recognition for Software Development. He received a
Bachelor's degree in Aeronautical Engineering, with a minor in Computer
and Information Science, and a Master's degree in Mechanical
Engineering, both from OSU.
Tutorial 6:
Network-centric Computing with PUNCH: Learn How to Design and Implement a Computing Portal (Canceled)
Nirav H. Kapadia
Jose' A. B. Fortes
Purdue University
Network-centric computing promises to revolutionize the way in which
computing services are delivered to the end-user. Analogous to the power
grids that distribute electricity today, computational grids will distribute
and deliver computing services to users anytime, anywhere. Corporations and
universities will be able to out-source their computing needs, and individual
users will be able to access and use specialized software via Web-based
computing portals.
This tutorial will 1) describe key issues that must be addressed in the
course of designing a wide-area network-computing infrastructure that
supports the service-based computing paradigm outlined above, 2) discuss
solutions in the context of the Purdue University Network Computing Hubs
(PUNCH), and 3) show attendees how to configure and bring up a computing
portal using PUNCH technologies. PUNCH is a computing portal that has been
operational for the past five years, and is used on a regular basis by about
850 users from 10 countries.
The goal of the tutorial is to provide insight into the following questions.
What is a network-computing system, and what parameters does one use to
characterize such systems? What factors determine the architecture of a
wide-area network-computing system? What are the technical implications of
crossing administrative boundaries, and how does one address the associated
problems? What does it take to reuse the World Wide Web as a general-purpose
interface to a wide-area network-computer? What are the types of operating
system services that are required to allow remote collaborators to access and
run legacy software applications via the Web? What are the implications of
wide-area computing from a resource management perspective, and how can one
reuse existing scheduling mechanisms? What does it take to manage and run a
computing portal? How does one quantify the benefits of such a service, and
what do the users think of it?
In addition to discussing the issues outlined above, the tutorial will
briefly touch on four advanced topics: 1) using virtual filesystems to access
remote data in an application-transparent manner, 2) a "system of systems"
approach to adaptive resource management in a computational grid, 3) using
predictive application-performance modeling to automate cost and performance
tradeoff decisions, and 4) performance and interoperability issues in
incorporating cluster management systems within a wide-area network-computing
environment.
The tutorial will start with a brief introduction to network-computing, and
will cover the issues outlined above at enough depth to allow the audience to
understand the associated implications --- without overwhelming them with
implementation details. Prerequisites for the tutorial are as follows:
1) basic knowledge of programming in a language such as `C', 2) a general
idea of Unix- or Linux-based system operation, and 3) basic understanding of
the concept of distributed computing. Additional information on the tutorial,
including a tentative lecture outline, can be found at
www.ece.purdue.edu/~kapadia/Tutorials.
The concepts described in this tutorial have been implemented and tested in
the PUNCH infrastructure. PUNCH is a computing portal that has been
operational for five years. To date, it has been utilized by more than 3,000
users, who have logged over 3,000,000 hits and have initiated more than
200,000 runs. Today, PUNCH is used on a regular basis by approximately 850
users from 10 countries; it provides access to 50 engineering software
packages developed by 13 universities and 6 vendors. PUNCH is the enabling
technology for NETCARE (NETwork-computer for Computer Architecture Research
and Education; a NSF project involving Purdue, Northwestern, and U. of
Wisconsin-Madison), DesCArtES (Distributed Center for Advanced Electronics
Simulations; a NSF project involving U. of Illinois at Urbana-Champaign,
Arizona State Univ., Stanford, and Purdue), iPUNCH (a statewide
network-computer linking Purdue's campuses and technology centers), and the
eDA Hub (Electronic Design Automation Hub; in cooperation with SIGDA). PUNCH
can be accessed at www.ece.purdue.edu/punch.
Biography:
Nirav H. Kapadia is a senior research scientist in the School of Electrical
and Computer Engineering at Purdue University. His research interests are in
the areas of network-based and wide-area distributed computing, Web-based
computing portals, predictive application-performance modeling, and resource
management across institutional boundaries. He conceived, designed, and
developed the PUNCH network-computing infrastructure. Kapadia received the
B.E. degree in Electronics and Telecommunications from Maharashtra Institute
of Technology (India) in 1990, the M.S. degree in Electrical Engineering from
Purdue University in 1994, and the Ph.D. degree in Computational Engineering
from Purdue University in 1999. He is a member of Phi Beta Delta, an honor
society for international scholars. Additional information is available at
www.ece.purdue.edu/~kapadia.
Jose' A. B. Fortes is a professor and assistant head for education in the
School of Electrical and Computer Engineering at Purdue University. His
research interests are in the areas of parallel processing, computer
architecture, network-computing, and fault-tolerant computing. He received
the B.S. degree in Electrical Engineering (Licenciatura em Engenharia
Electrote'cnica) from the Universidade de Angola in 1978, the M.S. degree in
Electrical Engineering from the Colorado State University, Fort Collins in
1981, and the Ph.D. degree in Electrical Engineering from the University of
Southern California, Los Angeles in 1984. He is a Fellow of the Institute of
Electrical and Electronics Engineers (IEEE) professional society, and was a
Distinguished Visitor of the IEEE Computer Society from 1991 till 1995.
Additional information is available at www.ece.purdue.edu/~fortes.
Tutorial 7:
Programmable Networks (CANCELED)
Andrew T. Campbell
Center for Telecommunications Research
Columbia University
This tutorial had to be canceled because of a conflict. We encourage you to
attend another tuturial instead, or to participate in one of the two workshops
that take place before HPDC, the
4th Globus Retreat
or the Active
Middleware Services workshop.
Recent advances in active network technology, open signaling and
control, distributed systems, service creation, resource
allocation and transportable software are driving a reexamination
of existing network architectures, middleware and the evolution
of control and management systems away from traditional
constrained solutions. The ability to dynamically create,
deploy and manage new network architectures, protocols and
services in response to user demands is creating a paradigm shift
in telecommunications. Network researchers are exploring new ways
in which network switches, routers and base stations can be
dynamically programmed by network applications, users, operators
and third parties to accelerate network innovation.
This trend reflects the acceptance of computing and middleware paradigms
in telecommunication networks. Programmable networks seek to exploit
advanced software techniques and technologies in order to make network
infrastructure more flexible, thereby allowing users and service
providers to customize network elements to meet their own specific
needs.
Customizing routing, signaling, resource allocation and accelerating
information processing in this manner raises a number of significant
security, reliability and performance issues. In this tutorial we will
discuss the state of the art in programmable networks. We will discuss
a number of important innovations that are creating a paradigm shift
in networking leading to higher levels of network programmability. These
include the:
-Separation between transmission hardware and control software,
-Availability of open programmable network interfaces,
-Accelerated virtualization of networking infrastructure,
-Rapid creation and deployment of new network services and
architectures, and
-Environments for resource partitioning and coexistence of multiple
distinct network architectures.
Topics covered in this tutorial will include:
Open and innovative signaling systems
Active networks
Programming abstractions and interfaces for networks
Service creation platforms
Programming for mobility
Experimental architectures and implementations
Programming for QOS
Enabling technologies, platforms and languages
Support of multiple control planes
Control and resource APIs and object representations
Programmability support for virtual networks
The role of standards
Biography
Andrew T. Campbell is an Assistant Professor in
the Department of Electrical Engineering and member of
the COMET Group at the Center for Telecommunications Research,
Columbia University, New York. His area of interest includes
open programmable networks, mobile networking, distributed systems
and QOS research. He is a past co-chair of the 5th IFIP/IEEE
International Workshop on Quality of Service (IWQOS97) and the
6th IEEE International Workshop on Mobile Multimedia Communications
(MOMUC99) and is currently the co-chair of the 4th IEEE Conference on
Open Architecture and Network Programming (OPENARCH 2001). Andrew
has been involved in building a number of programmable networks for
ATM (called xbind), mobile (called Mobiware) and IP (called Genesis)
networks. He is a guest editor for the IEEE Journal on Selected Areas in
Communications on Active and Programmable Networks and is been a
member of OPENSIG, the international working group on programmable
networks since its creation. Andrew received his Ph.D. in Computer
Science in 1996, the IBM Faculty Award 1998 and the NSF CAREER Award for
his research in programmable mobile networking in 1999.