Pre-Symposium Tutorials
Tuesday, August 1, 2000

Full Day Tutorial: (9:00 a.m. - 4:30 p.m.)

Tutorial 1: Access Grid Tutorial - Building and operating an Access Grid Node (Canceled)
Rick Stevens, Argonne National Laboratory and University of Chicago

This full day tutorial will cover all aspects of creating and operating an Access Grid node. The tutorial begins with an overview, our philosophy of the Access Grid, background material, a list of current sites and a historical timeline of significant AG events. In section two we discuss Access Grid architectural issues including hardware and software choices and room considerations. In that section we go over assembly, wiring diagrams, software installation and room layout, lighting and gear arrangement. In section three we show how to operate the AG, including how to manage sound and video for an optimal experience, how to use the software included with the AG and how to manage network problems. The last section shows how to use the AG in different operation modes, from Lectures and Q&A sessions to Site Visits to a Distributed Panel Session.

Selected portions of the tutorial will be supplemented or delivered by others from remote AG nodes.

Morning Half-Day Tutorials (8:30 a.m. - 12:00 p.m.)

Tutorial 2: Legion – The Grid Operating System (Canceled)
Andrew Grimshaw, University of Virginia

Legion is an integrated Grid operating system that has been deployed at commercial, government, and academic sites around the world. Legion
· eliminates the need to move and install binaries manually on multiple platforms,
· supports single sign-on for users with strong PKI-based authentication and flexible access control for users,
· provides a shared, secure virtual file system that spans all the machines in a Legion system,
· supports remote execution of legacy codes, and their use in parameter space studies,
· provides transparent remote execution of sequential and parallel jobs on remote
resources using the “native” MPI implementation,
· cross-site and cross-platform MPI execution of applications.

This tutorial will provide background on the Legion system and teach how to run existing parallel codes within the Legion environment. The target audience is supercomputer users who are already familiar with parallel processing tools such as MPI, or who have the need to execute the same application hundreds or thousands of times. The tutorial will consist of an introduction to the Legion system, architecture, and object model; followed by an in-depth presentation of the users’ view of Legion. We will address issues such as logging on to the system, compiling and registering binaries, and using MPI.

Selected Legion features include:

· Security: Security was built into Legion from its inception. Legion’s security model supports strong PKI-based authentication with a single sign-on for users, data integrity both on the wire and on disk, and flexible access control for users. The result is a complete security environment that protects all of the stakeholders in a Grid, from users to resource owners.

· Distributed file system: Legion provides a transparent virtual file system that spans all the machines in a Legion system. Input and output files can be seen by all the parts of a computation, even when the computation is split over multiple machines that don't share a common file system. Different users can also use the virtual file system to collaborate, sharing data files and even accessing the same running computations. The Legion file system can be accessed via library functions based on the Unix stdio and stream calls, via command line tools such as legion_cat and legion_ls, or via NFS when the Legion-NFS binding is used. The Legion-NFS binding provides completely transparent access to the Legion file system to applications.

· Next generation applications: Legion's object-based architecture dramatically simplifies building new applications and add-on tools for tasks such as visualization, application steering, load monitoring, and job migration.

· Transparent remote execution: Legion allows user to execute programs throughout a Legion system without needing to know or specifiy where they will be executed. (Of course the user can specifiy where they will be executed, or the necessary characteristics of a host.) Legion will take care of moving data as needed, dealing with security and access control, etc. This is possible even with legacy codes where the sources are not available. This capability is particularly powerful when used to execute large numbers of jobs, for example to execute a parameter space study.

· Binary management: Legion eliminates the need to move and install binaries manually on multiple platforms. After Legion schedules a set of tasks over multiple remote machines, it automatically transfers the appropriate binaries to each host. A single job can run on multiple heterogeneous architectures simultaneously; Legion will ensure that the right binaries go to each, and that it only schedules onto architectures for which it has binaries. Legion also provides legion_make, a utility that remote compiles applications on different architectures, eliminating the need to log onto multiple plaforms to build binaries.

· Fault-tolerance: Legion has a powerful distributed event/exception management system that facilitates the construction of application-specific failure detection and recovery schemes. The default behavior is that when the Legion libraries detect an object failure the object is automatically restarted. The event management system has been used to construct a variety of application specific fault-tolerance libraries including a MPI two-phase distributed consistent check-point library that automatically restarts an application on detection of failure. The same mechanism can be used to suspend the application for later execution, or to migrate the application to a different set of resources.

· Parallel computing: Legion supports MPI, PVM, a parallel C++, and a parallel object-based Fortran. Legion-MPI applications can execute across sites and across platforms. In addition, “native” MPI jobs can be started remotely. These features also make Legion attractive to administrators looking for ways to increase and simplify the use of shared high-performance machines. The Legion implementation emphasizes extensibility, and multiple policies for resource use can be embedded in a single Legion system that spans multiple resources or even administrative domains.

Biography:

Andrew S. Grimshaw is an Associate Professor of Computer Science and Director of the Institute of Parallel Computation at the University of Virginia. His research interests include high-performance parallel computing, heterogeneous parallel computing, compilers for parallel systems, operating systems, and high-performance parallel I/O. He is the chief designer and architect of Mentat and Legion. Grimshaw received his M.S. and Ph.D. from the University of Illinois at Urbana-Champaign in 1986 and 1988 respectively.

Andrew Grimshaw
Department of Computer Science
University of Virginia
Charlottesville, VA 22903

(804) 982-2204
fax: (804) 982-2214
grimshaw@Virginia.edu


Tutorial 3: Java and High Performance Computing: The Past, Present and Future (Frick room)
Rajkumar Buyya
Monash University, Melbourne, Australia

Mark Baker
University of Portsmouth, UK

Java is potentially an excellent platform for developing large-scale science and engineering applications. Java has advantages, including it is a descendant of C++, comes with built-in multithreading, inherent portability, as well as aspects such as visualisation and user interfaces.

The tutorial is divided into two parts. In order to encourage participation of those not aware of Java programming, the first part covers an introduction to Java programming with emphasising on its key features such as networking, concurrency, and graphics programming. The second part covers issues related parallel and distributed computing using Java. In this tutorial we will look at not only how Java can be used to develop high performance applications, but also as part of the computing infrastructure that enables these Java applications to run. We will be not only considering localised platforms, such as clusters of computers, but also ones on a more global scale, such as the emerging computational GRID systems.

Thet Tutorial will cover the following areas:

o Java programming and core features
o Java constructs
o Multithreading
o Graphic programming
o Network programming
o Distributed programming
o Web building and social issues
o An overview of Java's potential as a platform for high-performance applications.
o A brief review of international efforts in this area.
o A discussion about the Java Grande Forum and their work on Java numerics, concurrency and applications.
o MPJ - the MPI-like interface to Java.
o A Jini-based infrastructure for supporting MPJ applications.
o A summary, where we discuss the lessons that have been learnt so far as well as the likely future trends of Java as a platform for high-performance computing.

Biography:

Rajkumar Buyya
Monash University, Australia

Rajkumar Buyya is a Research Scholar at the School of Computer Science and Software Engineering, Monash University, Melbourne, Australia. He was awarded Dharma Ratnakara Memorial Trust Gold Medal for his academic excellence during 1992 by Kuvempu/Mysore University. He is co-author of books: Mastering C++ and Microprocessor x86 Programming; and recently, he has edited a two volume book on High Performance Cluster Computing: Architectures and Systems (Vol. 1); Programming and Application (Vol.2) published by Prentice Hall, USA. He served as Guest Editor for the special issues of international journals: Parallel and Distributed Computing Practices, Informatica: An International Journal of Computing and Informatics, and Journal of Supercomputing.

Rajkumar is a speaker in the IEEE Computer Society Chapter Tutorials Program. Along with Mark Baker, he co-chairs the IEEE Computer Society Task Force on Cluster Computing. He has contrbuted to the development of HPCC system software environment for PARAM supercomputer developed by the Centre for Development of Advanced Computing, India.

Rajkumar conducted tutorials on advanced technologies such as Parallel, Distributed and Multithreaded Computing, Client/Server Computing, Internet and Java, Cluster Computing, and Java and High Performance Computing at international conferences. He has organised/chaired workshops, symposiums, and conferences at the international level in the areas of Cluster Computing and Grid Computing. He also serves as a reporter for Asian Technology Information Program, Japan/USA. His research papers have appeared in international conferences and journals. His research interests include Programming Paradigms and Operating Environments for Parallel and Distributed Computing.

Dr Mark Baker University of Portsmouth, UK

Mark Baker started working in the field of High Performance Computing at Edinburgh University (UK) in 1988. In Edinburgh he was involved in the development of parallel linear solvers on a large Transputer-systems using Occam. From 1990 until 1995 Mark was a project leader of a group at the University of Southampton (UK). This group was involved in developing and supporting environments and tools for a range of parallel and distributed systems. It was whilst at Southampton that Mark started to actively investigate and research software for managing and monitoring distributed environments. In 1995 Mark took up a post as Senior Research Scientist at NPAC, Syracuse University (USA). Whilst at NPAC Mark researched and wrote the widely sited critical review of the Cluster Management Systems. At Syracuse Mark worked on a range projects involving the major HPC groups and Labs. in the US. It was during this period that he worked closely with Prof. Geoffrey Fox on a variety of cluster and metacomputing related projects.

Since 1996, Mark has been a Senior Lecturer in the Division of Computer Science at the University of Portsmouth. At Portsmouth Mark lectures on network architectures, client/server programming and open distributed systems. Mark's current research is focused on the development of tools and services for PC-based distributed systems. Mark also tracks international metacomputing efforts and is involved with Java Grande and the definition of a Java interface to MPI.

Mark has recently contributed a number of articles on cluster computing, including a chapter for the Encyclopaedia of Microcomputers, a paper for Software Practice and Experience and was the editor and a contributor to a white paper on cluster computing. Mark is co-chair of the recently established IEEE Computer Society Task Force on Cluster Computing (TFCC) and is currently a visiting Senior Research Scientist at Oak Ridge National Lab., USA.

Mark is on the international editorial board of the Wiley Journal, Concurrency: Practice and Experience and regularly reviews papers for many journals in his field, including IEEE Computer and Concurrency. Mark gave the Cluster Computing tutorial at HPDC in Los Angeles in 1999. A full list of Mark's activities can be found on his Web site.

Tutorial 4: The Cactus Code: A framework for parallel scientific computing (Phipps room)
Gabrielle Allen
Gerd Lanfermann
Max-Planck-Institut fuer Gravitationsphysik
Albert Einstein Institut

Cactus is an open source parallel programming environment designed for scientists and engineers. It has a modular structure enabling parallel computation across different architectures and large-scale collaborative code development between different groups. Users add their own application modules, written in Fortran or C/C++, to compliment the provided toolkit modules which provide access to computational features such as parallel I/O, checkpointing and interpolation.

This tutorial will give a practical introduction to the Cactus Code, describing its design requirements and their realization, as well as the architecture of Cactus and the tools and capabilities it provides. A worked example will demonstrate the implementation of a simple but illustrative example, in particular focusing on the few steps required to introduce parallelism. Finally, we will illustrate how Cactus can provide easy access to many of the cutting edge software technologies being developed in the academic research community, such as the Globus Metacomputing Toolkit, HDF5 parallel I/O, adaptive mesh refinement, and remote steering and visualization.

This course is targeted at scientists and engineers who want easy access to parallel computational techniques, to computational scientists who wish to make their tools or techniques available to a wide community of scientific users, and to anyone interested in making high performance computing more accessible for the average user.



Biography:

Gabrielle Allen is a research programmer at the Albert Einstein Institute (Max Planck Institute for Gravitational Physics) where she has been a key member of the Cactus Team for the past two years. Her research interests include numerical relativity, scientific and high performance computing and software development. Gabrielle received her PhD in computational astrophysics from the University of Wales in 1993. Gerd Lanfermann is a research programmer at the Albert Einstein Institute (Max Planck Institute for Gravitational Physics) where he has been a member of the Cactus Team and developing Cactus for the past three years. Gerd received his Diploma degree in theoretical physics from the Free University of Berlin in 1999. He is especially interested in the HPC aspects of numerical simulations, such as numerical relativity.

Contact Info:

Gabrielle Allen
Max-Planck-Institut fuer Gravitationsphysik
Albert Einstein Institut
Am Muehlenberg 5, D-14476 Golm, Germany
Email: allen@aei-potsdam.mpg.de
Phone: +49 331 5677471 (or 56770)
Mobile: +49 0177 6333909
Fax: +49 331 5677298


Afternoon Half-Day Tutorials (1:30 p.m. - 5:00 p.m.)

Tutorial 5: Software Configuration for Clusters in a Production HPC Environment (Monongahela room)
Doug Johnson, Troy Baer and Jim Giuliani
Ohio Supercomputer Center

With the increases in performance of commodity hardware and the proliferation of more exotic hardware being priced nearer to what could be considered commodity, the viability of clusters as a multi-user, high performance computing platform has become more concrete. At the same time, the software environments have become more full-featured, but also more complex. In this tutorial we will present an overview of what we feel are the necessary software components and implementation details for a viable computational science platform. The topics covered will include development environment, application performance analysis, and system management.

Software tools available on a cluster have increased to include many different language and programming model choices. We will present a survey of the compilers and languages available. The use of these languages with shared memory, distributed shared memory and hybrid programming models will be introduced along with libraries and parallel application development frameworks.

Application performance analysis will be presented with a three tier approach to performance characterization; timing, profiling and hardware utilization. This will include an introduction to tools developed at OSC for analysis of hardware utilization.

Clusters offer a wide range of choices for system management. System wide monitoring of performance and resource availability will be covered. Resource management which includes; scheduling, parallel program execution, job environment modifications, interactive use in a job scheduled environment, accounting and internal network topology will be presented. Methods for remote administration including access to hardware level resources will be covered.

--------------------------------------------------------------------

Biography:

Doug Johnson is a Systems Developer/Engineer at the Ohio Supercomputer Center and is the technical lead for the centers clustering project.

Troy Baer has been a systems developer/engineer in the Science and Technology Support Group at the Ohio Supercomputer Center (OSC) in Columbus, Ohio, since 1998. Before working at OSC, Mr. Baer worked as a graduate research associate at the Ohio State University Gas Turbine Laboratory, and as an intern with the Ohio Aerospace Institute at NASA Glenn Research Center in Cleveland, Ohio. Mr. Baer holds bachelor's and master's degrees from the Ohio State University in aeronautical and astronautical engineering, specializing in computational fluid dynamics.

Jim Giuliani has been a systems developer/engineer in the Science and Technology Support Group at the Ohio Supercomputer Center (OSC) since 1998. At OSC, Jim leads training workshops, provides consultation, and helps convert computer codes so researchers can efficiently use the Center's surepcomputers and licensed software. Prior to joining OSC, Jim served as Operations Manager for The Ohio State University (OSU) Department of Computer and Information Science. He has also held R&D positions in industry and at The Ohio Aerospace Institute in Cleveland, Ohio.
Jim is the recipient of the NASA Lewis Awareness Award and the NASA Certificate of Recognition for Software Development. He received a Bachelor's degree in Aeronautical Engineering, with a minor in Computer and Information Science, and a Master's degree in Mechanical Engineering, both from OSU.

Tutorial 6: Network-centric Computing with PUNCH: Learn How to Design and Implement a Computing Portal (Canceled)
Nirav H. Kapadia
Jose' A. B. Fortes
Purdue University

Network-centric computing promises to revolutionize the way in which computing services are delivered to the end-user. Analogous to the power grids that distribute electricity today, computational grids will distribute and deliver computing services to users anytime, anywhere. Corporations and universities will be able to out-source their computing needs, and individual users will be able to access and use specialized software via Web-based computing portals.

This tutorial will 1) describe key issues that must be addressed in the course of designing a wide-area network-computing infrastructure that supports the service-based computing paradigm outlined above, 2) discuss solutions in the context of the Purdue University Network Computing Hubs (PUNCH), and 3) show attendees how to configure and bring up a computing portal using PUNCH technologies. PUNCH is a computing portal that has been operational for the past five years, and is used on a regular basis by about 850 users from 10 countries.

The goal of the tutorial is to provide insight into the following questions. What is a network-computing system, and what parameters does one use to characterize such systems? What factors determine the architecture of a wide-area network-computing system? What are the technical implications of crossing administrative boundaries, and how does one address the associated problems? What does it take to reuse the World Wide Web as a general-purpose interface to a wide-area network-computer? What are the types of operating system services that are required to allow remote collaborators to access and run legacy software applications via the Web? What are the implications of wide-area computing from a resource management perspective, and how can one reuse existing scheduling mechanisms? What does it take to manage and run a computing portal? How does one quantify the benefits of such a service, and what do the users think of it?

In addition to discussing the issues outlined above, the tutorial will briefly touch on four advanced topics: 1) using virtual filesystems to access remote data in an application-transparent manner, 2) a "system of systems" approach to adaptive resource management in a computational grid, 3) using predictive application-performance modeling to automate cost and performance tradeoff decisions, and 4) performance and interoperability issues in incorporating cluster management systems within a wide-area network-computing environment.

The tutorial will start with a brief introduction to network-computing, and will cover the issues outlined above at enough depth to allow the audience to understand the associated implications --- without overwhelming them with implementation details. Prerequisites for the tutorial are as follows: 1) basic knowledge of programming in a language such as `C', 2) a general idea of Unix- or Linux-based system operation, and 3) basic understanding of the concept of distributed computing. Additional information on the tutorial, including a tentative lecture outline, can be found at www.ece.purdue.edu/~kapadia/Tutorials.

The concepts described in this tutorial have been implemented and tested in the PUNCH infrastructure. PUNCH is a computing portal that has been operational for five years. To date, it has been utilized by more than 3,000 users, who have logged over 3,000,000 hits and have initiated more than 200,000 runs. Today, PUNCH is used on a regular basis by approximately 850 users from 10 countries; it provides access to 50 engineering software packages developed by 13 universities and 6 vendors. PUNCH is the enabling technology for NETCARE (NETwork-computer for Computer Architecture Research and Education; a NSF project involving Purdue, Northwestern, and U. of Wisconsin-Madison), DesCArtES (Distributed Center for Advanced Electronics Simulations; a NSF project involving U. of Illinois at Urbana-Champaign, Arizona State Univ., Stanford, and Purdue), iPUNCH (a statewide network-computer linking Purdue's campuses and technology centers), and the eDA Hub (Electronic Design Automation Hub; in cooperation with SIGDA). PUNCH can be accessed at www.ece.purdue.edu/punch.

Biography:
Nirav H. Kapadia is a senior research scientist in the School of Electrical and Computer Engineering at Purdue University. His research interests are in the areas of network-based and wide-area distributed computing, Web-based computing portals, predictive application-performance modeling, and resource management across institutional boundaries. He conceived, designed, and developed the PUNCH network-computing infrastructure. Kapadia received the B.E. degree in Electronics and Telecommunications from Maharashtra Institute of Technology (India) in 1990, the M.S. degree in Electrical Engineering from Purdue University in 1994, and the Ph.D. degree in Computational Engineering from Purdue University in 1999. He is a member of Phi Beta Delta, an honor society for international scholars. Additional information is available at www.ece.purdue.edu/~kapadia.

Jose' A. B. Fortes is a professor and assistant head for education in the School of Electrical and Computer Engineering at Purdue University. His research interests are in the areas of parallel processing, computer architecture, network-computing, and fault-tolerant computing. He received the B.S. degree in Electrical Engineering (Licenciatura em Engenharia Electrote'cnica) from the Universidade de Angola in 1978, the M.S. degree in Electrical Engineering from the Colorado State University, Fort Collins in 1981, and the Ph.D. degree in Electrical Engineering from the University of Southern California, Los Angeles in 1984. He is a Fellow of the Institute of Electrical and Electronics Engineers (IEEE) professional society, and was a Distinguished Visitor of the IEEE Computer Society from 1991 till 1995. Additional information is available at www.ece.purdue.edu/~fortes.

Tutorial 7: Programmable Networks (CANCELED)
Andrew T. Campbell
Center for Telecommunications Research
Columbia University

This tutorial had to be canceled because of a conflict. We encourage you to attend another tuturial instead, or to participate in one of the two workshops that take place before HPDC, the 4th Globus Retreat or the Active Middleware Services workshop.

Recent advances in active network technology, open signaling and control, distributed systems, service creation, resource allocation and transportable software are driving a reexamination of existing network architectures, middleware and the evolution of control and management systems away from traditional constrained solutions. The ability to dynamically create, deploy and manage new network architectures, protocols and services in response to user demands is creating a paradigm shift in telecommunications. Network researchers are exploring new ways in which network switches, routers and base stations can be dynamically programmed by network applications, users, operators and third parties to accelerate network innovation.

This trend reflects the acceptance of computing and middleware paradigms in telecommunication networks. Programmable networks seek to exploit advanced software techniques and technologies in order to make network infrastructure more flexible, thereby allowing users and service providers to customize network elements to meet their own specific needs. Customizing routing, signaling, resource allocation and accelerating information processing in this manner raises a number of significant security, reliability and performance issues. In this tutorial we will discuss the state of the art in programmable networks. We will discuss a number of important innovations that are creating a paradigm shift in networking leading to higher levels of network programmability. These include the:

-Separation between transmission hardware and control software,
-Availability of open programmable network interfaces,
-Accelerated virtualization of networking infrastructure,
-Rapid creation and deployment of new network services and architectures, and
-Environments for resource partitioning and coexistence of multiple
distinct network architectures.

Topics covered in this tutorial will include:

Open and innovative signaling systems
Active networks
Programming abstractions and interfaces for networks
Service creation platforms
Programming for mobility
Experimental architectures and implementations
Programming for QOS
Enabling technologies, platforms and languages
Support of multiple control planes
Control and resource APIs and object representations
Programmability support for virtual networks
The role of standards


Biography

Andrew T. Campbell is an Assistant Professor in the Department of Electrical Engineering and member of the COMET Group at the Center for Telecommunications Research, Columbia University, New York. His area of interest includes open programmable networks, mobile networking, distributed systems and QOS research. He is a past co-chair of the 5th IFIP/IEEE International Workshop on Quality of Service (IWQOS97) and the 6th IEEE International Workshop on Mobile Multimedia Communications (MOMUC99) and is currently the co-chair of the 4th IEEE Conference on Open Architecture and Network Programming (OPENARCH 2001). Andrew has been involved in building a number of programmable networks for ATM (called xbind), mobile (called Mobiware) and IP (called Genesis) networks. He is a guest editor for the IEEE Journal on Selected Areas in Communications on Active and Programmable Networks and is been a member of OPENSIG, the international working group on programmable networks since its creation. Andrew received his Ph.D. in Computer Science in 1996, the IBM Faculty Award 1998 and the NSF CAREER Award for his research in programmable mobile networking in 1999.