This book addresses the question of how system software should be designed to account for faults, and which fault tolerance features should provide for highest reliability. With this third edition of Software Design for Resilient Computer Systems, the book is thoroughly updated to contain the newest advice regarding software resilience. With a new introductory chapter, the new edition is ideal for researchers and industry professionals.
In the book, the authors first show how system software interacts with the hardware to tolerate faults. They analyze and further develop the theory of fault tolerance to understand the diverse ways to increase the reliability of a system, with special attention on the role of system software in this process. They introduce the theory of redundancy and its use for construction of a subsystem through generalised algorithm of fault tolerance (GAFT) and apply it to distributed systems. The book's approach is applied to various hardware subsystems: different structures of RAM and processor cores and demonstrates exceptional performance reliability and energy efficiency. This third edition devotes substantial attention to system software for modern computers, including run time systems, supporting algorithms of recovery and their analysis, language aspects and ways to improve reconfigurable and parallel computing.
Due to the wide-reaching nature of the content, this book applies to a host of industries and research areas, including military, aviation, intensive health care, industrial control, and space exploration.