Dependable Computing Systems
ECE/CS 4434/6434
Fall 2025
Computing systems are used in critical domains like aerospace, energy, transportation, manufacturing, healthcare, and commerce. However, in practice, various unexpected faults, accidental or adversarial perturbations, human errors, or cyber-attacks can compromise the computing systems’ dependability and security, leading to catastrophic consequences such as injury, loss of life, damage to equipment, or financial loss.
This course focuses on the principles and emerging practices in designing and assessing dependable computing systems that can continue to operate correctly in the presence of faults and attacks. We will learn what can go wrong from the hardware, software, and algorithmic perspectives, how to predict, prevent, and detect faults/attacks, and how to design systems that can tolerate faults/attacks and recover from failures.
This year, we will have a special focus on the dependability and security of ML-enabled computing systems. Through paper discussions, class activities, and hands-on projects, we will explore the threats to the dependability, security, and safety of ML-enabled systems and examine the state-of-the-art techniques for detecting and mitigating such threats.
Topics:
- Introduction to Dependable Computing
- Basic terminology, concepts, and attributes
- Dependability evaluation techniques
- Quantitative Dependability Modeling and Evaluation
- Probabilistic measures of dependability attributes
- Combinatorial modeling for static redundancy
- State-space modeling for dynamic redundancy
- Fault Tolerant Computing
- Hardware fault tolerance
- Information redundancy
- Software fault tolerance
- Checkpointing and recovery
- Reliable networked systems
- Machine Learning Dependability and Security
- Threats to ML reliability and security
- ML verification and certification
- ML recovery and repair
Time: Mon/Wed 11:00 - 12:15 PM
Location: Thornton Hall D115
Instructor: Homa Alemzadeh, ha4d@virginia.edu - Office: 259 Olsson Hall (Link Lab)
Teaching Assistants: TBD
Instructor Office Hours: Mon/Wed 12:30-1 PM or by appointment
TA Office Hours: TBD or by appointment
UVA Canvas Site (For lecture notes, homework submission, grading)
Piazza (For questions, discussions, and polls)
Pre-requisites: This course is intended for graduate and senior-level undergraduate students. A basic knowledge of probability and computer systems/architecture is required. A working knowledge of programming is required for the homework and final project.
Grading Policy:
Class Participation/Activity | 5% | |
Presentations: | ||
-- Short presentations on real-world reliability/safety/security incidents/issues | 10% | |
-- Paper Presentations | UG: 10% - GRAD: 20% | |
Homework | UG: 25% - GRAD: 15% | |
Final Project | 30% | |
Midterm Exam | 20% |
There will be a 10% penalty for late homework (per school day). You will also have a grace period of three days for the whole course to address any unexpected events such as sickness, traveling, other deadlines, interviews, etc. This means that you can always submit late assignments (except the presentation slides and the final exam), but if you don't use more than three late days, you will not be penalized in any way. The late penalties and the grace period will be accounted at the end of the semester. Also, your lowest grades for homework and class activity will be dropped.
References:
- I. Koren and C. Mani Krishna, Fault-tolerant Systems, 1st edition, 2007, Morgan Kaufmann (Available here through UVA Library).
- J. Knight, Fundamentals of Dependable Computing for Software Engineers, 2012, CRC Press (Available here through UVA Library).
- K. Trivedi, Probability and Statistics with Reliability, Queuing and Computer Science Applications, 2nd edition, 2001, John Wiley & Sons (Available here through UVA Library).
- D. K. Pradhan, Fault Tolerant Computer System Design, 1st edition, 1996, Prentice-Hall.
- B. W. Johnson, Design and Analysis of Fault Tolerant Digital Systems, 1988, Addison-Wesley Longman Publishing Co.