Definition (in the context of systems and data centers):

Fault tolerance is the ability of a computer system or digital infrastructure to continue functioning properly despite the failure of some of its components.
How it works:
- The system is designed to detect errors and automatically compensate for them.
- It uses mechanisms such as hardware or software redundancy, data replication, and recovery protocols.
- In many cases, the end user does not notice any interruption, as the system “hides” the failure.
Example:
In a data center, if a hard drive fails, another redundant drive (RAID) can immediately take over the load, preventing data loss and keeping the application running.
Main objective:
Ensure service continuity and minimize the impact of unavoidable failures in critical environments.
See also: Redundancy, High Availability (HA).
« Back to Glossary Index