10
Reliability and Availability Features
The AlphaServer ES40 system achieves an unparalleled level
of reliability and availability through the careful application of
technologies that balance redundancy, error correction, and
fault management. Reliability and availability features are
built into the CPU, memory, and I/O, and implemented at the
system level.
Processor Features
•
CPU data cache provides error correction code (ECC)
protection.
•
Parity protection on CPU cache tag store.
•
Multi-tiered power-up diagnostics to verify the
functionality of the hardware.
When you power up or reset the system, each CPU, in parallel,
runs a set of diagnostic tests. If any tests fail, the failing CPU
is configured out of the system. Responsibility for initializing
memory and booting the console firmware is transferred to
another CPU, and the boot process continues. This feature
ensures that a system can still power up and boot the operating
system in case of a CPU failure. Messages on the operator
control panel power-up/diagnostic display indicate the test
status and component failure information.
Memory Features
•
The memory ECC scheme is designed to provide maxi-
mum protection for user data. The memory scheme
corrects single-bit errors and detects double-bit errors and
total DRAM failure.
•
Memory failover. The power-up diagnostics are designed
to provide the largest amount of usable memory, config-
uring around errors.
I/O Features
•
ECC protection on the switch interconnect and parity
protection on the PCI and SCSI buses.
•
Extensive error correction built into disk drives.
•
Optional internal RAID improves reliability and data
security.
•
Disk hot swap.
System Features
Auto reboot. On systems running Tru64 UNIX or OpenVMS,
a firmware environment variable lets you set the default action
the system takes on power-up, reset, or after an operating
system crash. For maximum system availability, the variable
can be set to cause the system to automatically reboot the
operating system after most system failures. Windows NT
auto reboots by default, but lets you specify a countdown value
so you can stop the system from booting if you need to carry
out other tasks from the console firmware.
Software installation. The operating systems are factory
installed. Factory installed software (FIS) allows you to boot
and use your system in a shorter time than if you install the
software from a distribution kit.
Diagnostics. During the power-up process, diagnostics are run
to achieve several goals:
•
Provide a robust hardware platform for the operating
system by ensuring that any faulty hardware does not
participate in the operating system session. This maxi-
mizes system uptime by reducing the risk of system
failure.
•
Enable efficient, timely repair.
Audible beep codes report the status of diagnostic testing.
The system has a firmware update utility (LFU) that provides
update capability for console and PCI I/O adapter firmware. A
fail-safe loader provides a means of reloading the console in
the event of corrupted firmware.
Thermal management. The air temperature and fan operation
are monitored to protect against overheating and possible
hardware destruction. Six fans provide front to back cooling,
and the power supplies, in the rear, have their own fans. If the
termperature rises, the system fans speed up; or if necessary to
prevent damage, the system shuts down. If the main fan,
which cools the system card cage, fails, a redundant fan takes
over.
Error handling. Parity and other error conditions are detected
on the PCI buses. The memory checking scheme corrects
single-bit errors and detects double-bit errors. Multiple ECC
corrections to single-bit errors detected by the operating
systems help in determining where in the system the error
originated. Errors are logged for analysis.
Disk hot swap. The hardware is designed to enable hot swap
of disks. Hot swap is the removal of a disk or disks from any
of the storage compartments while the rest of the system
remains powered on and continues to operate. This feature
contributes significantly to system availability. Since many
disk problems can be fixed without shutting down the entire
system, users lose access only to the disks that are removed.
N+1 power redundancy. A second or third power supply can
be added to provide redundant power to the chassis. A second
power supply is needed for more than two CPUs or if a second
disk cage is installed. In this case the third supply provides
redundancy. Power supplies are 735 watts (DC). Each has
two LEDs to indicate the state of power to the system.
An external UPS can be purchased to support critical customer
configurations. Because power is maintained for the entire
system (CPU, memory, and I/O), power interruptions are
completely transparent to users.
Komentáře k této Příručce