|
$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$ %$% %$% $%$ Electronic Switching System Faults $%$ %$% %$% $%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$ "Notes from No 2 ESS Administration and Maintenance Plan," "BSTJ Vol 48, 1969" "Data Maintenance" Memory mutilation results from hardware faults and program bugs. During nonsynchronous operation mismatch detection not available so there may be a long period of time during which mutilation occurs. Mismatch detection useless in finding data mutilation caused by program bugs. Data maintenance aided by ease of communication among programs, absence of linked lists, and per call memory allocation (Call processing program addressing is relative to the allocated memory, reducing scope of data accesses). Defensive programming techniques: Range check table indexes, Zero check derived transfer-to addresses, and Distinct program and data errors prevent programs being read as data. Audit programs detect bad data. Audits run periodically or as requested from tty. Separate audits for different memory blocks Audits correct by idling memory blocks containing bad data. System recovery initiated by control unit switch during simplex operation, control unit switch can be caused by bad data or bugs that cause sanity time out. System recovery Funtions: Make call store consistent with state of periphery. Clear memory associated with program in control at time of recovery, Run audits, Repeat the above with widening scope of memory initialization until sanity obtained "Notes from Design of Recovery Strategies for A Fault Tolerant No. 4 ESS" "by R. J Willet - BSTJ vol 61, no 10, 4-13-82" "Objectives" 616,000 call attempts/hour 100,000 acive terminations Downtime less than 2 hours in 40 years Not cost-effective (or possible) to remove all software errors - minimize number of service effecting errors and analyze data for cause. "Software Recovery" Reconstruct data from associated information - slow, disturbs few calls. Reinitialize memory structure - fast, disturbs many calls. "Audit Programs" Provide for integrity of system memory Structured into mutilation detection and correction modules Detection modules run continiously in background Detection modules augmented by defensive checks in operational programs Call correction modules to correct errors found by background audits or defensive checks. "System Integrity Programs" Provide for integrity of programs Monitor job scheduling and sequencing for frequency and execution times Use sanity timers Call audits or reinitialize system to correct errors. "Recovery from software problems" Software problems caused by program errors or bad data Out-of-range accesses trigger hardware interrupt, recovery requires correction of data, or killing of call and return of control to a safe point. Inhibit (pest) interrupts while audits are correcting problem, risky, but assumes single software fault. In cases where the out-of-range error can be isolated to a single unit can use frame level pesting, otherwise use system level pesting. Software recovery does not consider the possibility of a hardware fault. Recovery cannot fix a program bug. Running pested may allows the system to operate in a degraded fashion while maintenance personnel analyze data and correct program. The buffer overflow problem - may be caused by program error. Buffers protected by hardware overflow interrupts. Recovery runs the buffer unloader program to unload the buffer and audits the task dispenser program to ensure the unloader is scheduled properly. The overflow interrupt is pested. If problem continues, hardware is suspect. "No. 4 ESS: Maintenance Software" "by M. N. Meyers, W. A. Routt and K. W. Yoder," "BSTJ Vol. 56, No. 7, September 1977" "Software Error Recovery" Since system operation is dependent on data in memories, and memories can be written, there is a possibility the memory will be in a state that precludes operation. System must be as error-free as possibile. Since system cannot be completely error-free, it must be error tolerant. "Classification of software errors" Errors in interfaces between software modules. Non-conformity to systems rules. KsO$