|
We're very excited about this: we can check every time a program uses a pointer or array and ensure that only valid references are allowed. This isn't new: what's new is that checked code can interwork with unchecked modules, libraries and system calls. We're still working on some rough edges and on improving the performance.
This is a short overview; for a full report (and the code), see here.
C is unusual among programming languages in providing the programmer with the full power of pointers. Languages in the Pascal/Algol family have arrays and pointers, with the restriction that arithmetic on pointers is disallowed. Languages like BCPL allow arbitrary operations on pointers, but lack types and so require clumsy scaling by object sizes.
An advantage of the Pascal/Algol approach is that array references can be checked at run-time fairly efficiently, in fact so efficiently that there is a good case for bounds-checking in production code. Bounds checking is easy for arrays because the array subscript syntax specifies both the address calculation and the array within which the resulting pointer should point.
With pointers in C, a pointer can be used in a context divorced from the name of the storage region for which it is valid.
One response to this analysis is to discard C, since this lack of efficient checkability is responsible for many software failures.
A second approach is to extend the language to make checking easier. There are various proposals for doing this, and it is an opportunity to add other features such as assertion checking.
A third more-or-less workable scheme is to modify the representation of pointers to include three items: the pointer itself, and the lower and upper bounds of the object to which it is supposed to point. Experience with this has shown the benefits of bounds checking (e.g. see the bcc and rtcc compilers cited below), but there are difficulties:
Even if the checking code can be optimised away, there remains the cost of passing triples for every pointer - which essentially prevents their being allocated to registers.
Some automatic support for interfacing checked code with normal code can be given, but this only works for straightforward cases. GUI code with call-backs, for example, is tricky.
Our technique provides full checking without changing the representation of pointers. We therefore avoid most of the problems noted above. Some efficiency problems remain, but bounds checking need not be used in all of the files which make up a program, so trusted, performance-critical code can run at full speed.
The key idea is this:
For example, in "p+2*k+1" we derive a new pointer from "p".
By contrast, in "p+q" or "p-q", we derive an integer from two pointers. The integer is nonsense as a pointer.
We call this unique original pointer the expression's "base" pointer.
An allocated storage region may be a global, static, automatic or heap-allocated variable, structure or array.
This is so that a pointer can be incremented and then tested for the loop exit condition.
To prevent false alarms, we pad the storage layout of arrays to that A[100] is a valid pointer (we still check it when it is used).
We made some small modifications to the C front-end of gcc, the Gnu C compiler, to add code to check pointer arithmetic and use, and to maintain a table of known allocated storage regions.
We went to some trouble to ensure that gcc's optimiser could handle the added code, and employed modest inlining for efficiency.
The table of known allocated storage regions has to handle insertions, deletions and range lookups extremely fast, but since programs display a high degree of locality the access pattern is highly skewed. For these reasons a splay tree was used, in which objects are migrated to the root when accessed.
The software is distributed free under GNU copyleft, in the form of a patch to the gcc 2.7.0 source distribution ( here).