By Simson L. Garfinkel, Ph.D.
ACM Queue vol. 5, no. 7
A computer used by Al Qaeda ends up in the hands of a Wall Street
Journal reporter. A laptop from Iran is discovered that contains details
of that country's nuclear weapons program. Photographs and videos are
downloaded from terrorist Web sites.
As evidenced by these and countless other cases, digital documents and
storage devices hold the key to many ongoing military and criminal
investigations. The most straightforward approach to using these media
and documents is to explore them with ordinary tools - open the word
files with Microsoft Word, view the Web pages with Internet Explorer,
and so on.
Although this straightforward approach is easy to understand, it can
miss a lot. Deleted and invisible files can be made visible using basic
forensic tools. Programs called carvers can locate information that
isn't even a complete file and turn it into a form that can be readily
processed. Detailed examination of e-mail headers and log files can
reveal where a computer was used and other computers with which it came
into contact. Linguistic tools can discover multiple documents that
refer to the same individuals, even though names in the different
documents have different spellings and are in different human languages.
Data-mining techniques such as cross-drive analysis can reconstruct
social networks - automatically determining, for example, if the
computer's previous user was in contact with known terrorists. This sort
of advanced analysis is the stuff of DOMEX, the little-known
intelligence practice of document and media exploitation.
The U.S. intelligence community defines DOMEX as "the processing,
translation, analysis, and dissemination of collected hard-copy
documents and electronic media, which are under the U.S. government's
physical control and are not publicly available."1 That definition goes
on to exclude "the handling of documents and media during the
collection, initial review, and inventory process." DOMEX is not about
being a digital librarian; it's about being a digital detective.
Although very little has been disclosed about the government's DOMEX
activities, in recent years academic researchers - particularly those
concerned with electronic privacy - have learned a great deal about the
general process of electronic document and media exploitation. My
interest in DOMEX started while studying data left on hard drives and
memory sticks after files had been deleted or the media had been
"formatted." I built a system to automatically copy the data off the
hard drives, store it on a server, and search for confidential
information. In the process I built a rudimentary DOMEX system. Other
recent academic research in the fields of computer forensics, data
recovery, machine translation, and data mining is also directly
applicable to DOMEX.
This article introduces electronic document and media exploitation from
that academic perspective. It presents a model for performing this kind
of exploitation and discusses some of the relevant academic research.
Properly done, DOMEX goes far beyond recovering documents from hard
drives and storing them in searchable archives. Understanding this
engineering problem gives insight that will be useful for designing any
system that works with large amounts of unstructured, heterogeneous
When researchers say that their work is centered on information or
document "exploitation," eyebrows invariably raise. The word
exploitation is provocative, attracting unwarranted attention to a
process that could just as easily be classified as "computer forensics"
or even "data recovery." But, in fact, the word is apropos.
The words exploit and exploitation imply using something in a manner
that's "unfair or selfish."2 And it's true. People who are in the
business of document and media exploitation really do seek to make
unfair use of computer documents and electronic storage devices. Fair,
after all, means following the rules. The "rules" of a computer system
are the APIs, the data-storage standards, the file permissions, and
other interfaces that were intended to be used by the file's creator.
When a file in the computer's electronic trash is deleted by "emptying
the trash," the rules say that the file's contents should no longer be
accessible. The "undelete" command that is part of every forensic
toolkit takes advantage of the fact that computer systems generally do
not overwrite the contents of deleted files. This is a common problem in
computer systems, affecting not only deleted files in file systems but
also deleted paragraphs in word processors and even unallocated pages in
virtual memory systems.
Computer forensic practitioners working for police departments and
litigation support firms also make their living by recovering
intentionally deleted data, but even these processes follow rules -
though those involved in exploitation might choose to ignore them. The
goal of computer forensics is to assist in some kind of investigation,
which usually begins because a crime was committed and, hopefully, ends
with the perpetrator being convicted in a court of law. With conviction
as a goal, forensic practitioners must be concerned with the evidentiary
integrity and chain of custody - and they need to limit their search to
information that is relevant to that investigation. In many cases the
evidence will have been obtained under a search warrant or discovery
procedure, the terms of which may limit the forensic examiner's actions
or even which kinds of files may be examined. Evidence obtained by
breaking the rules may even be suppressed.
For example, in the case of U.S. v. Carey, an investigator executing a
warrant on narcotics discovered files with a JPG extension that
contained child pornography. Carey was indicted and convicted for
possession of child pornography, but the appellate court reversed the
ruling and remanded the case back to the trial court, arguing that "the
seizure of evidence was beyond the scope of the warrant."3 The evidence
should have been suppressed.
Unlike the investigators in the Carey case, those engaged in document
and media exploitation are not bound by any rules other than laws of
physics and nature. The goal of information exploitation is to get and
use the data - the ends justify the means. It's OK if these results
aren't good enough for a conviction. Exploitation rarely seeks to prove
or disprove the details of a case; instead, it seeks to make the fullest
use of all the data that has been obtained. The standard of success is
the usefulness of the result, not the reliability of the process.
If you find the preceding paragraph alarming, remember that DOMEX is
about exploiting data, not people. "Exploitation" is precisely the
attitude that you want when you take a crashed hard drive to a
data-recovery firm. If you've just lost the only copy of a 400-page
manuscript, it's probably OK with you if the firm is able to recover the
first 200 pages of the September 20 version and the last 180 pages of
the August 19 version. Although a good defense attorney might be able to
suppress a document that was made by stitching together those two
halves, you probably don't care about that if you are the author and the
alternative is rewriting the 400 pages from memory. Likewise, if you are
using some kind of desktop search system to index the files on your hard
drive, you don't mind if the product makes a mistake or two and shows
you files that you aren't "allowed" to see - just as long as you find
what you're searching for.
Visit InfoSec News