Document & Media Exploitation: The DOMEX challenge is to turn digital bits into actionable intelligence.

Document & Media Exploitation: The DOMEX challenge is to turn digital bits into actionable intelligence.
Document & Media Exploitation: The DOMEX challenge is to turn digital bits into actionable intelligence. 

By Simson L. Garfinkel, Ph.D.
ACM Queue vol. 5, no. 7 
November/December 2007 

A computer used by Al Qaeda ends up in the hands of a Wall Street 
Journal reporter. A laptop from Iran is discovered that contains details 
of that country's nuclear weapons program. Photographs and videos are 
downloaded from terrorist Web sites.

As evidenced by these and countless other cases, digital documents and 
storage devices hold the key to many ongoing military and criminal 
investigations. The most straightforward approach to using these media 
and documents is to explore them with ordinary tools - open the word 
files with Microsoft Word, view the Web pages with Internet Explorer, 
and so on.

Although this straightforward approach is easy to understand, it can 
miss a lot. Deleted and invisible files can be made visible using basic 
forensic tools. Programs called carvers can locate information that 
isn't even a complete file and turn it into a form that can be readily 
processed. Detailed examination of e-mail headers and log files can 
reveal where a computer was used and other computers with which it came 
into contact. Linguistic tools can discover multiple documents that 
refer to the same individuals, even though names in the different 
documents have different spellings and are in different human languages. 
Data-mining techniques such as cross-drive analysis can reconstruct 
social networks - automatically determining, for example, if the 
computer's previous user was in contact with known terrorists. This sort 
of advanced analysis is the stuff of DOMEX, the little-known 
intelligence practice of document and media exploitation.

The U.S. intelligence community defines DOMEX as "the processing, 
translation, analysis, and dissemination of collected hard-copy 
documents and electronic media, which are under the U.S. government's 
physical control and are not publicly available."1 That definition goes 
on to exclude "the handling of documents and media during the 
collection, initial review, and inventory process." DOMEX is not about 
being a digital librarian; it's about being a digital detective.

Although very little has been disclosed about the government's DOMEX 
activities, in recent years academic researchers - particularly those 
concerned with electronic privacy - have learned a great deal about the 
general process of electronic document and media exploitation. My 
interest in DOMEX started while studying data left on hard drives and 
memory sticks after files had been deleted or the media had been 
"formatted." I built a system to automatically copy the data off the 
hard drives, store it on a server, and search for confidential 
information. In the process I built a rudimentary DOMEX system. Other 
recent academic research in the fields of computer forensics, data 
recovery, machine translation, and data mining is also directly 
applicable to DOMEX.

This article introduces electronic document and media exploitation from 
that academic perspective. It presents a model for performing this kind 
of exploitation and discusses some of the relevant academic research. 
Properly done, DOMEX goes far beyond recovering documents from hard 
drives and storing them in searchable archives. Understanding this 
engineering problem gives insight that will be useful for designing any 
system that works with large amounts of unstructured, heterogeneous 

Why "Exploitation?"

When researchers say that their work is centered on information or 
document "exploitation," eyebrows invariably raise. The word 
exploitation is provocative, attracting unwarranted attention to a 
process that could just as easily be classified as "computer forensics" 
or even "data recovery." But, in fact, the word is apropos.

The words exploit and exploitation imply using something in a manner 
that's "unfair or selfish."2 And it's true. People who are in the 
business of document and media exploitation really do seek to make 
unfair use of computer documents and electronic storage devices. Fair, 
after all, means following the rules. The "rules" of a computer system 
are the APIs, the data-storage standards, the file permissions, and 
other interfaces that were intended to be used by the file's creator. 
When a file in the computer's electronic trash is deleted by "emptying 
the trash," the rules say that the file's contents should no longer be 
accessible. The "undelete" command that is part of every forensic 
toolkit takes advantage of the fact that computer systems generally do 
not overwrite the contents of deleted files. This is a common problem in 
computer systems, affecting not only deleted files in file systems but 
also deleted paragraphs in word processors and even unallocated pages in 
virtual memory systems.

Computer forensic practitioners working for police departments and 
litigation support firms also make their living by recovering 
intentionally deleted data, but even these processes follow rules - 
though those involved in exploitation might choose to ignore them. The 
goal of computer forensics is to assist in some kind of investigation, 
which usually begins because a crime was committed and, hopefully, ends 
with the perpetrator being convicted in a court of law. With conviction 
as a goal, forensic practitioners must be concerned with the evidentiary 
integrity and chain of custody - and they need to limit their search to 
information that is relevant to that investigation. In many cases the 
evidence will have been obtained under a search warrant or discovery 
procedure, the terms of which may limit the forensic examiner's actions 
or even which kinds of files may be examined. Evidence obtained by 
breaking the rules may even be suppressed.

For example, in the case of U.S. v. Carey, an investigator executing a 
warrant on narcotics discovered files with a JPG extension that 
contained child pornography. Carey was indicted and convicted for 
possession of child pornography, but the appellate court reversed the 
ruling and remanded the case back to the trial court, arguing that "the 
seizure of evidence was beyond the scope of the warrant."3 The evidence 
should have been suppressed.

Unlike the investigators in the Carey case, those engaged in document 
and media exploitation are not bound by any rules other than laws of 
physics and nature. The goal of information exploitation is to get and 
use the data - the ends justify the means. It's OK if these results 
aren't good enough for a conviction. Exploitation rarely seeks to prove 
or disprove the details of a case; instead, it seeks to make the fullest 
use of all the data that has been obtained. The standard of success is 
the usefulness of the result, not the reliability of the process.

If you find the preceding paragraph alarming, remember that DOMEX is 
about exploiting data, not people. "Exploitation" is precisely the 
attitude that you want when you take a crashed hard drive to a 
data-recovery firm. If you've just lost the only copy of a 400-page 
manuscript, it's probably OK with you if the firm is able to recover the 
first 200 pages of the September 20 version and the last 180 pages of 
the August 19 version. Although a good defense attorney might be able to 
suppress a document that was made by stitching together those two 
halves, you probably don't care about that if you are the author and the 
alternative is rewriting the 400 pages from memory. Likewise, if you are 
using some kind of desktop search system to index the files on your hard 
drive, you don't mind if the product makes a mistake or two and shows 
you files that you aren't "allowed" to see - just as long as you find 
what you're searching for.


Visit InfoSec News 

Site design & layout copyright © 1986-2014 CodeGods