IEEE Report on Meta Data

By Ronald D. Hackett, PE

The Institute of Electrical and Electronics Engineers (IEEE) Computer Society has published an article by Simon Byers entitled, "Information Leakage Caused by Hidden Data in Published Documents" in their new Security and Privacy magazine.  Simon Byers is a Senior Member of the AT&T Lab's Technical Staff.  This article is important because it is a professional magazine for computer and software engineers, and it provides some statistics on the rate of hidden data occurrence in Word documents.

In his research, Mr. Byers uses an existing web crawler to collect Microsoft Word documents from the open Internet.  The article reports collecting documents at a rate of approximately 1000 documents per hour using a cable modem.  The "first" 100,000 documents took 16 gigabytes of disk space.  The collection did not appear to be targeted at any particular organization.

Once the documents were downloaded, they were analyzed using open source tools.  Several search algorithms are discussed, but the end results are extremely interesting.  Approximately half of the 100,000 documents contained between 10 and 50 hidden words, one-third contained 50 to 500 hidden words, and 10 percent contained over 500 hidden words.  That's a whopping 93% incident rate of information being transmitted that is not intended for the recipient.

The article includes some suggestions for working around the problem, but those recommendations are far from adequate.  While this is a very good article, it only addresses well-known Meta data issues.  It does not address the other lesser known and often overlooked problems cited in this author's briefings and reports.

Unfortunately, this article is only available through IEEE's subscription services.  You can obtain a copy through your library using the citation provided below.  Before you go through a lot of trouble, I did locate an earlier version of the article that is available on the open Internet.  That citation and a link are also provided below.

Byers, Simon, "Information Leakage Caused by Hidden Data in Published Documents," IEEE Security & Privacy, Vol. 2, No. 2, pg 23-27, IEEE Computer Society, March/April 2004.

Byers, Simon, "Scalable Exploitation of, and Responses to Information Leakage Through Hidden Data in Published Documents," 3 April 2003.

Mr. Byers work is the first scientific research I have seen on the problem of hidden data in electronic documents, but there are numerous examples available.  British Prime Minister Tony Blair, the Danish Prime Minister, SCO, and the California Attorney General have all been embarrassed by the hidden data in Microsoft Word documents.
 

  1. Smith, Richard M., "Microsoft Word Bytes Tony Blair in the Butt", 30 June 2003.
  2. Danish Prime Minister Gets Bitten by Word ,” The Sydney Morning Herald (smh.com.au), January 13, 2004. 
  3. Shankland, Stephen and Ard, Scott, “Document shows SCO prepped lawsuit against BofA,” CNET News, March 4, 2004.
  4. Jardin, Xeni, "P2P in the Legal Crosshairs," Wired News, 15 March 2004. Note: Skip to paragraph 5 to see why this article is listed.
Some of these articles imply that converting a document to Adobe's Portable Document Format (PDF) is the safe and appropriate method of publishing information electronically.  That perception is incorrect as demonstrated in the following examples.
  1. Poulsen, Keven, "Justice e-sensorship gaffe sparks controversy," Security Focus, 22 October 2003.
  2. Foss, Kurt, "Washington Post's scanned-to-PDF Sniper Letter More Revealing Than Intended," Planet PDF, 26 October 2002.
Sensitive and private information is routinely and unwittingly compromised by hidden data in electronic documents published to the web and sent by email. Computer generated documents and files often contain hidden information that is unknown to authors and readers, but could be exploited by knowledgeable third parties.  It is the individual users and authors responsibility to be aware of this hidden data and to remove it before publishing documents in electronic form.