In his research, Mr. Byers uses an existing web crawler to collect Microsoft Word documents from the open Internet. The article reports collecting documents at a rate of approximately 1000 documents per hour using a cable modem. The "first" 100,000 documents took 16 gigabytes of disk space. The collection did not appear to be targeted at any particular organization.
Once the documents were downloaded, they were analyzed using open source tools. Several search algorithms are discussed, but the end results are extremely interesting. Approximately half of the 100,000 documents contained between 10 and 50 hidden words, one-third contained 50 to 500 hidden words, and 10 percent contained over 500 hidden words. That's a whopping 93% incident rate of information being transmitted that is not intended for the recipient.
The article includes some suggestions for working around the problem, but those recommendations are far from adequate. While this is a very good article, it only addresses well-known Meta data issues. It does not address the other lesser known and often overlooked problems cited in this author's briefings and reports.
Unfortunately, this article is only available through IEEE's subscription services. You can obtain a copy through your library using the citation provided below. Before you go through a lot of trouble, I did locate an earlier version of the article that is available on the open Internet. That citation and a link are also provided below.
Byers, Simon, "Information Leakage Caused by Hidden Data in Published Documents," IEEE Security & Privacy, Vol. 2, No. 2, pg 23-27, IEEE Computer Society, March/April 2004.
Byers, Simon, "Scalable Exploitation of, and Responses to Information Leakage Through Hidden Data in Published Documents," 3 April 2003.
Mr. Byers work is the first scientific
research I have seen on the problem of hidden data in electronic documents,
but there are numerous examples available. British Prime Minister
Tony Blair, the Danish Prime Minister, SCO, and the California Attorney
General have all been embarrassed by the hidden data in Microsoft Word
documents.