Table of Contents: |
|
Purpose |
|
In 2005,
the document handling was a $19 billion industry (from IDC), but,
(from ImageTag) "Current paper-to-digital solutions capture less
than 1% of the paper headed for the file cabinet." Digital documents
have cost, access, speed, organization, durability, efficiency,
environmental, competitive, discovery, and other compelling
advantages over paper documents, but conventional technology will
offer either a good image or compression but not both simultaneously
which leaves no feasible solution. In this paper, we will analyze
and demonstrate the methods needed to get both quality images and
unprecedented compression at the same time. |
|
|
Introduction |
|
Paper-based filing systems are several thousand years old and found
in every office across the world, yet digital documents could exude
many clear advantages over paper-based filing systems if they
simultaneously exhibited good compression and quality images.
|
|
|
1. |
Cost: Digital documents
are nearly free, and they are getting cheaper with each
passing day. Hard and other disk drives keep increasing in
capacity, but their prices remain the same. With state of the
art compression, such as Pac-n-Zoom®, a 10 page document only
requires about 10 KBytes which is 1 KByte per page. A 250
GByte hard drive currently costs about $80.00. Then, 1 page
can be stored for 80.00 / 250G * 1K = 32 millionths of a cent,
and it can be backed up for much less than that. By
comparison, it costs about .8 cents for the paper and another
2.4 cents for the ink to print a sheet of paper. This means
that it costs about (.8 + 2.4) / .000032 or 100,000
times as much to print a page as it does to write it to a
disk. Copying all of these papers takes an average of 3% of
all revenue generated by US businesses. If we assume 20%
margins, then office copies cost us 15% of our income. Besides
having numerous other competitive advantages, digital
documents are much cheaper.
|
2. |
Access: The value of a
document is realized when someone views it. The cost of
copying a digital document is less than a billionth of a cent,
and the copies can be simultaneously shipped at the speed of
light to most places in all 7 continents. With the mobile and
far-flung work force of today, it is increasingly unlikely
that the needed document will be in the same office as the
employee who needs it. There are many times when files need to
viewed by people outside the company. Digital storage can
provide cheap and manageable access to the IRS, SEC, or civil
subpoenas. By providing access to vendors and other
outsourcing agents, many processes can be optimized. Digital
documents can be served to almost anyone in anyplace for next
to nothing.
|
3. |
Speed: Any accountant can
tell you that time is money. In a well designed system it
takes less than two seconds to click on a link and retrieve a
digital document, but on the average, it takes about 20
minutes to search, retrieve, deliver, and re-file a document.
There is no need to put the digital document back. When these
facts are taken together, digital files are about 600 times
faster then paper files even when paper files are in an ideal
setting which is an increasingly bad assumption. Faster access
equals a quicker response to the customer's needs which
translates to greater profits.
|
4. |
Organization: Even if a
filing clerk is able to label the files in a way that the
files can be found, there is almost no chance that the
appropriate and necessary cross-references are included. Very
few documents have the luxury of starting at the beginning and
telling the whole story through to the finish. Most documents
are a thread in a tapestry of thought, and without the
appropriate cross references, most paper documents are out of
context. The author of a digital document can link in cross
references, and the organization is not left to the filing
clerk who probably doesn't know what they should be. Without
these vital links, 60% of employees waste more than 60 minutes
each day by duplicating work that has already been done (from
www.openarchive.com). Paper files are so much trouble that the
average disorganized office manager has 3,000 documents just
"lying around" (from a study by US News and Report). The costs
from late fees, premium prices, and other chaos expenses can
eat up to 20% of the entire budget. 80% of a company's
information exists as unstructured data that is scattered
throughout the enterprise (from KMWorld). The paper files are
ironically the most disorganized part of many organizations.
Digital files are easier to find, integrate, and organize.
|
5. |
Durable: (From
World-scan.com) "More than 70% of today's business would fail
within 3 weeks if they suffered a catastrophic loss of
paper-based records due to fire or flood." For their part,
digital documents tend to survive these unforeseen events. For
example, nearly all the paper documents were lost but almost
all the digital documents survived the World Trade Center
tragedy. It doesn't take a catastrophe to lose paper. 22% of
all documents get lost and 7.5% are never found which wipes
out 15% of all profits. For their part, digital files are not
normally removed from the filing cabinet which dramatically
reduces the chance of losing the document.
|
6. |
Efficiency: In many
cases, it easier to outsource some corporate function than it
is to "reinvent the wheel" in-house. The primary problem of
small companies (according to the Boulder County Business
survey of companies in Boulder County) is the handling of
government paperwork. The second largest problem these
companies have (according to the same survey) is the handling
of personnel. A company could specialize in handling of
government paperwork or personnel issues and be far more
efficient than everyone doing the same things paperwork
themselves, but the specialized companies would need access to
the necessary files which digital files allow. Digital files
allow managers to easily verify the existence and accuracy of
all the paper trails which is nearly impossible with paper
files. In these and many other ways, digital files make a
company much more efficient than paper files.
|
7. |
Environmental: In 1993 US
businesses used 2 million tons of copy paper, and in 2000 this
waste grew to 4.6 million tons or more than 92 billion sheets
of paper (Document Magazine). Since it takes 17 trees to make
a ton of paper, the US used 78 million trees worth of paper in
2000. To make matters worth, the use of paper is constantly
and quickly increasing. As shown above, the use of paper more
than doubled within 7 years, and paper already accounts for
40% of the municipal solid waste stream. Digital copies leave
almost no environmental scars.
|
8. |
Competitive: A winning
team plays together. While businesses increasingly organize
and automate around the computer, paper documents resist
efforts to increase productivity. When a company automates
paper processes, they gain a clear advantage over their
competitors. Competitive companies are trying to move faster.
For example, law firms manage millions of pages of documents,
and it is imperative to a court case that the right documents
and case files are available to the right person at the right
time. To quote Ali Shahidi of Alschuler, Grossman, Stein, and
Kahan LLP, a Santa Monica law firm, "We're doing things we
couldn't have imagined a few years ago. We're smarter, better,
and more nimble." At the present time, 99% of the paper
documents can not be analyzed, automated, and organized by a
computer. The paper part of the office is a remnant from the
last century, and it doesn't allow a company to move forward
with the modern techniques of the information age.
|
9. |
Discovery: As the
Internet has proven, digital information is the easiest
information to find. In a typical office, if an important
document is viewed by an employee, there is significant chance
that the document will be lost forever. The average white
collar worker spends 1 hour each day looking for lost
documents (from Esselte). Since digital documents are not
removed from the filing cabinet, they are unlikely to be lost.
The searcher may not know what label the file was filed under.
With the ability to contain links, digital documents are
easier to cross reference and maintain context. The text of
most digital documents can be recognized by the computer which
allows the searcher to search for phrases inside the document.
Digital documents are easier to find than paper files because
digital documents are unlikely to be lost and digital text is
searchable by a computer.
| |
|
|
Paper-based filing systems are several thousand years old and found
in every office across the world, yet digital documents could exude
many clear advantages over paper-based filing systems if they
simultaneously exhibited good compression and quality images.
As we have shown, digital files have many
advantages over paper, but only about 1% of the files in the filing
cabinet have been digitized. With current technology, it is not
practical to digitize a large percentage of the documents. Users can
get a good image (e.g., JPEG) or good compression (e.g., TIFF G4), but they can not get both a high
quality image and small file size at the same time.
If the text is big and black and if the
paper is clean and white (with no writing), a threshold segmenter followed by a statistical compressor, such as TIFF
G4, yields file sizes that are useable (if not a little annoying) on
a LAN. Threshold segmentation is not
the, "silver bullet" people need, and the files image to a FAX-like
quality. If we were looking at the yellow carbon copy of a receipt,
much of the information would be lost. With threshold segmentation,
TIFF G4 has adequate compression but poor image quality.
JPEG has moderately good image quality
because it skips segmentation altogether, and simply
performs a discreet cosine transform (with huffman encoding) on the image. The
quality comes with a price. Since all the noise is left in the
image, the compression is relative small. The user will have to wait
long times for the file to serve, transmit, and load. If the file
was being transmitted over the Internet, the user could easily be
waiting 10s of seconds to view a single page. Without segmentation,
JPEG has a good image but poor compression.
To move the files from the filing cabinet
to the computer, we need a solution that has both a good image and
high compression. |
|
|
Human-Like
Segmentation |
|
Threshold segmentation is usually not
good enough to handle the spectrum of document needs a business has.
For example, a common receipt might have
blue printing on yellow paper with a red number in the upper
right-hand corner. It is not uncommon for some of the blue printing
to be very fine while the issuing company's name is in big bold
print. People often write and stamp on the receipt. The colors they
use don't have to conform to those on the receipt. The receipt might
be a carbon copy with the attendant image degradation. The handling
of the receipt might be an issue. It might have gotten dirty or
crumpled, and these are only a small sample of image degradation
possibilities.
A simple receipt can
challenge the computer's ability to store the image without loss
while achieving useable compression, and there are many applications
(such as X-rays, drawings, and others) that are even more difficult.
Threshold segmentation is fast and easy,
but the quality of segmentation often falls short of what
is needed to do the job. In fact, the standard that everyone is held
to is human-like segmentation. If people can't see what is on the
receipt, then the receipt carries the blame.
Few things shed blame as well as humans.
A compression project with less segmentation than a human will dam
the river of blame and divert it through the IT department.
Maybe the accounts payable person did
write "paid" on the invoice that was paid twice, or maybe the
necessary scrawl was omitted. Either way the computer will be blamed
if it has a reputation for missing such things. To avoid all of
these accusations, the document imaging system needs to
segment like a human (or better).
This
means that threshold segmentation is not good enough except in
certain conditions where image quality can be guaranteed. In fact,
the only acceptable segmentation for most of the paper in the office
is human-like segmentation. Humans set the standard.
When the computer industry started moving
paper documents onto the computer, segmentation was used to create a
higher contrast (or sharper) image. A segmented image is more easily
compressed, because there are fewer artifacts to compress.
Segmentation removes many small defects (we call noise) out of the
picture.
These are all convenient reasons
to use segmentation, but unlike our conventional document imaging
methodology, humans require segmentation to achieve extraction.
When we use our senses to perceive something, we are actually
performing several steps. Since we use our senses so much, these
steps have become second nature to us, and may even be performed in
the subconscious.
The first step, which
we call segmentation, is that of grouping like shades together. Let
us use a black letter 'e' on a white background as an example. The
black would probably be many different shades up and down the 'e'
(here is a typical example of a black 'e' scanned
from a white sheet of paper), but we would need to group all of them
together before we could recognize the letter 'e'.
While segmentation might seem as if it
adds artificiality to the picture, recognition requires
segmentation. In other words, if we want the picture to mean
anything to us, we will have to segment it in our head.
The process of finding a shape (or any feature) of an object is called
extraction. The extracted shape is compared to memory (probably a
database but isn't necessarily limited to template matching). If we can match
the shape to a shape stored in memory, we recognize the shape. In
our example, the segmentation must bring the "e" in as a single region, or we will extract a shape not
recognized by the database. In other words, when parts of a letter
are missing, it is difficult to identify the partial letter.
The document imaging industry's use of
segmentation to achieve OCR extraction is similar, but threshold
segmentation is often a fatal shortcut that prevents robust
extraction. For example, a computer can not perform OCR extraction
with the accuracy a human can in a multiple color environment.
As computers get stronger, they can
afford to forsake threshold segmentation for the more robust edge
detection segmentation that is more intelligent and human-like.
A computer with intelligent human-like
segmentation isn't limited to more accurate extraction. As the
computer begins to aggregate the image through edge detection
segmentation, it can also achieve much better compression than it
could with threshold segmentation.
To
understand how these things occur, let's backtrack to explain the
details of each type of segmentation.
|
|
|
Threshold
Segmentation |
|
Threshold segmentation is the simplest
type of segmentation. In a simple example, if we had black text on
white paper, we could set the threshold to gray. All of the text
that is darker than the gray threshold would be considered black,
and any background that was lighter than the gray threshold would be
considered white. When the segmentation is complete, there are two
colors in the picture.
Threshold
segmentation can be much more complicated than this. For example, a
histogram of colors could be taken
across a region or an entire picture. The more
predominant colors can be considered the foreground and background.
The threshold could be set at the middle color between the
foreground and background colors. Of course as threshold
segmentation becomes more complicated, it runs slower.
In a standard 8.5 inch by 11 inch sheet
of paper scanned in at 300 dots per inch with all three primary colors, we have 8.5 * 11 * 300
* 300 * 3 or 25,245 KBytes of data, if we assume 8 bits
per primary color. Even today's computers take a noticeable amount
of time to chew through 25 MBytes of data. Therefore, threshold
segmentation (typically a simple variety) is usually used.
Threshold segmentation should be
considered a fast but coarse segmentation. Threshold segmentation
does not address a number of segmentation problems required by
human-like segmentation.
For example, transition distortions are usually the
largest distortion introduced by a scanner, and threshold
segmentation does not find or rebuild the border. In many cases, a pattern of
colors is mixed together to provide some information (for example a
logo), and threshold segmentation will not be able to sort through
the color complexities with human-like intelligence.
|
|
|
Edge Detection
Segmentation |
|
To segment like a human, the segmenter
needs to mimic the human process of edge detection segmentation. For
about 30 years, people have been trying to achieve high quality with
edge detection segmentation, but it
has only recently been accomplished.
Edge
detection segmentation does a better job of supporting image restoration. The edge of a blob is distorted in a variety of
ways, and the edge needs to be discovered before it can be restored.
In the past, most edge detection
segmentation was too coarse to be used in document imaging, but computers can
theoretically handle edge detection better than humans. Humans can
only see about 200,000 colors, but computers typically work with 16
million colors. The optoelectronic sensing technology is capable of
many more.
When the computer does a
better job of segmenting, it can do a better job compressing. If the
picture is over segmented, the baby is thrown out with the bath
water. In other words, the data washes out along with the noise.
With better segmentation, more noise can be corrected while leaving
the data.
The input distortion from the
optoelectronic capture appliance (usually a scanner) prevents high
levels of compression in color documents; so a big size and quality
difference can currently be found between color documents created on
a computer and those scanned from a paper.
For example, a scanned color document can
be compressed about 3 times with a statistical compressor, but the same
file created on the computer could be compressed about 100 times.
Furthermore, the computer generated document would be much cleaner
and clearer.
Edge detection segmentation
is a much more complicated and computer intensive segmentation
technique. At first, edge detection might seem simple, but it is
complicated be several factors.
|
|
|
1. |
Continuous Tone:
We may not be segmenting along a clear color transition. In fact, the
image could be continuous tone with nearly unlimited color
variations.
|
2. |
Image Distortion: Clear color
transitions are usually smeared 4 or 5 pixels in two directions. Finer
details become blurry, and humans are able to segment some of
this. Then the humans expect the computer to segment at human
level.
|
3. |
Fine Artifacts: Text (of a specific
font) comes with a finite set of artifacts, and they have a
minimum size. Artifacts below the minimum can be ignored in
text, but they must be segmented in continuous tone.
| |
|
|
As we
would expect, edge detection segmentation takes much longer than threshold segmentation.
|
| |