|
|
Setting the Configuration File:
|
While all parts
of Pac-n-Zoom can use any of the
three interfaces, to
keep it simple, we will used
samples from the configuration
file (which are given with a blue
background) to show which values
need to be set. The configuration
file uses the
standard
Pac-n-Zoom data format. There
are five data segments in the configuration
file that directly relate to the
golden files.
|
|
|
|
# BLOB COMPRESSOR PARAMETERS
0; Acceptable Tolerance
1; Write Golden File Flag
*
|
|
|
There are two
settings in the "BLOB COMPRESSOR
PARAMETERS" segment that need to
be correctly set to write the
golden file.
|
1. |
Acceptable Tolerance: The
first setting in the "BLOB
COMPRESSOR PARAMETERS" segment is
the acceptable tolerance. For
normal results, this value should
be set to 0. A higher value is
more efficient, but nonidentical
parts of different letters will be
substituted for each other. While
most people won't notice a value
that is less than 3, it might
bother some people.
|
2. |
Golden File Flag: The
second setting in the "BLOB
COMPRESSOR PARAMETERS" segment is
a flag that determines whether
the golden file should be wrote.
If the GUI were being used, the
golden file would be saved as a
*.pzl file. If the flag is set
(1), the golden file is wrote.
|
|
|
# CURRENT BC GOLDEN FILE NAMES
;PACNZOOM
*
|
|
|
This is a list of golden files
that will be used to build the
current golden files. In most
cases, no golden files should be
used to build additional golden
files, and there are no file names
in this list. In the example that
is above, the file name is
commented out.
If golden file names are given and
the acceptable tolerance is 0 (as
it should be), the only affect of
these extra golden file names is
to make the program take longer to
create the additional golden file.
If golden file names are given and
the acceptable tolerance is not
zero, the additional golden file
will contain morphological
genetics that are within the
acceptable tolerance of the given
golden files.
|
|
|
# NEXT BC GOLDEN FILE NAME
PACNZOOM
*
|
|
|
This segment contains the name of
the next golden file. In this
example that is above, the next
golden file will be named
"PACNZOOM.pzl"
|
|
|
# NEXT BC GOLDEN FILE FLAGS
Painted File
*
|
|
|
Frame flags can be written, but
the reading of frame flags is not
currently supported. This segment
contains the flags of the next
golden file. In this example that
is above, the next data frame will
be flagged:
~ Painted File
Frame flags are a tool that can
organize golden files. For
example, only those frames flagged
with "Painted File" would be read
if the flag was used to select
golden file frames.
Extra golden frames will slow down
Pac-n-Zoom encoding, and the data
could be snapped to an unintended
golden figure.
If only one frame is used per
file, the same result can be
achieved with the segment header:
# CURRENT BC GOLDEN FILE NAMES.
|
|
|
# BC GOLDEN TEXT
Times Roman 41 ver. 3
English
Bold
Italics
Underlined
*
|
|
|
The first line of this segment
contains the name of the font. All
the rest of the lines in this
segment are attributes of the
font. Instead of using the old
pitch system, the number of
vertical pixels in the tallest
letter should be used to name the
font. This convention, when
combined with the size of the
image, will produce more
consistent results across
different media and hold truer to
the original document.
The information in this segment
allows the figure to be recognized
(i.e., OCR), and for the first
time (I guess) the font and
attributes can be recognized.
While this might seem pretty
handy, the same formatting in
different programs can look
different. In other words, porting
formatting from one program to
another is tricky.
It is still a good idea to fill
this information in. Without it,
there will probably be 50,000
undocumented golden files, and
the time will likely come when
some of these will be preferred
over others. There won't be any
good clear way to do that without
this data segment.
|
|
|
|
|
|
Since so many golden files are
needed, it is easier to create
them automatically. During this
process, care should be taken to
make perfect characters. There
should no stretching or scanning.
If OCR is desired, the characters
must be clustered to be
identified. The characters must be
replicated before they are
clustered.
After the golden file is wrote,
the characters need to be
identified. To make the
identification process automatic,
the file should be created with
the characters in lexicographic
(or better ASCII) order. An extra
space should be placed between the
letters on the same row, and an
extra row should be placed between
rows. The following example is a
good form to follow.
|
|
! !
" "
# #
% %
' '
( (
) )
* *
+ +
, ,
- -
. .
/ /
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
: :
; ;
< <
= =
> >
? ?
@ @
A A
B B
C C
D D
E E
F F
G G
H H
I I
J J
K K
L L
M M
N N
O O
P P
Q Q
R R
S S
T T
U U
V V
W W
X X
Y Y
Z Z
[ [
\ \
] ]
^ ^
_ _
` `
a a
b b
c c
d d
e e
f f
g g
h h
i i
j j
k k
l l
m m
n n
o o
p p
q q
r r
s s
t t
u u
v v
w w
x x
y y
z z
{ {
| |
} }
~ ~
|
|
|
|
|
|
If only several or fewer golden
files need to be made (for example
a special font the company might
use), it is probably easiest to
use the
manual method. When a
number of files need to be made,
some sort of
automation should be
used.
|
|
I. |
Manual Method: The manual
method contains four steps.
|
|
|
|
|
A. |
Create: The graphic file
should be created with perfect
fonts. A word process, text
editor, paint program, or some
other program with textual
abilities should be used. The file
should not be scanned in or
captured with any sort of digital
sensor, because there should be no
noise in the image.
|
|
|
|
B. |
Print: The file should be
printed or saved to a
bitmap
(*.bmp) file.
|
|
|
|
C. |
Configure: The
configuration file should be set
as shown
above.
|
|
|
|
D. |
Run: Run Pac-n-Zoom and
save the file as a golden (*.pzl)
file.
|
|
|
II. |
Semiautomatic Method: When
more than a few golden
frames need
to be created, it is probably
easier to automate part of the
process. If the program creating
the graphic file has macros, the
entire process could be automated,
but it might take longer to get it
working then the time it saves.
The following provides a method
for a semiautomatic solution.
|
|
|
|
|
A. |
Software: The following
software programs are needed to
use the suggested method.
|
|
|
|
|
1. |
Macro Recorder: If the
program that creates the graphic
file does not have macros, the
process can still be automated by
using a separate program to record
the mouse and keyboard activity.
The following list of programs is
not exhaustive. It represents a
fraction of a short Internet
search.
a. |
Workspace Macro Pro - Automation Edition 5.5 from
Tethys Solutions
|
b. |
Journal Macro from Chosen Software
|
c. |
EZ Macros 5.0a by American Systems
|
d. |
Eventcorder by CMS |
|
|
|
|
2. |
Pac-n-Zoom:
|
|
|
|
3. |
Textual Program: This can
be almost any program that can
produce text. Some common
utilities are Word Pad, Note Pad,
and Paint. If the program uses
macros, it will be easier to
automate. It will be very useful
if the program fonts have a
selectable
pitch, where the
desired pitch can be typed in.
Better results can be obtained
with more pitch sizes.
|
|
|
|
4. |
Printer Driver: A virtual
printer driver that can print to
bitmap (*.bmp) will probably make
the most accurate golden file. The
more selectable resolutions of the
virtual printer driver, the better
results will be. A short Internet
search turned up 3 virtual printer
drivers that are listed here.
a. |
Zan Image Printer 4.0 from zan1011.com
|
b. |
Soft Copy 2.x from Dobysoft
|
c. |
KazStamp 9.0 from Kaczynski Software
|
|
|
|
|
5. |
Script Language: The script
language will need to be able to
read and modify a file. It will
launch and terminate Pac-n-Zoom,
and it will feed Pac-n-Zoom's
process controller (not to be
confused with the OS process
controller) the remote commands
that Pac-n-Zoom will execute.
There are many scripting languages
that can do these things. We use
Python, but Perl and others would
work well.
|
|
|
B. |
Bitmaps: There are many
different ways to make the graphic
files. The following method is one
of many.
|
|
|
|
|
1. |
List: In this method, we
will make all the graphic files we
need and keep a list of the file
names and text properties that were
used to build each file. This list
should be kept in order and
delimited, because some script
will use the list to modify the
Pac-n-Zoom configuration file.
|
|
|
|
2. |
Text: Put in the text as
shown
above.
|
|
|
|
3. |
Printer: Select the virtual
printer driver that will print to
a bitmap file, and set printer to
the desired resolution. The
printing and the scanning should
be at the same resolution. For
example, if the scanning is going
to be at 300 dots per inch (DPI),
the printing should be set to 300
DPI. This will help the attributes
to be closer to the desired
results.
|
|
|
|
4. |
Font: Set the
font of the
text and store to the list.
|
|
|
|
5. |
Pitch: Set the
pitch of the
text and store to the list.
|
|
|
|
6. |
Attribute: Set the
attribute of the
text and store to the list.
|
|
|
|
7. |
Name: Set the name of the
printed file store it to the list.
|
|
|
|
8. |
Print: Save the image to a
bitmap (*.bmp) file by printing
the image.
|
|
|
|
9. |
Next: Start building the
next graphic file by going back to
step D).
|
|
|
C. |
Pac-n-Zoom: We are now
ready to use a master script that
will create the golden files by
running Pac-n-Zoom.
Pac-n-Zoom obtains the attributes
of the golden file frame from the
Pac-n-Zoom configuration file
which is only read when Pac-n-Zoom
is launched. For each bitmap file,
we will need to take these steps.
|
|
|
|
|
1. |
Read: Read the following from
list that was created when
building the bitmap files.
|
|
|
|
|
a. |
Bitmap: Read the name of the
bitmap file.
|
|
|
|
b. |
Name: Read the name assigned
to the golden file. The golden
file can contain more than one
text frame.
|
|
|
|
c. |
Font: Load the name of the
font.
|
|
|
|
d. |
Pitch: Get the size of the
font.
|
|
|
|
e. |
Attributes: Get the list of
attributes.
|
|
|
|
f. |
Flags: Read the flags that
will be set in this text frame.
|
|
|
2. |
Configure: The "# BLOB
COMPRESSOR PARAMETERS", "#
CURRENT BC GOLDEN FILE NAMES", and
most other data segments don't
change during the execution of the
master script, but the following
need to be set differently for
each bitmap file.
|
|
|
|
|
a. |
Name: The "# NEXT BC GOLDEN
FILE NAME" data segment should
contain the name of the golden
file with the extension
("*.pzl").
|
|
|
|
b. |
Flag: The "# NEXT BC GOLDEN
FILE FLAGS" data segment should
contain the flags desired in the
golden frame.
|
|
|
|
c. |
Font: The first line of the
"# BC GOLDEN TEXT" data segment
should contain the font and
pitch.
|
|
|
|
d. |
Language: The second line of
the "# BC GOLDEN TEXT" data
segment should contain the
language being used.
|
|
|
|
e. |
Attributes: The third line
and on of the "# BC GOLDEN TEXT"
data segment should contain the
remaining attributes of the
font.
|
|
|
3. |
Launch: The master script
launches Pac-n-Zoom.
|
|
|
|
4. |
Load: A remote command is used
to load the bitmap.
|
|
|
|
5. |
Write: A remote command orders
the golden file to be wrote. Since
the configuration file is set, the
program defaults to writing a
golden file and to the golden file
name.
|
|
|
|
6. |
Terminate: The master script
terminates Pac-n-Zoom.
|
|
|
|
7. |
Return: As long as there are
unprocessed bitmap files, go back
to step A).
|
|
|
|
|
|
|
While Pac-n-Zoom is able to group
things into
clusters within an
acceptable tolerance, without
using other golden files (which
would be pointless) it does not
have the ability to identify the
clusters in a meaningful
human-like way. Since a typical
application might used several
thousand golden files, the task of
manually identifying each cluster
would likely be long and tedious.
If the golden file was made in the
order shown above, a program can
automatically identify the
clusters by using the following
format.
|
|
|
I. |
Data Segment: There are several
data segments inside a font frame,
and data segments can have any
order. The cluster data segment
needs to be found.
|
|
|
|
A. |
Opening: Data segments are
opened when no other data segment
is opened and when a '#' character
is the first character of the
line.
|
|
|
|
|
B. |
Closing: A data segment is
closed (assuming it is opened)
when a '*' character is the first
character of the line.
|
|
|
|
|
C. |
Indentification: The cluster
data segment is identified
with "# Clusters". To indentify
the cluster segment, the
following checks should be made.
|
|
|
|
|
|
1. |
Open: A check should be made
that no other segments are opened.
|
|
|
|
|
2. |
'#': The program should check
that '#' is the first character
of the line.
|
|
|
|
|
3. |
" Clusters": " Clusters" should
follow the '#' found in step B).
|
|
|
II. |
Cluster Row: A cluster row
consists of two parts and has the
following characteristics.
|
|
|
|
A. |
Preamble: When the golden file
is initially written, all of the
preambles are unidentified. The
identifying program changes the
cluster row preamble to one of the
identified formats. The cluster
row preamble has the following
formats.
|
|
|
|
|
|
1. |
Unidentified: An unidentified
preamble has the following format.
|
|
|
|
|
|
U| Height | Width | Column | Row |
|
|
|
|
a. |
U: Stands for unidentified
|
|
|
|
|
b. |
Height: The maximum height of
the cluster in pixels
|
|
|
|
|
c. |
Width: The maximum width of the
cluster in pixels
|
|
|
|
|
d. |
Column: The column of the
initial pixel
|
|
|
|
|
e. |
Row: The row of the initial
pixel
|
|
|
|
2. |
Text: A text preamble has 1
field that contains the text
character
|
|
|
|
|
3. |
Graphic: A graphic preamble has
the following format. The graphic
can be any shape.
| Name | Height | Width |
|
|
|
|
|
|
a. |
Name: The name of the cluster
that was manually inserted.
|
|
|
|
|
b. |
Height: The maximum height of
the cluster in pixels.
|
|
|
|
|
c. |
Width: The maximum width of the
cluster in pixels.
|
|
|
B. |
Frame: The cluster frame (not
to be confused with the font or
data frame) follows the last
vertical bar, '|'.
|
|
|
|
|
C. |
Line: With no exceptions, there
is one cluster row on each line
of data in the cluster data
segment.
|
|
|
|
|
D. |
Samples:
|
|
|
|
|
|
1. |
Unidentified: U| 41| 21| 37|895|5A 8C E1 0A 23 32 33 18
|
|
|
2. |
Unidentified: U|107|410|155|962|5A B4 01 22 04 B7 A0 1
|
|
|
3. |
Text: |a|5A 8C E1 0A 23 32 33 18
|
|
|
|
4. |
Graphic: |School|107|410|5A B4 01 22 04 B7 A0 1
|
|
III. |
Identification Strategy: The
objective of the software is to
accurately convert the
unidentified clusters into text
clusters. This process is not as
simple as treating each cluster
row as another character.
|
|
|
|
A. |
Blob: A blob is a group of pixels
that meet the following conditions.
|
|
|
|
|
|
1. |
Color: All the pixels in a blob
are the same color.
|
|
|
|
|
2. |
Adjacent: Each pixel in the blob
touches at least on other pixel in
the blob.
|
|
|
|
B. |
Cluster: When two blobs are within
the acceptable tolerance (in the
case of golden files, the
acceptable tolerance should be
zero) of each other, they form a
cluster. This is why we put two
identical characters on each row
to build a golden file image.
|
|
|
|
|
|
1. |
Cluster Row: Each cluster row
contains a cluster, but a cluster
is not necessarily a character.
For example, the letter 'i' has
both a dot and body
clusters.
|
|
|
|
|
2. |
Super Cluster: While retaining
their cluster status, the dot and
body of the 'i' also fold together
to form a new cluster which is the
character 'i'. A super cluster is
the folding of two or more
clusters.
|
|
|
|
C. |
Spacing: By definition, a font of
a certain pitch will have an exact
number of text rows per a given
length. This rule can be relied
upon to determine which cluster
row contains the cluster that
holds the text character.
|
|
|
|
|
D. |
Beginning: In an unidentified
cluster, the third and fourth
fields respectively hold the row
and column of the initial pixel of
the cluster. The initial pixel is
the top or most northern pixel. If
there is more than one top pixel,
the most left or western pixel is
the initial pixel.
|
|
|
|
|
E. |
Order: The super clusters are
ordered after their cluster
components. For example, while the
clusters are ordered in the scan
(most north then most west is
first) order, in the letter 'i',
the body of the 'i' is ordered
before the super cluster that is
the character 'i'.
Then, the last cluster whose
beginning (
D. from above) falls
within the spacing (
C. from above)
is the super cluster that needs to
be changed from an unidentified
cluster (
II.A.1. from above) to a
text cluster (
II.A.2. from above).
|
|
|
|
IV. |
Format Exceptions: The following
text symbol could create problems
or confusion.
|
|
|
|
A. |
'U': The first letter of an
undefined cluster is the letter
'U', but the first letter of a
text or graphic cluster that has a
class of 'U' is a vertical bar.
|
|
|
|
|
B. |
'|': The program reads "|||" as an
alternative for a line return.
Therefore the class of a graphic
or text cluster that is name '|'
should use a space which is
deleted any way. The following
example illustrates the method.
# Clusters
| ||5A 8C E1 0A 23 32 33 18
*
|
|
|
|
|
C. |
" |": A blank is ignored, because
a blank, ' ', can not be a
cluster. In other words, the
following cluster is the same as
'|' which was given above. To put
this yet another way, '|' is the
same cluster as " |".
# Clusters
| ||5A 8C E1 0A 23 32 33 18
*
|
|
|
|
V. |
Sample Frame: The following sample
provides some clarity into the
cluster identification process. The
clusters will have no meaning without
the rest of the frame.
|
|
|
|
A. |
Unidentified:
|
|
|
{
~ Painted File
# Media File Name
TimRom41.pzh
*
# Media File Size
01 C1; Width
02 9E; Height
*
# TEXT
Times Roman 41
English
*
# Shapes
EE DF 40 00 00 00 00 00
FE EF F0 F0 0F 00 00 0B 0
*
# Borders
C0 03 6D CC C
F6 54 00 A4 08 81 0
*
# Cluster Patterns
9
8
5
*
# Clusters
U| 41| 21| 37|895|5A 8C E1 0A 23 32 33 18
U|107|410|155|962|6A 8C C1 8E 29 82 B3 84 34 5
*
}
|
|
B. |
Identified:
|
|
|
{
~ Painted File
# Media File Name
TimRom41.pzh
*
# Media File Size
01 C1; Width
02 9E; Height
*
# TEXT
Times Roman 41
English
*
# Shapes
EE DF 40 00 00 00 00 00
FE EF F0 F0 0F 00 00 0B 0
*
# Borders
C0 03 6D CC C
F6 54 00 A4 08 81 0
*
# Cluster Patterns
9
8
5
*
# Clusters
|a|5A 8C E1 0A 23 32 33 18
|b|6A 8C C1 8E 29 82 B3 84 34 5
*
}
|
|
|
|
|