The Lossless Color Component Finder
Libsupercc actually contains both the color and monochrome lossless component
finders, but the monochrome case is pretty simple, and is basically the same as
the lossy monochrome component finder. This document won't discuss the
monochrome stuff beyond the recognition of the files involved.
The lossless monochrome component finder
1. Files:
- cc2box.c (cc2box.h)
- tokenCC.c (tokenCC.h)
The lossless color component finder
1. Files:
(if you don't understand some of the terms below, keep reading, everything
will be made clear in its time...)
- colorCC.c (colorCC.h) - main interface for the color cc
(connected component) finder. basically just initializes (and cleans
up) the color cc finder cookie. the most interesting functions are:
- DgpCC_doStrip() - drives the color cc finding algorithms
- dgpcc_updateStatus() - attempts to remove any finished ccs and
convert them into DgpCC_ColorCC's to be returned to the caller
(in this case libcompdoc)
- dgpcc_movetoNextLine() - after a line of image data is read,
this function moves all of the still active ccs to the next line, and
indicates which ccstates are no longer active
- colorCCread.c - this is the next level below colorCC.c.
this is where a line of the image is read, and relationships between
neighboring ccstates is initialized (they can be changed later by the
heuristics). the interesting functions are:
- readNextRun() - this is a wrapper around the run reader
function passed in to DgpCC_doStrip(). it gets the next run
and combines any runs on the same line that should be one run (the
application sending us the data may choose to represent runs in whatever
fashion they choose, and two matching, contiguous runs could be broken
up by the application. this function corrects any issues like that
and returns the next run, or end of line). the other algorithms in
the program depend on runs not being contiguous on the same line (washes
are a special case), otherwise things start breaking.
- dgpcc_readLine() - this function keeps track of the current
line and the previous line. it matches up runs on the current line
with any overlapping runs on the previous line that are part of the same
cc. it also starts to define relationships between touching runs
that are not currently part of the same cc (for the heuristics).
- colorCCstate.c (colorCCstate.h) - the most important data
type of this whole process is the ccstate, which is the internal
representation of a cc while it is still being defined. this file
contains functions which manipulate the ccstate. the interesting
functions:
- ccstate_merge() - this takes two ccstates which were formerly
separate and makes them one ccstate.
- ccstate_convertToColorCC() - this takes the internal
representation of a cc (a ccstate) and turns it into an external
representation (a DgpCC_ColorCC). this happens whenever the
heuristics are "finished" with a ccstate (a ccstate is
finished when there is no possible way to add to it anymore, either with
new runs, wash merges and fills).
- ccstate_disassembleSampledCC() - small sampled ccs below a
certain threshold (passed in to DgpCC_initPage()), need to be
separated into solid color ccs, which is done by this function.
it's not as simple as it may seem, and this function had to be a little
more complicated than i originally thought. it does work, though,
and this case is pretty infrequent anyway.
- ccstate_movetoNextLine() - after a line of image runs is read,
every ccstate which is currently active (has runs on the current and
previous lines) needs to be "moved" to the next line.
moving entails taking any runs in the previous line and converting them
into semi-compact run segments, and taking any runs in the current line
and moving them to the previous line.
- ccstate_underFill() - if it is determined that one ccstate is
on top of another ccstate, sometimes the fill algorithm will fill
underneath the inside ccstate. this filling is done in this
function. it's very similar to merging the outside ccstate with a
copy of the inside ccstate.
- scrs_merge() - this does a lot of the work of ccstate_merge().
each line of semi-compact run segments (scrs) must be merged between the
two ccstates being merged. this function merges one line of
semi-compact run segments.
- scrs_fill() - similar to scrs_merge(), this function
does most of the work of ccstate_underFill(). each line of
semi-compact runs in the inner ccstate must be replicated and merged
with the the corresponding line in the outer ccstate.
- colorCCwash.c - like the name implies, this file holds the wash
merging heuristics. interesting functions:
- dgpcc_doWashMerge1() - this function finds all the ccstates
that could wash merge with a given ccstate (at this time), and merges
them.
- colorCCfill.c - again, pretty obvious, this file holds the filling
heuristics. interesting functions:
- dgpcc_doFillHoles1() - this function finds all the ccstates
involved in a potential fill relationship with a given ccstate and
attempts to do the fills. this function involves some
recursiveness because inner ccstates which have inner ccstates need to
be fill checked before the outer ccstate.
- dgpcc_fillCheck() - this is the actuall fill check heuristic
which determines how inner ccstates are handled. sometimes they
are filled under, and sometimes they could be merged with the ccstate
which is outside of the outer ccstate.
- colorCCutil.c (colorCCutil.h, colorCCutil_t.h) -
there are a bunch of data types used in this code which are all defined in colorCCutil_t.h.
many macros defined for using these data types are in colorCCutil.h.
all the functions for these data types are defined in colorCCutil.c
(except those for ccstates). interesting functions:
- cccmap_combine() - the relationships between ccstates are kept
track of in the cccmap (color connected component map). whenever a
ccstate is combined with another ccstate, the releationships between
these two ccstates and the rest of the ccstates must be combined, which
is done in this function.
- colorCC-priv.h - a few random function definitions.
2. Important Data Types:
- CCstate - this is the internal representation of a color connected
component while it is still being defined. key fields:
- map_index - the current index of this ccstate in the cccmap
- num_neighbors - number of other ccstates touching this ccstate
- num_potential_washes - number of other ccstates which this
ccstate is touching which could potentially have a wash relationship
with it.
- num_potential_fills - number of other ccstates which this
ccstate is touching which could potentially have a fill relationship
with it.
- sc_lines - this is a list of semi-compact run segment (sc_run_seg's
or sc_xrun_seg's) lists which represent the runs of the ccstate
above the current and previous run lines.
- prev_run_line - this is a list of Run's which represent
the next to last line of this ccstate (as it is currently represented,
the next to last line may change as more lines of image runs are
read). see Note below
- cur_run_line - this is a list of Run's which represent
the last line of this ccstate. Note - once a ccstate is no longer
active, the Run's in the cur_run_line and prev_run_line
lists will be turned into semi-compact runs and added to the sc_lines
list.
- state
- normal - this is a solid ccstate
- sampled - this is a sampled ccstate
- wash - this ccstate is a wash
- background - this is a solid ccstate which we are
considering "background" (it's big and it matches the
pre-defined background color. Note - these four values (normal,
sampled, wash, and background are mutually
exclusive).
- active - one of three values indicating whether this
ccstate is active, newly inactive (it just became inactive and the
heuristics have not been run yet), or inactive.
- above - this ccstate has been filled underneath of (it is
"above" another ccstate).
- Run - a run represents a run of image data in the last two lines of
image that have been read (the "current" and "previous"
lines). key fields:
- cccrun - info about the run (length color, etc).
- ccstate - the ccstate which this Run is currently a part
of.
- sc_run_seg - the semi-compact run segment. this data type and
the sc_xrun_seg are used in a very tricky way in the code. they
were specifically laid out so that some algorithms could be made to work
easily on both data types. it is used to represent groups of solid
runs. it was created to be a compromise between size and speed.
each segment currently holds up to 4 runs, and a pointer to the next segment
in the line. this way runs can be added quickly and with little
copying (by inserting segments in the list), but don't use too much memory
(like a Run structure for each run, because some of the info in the Run
structure is redundant with the info held by the ccstate). key fields:
- next - pointer to next segment of runs in the line
- num_runs - number of runs currently held in this segment
- runs - array of xpos's and length's of up to 4 runs
- sc_xrun_seg - the semi-compact extended run segment. this
data type is laid out so that it's beginning fields are the same as that of
the sc_run_seg, so certain algorithms can work with either data type.
but, it holds extra info, for use with sampled runs and washes. key
fields:
- next, num_runs, runs - same as above
- userdata - array of extra information for each run (for sampled
runs it holds a pointer to the sampled data, for wash runs it holds the
color of that run)
- DgpCC_ColorCC - external representation of a color cc (fields
are pretty self explanatory).
- cccMap - this is the color connected component map which keeps
track of the relationships between neighboring ccstates. it is
basically set up like a matrix cut in half diagonally. since all the
info between two ccstates is symmetric, you don't need both the (i, j) and
the (j, i) entries in the matrix, you only need (i, j) where i > j.
so, every pair (i, j) has exactly one entry in the matrix, where i !=
j. also, the size of the matrix is dynamic (each row is a separate
block of memory), so rows can be added and removed as the number of ccstates
changes (which happens a lot). key fields:
- col - this "column" holds pointers to each row of the
matrix and the ccstate associated with that row.
- num_els - the number of ccstates currently in the map
- max_num_els - the maximum number of ccstates that can be held
in the map (before it must be expanded)
- cached_index - this value is guaranteed to be either the index
of the lowest unused row in the map, or less than the index of the
lowest unused row in the map. so, if the row at cached_index
is used (col[cached_index].ccstate != NULL), the lowest
unused row can be found by incrementing cached_index until (col[cached_index].ccstate
== NULL), provided, of course, that num_els < max_num_els.
- cccMapRowData - each row pointed to by col is an array of cccMapRowData
structures, which contains info about the relationship between two ccstates.
- Example, given
ccstate_i
and ccstate_j
,
where ccstate_i != ccstate_j
:
- if
ccstate_i->map_index == i
, then ccstate_i
== col[i].ccstate
- if
ccstate_j->map_index == j
, then ccstate_j
== col[j].ccstate
- let
m = max(i, j)
- let
n = min(i, j)
col[m].row[n]
is the cccMapRowData containing
info about the relationship between ccstate_i
and ccstate_j
key fields:
- distance - this is the maximum color distance between two
ccstates (sum of squares of each component: r, g, b). for
solid ccstates, the maximum value is the same anywhere.
however, for a wash, the color distance could vary depending on the
color where the two ccstates touch, in which case the maximum value
of all these values is saved.
- state
- flags
- neighbor - these two ccstates are neighbors
- potential_wash - these two ccstates have a
potential wash releationship
- potential_fill - these two ccstates have a
potential fill relationship
- potential_fill_conflict - these two ccstates have a
potential fill conflict
- all - this field is used for combining the above flags
for two different cccMapRowData's (when merging
occurs). since the all field is a union with the
above flags, the flags for two cccMapRowData's
can be merged by bit OR'ing the two all fields together.
3. Algorithm:
Alrightee, now that you've been thoroughly inundated by confusing info, i'll
try to make sense of it all by describing the algorithm.
3.1 Reading Runs
Note - connected components are 8 connected, meaning for a given pixel 0,
all the X pixels could be connected to 0:
Runs are read from an image left to right, top to bottom, line by line.
The caller gives us a run reading function which we call to give us each
run. We keep the two most recent lines (the line we are reading and the
line before it) as array's of Run's. This enables us to determine
neighbor relationships quickly and easily. As the current line is read (in
dgpcc_readLine()), runs are compared to their neighbors. If a run
matches it's neighbor (at this point in the algorithm, the only way two runs can
match is if they are both the same color or they are both sampled), it is added
to the ccstate of the neighboring run. if the current run already has a
ccstate, and it's different from the ccstate of the matching run, these two
ccstates are merged. if two neighboring runs do not match, the color
distance between them is calculated, and it is determined if they could possibly
wash merge in the future, or possibly be part of a fill relationship, and the cccMapRowData
for these two run's ccstates is updated accordingly. i'll describe the
wash and fill algorithms in depth below, but one other thing that is done at
this point is that potential fill conflicts are also detected. again, i'll
describe this in depth below.
after a line of the input image is read, all active ccstates must be moved to
the next line. an active ccstate is a ccstate which has runs in the
current line. it is active in that runs could still be added to it by the
next line of runs. active ccstates that had runs in the previous line, but
didn't gain any runs in the current line, will become newly inactive ccstates.
how does a newly inactive ccstate become inactive, and eventually a DgpCC_ColorCC?
At set times (currently after 80 newly inacive ccstates have been gathered, or
at the end of the page), the heuristics are run over the newly inactive ccstates.
once that happens, the ccstates become inactive. if an inactive ccstate
has no more potential to gain runs (num_potential_fills == 0 and num_potential_washes
== 0), then it can be removed entirely. this happens through a call to ccstate_convertToColorCC().
special cases: small sampled images are broken apart into solid ccstates
through a call to ccstate_disassembleSampledCC() (JPEG compression isn't
worth the effort below a certain size image); background ccstates are
simply deleted (they are unnecessary).
The heuristics themselves are described below, currently we only do wash
detection and filling.
3.2 Wash Detection
Applications like powerpoint can create these big, ugly color gradient
backgrounds (washes), which are drawn using GDI line drawing calls, so they
aren't seen as sampled data by the color cc finder. If these washes aren't
handled specially, the resulting XIFF files can be huge because each color
change is a new token, which can end up being thousands (or hundreds of
thousands) of tokens on the page. in order to combat this, we do wash
detection.
the grand total of wash detection is that we find ccs which are close in
color, and we merge them together. the closeness is calculated during the
run reading phase when neighbor relationships are defined. the most
important part of this process is that active ccstates cannot be wash merged
with another ccstate until the become newly inactive (or just inactive).
why is this important? because the alleviates the problem of wash merging
ccstates and later determining that they shouldn't be wash merged and trying to
pull them apart as an after thought. why might you determine later that
something shouldn't be part of a wash? because a blue character might be
written over a bluish wash and at one point it may seem that the character is
part of the wash, but as more runs get added to the wash, we find out the the
blue character is far apart in color value from some of the later runs.
instead of having to try to remove the character from the wash, we wait until
the character is inactive, and then we can safely add it to a wash, because
we've kept track of the maximum color distance between this character and the
wash, and by the time the character is inactive, the color distance is over the
wash threshold. (of course, if the wash threshold is too high, or the
character is too close to the wash all over, then the character will become part
of the wash... such is life.)
once these wash ccstates are inactive, we turn them into sampled images, but
indicate to the caller that they came from a wash, because they can be JPEG'ed
at a lower quality (at the caller's discretion, of course) without it making a
noticeable difference.
by the way, i think the wash detection is fairly simple, so there probably
are few improvements that could be made to it, the filling code below, however,
is much different!
3.3 Filling
Why fill? well, if you'll notice, the "o"s in this sentence,
they actually contain two color connected components, the black "o",
and the white circle in the middle (yes, in this case you could ignore the white
circle because it's "background", but if the "o" was over a
solid blue box, then you have the same circle in the middle, it would now be
blue, just follow the example...). however, this white circle is redundant
information because it matches the ccstate underneath of the black
"o", in this case the background of the page (in the case with the
solid blue box, the blue circle matches the blue box outside of the the
"o"). if we can detect this, we can fill underneath the the
black "o", and ditch the white circle in the middle, thus bettering
compression. this is a very simple example, but actually filling occurs
all over a page when various items intersect with each other, creating ccstates
of one color completely inside ccstates of another color.
however, detecting this is *QUITE* hard. i started out trying to detect
almost all possible fills, i ended up doing the easiest possible case and there
are still some difficulties (see the potential fill conflict explanation below).
The current algorithm basically just tries to find ccstates which are
guaranteed to be inside of another ccstate. the initial detection takes
place in
dgpcc_readLine() where runs are compared to the runs above them.
when a run first matches a run above it (if ever) the matching run is
recorded. then, if this same run overlaps any other runs above it which
happen to be a part of the same ccstate (which this run is a part of), we know
that we have closed a "hole" (well, not really, see potential fill
conflicts below). then, any ccstates which have runs between the above run
we originally matched and the run we just found (that has the same ccstate as
us), are holes in the current ccstate, and the potential_fill flags are
marked in the corresponding cccMapRowData's.
Later, in dgpcc_doFillHoles1() we find any ccstates with potential
fills (num_potential_fills > 0) and attempt to figure out the filling
relationship. The relationship works like this: given the outer_ccstate,
the ccstate, and the inner_ccstate:
- if the inner_ccstate matches the outer_ccstate, and the inner_ccstate is a
significant percentage of the ccstate, we merge the inner_ccstate with the
outer_ccstate.
- why does the inner_ccstate need to be a significant percentage of the
ccstate?
- imagine a large black box which is almost the size of f the page
- put lots of white text inside the black box
- now, the white text matches the background around the black box,
so you could have a black box with a bunch of white holes, or
- you could realize that would get better compression if you made
the black box solid and instead token compressed all the white
boxes.
- so, we detect that by requiring the the inner_ccstate(s) (the
white text) be a significant percentage of the ccstate (the black
box)
- else, we fill the ccstate underneath the inner_ccstate
And, this happens recursively, so each inner ccstate, if it has ccstates
inside of it, we fill detect on them first (trust me, it works better this way),
then fill detect on the inner_ccstate itself.
Potential fill conflicts: The algorithm above almost works.
unfortunately. because we are using 8 connected components, we may have a
ccstate which seems to be a hole in another ccstate, but could actually leak out
through a corner connection. example:
In this case 0' seems to be a hole in the X connected component (using the
algorithm described above), but actually 0' connects to the 0 connected
component and isn't really a hole at all. in fact the filling relationship
between these to components is very difficult to define. (if you don't
understand it, just stare at it for a while, it'll come to you). We detect
this case in dgpcc_readLine() and mark ccstates which have a relationship
like this using the cccMapRowData potential_fill_conflict
flag. If this flag is set when a ccstate is being evaluated by dgpcc_doFillHoles1(),
the fill relationship between this ccstate and the other ccstate is just
ignored. believe me, it's almost impossible to figure out the relationship
with out some time-consuming computation (you could go line by line through both
ccstates and see if one ccstate is outside of another ccstate by checking the
runs at the ends of every line and determining if the outer ccstate has the runs
at both ends of every line. i determined this to not really be worth the
effort, so i didn't do it...)
However, without much thinking, i'm sure you can come up with fifty fairly
simple cases that this algorithm misses (i know i can). however, speed is
of the essence in the ntprint world, and filling, while helping compression
somewhat, isn't necessary to achieve visually "correct" output, so i
went with the fast and simple solution (which, if you look at the code, isn't
even that simple...). good luck!
4. Current Issues
- profiling - i only did a little profiling of this code through the
coding process, so i'm sure some good bottlenecks could be found with a
healthy dose of quantify. from just a cursory view, it seems to me
that images still seem to be a big time cruncher, especially at higher
resolutions.
- scaled images with scaled masks - no, it's not a bug. you may
notice what seems to be bugs (in toursrus on the last page, if you
look *real* close) in the output, but what you are actually seeing is the
affects of scaled masks. in order to save space, large images can be
scaled (default is 100%). however, if those images have masks,
(currently we use XIFF transparency masks), which must be the same size as
the image, so the masks need to be scaled as well. this produces masks
which are not as accurate when scaled up. this will probably go away
in TIFF-FX because masks are not used, you have to use a selector for
everything. this would allow you to have a scaled image and a full
size "mask" (the selector), which would be cool. we should
optimize santa, however, to render layer pairs like this only using a
selector the size of the image, not making a full page selector (faster,
better mem usage).
- better filling algorithm? - as i've noted above, the filling code
is very simplistic, and could be improved (but at what cost in termsof
speed?)...