The Lossless Color Component Finder

 

Libsupercc actually contains both the color and monochrome lossless component finders, but the monochrome case is pretty simple, and is basically the same as the lossy monochrome component finder.  This document won't discuss the monochrome stuff beyond the recognition of the files involved.

The lossless monochrome component finder

 1. Files:

The lossless color component finder

1. Files:

(if you don't understand some of the terms below, keep reading, everything will be made clear in its time...)

2. Important Data Types:

3. Algorithm:

Alrightee, now that you've been thoroughly inundated by confusing info, i'll try to make sense of it all by describing the algorithm.

3.1 Reading Runs

Note - connected components are 8 connected, meaning for a given pixel 0, all the X pixels could be connected to 0:

X X X
X 0 X
X X X

Runs are read from an image left to right, top to bottom, line by line.  The caller gives us a run reading function which we call to give us each run.  We keep the two most recent lines (the line we are reading and the line before it) as array's of Run's.  This enables us to determine neighbor relationships quickly and easily.  As the current line is read (in dgpcc_readLine()), runs are compared to their neighbors.  If a run matches it's neighbor (at this point in the algorithm, the only way two runs can match is if they are both the same color or they are both sampled), it is added to the ccstate of the neighboring run.  if the current run already has a ccstate, and it's different from the ccstate of the matching run, these two ccstates are merged.  if two neighboring runs do not match, the color distance between them is calculated, and it is determined if they could possibly wash merge in the future, or possibly be part of a fill relationship, and the cccMapRowData  for these two run's ccstates is updated accordingly.  i'll describe the wash and fill algorithms in depth below, but one other thing that is done at this point is that potential fill conflicts are also detected.  again, i'll describe this in depth below.

after a line of the input image is read, all active ccstates must be moved to the next line.  an active ccstate is a ccstate which has runs in the current line.  it is active in that runs could still be added to it by the next line of runs.  active ccstates that had runs in the previous line, but didn't gain any runs in the current line, will become newly inactive ccstates.

how does a newly inactive ccstate become inactive, and eventually a DgpCC_ColorCC?  At set times (currently after 80 newly inacive ccstates have been gathered, or at the end of the page), the heuristics are run over the newly inactive ccstates.  once that happens, the ccstates become inactive.  if an inactive ccstate has no more potential to gain runs (num_potential_fills == 0 and num_potential_washes == 0), then it can be removed entirely.  this happens through a call to ccstate_convertToColorCC().  special cases:  small sampled images are broken apart into solid ccstates through a call to ccstate_disassembleSampledCC() (JPEG compression isn't worth the effort below a certain size image);  background ccstates are simply deleted (they are unnecessary).

The heuristics themselves are described below, currently we only do wash detection and filling.

3.2 Wash Detection

Applications like powerpoint can create these big, ugly color gradient backgrounds (washes), which are drawn using GDI line drawing calls, so they aren't seen as sampled data by the color cc finder.  If these washes aren't handled specially, the resulting XIFF files can be huge because each color change is a new token, which can end up being thousands (or hundreds of thousands) of tokens on the page.  in order to combat this, we do wash detection.

the grand total of wash detection is that we find ccs which are close in color, and we merge them together.  the closeness is calculated during the run reading phase when neighbor relationships are defined.  the most important part of this process is that active ccstates cannot be wash merged with another ccstate until the become newly inactive (or just inactive).  why is this important?  because the alleviates the problem of wash merging ccstates and later determining that they shouldn't be wash merged and trying to pull them apart as an after thought.  why might you determine later that something shouldn't be part of a wash?  because a blue character might be written over a bluish wash and at one point it may seem that the character is part of the wash, but as more runs get added to the wash, we find out the the blue character is far apart in color value from some of the later runs.  instead of having to try to remove the character from the wash, we wait until the character is inactive, and then we can safely add it to a wash, because we've kept track of the maximum color distance between this character and the wash, and by the time the character is inactive, the color distance is over the wash threshold.  (of course, if the wash threshold is too high, or the character is too close to the wash all over, then the character will become part of the wash... such is life.)

once these wash ccstates are inactive, we turn them into sampled images, but indicate to the caller that they came from a wash, because they can be JPEG'ed at a lower quality (at the caller's discretion, of course) without it making a noticeable difference.

by the way, i think the wash detection is fairly simple, so there probably are few improvements that could be made to it, the filling code below, however, is much different!

3.3 Filling

Why fill?  well, if you'll notice, the "o"s in this sentence, they actually contain two color connected components, the black "o", and the white circle in the middle (yes, in this case you could ignore the white circle because it's "background", but if the "o" was over a solid blue box, then you have the same circle in the middle, it would now be blue, just follow the example...).  however, this white circle is redundant information because it matches the ccstate underneath of the black "o", in this case the background of the page (in the case with the solid blue box, the blue circle matches the blue box outside of the the "o").  if we can detect this, we can fill underneath the the black "o", and ditch the white circle in the middle, thus bettering compression.  this is a very simple example, but actually filling occurs all over a page when various items intersect with each other, creating ccstates of one color completely inside ccstates of another color.

however, detecting this is *QUITE* hard.  i started out trying to detect almost all possible fills, i ended up doing the easiest possible case and there are still some difficulties (see the potential fill conflict explanation below).

The current algorithm basically just tries to find ccstates which are guaranteed to be inside of another ccstate.  the initial detection takes place in dgpcc_readLine() where runs are compared to the runs above them.  when a run first matches a run above it (if ever) the matching run is recorded.  then, if this same run overlaps any other runs above it which happen to be a part of the same ccstate (which this run is a part of), we know that we have closed a "hole" (well, not really, see potential fill conflicts below).  then, any ccstates which have runs between the above run we originally matched and the run we just found (that has the same ccstate as us), are holes in the current ccstate, and the potential_fill flags are marked in the corresponding cccMapRowData's.

Later, in dgpcc_doFillHoles1() we find any ccstates with potential fills (num_potential_fills > 0) and attempt to figure out the filling relationship.  The relationship works like this: given the outer_ccstate, the ccstate, and the inner_ccstate:

And, this happens recursively, so each inner ccstate, if it has ccstates inside of it, we fill detect on them first (trust me, it works better this way), then fill detect on the inner_ccstate itself.

Potential fill conflicts:  The algorithm above almost works.  unfortunately. because we are using 8 connected components, we may have a ccstate which seems to be a hole in another ccstate, but could actually leak out through a corner connection.  example:

X X 0 0
X 0' X 0
X X X 0

In this case 0' seems to be a hole in the X connected component (using the algorithm described above), but actually 0' connects to the 0 connected component and isn't really a hole at all.  in fact the filling relationship between these to components is very difficult to define.  (if you don't understand it, just stare at it for a while, it'll come to you).  We detect this case in dgpcc_readLine() and mark ccstates which have a relationship like this using the cccMapRowData potential_fill_conflict flag.  If this flag is set when a ccstate is being evaluated by dgpcc_doFillHoles1(), the fill relationship between this ccstate and the other ccstate is just ignored.  believe me, it's almost impossible to figure out the relationship with out some time-consuming computation (you could go line by line through both ccstates and see if one ccstate is outside of another ccstate by checking the runs at the ends of every line and determining if the outer ccstate has the runs at both ends of every line.  i determined this to not really be worth the effort, so i didn't do it...)

However, without much thinking, i'm sure you can come up with fifty fairly simple cases that this algorithm misses (i know i can).  however, speed is of the essence in the ntprint world, and filling, while helping compression somewhat, isn't necessary to achieve visually "correct" output, so i went with the fast and simple solution (which, if you look at the code, isn't even that simple...).  good luck!

4.  Current Issues