[Fontconfig] Regularizing contains operator semantics
Keith Packard
keithp at keithp.com
Fri Jul 11 14:54:56 PDT 2003
"Contains" matching issues.
The contains operator is currently used in font listing and can be used in
match/edit rules.
LISTING FONTS
When listing fonts, contains should have "obvious" semantics, I suggest
that those semantics depend on the type of the value:
string, number, boolean:
font has an equal value for every value in the pattern. This means
that using 'times,courier' for the family will result in no fonts
being listed as no font has both times and courier family names. In fact, I
can't see a good use for multiple values here as it would require multiple
values in the fonts; let's see if that is broken. For strings, the change
here is that 'contains' does not mean sub string -- list 'courier' and you
won't see 'courier 10 pitch'. I think strings should be treated as atomic
values in this context; fontconfig doesn't have string operators, which
is at least consistent.
charset:
font contains listed Unicode codepoints, in otherwords, the charset provided
by the font 'contains' all of the glyphs requested by the application.
lang:
(Remember that 'lang' is a composite value consisting of a language value and
a territory value. The list of lang values in a font is computed from
Unicode coverage ranges based on orthographies. Except for Chinese, all of
these coverage ranges are (currently) assocated only with a language and not
a territory. Chinese is (currently) split into three territory groups
(mainland China and Singapore, Hong Kong, Taiwan and Macau). So, most
language comparisons will be done with a language/territory pair supplied by
the application (often from the current locale) against fonts which know
only languages and not territories. However, applications will also provide
only languages at times to be matched against fonts which have languages and
territories.)
The font supports all of the langs requested by the application. I think
this means that the font 'contains' all of the langs requested by the
application (remember, we're talking about LISTING here). Now, the tricky
part of defining what 'support' means for a specific lang entry. When
the application provides a language/territory pair, then the font must
either provide a matching language/territory pair, or a bare language entry.
When the application provides a bare language, the font must either provide
a matching bare language entry or a language/territory pair with *any*
territory:
application font "supports"
----------- ---- ----------
zh zh_cn YES
zh_tw zh_cn NO
en_gb en YES
en en YES
MATCHING
The LISTING algorithm is designed to sharply restrict the set of provided
fonts; an empty list is often the result of overspecified patterns; that
matches the expected usage of providing precise information to users about
what actual fonts are available, rather than what font will be used when a
specific pattern is matched. In contrast, MATCHING is designed to always
provide a font, and in fact to provide a score measuring how accurate that
match is so that the set of available fonts can be sorted by this metric
and returned to the application.
When matching fonts, we're not using the boolean 'contains' operators, but
rather measuring distance from the pattern to the font (in CS terms, LISTING
is a constraint satsifaction problem while MATCHING is an constraint
optimization problem)
string, boolean:
Distance in these objects is measured with only two values -- matching and
nonmatching -- matching strings or booleans have distance 0 while
mismatching values have distance 1.
number:
Distance between two numbers is just the absolute value of thier difference
(the obvious value). This is used for things like weight and slant, the
numeric values for those constants was carefully chosen to prefer reasonable
substitutions (italic and oblique and closer together than either is to
roman).
charset:
Distance between two charsets is the count of characters requested by
the pattern but not provided by the font. This means that a font which
fully covers the requested characters has distance '0'.
lang:
Distance has three values:
0: pattern and font have equal language/country,
or pattern has only language and font has language with
any country.
1: Pattern and font have equal language and different
country (zh_CN vs zh_TW)
2: Pattern and font have different language
EDITING
The EDITING algorithm needs a method for matching patterns for each edit
operation; this is another constraint satisfaction problem as the edit rules
are either applied or not applied.
Match rules in edit instructions can use many different operators to
constrain pattern selection:
eq
not_eq
less
less_eq
more
more_eq
contains
not_contains
Each of these opeators behave differently for each datatype. For
datatypes which aren't ordered, I've defined the ordered operators to always
return false.
string:
I think these should be treated as unordered objects so that collation
isn't visible to the user. The remaining question is whether the 'contains'
operator should be used to detect sub-string presense. The LISTING
operation above should not do this as the operator is not selectable, but
allowing 'contains' to do substring detection in an EDITING context means
that LISTING won't use Contains, but rather some Contains-like analog which
is actuall Equal for strings. Hmm. Permitting Contains for EDITING would
probably be useful, especially for FC_STYLE pattern elements.
boolean, number:
These have obvious semantics for all of the operators if
contains/not_contains are allowed to be synonyms for eq/not_eq.
charset, lang:
I think the semantics described above for LISTING should apply here.
PROPOSED CHANGES
I believe the only changes necessary to implement these semantics are:
1) Use a Contains-alike operator for LISTING which does exact
matching for strings, permit Contains for EDITING to do
substring matching
2) Change lang Contains semantics to make ll_xx contain ll and
ll contain ll_xx (currently, I believe ll_xx does not contain ll)
More information about the Fontconfig
mailing list