TinyTM - Fuzzy Matching
Cheallenges
Fuzzy matching is the technique that compares paragraphs in the source
text with the translated segments in the TinyTM database. Fuzzy matching
is probably the most complex part of a translation memory:
- Fuzzy Matching deals with Natural Language
Fuzzy matching touches the area of Natural Language Processing (NLP) and
the inherent complexity of human language.
- Large TM Databases
The main value of a TM consists in the number of segments - its size.
However, large database automatically lead to slow response times.
- Speed!
TMs have been created to save translators time. A slow TM might actually
slow down a translator, so that fast response times are an essential characteristics
of any TM.
In order to get all three - complex comparisons with high speed on large
databases - the designers of a TM need to employ complex algorithms from
a number of advanced information technology areas. This section briefly
presents the techniques employed in TinyTM.
TinyTM as an Academic "Playing Ground"
The current V0.1 TinyTM fuzzy matching implementation has not yet optimized
for large databases and doesn't yet employ all of the techniques below.
However, it's foundations are strong, because the underlying (PostgreSQL
currently) database is user-extensible and already provides implementations
of the main techniques (see below).
We hope that Universities and other academic institutions will discover
TinyTM as a convenient "playing ground" for innovative algorithms etc. and
will contribute to the backend code.
Techniques Employed
- Levenshtein Editing Distance
TinyTM uses a variant of "Levenshtein
distance" as the main measure for the "% match" value between a source
segment and a segment from the TM.
TinyTM actually uses a "recursive" variant of the Levenshtein distance,
because the original algorithm gets very slow when comparing large segments.
Our "recursive Levenshtein" takes advantage of a "lower boundary" mathamatical
characteristic of the Levenshtein distance and recursively breaks down
the segments in parts until their size is below a certain limit.
References:
- Tagging and "Folksonomies"
TinyTM allows user to "tag" segments. Tagging represents a kind
of lightweight semantic markup and allows to select parts of a TM based
on user characteristics. This allows user groups to separate their translations
from other groups (if needed) and to separate translations in one context
from other contexts.
References:
- Trigram Fuzzy String Indexing
Trigram String distance measures the number of common trigrams (combination
of three letters) between two strings. PostgreSQL already implements a
Trigram distance function. It also provides a generalized inverted index
on trigram vectors. We will use Trigram indexing in later versions of
TinyTM.
References: