TinyTM - Fuzzy Matching

Cheallenges

Fuzzy matching is the technique that compares paragraphs in the source text with the translated segments in the TinyTM database. Fuzzy matching is probably the most complex part of a translation memory:

Fuzzy Matching deals with Natural Language
Fuzzy matching touches the area of Natural Language Processing (NLP) and the inherent complexity of human language.
Large TM Databases
The main value of a TM consists in the number of segments - its size. However, large database automatically lead to slow response times.
Speed!
TMs have been created to save translators time. A slow TM might actually slow down a translator, so that fast response times are an essential characteristics of any TM.

In order to get all three - complex comparisons with high speed on large databases - the designers of a TM need to employ complex algorithms from a number of advanced information technology areas. This section briefly presents the techniques employed in TinyTM.

TinyTM as an Academic "Playing Ground"

The current V0.1 TinyTM fuzzy matching implementation has not yet optimized for large databases and doesn't yet employ all of the techniques below. However, it's foundations are strong, because the underlying (PostgreSQL currently) database is user-extensible and already provides implementations of the main techniques (see below).

We hope that Universities and other academic institutions will discover TinyTM as a convenient "playing ground" for innovative algorithms etc. and will contribute to the backend code.

Techniques Employed

Levenshtein Editing Distance
TinyTM uses a variant of "Levenshtein distance" as the main measure for the "% match" value between a source segment and a segment from the TM.
TinyTM actually uses a "recursive" variant of the Levenshtein distance, because the original algorithm gets very slow when comparing large segments. Our "recursive Levenshtein" takes advantage of a "lower boundary" mathamatical characteristic of the Levenshtein distance and recursively breaks down the segments in parts until their size is below a certain limit.
References:
Tagging and "Folksonomies"
TinyTM allows user to "tag" segments. Tagging represents a kind of lightweight semantic markup and allows to select parts of a TM based on user characteristics. This allows user groups to separate their translations from other groups (if needed) and to separate translations in one context from other contexts.
References:
- Wikipedia on Folksonomies
- PostgreSQL TSearch2 implementation (used to index tags)
- TinyTM Server source code
Trigram Fuzzy String Indexing
Trigram String distance measures the number of common trigrams (combination of three letters) between two strings. PostgreSQL already implements a Trigram distance function. It also provides a generalized inverted index on trigram vectors. We will use Trigram indexing in later versions of TinyTM.
References:
- PostgreSQL Trigram Indexing

Search

Quick Links

Clients

TM-Server

Developer Community

Newsletter Sign up

Your E-Mail:

TinyTM - Fuzzy Matching

Cheallenges

TinyTM as an Academic "Playing Ground"

Techniques Employed