TMX 2.0 comments

hide

TMX 2.0 Comments

This wiki contains a log of comments received about the March 10, 2009 draft of TMX 2.0 (http://http://www.lisa.org/fileadmin/standards/tmx2_2009_03_09.html). Comments should be sent to Arle Lommel at arle@lisa.org.

See comments at http://http://lists.oasis-open.org/archives/xliff/200903/threads.html#00019

Discussion on XLIFF mailing list


Here are some general comments from the XLIFF committee:

  • Numerous requests to remove <sub> and represent it in another way in order to simplify markup
  • They would like to see <itag> recast in a way that makes it a node where the content is text rather than markup
  • There is considerable interest still in unifying the content markup approaches

From Oliver Christ (SDL)

Inclusion of segmentation rules

Often TMX files contain a mix of translation units generated in contexts with different, potentially conflicting segmentation rules, and the same TU may also be the result of multiple segmentation rule sets.

If SRX rules are to be included in the TM, one could argue that by the same token details about the filtering process should be included, as filter-level settings obviously significantly affect the segmentation process (e.g. whether HTML's <br/> should be treated internal or external). Such settings may be user-specific, not application-specific.

TMs typically also contain a mixture of TUs from various sources (document formats, TMs, TM applications and versions). It is therefore hard, if not impossible, to specify a single segmentation rules set which would be valid for the whole TM as soon as the TM contains such legacy data.

It therefore does not seem to be too useful to include segmentation rules in a TMX file, but rather to include them in a translation package. Otherwise, a TM application would need to ensure that TUs which were generated by a conflicting set of segmentation rules cannot be imported, otherwise the data in that TMX document would be inconsistent.

^^Tags

It may be useful to offer a simplified representation for tags in cases where the bracketing of paired tags in the TU is compliant with standard XML bracketing rules. This could simplify processing, but would require the introduction of a new element (similar to XLIFF’s <g>).

^^TU Groups

TU groups seem to primarily serve the purpose of being able to partially recreate the original (bilingual) document segment order from a TMX file. However, the same TU typically occurs in several documents, or may be used in multiple locations in a document.

To resolve this, either the TUs need to be duplicated in the TMX file (and each having its own TU group information), or one would need a mechanism to associate a TU with its locations in multiple documents.

Further, TU group information is associated at the TU level. However, TUs can be multilingual (contain segments in more than two languages, or contain e.g. multiple target segments in the same language). Clearly, if a TU has a single source, but multiple target segments, the TU locations will differ for each target segment. This cannot be expressed in the current model, other than by duplicating the TU, which in turn would mandate a purely bilingual TU model at least on the TMX level. It seems that the current TU Group model and the idea of multilingual TUs are incompatible.

^^"Per-Document TM"

Inclusion of segmentation rules, TU groups, and the remarks in section 5 (TMX Compliance) indicate that TMX 2.0 is intended to become a per-document TM representation, which is not the way how most TMs are used today.

In particular, section 5 (Compliance) reads,

The tool XYZ supports TMX Export if the TMX document created by tool XYZ contains all the information required to re-create the translated document without loss of text, data or formatting.

The tool XYZ supports TMX Import if any TMX document containing all the information required to re-create the translated document (…), can be imported in tool XYZ and effectively be used to re-create the translated document without loss of text, data or formatting.

This seems to indicate that TMX 2.0 is intended to become a representation of the (textual portions of a) bilingual version of a document, i.e. a substitute for XLIFF. This is a significant change (and restriction) to previous versions of the standard and does not reflect the way how TMX (or TMs) are used today.

More importantly, it seems to me that the compliance criteria significantly restrict the design and functionality of a TM application. In particular,

  • If the TM engine chooses not to disambiguate translations and only maintains a single translation per source segment,
  • or the TM engine allows users to interactively override segmentation during translation (i.e. join or split segments),
  • or the TM engine chooses not to emit or process TU Group information (which are optional),
  • or the TM engine chooses not to include the segmentation rules in the TMX file (which are optional),
  • or the TM engine does not use SRX for segmentation,

that TM engine cannot be TMX 2.0 compliant by definition, since the type of roundtripping required in the compliance criteria does not seem to be feasible. Compliance therefore seems to impose significant restrictions and requirements on the design and functionality of a TM application.

I also do not see how the application can recreate the translated document from the TMX and the monolingual source document alone without the TMX document also containing filter-related information which control how the input format is supposed to be forward-converted (from the native input format into the bilingual format, e.g. XLIFF) and back-converted (from the bilingual format to the monolingual native target format). Clearly, information about the segmentation of the input and the TM data alone are not sufficient for roundtripping, as user settings (not only fixed application-level settings) influence the application’s behavior.

Conclusions

Overall, the TMX 2.0 proposal seems to be strongly influenced by an existing internal database and application design, and adjusted to match that design. The compliance requirement effectively mandates a compliant TM engine to include all required information for roundtripping in a TMX file (which means that some optional elements are not optional anymore should the application attempt to be compliant). Compliance also requires a specific processing model, which significantly limits the way a TM application can work, and effectively restricts the type of data included in a TMX file. Only few, if any, of the proposed changes seem to be based on broader user feedback, and the weaknesses of TMX1.4b (for example the lack of detailed semantics and the existing ambiguities) are not addressed. The other area which rises concerns are the overlaps with XLIFF. There's no need to invent yet another bilingual document representation format, and instead of driving TMX towards an "XLIFF light", applications which need a per-document TM should rather directly use XLIFF for that purpose.


Michael Hinterseher (SPX)

What I couldn't find so far is the coverage of translation constraints in the TMX 2.0 standard.

It would be helpful to send along constraints to the translator in TMX format. This is especially helpful if only small text fragments have to be translated.

E.g. maximum number of characters of translation, category, notes, translation memory to be used or a reference (URL) to a binary with more information like a PDF.

This PDF could shows the context of the translated document with graphics and more which needs to be translated.


Uma Umamaheswaran (umavs@ca.ibm.com)

In section 1.2 re: character encoding in the 5th para there is mention abou thte purivate use area of Unicode. E000-F8FF is mentioned.
I think the requirement stated here will be applicable in general to the Private Use Planes 15 and 16 as well, even though today's implementations may be using only the BMP PUA.

So the range of private use should be extended to include .. U+F0000 to U+10FFFF.


March 12, 2009: from Anderson Wang
Dear Arle R. Lommel,

I noticed the TMX standard v2.0 draft already released and is very different from the last TMX 2.0 draft and TMX 1.4b.
Many elements have been removed from the new version (i.e. prop), but those elements are very popular and useful. Removing these already popular elements imply great effort in compatibility for legacy TMX compliant software. It's recommended to keep such elements unless LISA has a sound reason.

In addition, although CAT publishers can use user-defined elements or attributes by themselves to implements extensibility, a TMX application can safely ignore foreign elements or attributes present in a TMX document. If everybody do it like this then the TMX standard will actually negatively affect interoperability among different CATs. This new standard seems to cause more trouble. Many translators will have to upgrade the TMX data, and many legacy TMX files would become invalid. Significant effort is needed for all the publishers (except for the composer, who's prepared) to keep compatibility, this would actually have negative impact on the adoption of the new standard.

We strongly recommend to keep maximal compatibility with TMX 1.4b in TMX 2.0 unless the change is really necessary.

Best regards,
Anderson


March 12, 2009 from Yves Savourel:

Indication about moving, deleting, etc. of tags

The TMX 2.0 draft does not have any provision for information indicating if an inline code can be deleted, moved, or duplicated.

One type of information that has been already discussed in TMX and XLIFF and even implemented by some for inline codes is one or more attributes to indicate when a code can be deleted or not, can be cloned or not, and can be moved out of sequence or not.

This type of info is useful when doing QA, when the translator is manually editing a segment, when composing a target based on various matches, or in many other scenarios.

I would expect to have that type of information in both XLIFF 2.0 and TMX 2.0, and obviously it would make sense to represent it the same way.

Directionality

The TMX 2.0 draft does not have any information on how to represent directionality information.

For example if there is a span of LTR text embedded inside a RTL segment, how the change of directionality should be indicated?

All inline codes allow for non-TMX attributes, so a its:dir attribute could be used for example. But it would be probably useful to indicate this information in the specification.

The same remark applies to XLIFF 2.0.

PUA characters

In section 1.2 "Character Encoding" of the TMX 2.0 draft there is the following paragraph:

"In addition, if the source database or application generating a TMX file uses character codes in the Private Use Area of Unicode (code points U+E000-U+F8FF) it must convert those code points to their corresponding character entities in TMX files. For example, if a source document uses the "fft" ligature found in certain Adobe OpenType fonts at code point U+E097 in the Private Use Area, the corresponding TMX document would represent this character as &xE097;. This process is required since many text-processing tools do not support the PUA. Inclusion of such character entities in TMX files may necessitate additional negotiation between the creator and receiver of the file if such code points are to be properly interpreted. Such negotiations are outside the scope of the TMX standard and use of the PUA is discouraged when possible."

I believe the proper terminology is:
The construct &xE097; is a 'numeric character reference' not a 'character entity':

  • A character entity = <!ENTITY fft "&#xE097;">
  • A character entity reference = &fft;
  • A numeric character reference = &#xE097; (or &#57495;)

So there are the following errors:

  1. "character as &xE097;." is incorrect: It should be either "character as &#xE097;." or "character as &#57495;."
  2. "those code points to their corresponding character entities in TMX" is incorrect: It should be "those code points to their corresponding numeric character references in TMX".
  3. "Inclusion of such character entities in" should be "Inclusion of such numeric character references in".

Overall Comment on this requirement:

I disagree strongly with this requirement. The only stated reason for it is "many text-processing tools do not support the PUA". That is the problem of the text-processing tool, not TMX.

The burden of having to check TMX content character for such character (to know if it needs to be escaped as NCR or not) is disproportionate to the problem that is being resolved. More importantly, TMX should not try to resolve that problem. It should not have burdensome requirements based on the shortcomings of some tools (which are not even TMX-specific tools).

TMX should just be a normal XML citizen. For example: Many text-processing tools do not support the use of a BOM in UTF-8 file, but TMX, correctly following the Unicode standard, allows it.

I think PUA characters also need to be allow without specific distinction in XLIFF 2.0 as they are currently.

Storing non-XML characters

TMX documents may have to store text that content characters not allowed in an XML document (e.g. some control characters).

The draft of TMX 2.0 has no provision on how to work around this issue, (which, in my experience, occurs a lot more often than the PUA characters perceived problem).

While I do not have a solution to propose at this time, I believe it would be helpful to have one and it should be the same for both TMX 2.0 and XLIFF 2.0.

Character normalization

This is currently a hot topic in the W3C as more and more text is produced in languages that use multiple forms of the same character. As a format that deals with multilingual content, and a format designed for exchange, both XLIFF 2.0 and TMX 2.0 should probably have a common stated position on this issue (regardless what that position is).

Datatypes

The section on the datatype attribute includes a list of pre-defined values and equivalent MIME type values. I am wondering if this attribute should not be made a MIME type values directly (and maybe its name changed, but that is a lesser issue).

The same goes for XLIFF 2.0.

I see that among the pre-defined data there is "xliff". That speaks volume about the problem we have: If one generates a TMX document from an XLIFF document it should use the datatype values specify in the XLIFF document in the equivalent datatype attribute, not "xliff", otherwise we are losing information when going to TMX.

In order to do this both formats have to be using the same values: This is another example of the need to harmonize both formats into a coherent exchange system.

General comment on changes

In the TMX 2.0 proposal there are two main sets of changes:

The first one that affect some elements outside <seg>, some are good, some less good (IMO), but they generally bring new features and do not break compatibility or only a little.

The second set is the new proposed content markup (<itag>). It does bring a massive compatibility break, but--and that is my main problem--does not bring any new feature: There is nothing you can do with <itag> that you cannot already do with the 1.4 content markup.

( I would even argue that it makes parsing somewhat more complicated. For example: now you have to query both the name of the element as well as its type attribute to know what kind of code it represents, before, in most cases you could do this by just looking at its name. )

Maybe I am missing some important benefits, and then I would like to be enlightened. But as far as I cannot see any (<sub> is still there, we still have some text-nodes with real text some other with codes, etc.) It seems the proposed markup changes completely how the content is coded, but does not change any of functionality.

I believe the content of both TMX and XLIFF is the same thing: an abstracted representation of extracted text with inline codes. It has the same purposes and the same requirements. In fact when I read an TMX <seg> or a XLIFF <source> I use the same object to store their content, and I generate either format from that single type of object. This is a pretty strong indication that both are similar. And if they are: why do we need two different XML representations for it?

What that representation should be is a different question.

[[---> now comes the important part:

Establishing that uniqueness is important because it paves the way to have a exchange format between translation tools at the text fragment level. As the component-based and internet-driven technologies evolve we need to make sure the tools of the future will beable to communicate as seamlessly as possible not only using documents exchange, but also small segment of information.

Many of the Web services, plugins, and other bricks that are making up the tools being build today need to exchange data at the segment level, not at the file level. Whether these components identify terms, highlight spelling mistakes, provide TM matches, or MT guesses, they all, ultimately, need to access the same abstracted extracted text.

Having a single representation for TMX and XLIFF contents is not only logical, it is necessary to bring more interoperability between the tools being build today.

---]]

So Arle, I have an idea for OSCAR: Instead of creating new a content markup now, I would suggest:

a) If it seems important to have a new version of TMX published, to move to TMX 1.5 with some of the changes proposed in this draft, but without touching the content markup.

b) Then both groups can work together to come up with a common representation of the content markup in both formats.


If it seems important to have a new version of TMX published, to move to TMX 1.5 with some of the changes proposed in this draft, but without touching the content markup. Nutrisystem.

contributed by hujik lijik on Jul 25 1:54am


give your opinion or make remarks on the topic. Ideally you should be bold and edit / append your ideas into what others have written already. If not someone will do it for you.

Abercrombie Scarves
Abercrombie Fitch Scarves
Abercrombie and Fitch Scarves
Abercrombie Hats
Abercrombie Fitch Hats
Abercrombie and Fitch Hats
Abercrombie Caps
Abercrombie Fitch Caps
Abercrombie and Fitch Caps
Abercrombie Bags
Women Ed Hardy Swimwear
Women Ed Hardy Shoes
Women Ed Hardy Jeans
Women Ed Hardy Hoodies
Women Ed Hardy Tanks
Women Ed Hardy Tops
Women Ed Hardy T Shirt

contributed by hook KING on Dec 4 4:54pm


Great article, thank you for sharing.
Ngan hang

contributed by Pham Giai Khoi on Jan 20 6:59pm


<a href="http://www.modconvertermac.org">MOD Converter for Mac</a>,
<a href="http://www.modvideoconvertermac.com">MOD Video Converter Mac</a>,
<a href="http://www.swfconvertermac.org">SWF Converter Mac</a>,
<a href="http://www.converttomkv.com">Convert to MKV</a>

contributed by john bwn on Mar 3 10:02pm


<a href="http://www.macavchdconverter.org/">Mac AVCHD Converter</a>, is an AVCHD converter Mac software to help you convert AVCHD video on Mac OS. It is the best AVCHD video converter for Mac users, supporting almost all video/audio formats.
<a href="http://www.macavchdconverter.org/blu-ray-to-avchd-for-mac.html">Blu-ray to AVCHD for Mac</a>
<a href="http://www.avchdvideoconverterformac.com/">AVCHD Video Converter for Mac</a>,is a powerful avchd video converter mac, which is mainly used to convert avchd video on mac and convert avchd video to popular portable players.

contributed by vivian cu on Mar 4 7:26pm


Tags

    There are no tags for this page.

Incoming Links

There are no pages that link to this page yet.

Click this button to save this page to your computer for offline use. Created by System User on Mar 12 3:37am. Updated by vivian cu on Mar 4 7:26pm. (14 revisions, 1,232 views)