There is an issue regarding how DICOM should specify the use of a pass phrase as part of the addition of pass phrase based media encryption. This discusses the issues and my current recommendations.
Meta-discussion (DICOM Philosophy)
DICOM provides a complete interoperability specification. It does not leave out critical components. But, DICOM does not want to re-invent wheels. For example, DICOM devotes several pages to precisely specify the use of TCP/IP and several pages to how DICOM conformance statements will document the use of network options.
There are issues surrounding the pass phrases for media encryption that need some degree of specification.
Scope and Assumptions
I will assume that the users are able to convey the pass phrase somehow, most likely orally or in some printed form. They can deal with issues of proper spelling, capitalization, writing system, etc. An example of writing system selection is conveying whether the hanzi are in traditional or simplified form. All of these can affect the use of a pass phrase as an encryption key. Users will understand this and be able to convey this kind of information.
The problem arises because the RFC specifies how the byte sequence of the pass phrase is processed to create the encryption key. The RFC does not specify the relationship between the byte sequence and the user conveyed pass phrase.
I believe that DICOM must specify this relationship and might give guidance regarding some of the operational pass phrase issues.Issues
Encoding Systems
In China computer systems utilize Big5, GB2312, GBK, GB18030, and Unicode. Internally the Unicode systems might use UTF-8, UTF-16, UTF-32, UCS-2, or UCS-4 encoding. There are similar internal variants for GB18030. So there are more than ten possible byte sequences that might be used internally to represent the exact same pass phrase. With the exception of GB2312, GBK, and GB18030 using UTF-8 these will be different for a passphrase using normal Chinese and latin characters.
Similar issues arise around the world with the mixture of ISO 2022 and 5 possible Unicode encodings. The only major world segment where this is a lesser issue is in the English speaking countries. The latin alphabet is encoded the same way for ISO IR-6 in ISO 2022, ASCI, Big5, GB2312, etc. The issue of Unicode internal coding does remain a potential problem for the latin alphabet.
Unicode (and GB18030) specific issues
There are some further issues that are unique to Unicode and GB18030. GB18030 inherits these issues from Unicode, so for the rest of this I will just refer to Unicode terminology.
Composing Characters and related issues
Many languages have accented and modified characters. Unicode includes some code points for pre-composed characters such as an accented "e". These characters can also be represented by two composing characters: an "e" and an accent. These will have two different byte sequences. The Unicode report TR-15 discusses the issues of composing characters in much more detail. Composing characters matter for string comparison, hash codes, and other computer functions.
The W3C, IHE, and others have recommended the use of Normalized Form C (NFC) from TR-15. DICOM will need to specify the normalization in order to get consistent byte sequences for pass phrases. NFC is the appropriate form to use.
Composing Languages
Some languages, like Thai, have no pre-composed characterset available. For this kind of language there are language specific rules regarding the proper ordering of the code points used to compose the final representation.
Multi-lingual use
Many media exchanges will be among systems that all support the same language; but, there will also be exchanges that cross language boundaries. Issues that will arise when this happens include:
- homotypes - One example is the Serbian "dot-less i". This is a different code point than the composing dot-less i. It is nearly impossible to tell these two apart visually, and in many fonts the two are identical. A system configured for Serbian will use the Serbian character. A typical configuration for other European languages is to use the composing dot-less i. Data exchange between Serbian and non-Serbian systems will fail mysteriously. It is not reasonable for users to recognize this kind of problem.
Many of the punctuation marks, space characters, dashes, etc. are also homotypes.
- Ad hoc solutions - Enterprising users will try to use available alternative data entry systems for unsupported character sets. For example, Greek may be entered by using the local equation support. But some of the mathematical characters that look like the Greek characters are different code points. These are not true homotypes because the characters do look somewhat different. This difference is subtle and is likely not noticed by the users.
- Complete failure - There will be many system - language combinations that simply do not work. It is not practical to enter any of the composing writing systems (like Thai) on a system that lacks data entry support for that language.
My Recommendations
It is not reasonable to expect users to know how their characters are encoded internally. Only the software developers know what internal coding system is used and what byte sequence is provided to the encryption algorithm.
DICOM must specify the byte encoding algorithm to be used for encryption. I would require that it be Unicode text in normalization form C encoded using a minimum length UTF-8 encoding. This is regardless of the local system internal encoding preferences or language support.
We need proper wording to explain why this matters and that this is a requirement for defining the byte sequence used to generate the encryption key, not a requirement on the user interface.
The multi-lingual issues, and the need to support archival uses, also drive the need for an option that does constrain the user interface somewhat. All of the encoding systems support the latin alphabet (ISO IR-6). DICOM has mandated that at a minimum, any DICOM conformant system must support the latin alphabet. (For those not accustomed to the jargon, ASCII is an encoding of the latin alphabet.)
DICOM should specify a pass phrase interface option that restricts the pass phrase to use only characters in the latin alphabet. This option can then be used when describing a particular piece of media. The name "International Latin pass phrase" could be used to describe a software capability as well. The users can then specify that they want media with an "international latin pass phrase" when they need to ensure that any DICOM compliant system be able to process the encrypted media.
That was inspiring,
This is really helpful,
Keep up the good work
Posted by: bespoke software | December 18, 2009 at 05:21 AM