Subject: Hash compatibility: Unicode considerations
Date: Sat, 20 Apr 2002 10:29:55 -0400
From: Michel Gallant <neutron@istar.ca>
Organization: Bell Sympatico
Newsgroups: microsoft.public.platformsdk.security,comp.lang.java.security

Obtaining a hash (digest) of textual data can introduce compatibility
issues between different approaches, depending on how the text is
passed to the hash procedure. For example, text internally represented
as unicode (for example in JScript) might be passed as a LittleEndian
byte encoding to the hash algorithm. Of course this shows up as differences
in signed hashes and pkcs#7 encoded messages.

To help troubleshoot and resolve some of these issues, I have updated
the MD5/SHA-1 signed Java applet calculator with a choice to show
the UnicodeLittleUnmarked hash value (mainly used on win32/Intel)
as well as the usual default character --> byte encoding (typically UTF-8):
   http://home.istar.ca/~neutron/messagedigest/

Some examples:
  In Java,  an internally represented String object (Unicode) when converted
  to a byte array with string.getBytes()  encodes by default to bytes
without
  unicode encoding

  Examples in CAPICOM 2 which show hashing (e.g. CHashData.vbs) use
  the unicode byte representation (LittleEndian) for  HashedData.Hash
Content

  The .net SDK provides capability to easily control this encoding of byte
data
  passed to the hash algorithm, for example:
   Byte[] data2hash  = UTF8Encoding()).GetBytes(s) ;
   Byte[]  data2hash = UnicodeEncoding()).GetBytes(s) ;
   byte[] hashvalue2 = (new
MD5CryptoServiceProvider()).ComputeHash(data2hash);

 Sometimes, examples that hash text data explicitly add a null byte to the
 data to be hashed (for example, MS PSDK cryptoAPI signature and hash demos)
which
 of course changes the hash value and any subsequent signatures.

 - Mitch Gallant
    http://home.istar.ca/~neutron/wsh