Dec 31 2007

Yahoo! messenger archive file format

Tags: , , , , Rajiv @ 8:46 pm UTC

The first step to get my Yahoo! messenger (YMessenger) conversations into windows desktop search is to decode the conversations stored in YMessenger archive files.

If you enable message archiving, YMessenger saves all the conversations with your friends in C:\Program Files\Yahoo!\Messenger\Profiles\${userid}\Archive\Messages\ directory in files with the extension .dat. The menu option Contacts -> Message Archive shows all the archived conversations grouped by user.

Figuring out the format of this file was thrilling … but probably not as challenging as the codes faced by codebreakers in the real world. (Simon Sigh’s The code book did leave a lasting impression on me … as did his other books!)

If you consider the equation encoded-message = code(original-message), the real codebreakers have access to only the encoded message. They have to figure out the code and in the process figure out the original message. The biggest leverage I had was that I had to figure out only the codem given encoded-message and original-message.

For example, I could send myself the message “a” and look at the contents of the .dat file. Then I could send myself the message “aa” and look at the contents of the .dat file. Then send the message “b”, followed by “bb”, followed by “ab”. Looking at the changes to the .dat file after every step.

Each message (from your to your buddy or vice-versa) in the .dat file is represented by a Record. Every Record has a timestamp of when it was sent, whether it was from you or from your buddy and the formatted (bold, italic etc) message. Your profile name and the the name of your buddy can be derived from the name of the .dat file and the name of its parent directory respectively.

The format for a Record is:

  • The first int (four bytes) represents the number of seconds from Java epoch (Jan 1, 1970)
  • The second int … i don’t know what it is
  • The third int indicates that the message is from you to your buddy if it is zero or from your buddy to you if it is non-zero
  • The fourth int (msgLen) represents the length of the encoded formatted-message. There is no encoding/encryption till this point.
  • Next msgLen bytes represent the encoded formatted-message

The formatting of the message is indicated by some special tokens in the .dat file. These font-attribute tokens always start with 0x1B5B. So if you type the message “this is NOT acceptable”, the formatted message would be “this is {[0x1B5B][0x31]}NOT{[0x1B5B]x[0x31]} acceptable” (where {} indicate one token and [] indicates bytes shown in hex value instead of ASCII values). {[0x1B5B][0x31]} is a token that indicates begin bold and {[0x1B5B]x[0x31]} indicated end bold.

There are tokens for (begin and end of) bold, italic, underline. Then there are tokens that mark the begining of custom and standard(/palette) colors. The ending of colors is indicated by a token that indicates begining of standard color: black! The only peculiarity I could notice was that they use HTML like tags (instead of tokens starting with 0x1B5B) when you send messages with color gradient.

The formatted-message is then encoded before saving into the record. One of the challenges with a web-desktop application is that while storing the encrypted data on the desktop, what key do we use for encryption? The key used for encryption has to be different for each user, should not be guessable by other users and should not be stored on the PC. One option could be to use the password of the user as the key. But, whenever the user changes the password the archive has to be decrypted with the old password and encrypted with the new one.

The alternate solution is to have a autogenerated key stored per user on the website. An authenticated user can download his key and decrypt the archive. Changing the password does not change this key.

YMessenger uses the simple XOR cipher to encrypt the messages. The key used for the cipher is highly guessable: your user-id! Every byte of the formatted-message is XOR’ed with a byte from the user-id. For example if you message was “Hello World!” and your user-id was “doofy“, then the encrypted bytes would be:
[H^d][e^o][l^o][l^f][o^y][ ^d][W^o][o^o][r^f][l^y][d^d][!^o]

The beauty of XOR cipher is that if encoded-message = xor-cipher(original-message, key) then original-message = xor-cipher(encoded-message, key) documents the file format in more detail. demonstrates how one can use the Parser to convert the .dat files to HTML or plain text format.

How did you spend your new year’s eve?! :D