Dec 31 2007

Yahoo! messenger archive file format

Tags: , , , , Rajiv @ 8:46 pm UTC

The first step to get my Yahoo! messenger (YMessenger) conversations into windows desktop search is to decode the conversations stored in YMessenger archive files.

If you enable message archiving, YMessenger saves all the conversations with your friends in C:\Program Files\Yahoo!\Messenger\Profiles\${userid}\Archive\Messages\ directory in files with the extension .dat. The menu option Contacts -> Message Archive shows all the archived conversations grouped by user.

Figuring out the format of this file was thrilling … but probably not as challenging as the codes faced by codebreakers in the real world. (Simon Sigh’s The code book did leave a lasting impression on me … as did his other books!)

If you consider the equation encoded-message = code(original-message), the real codebreakers have access to only the encoded message. They have to figure out the code and in the process figure out the original message. The biggest leverage I had was that I had to figure out only the codem given encoded-message and original-message.

For example, I could send myself the message “a” and look at the contents of the .dat file. Then I could send myself the message “aa” and look at the contents of the .dat file. Then send the message “b”, followed by “bb”, followed by “ab”. Looking at the changes to the .dat file after every step.

Each message (from your to your buddy or vice-versa) in the .dat file is represented by a Record. Every Record has a timestamp of when it was sent, whether it was from you or from your buddy and the formatted (bold, italic etc) message. Your profile name and the the name of your buddy can be derived from the name of the .dat file and the name of its parent directory respectively.

The format for a Record is:

  • The first int (four bytes) represents the number of seconds from Java epoch (Jan 1, 1970)
  • The second int … i don’t know what it is
  • The third int indicates that the message is from you to your buddy if it is zero or from your buddy to you if it is non-zero
  • The fourth int (msgLen) represents the length of the encoded formatted-message. There is no encoding/encryption till this point.
  • Next msgLen bytes represent the encoded formatted-message

The formatting of the message is indicated by some special tokens in the .dat file. These font-attribute tokens always start with 0x1B5B. So if you type the message “this is NOT acceptable”, the formatted message would be “this is {[0x1B5B][0x31]}NOT{[0x1B5B]x[0x31]} acceptable” (where {} indicate one token and [] indicates bytes shown in hex value instead of ASCII values). {[0x1B5B][0x31]} is a token that indicates begin bold and {[0x1B5B]x[0x31]} indicated end bold.

There are tokens for (begin and end of) bold, italic, underline. Then there are tokens that mark the begining of custom and standard(/palette) colors. The ending of colors is indicated by a token that indicates begining of standard color: black! The only peculiarity I could notice was that they use HTML like tags (instead of tokens starting with 0x1B5B) when you send messages with color gradient.

The formatted-message is then encoded before saving into the record. One of the challenges with a web-desktop application is that while storing the encrypted data on the desktop, what key do we use for encryption? The key used for encryption has to be different for each user, should not be guessable by other users and should not be stored on the PC. One option could be to use the password of the user as the key. But, whenever the user changes the password the archive has to be decrypted with the old password and encrypted with the new one.

The alternate solution is to have a autogenerated key stored per user on the website. An authenticated user can download his key and decrypt the archive. Changing the password does not change this key.

YMessenger uses the simple XOR cipher to encrypt the messages. The key used for the cipher is highly guessable: your user-id! Every byte of the formatted-message is XOR’ed with a byte from the user-id. For example if you message was “Hello World!” and your user-id was “doofy“, then the encrypted bytes would be:
[H^d][e^o][l^o][l^f][o^y][ ^d][W^o][o^o][r^f][l^y][d^d][!^o]

The beauty of XOR cipher is that if encoded-message = xor-cipher(original-message, key) then original-message = xor-cipher(encoded-message, key)

Parser.java documents the file format in more detail. Main.java demonstrates how one can use the Parser to convert the .dat files to HTML or plain text format.

How did you spend your new year’s eve?! :D

Share:
  • email
  • del.icio.us
  • DZone
  • Technorati
  • Reddit
  • Ma.gnolia
  • Google Bookmarks
  • YahooMyWeb
  • SphereIt
  • StumbleUpon
  • Digg
  • Mixx
  • TwitThis
  • Furl
  • Simpy

No trackbacks

7 comments


  1. KermodeBear


    Hey! Thank you VERY much for posting this information. I found it to be incredibly useful. I had been wanting to export my conversations with some friends to text files for quite a while, but was not interested in paying for utilities that do this. With your help I was able quickly write up some PHP to do the job. Yes, I know, PHP isn’t the ideal language for this kind of thing, but it is the language in which I am most adept.

    decodeYahoo.php

    #/usr/bin/php
    <?php
    /**
    * Usage:
    * ./decodeYahoo.php account contact file
    *
    * 'account' is the Y!M account name.
    * 'recipient' is the name of the contact
    * 'file' is the file to parse.
    *
    * Please note: There is no error checking
    * in the code below. If you want to use
    * this code for anything important, please
    * add some. Also, fopen/fread would be more
    * memory efficient than a file_get_contents,
    * but again, I'm being super lazy today. (o:
    */

    $account = $argv[1];
    $contact = $argv[2];
    $data = file_get_contents($argv[3]);
    $p = 0; // Position in the file's data.

    while ($p


  2. KermodeBear


    Seems that the ‘code’ tag still hates less-than symbols inside. Tsk tsk. Giving this another try.


    #/usr/bin/php
    <?php
    /**
    * Usage:
    * ./decodeYahoo.php account contact file
    *
    * 'account' is the Y!M account name.
    * 'recipient' is the name of the contact
    * 'file' is the file to parse.
    *
    * Please note: There is no error checking
    * in the code below. If you want to use
    * this code for anything important, please
    * add some. Also, fopen/fread would be more
    * memory efficient than a file_get_contents,
    * but again, I'm being super lazy today. (o:
    */

    $account = $argv[1];
    $contact = $argv[2];
    $data = file_get_contents($argv[3]);
    $p = 0; // Position in the file's data.

    while ($p < strlen($data)) {
    $result = array();

    $pieces = unpack('ltime/lunknown/ltoFrom/llength', substr($data, $p, $p + 16));

    $p += 16;

    $result['date'] = date('r', $pieces['time']);
    $result['to'] = $pieces['toFrom'] ? $contact : $account;
    $cypherText = substr($data, $p, $pieces['length']);

    $p += $pieces['length'];

    // Generates the XOR key by repeating the account name
    // then chopping off extra characters so the entire message
    // can be XOR in one operation.
    $key = str_repeat($to, ceil($pieces['length'] / strlen($to)));
    $key = substr($key, 0, strlen($cypherText));
    $result['text'] = $key ^ $cypherText;

    // There seems to be an extra 4 bytes of junk here.
    $p+=4;

    // Nov 01, 2008 12:34 PersonTalking: That's what SHE said.
    echo "{$result['date']} {$result['to']}: {$result['text']}\n";
    }
    ?>


  3. JSG


    I think the format is actually:
    ———————————
    1. timestamp (5 bytes)
    2. padding (3 bytes)
    3. toOrFrom (1 bytes)
    4. padding (3 bytes)
    5. mesg_len (1 bytes)
    6. padding (3 bytes)

    Some of that is up for debate but the timestamp is definitely 5 bytes.


  4. KermodeBear


    It doesn’t make sense for the message length to be 1 byte, since messages can be over 256 characters in length. I’m also not convinced that the timestamp is 5 bytes either – it may be, but when I have decoded the archive files, the time stamps have always been accurate.

    That said, I haven’t tested this on the newest versions of the Yahoo! IM software, so something might have changed. Maybe it’s time to give it another go around and see what happens. (o:


  5. Joe Dassin


    Ok can someone do it the other way around ? I mean a YIM! Archives ENCODER …..


  6. Kyle E. Domingo


    Just to inform you guys, I tried to code a reader for YM archives but this format (though the segments are still the same) is already not applicable in my current YM version (9.0.0.1912). For this version, bytes in each segment is reversed.

    For example, the four byte representation of the timestamp now (1272351546) is 4B D6 8B 3A. This is encoded in the archive file as 3A 8B D6 4B. same goes for the to_or_from, the message length (example, 0200 0000 for a 2 letter message), and i think, the message as well.


  7. Rajiv


    Hi Kyle, thanks for the note!

    The blog entry does not discuss the file format in detail. One of the details not documented being the format in which an integer is stored/read.

    The attached file Parser.java is a more “complete reference” (to the extent I have understood). If you see the readInt method in the file you will notice the bytes are read in reverse like you mentioned.

Leave a Reply

Subscribe to comments on this post

Allowed tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>