Dec 31 2007

Yahoo! messenger archive file format

Tags: , , , , Rajiv @ 8:46 pm GMT-0700

The first step to get my Yahoo! messenger (YMessenger) conversations into windows desktop search is to decode the conversations stored in YMessenger archive files.

If you enable message archiving, YMessenger saves all the conversations with your friends in C:\Program Files\Yahoo!\Messenger\Profiles\${userid}\Archive\Messages\ directory in files with the extension .dat. The menu option Contacts -> Message Archive shows all the archived conversations grouped by user.

Figuring out the format of this file was thrilling … but probably not as challenging as the codes faced by codebreakers in the real world. (Simon Sigh’s The code book did leave a lasting impression on me … as did his other books!)

If you consider the equation encoded-message = code(original-message), the real codebreakers have access to only the encoded message. They have to figure out the code and in the process figure out the original message. The biggest leverage I had was that I had to figure out only the codem given encoded-message and original-message.

For example, I could send myself the message “a” and look at the contents of the .dat file. Then I could send myself the message “aa” and look at the contents of the .dat file. Then send the message “b”, followed by “bb”, followed by “ab”. Looking at the changes to the .dat file after every step.

Each message (from your to your buddy or vice-versa) in the .dat file is represented by a Record. Every Record has a timestamp of when it was sent, whether it was from you or from your buddy and the formatted (bold, italic etc) message. Your profile name and the the name of your buddy can be derived from the name of the .dat file and the name of its parent directory respectively.

The format for a Record is:

  • The first int (four bytes) represents the number of seconds from Java epoch (Jan 1, 1970)
  • The second int … i don’t know what it is
  • The third int indicates that the message is from you to your buddy if it is zero or from your buddy to you if it is non-zero
  • The fourth int (msgLen) represents the length of the encoded formatted-message. There is no encoding/encryption till this point.
  • Next msgLen bytes represent the encoded formatted-message

The formatting of the message is indicated by some special tokens in the .dat file. These font-attribute tokens always start with 0x1B5B. So if you type the message “this is NOT acceptable”, the formatted message would be “this is {[0x1B5B][0x31]}NOT{[0x1B5B]x[0x31]} acceptable” (where {} indicate one token and [] indicates bytes shown in hex value instead of ASCII values). {[0x1B5B][0x31]} is a token that indicates begin bold and {[0x1B5B]x[0x31]} indicated end bold.

There are tokens for (begin and end of) bold, italic, underline. Then there are tokens that mark the begining of custom and standard(/palette) colors. The ending of colors is indicated by a token that indicates begining of standard color: black! The only peculiarity I could notice was that they use HTML like tags (instead of tokens starting with 0x1B5B) when you send messages with color gradient.

The formatted-message is then encoded before saving into the record. One of the challenges with a web-desktop application is that while storing the encrypted data on the desktop, what key do we use for encryption? The key used for encryption has to be different for each user, should not be guessable by other users and should not be stored on the PC. One option could be to use the password of the user as the key. But, whenever the user changes the password the archive has to be decrypted with the old password and encrypted with the new one.

The alternate solution is to have a autogenerated key stored per user on the website. An authenticated user can download his key and decrypt the archive. Changing the password does not change this key.

YMessenger uses the simple XOR cipher to encrypt the messages. The key used for the cipher is highly guessable: your user-id! Every byte of the formatted-message is XOR’ed with a byte from the user-id. For example if you message was “Hello World!” and your user-id was “doofy“, then the encrypted bytes would be:
[H^d][e^o][l^o][l^f][o^y][ ^d][W^o][o^o][r^f][l^y][d^d][!^o]

The beauty of XOR cipher is that if encoded-message = xor-cipher(original-message, key) then original-message = xor-cipher(encoded-message, key)

Parser.java documents the file format in more detail. Main.java demonstrates how one can use the Parser to convert the .dat files to HTML or plain text format.

How did you spend your new year’s eve?! 😀


Dec 21 2007

Addicted to search

Tags: , , , , , Rajiv @ 11:39 am GMT-0700

For as long as I can remember, I have been too lazy to use: my fingers to type and my brain to remember things. When I used to work on linux, I used to rely heavily on the locate command to search and open files vi `locate math.h`. This was something I missed a lot in windows. Finally I started using Launchy as a replacement for locate.
Locating sources files using launchy

Searching through mails also worked pretty well when was using Evolution on Linux. But search in Outlook sucks, especially if you are using IMAP. The amazing Lookout plugin for Outlook was my saviour. Unfortunately, it had its own problems. It used to crash my Outlook 2K often; and once Microsoft bought them, there was no hope of getting things fixed. Microsoft has been pushing its own Windows Desktop Search instead of Lookout. Though not as fast as Lookout … it is the compromise solution I have been using for the sake of stability.

I know there are other desktop search products out there, including the one from google. But the thing I like about windows desktop search is that I can do Outlook operations on the search results (like forward the mail or move them into folders). Now I am so addicted to search that I move mails to folders only when it has huge attachment. Otherwise, it is pretty much the gmail model. I only use Inbox.yyyy and sent-mail.yyyy folders actively (yyyy being the year). I considering setting up a rule in Outlook to save sent-mail in Inbox, that way I would have only Inbox.yyyy folders and I can do a threaded view of the conversations (i.e. the gmail’s “All Mail” label)!

Apart from email, our other significant communication medium at work is Yahoo! messenger. I archive all my conversations and refer to them very often. The unfortunate side effect of this is that some conversations that start on email are concluded on chat and six months later when I search mails, I do not find the mail with the conclusion. Over the years I have wizened up to search conversations on email and followup the search in ymessenger. Unfortunately, search in ymessenger also sucks! You have to do a manual search based on the timestamp of email conversation and people involved.

YMessenger saves conversations in C:\Program Files\Yahoo!\Messenger\Profiles\${userid}\Archive\Messages\ directory in files with the extension .dat. It would be nice to have Windows Desktop Search (WDS) index these files and show my conversation results when I search for communications. I can think of couple of approaches to achieve this:

  • Convert ymessenger archives to Outlook mailbox format (.pst) and let WDS index it
  • Convert ymessenger archives to RSS and import the RSS into Outlook using RSS Popper and once the messages are in Outlook, WDS will index it
  • Convert the ymessenger archive files (.dat) to html format and have WDS index these. Probably the easiest integration, but the limitation would only issue will not be able to run searches of the type “customer requirements from:myYahooBuddy date:last month”
  • WDS supports plugging-in IFilters to search new file types. I could implement IFilter to index the ymessenger archive files (.dat).

All of these presume there is some API to decode the content in the ymessenger archive files (.dat). The search is on!