Does your favorite web service have a crappy interface? Make your own with Python, Python-Requests and BeautifulSoup!

There are some pretty useful sites out there, but some interfaces are just plain annoying.
Take pof.com for example: they have millions of users, but haven’t touched their interface since the beginning; if you get lots of messages it becomes a pain to go through them all very quickly.
I figured it might be easier to use my hacking skills to create my own interface.
Step 1: Login
First take a look at the source code of the form on the login page:

<form action="https://www.pof.com/processLogin.aspx" method="post" id="frmLogin" name="frmLogin" class="form right">
	<div id="login-box">
		<input name="url" id="url" class="title" type="hidden">
		<input name="username" id="username" class="title input" type="text" value="l33tman">
        <label class="headline txtBlue size12 label username" for="username">Username</label>
		<input name="password" id="password" class="title input" type="password">
		<label class="headline txtBlue size12 label password" for="password">Password</label>
        <script type="text/javascript">
            var nowt = new Date(),
                tempt_F = nowt.getTimezoneOffset();
            document.write('<input type='hidden' value='' + tempt_F + '' name='tfset'/>');
        </script><input type="hidden" value="300" name="tfset">
		<input name="login" id="login" class="button norm-blue submit" type="submit" value="Check Mail!">
        <input name="callback" id="callback" type="hidden" value="http%3a%2f%2fwww.pof.com%2fstart.aspx">
        <input name="sid" id="sid" type="hidden" value="wcqugtcmwbpb2rvn345x4mxk">
	</div>
    <script type="text/javascript">
        if (document.getElementsByTagName("html").lang == undefined || document.getElementsByTagName("html").lang == null) {
            var html = document.getElementsByTagName("html")[0];
            html["lang"] = "en";
        }
    </script>
</form>

We will use python-requests to make all our requests with a simulated user session. See http://docs.python-requests.org/en/latest/ for more details.
We will start by passing in all those input values to requests.post:

import requests
session = requests.session()
payload = dict(username=username,
               password=password,
               tfset="300",
               callback="http%3a%2f%2fwww.pof.com%2finbox.aspx",
               sid="wcqugtcmwbpb2rvn345x4mxk")
response = session.post("http://pof.com/processLogin.aspx", data=payload)

By using session.post instead of the plain request.post, we retain all the cookie information necessary to simulate an actual logged in user.
Step 2: Collect the message links
BeautifulSoup makes parsing html extremely simple. See http://www.crummy.com/software/BeautifulSoup/bs4/doc/ for the docs.
Say we have an html string and we’d like to find all the html elements with the “message” class. Here’s how we would do that with BeautifulSoup:

soup = BeautifulSoup(html)
for message in soup.find_all('a', 'message'):
    # process your message

In step 1 we logged into pof.com and got a response object. We can pass the html contents of this object to BeatifulSoup to begin parsing.
For our case we need the next_page link and the links to the messages (the html code of POF is terrible, so some hackery was necessary to get the elements properly):

soup = BeautifulSoup(response.text)
next_page = soup.find('a', text='Next Page')attrs['href']
message_links = []
for message_link in soup.find_all(attrs={'href': re.compile('viewallmessages.*')}):
        message_links.append(message_link.attrs['href'])

Step 3: Collect the content
We now need to go to each link and fetch the message content and user data.
Continuing to use the session object for all requests, we get:

def parse_all_messages(links):
    messages = []
    for link in links:
        comment_page = session.get(link)
        soup = BeautifulSoup(comment_page.text)
        for message in soup.find_all(attrs={'style': re.compile('width:500px.*')}):
            user = soup.find('span', 'username-inbox')
            user_image_url = soup.find('td', attrs={'width':"60px"}).img.attrs['src']
            messages.append(dict(user_username=clean_string(user.text),
                                 user_url=pof_url(user.a.attrs['href']),
                                 user_image_url=user_image_url,
                                 date=user.parent.find('div').text,
                                 message=clean_string(message.text)))
    return sorted(messages, key=lambda m: to_date(m['date']), reverse=True)

Step 4: Pretty Print the data
I have opted to use Jinja2 to render the html, but this is not at all necessary. Jinja2 is a simple templating library that is used in many python web frameworks. See http://jinja.pocoo.org/docs/ for a more in depth tutorial.
It’s fairly simple to use:

>>> from jinja2 import Template
>>> template = Template('Hello {{ name }}!')
>>> template.render(name='John Doe')
u'Hello John Doe!'

Be careful to properly encode your strings when using Jinja2. POF has some malformed characters which required cleaning the strings with “”.encode(‘ascii’, ‘ignore’)
Step 5: Run it!
Below is the script in its entirety.

####################################################################
# my_pof_messages.py
#
# A simple script to scrape your pof messages and
# print them to single html file. Also outputs to json.
#
# Usage:
# sudo pip install beautifulsoup4 requests jinja2
# python my_pof_messages.py <username> <password> <output_prefix>
# firefox output_prefix.html
#
# Author:
# Ramin Rahkhamimov
# ramin32@gmail.com
# http://raminrakhamimov.com
#####################################################################
import requests
from bs4 import BeautifulSoup
import re
from jinja2 import Template
import json
import sys
from datetime import datetime
pof_url = lambda x: "https://www.pof.com/%s" % x
session = requests.session()
def append_message_links(e, links):
    soup = BeautifulSoup(e.text)
    for a in soup.find_all(attrs={'href': re.compile('viewallmessages.*')}):
        links.append(pof_url(a.attrs['href']))
    next_page = soup.find('a', text='Next Page')
    return next_page and pof_url(next_page.attrs['href'])
def get_all_message_links(username, password):
    links = []
    payload = dict(username=username,
                   password=password,
                   tfset="300",
                   callback="http%3a%2f%2fwww.pof.com%2finbox.aspx",
                   sid="ikdnixh1pblvis1dlqaa0mb3")
    e = session.post(pof_url("processLogin.aspx"), data=payload)
    next_page = append_message_links(e, links)
    while next_page:
        e = session.get(next_page)
        next_page = append_message_links(e, links)
    return set(links)
def clean_string(string):
    return string.encode('ascii', 'ignore')
def to_date(date_string):
    return datetime.strptime(date_string, '%m/%d/%Y %I:%M:%S %p')
def parse_all_messages(links):
    messages = []
    for link in links:
        comment_page = session.get(link)
        soup = BeautifulSoup(comment_page.text)
        for message in soup.find_all(attrs={'style': re.compile('width:500px.*')}):
            user = soup.find('span', 'username-inbox')
            user_image_url = soup.find('td', attrs={'width':"60px"}).img.attrs['src']
            messages.append(dict(user_username=clean_string(user.text),
                                 user_url=pof_url(user.a.attrs['href']),
                                 user_image_url=user_image_url,
                                 date=user.parent.find('div').text,
                                 message=clean_string(message.text)))
    return sorted(messages, key=lambda m: to_date(m['date']), reverse=True)
def save_messages(messages, prefix):
    template = Template("""
    <html>
    <head>
        <style>
            .user, .message, .date {
                display: inline-block;
                vertical-align: top;
            }
            .message {
                width: 500px;
                padding-left: 10px;
            }
        </style>
    </head>
    <body>
        <ol>
        {% for message in messages %}
            <li>
            <a href="{{message.user_url}}" class="user">
            <img src="{{message.user_image_url}}"/>
            <div>
                {{message.user_username}}
            </div>
            </a>
            <div class="message">
                {{message.message}}
            </div>
            <div class="date">
                {{message.date}}
            </div>
            </li>
        {% endfor %}
        </ol>
    </body>
    </html>
    """)
    with open('%s.html' % prefix, 'w') as f:
        f.write(template.render(messages=messages))
    with open('%s.json' % prefix, 'w') as f:
        f.write(json.dumps(messages))
if __name__ == '__main__':
    if len(sys.argv) != 4:
        print "Usage: my_pof_messages.py <username> <password> <output_prefix>"
    links = get_all_message_links(sys.argv[1], sys.argv[2])
    messages = parse_all_messages(links)
    save_messages(messages, sys.argv[3])

Install requests, beautifulsoup4 and jinja2 and run with python. Depending on your inbox size, this may take a couple of minutes. Once the script is done running, open the newly create html file with your favorite browser:

sudo pip install requests beautifulsoup4 jinja2
python my_pof_messages.py your_username your_password output
firefox output.html

This script can be easily tweaked to be used with your favorite service provider.

Leave a comment

Your email address will not be published. Required fields are marked *