There are some pretty useful sites out there, but some interfaces are just plain annoying.
Take pof.com for example: they have millions of users, but haven’t touched their interface since the beginning; if you get lots of messages it becomes a pain to go through them all very quickly.
I figured it might be easier to use my hacking skills to create my own interface.
Step 1: Login
First take a look at the source code of the form on the login page:
<form action="https://www.pof.com/processLogin.aspx" method="post" id="frmLogin" name="frmLogin" class="form right"> <div id="login-box"> <input name="url" id="url" class="title" type="hidden"> <input name="username" id="username" class="title input" type="text" value="l33tman"> <label class="headline txtBlue size12 label username" for="username">Username</label> <input name="password" id="password" class="title input" type="password"> <label class="headline txtBlue size12 label password" for="password">Password</label> <script type="text/javascript"> var nowt = new Date(), tempt_F = nowt.getTimezoneOffset(); document.write('<input type='hidden' value='' + tempt_F + '' name='tfset'/>'); </script><input type="hidden" value="300" name="tfset"> <input name="login" id="login" class="button norm-blue submit" type="submit" value="Check Mail!"> <input name="callback" id="callback" type="hidden" value="http%3a%2f%2fwww.pof.com%2fstart.aspx"> <input name="sid" id="sid" type="hidden" value="wcqugtcmwbpb2rvn345x4mxk"> </div> <script type="text/javascript"> if (document.getElementsByTagName("html").lang == undefined || document.getElementsByTagName("html").lang == null) { var html = document.getElementsByTagName("html")[0]; html["lang"] = "en"; } </script> </form>
We will use python-requests to make all our requests with a simulated user session. See http://docs.python-requests.org/en/latest/ for more details.
We will start by passing in all those input values to requests.post:
import requests session = requests.session() payload = dict(username=username, password=password, tfset="300", callback="http%3a%2f%2fwww.pof.com%2finbox.aspx", sid="wcqugtcmwbpb2rvn345x4mxk") response = session.post("http://pof.com/processLogin.aspx", data=payload)
By using session.post instead of the plain request.post, we retain all the cookie information necessary to simulate an actual logged in user.
Step 2: Collect the message links
BeautifulSoup makes parsing html extremely simple. See http://www.crummy.com/software/BeautifulSoup/bs4/doc/ for the docs.
Say we have an html string and we’d like to find all the html elements with the “message” class. Here’s how we would do that with BeautifulSoup:
soup = BeautifulSoup(html) for message in soup.find_all('a', 'message'): # process your message
In step 1 we logged into pof.com and got a response object. We can pass the html contents of this object to BeatifulSoup to begin parsing.
For our case we need the next_page link and the links to the messages (the html code of POF is terrible, so some hackery was necessary to get the elements properly):
soup = BeautifulSoup(response.text) next_page = soup.find('a', text='Next Page')attrs['href'] message_links = [] for message_link in soup.find_all(attrs={'href': re.compile('viewallmessages.*')}): message_links.append(message_link.attrs['href'])
Step 3: Collect the content
We now need to go to each link and fetch the message content and user data.
Continuing to use the session object for all requests, we get:
def parse_all_messages(links): messages = [] for link in links: comment_page = session.get(link) soup = BeautifulSoup(comment_page.text) for message in soup.find_all(attrs={'style': re.compile('width:500px.*')}): user = soup.find('span', 'username-inbox') user_image_url = soup.find('td', attrs={'width':"60px"}).img.attrs['src'] messages.append(dict(user_username=clean_string(user.text), user_url=pof_url(user.a.attrs['href']), user_image_url=user_image_url, date=user.parent.find('div').text, message=clean_string(message.text))) return sorted(messages, key=lambda m: to_date(m['date']), reverse=True)
Step 4: Pretty Print the data
I have opted to use Jinja2 to render the html, but this is not at all necessary. Jinja2 is a simple templating library that is used in many python web frameworks. See http://jinja.pocoo.org/docs/ for a more in depth tutorial.
It’s fairly simple to use:
>>> from jinja2 import Template >>> template = Template('Hello {{ name }}!') >>> template.render(name='John Doe') u'Hello John Doe!'
Be careful to properly encode your strings when using Jinja2. POF has some malformed characters which required cleaning the strings with “”.encode(‘ascii’, ‘ignore’)
Step 5: Run it!
Below is the script in its entirety.
#################################################################### # my_pof_messages.py # # A simple script to scrape your pof messages and # print them to single html file. Also outputs to json. # # Usage: # sudo pip install beautifulsoup4 requests jinja2 # python my_pof_messages.py <username> <password> <output_prefix> # firefox output_prefix.html # # Author: # Ramin Rahkhamimov # ramin32@gmail.com # http://raminrakhamimov.com ##################################################################### import requests from bs4 import BeautifulSoup import re from jinja2 import Template import json import sys from datetime import datetime pof_url = lambda x: "https://www.pof.com/%s" % x session = requests.session() def append_message_links(e, links): soup = BeautifulSoup(e.text) for a in soup.find_all(attrs={'href': re.compile('viewallmessages.*')}): links.append(pof_url(a.attrs['href'])) next_page = soup.find('a', text='Next Page') return next_page and pof_url(next_page.attrs['href']) def get_all_message_links(username, password): links = [] payload = dict(username=username, password=password, tfset="300", callback="http%3a%2f%2fwww.pof.com%2finbox.aspx", sid="ikdnixh1pblvis1dlqaa0mb3") e = session.post(pof_url("processLogin.aspx"), data=payload) next_page = append_message_links(e, links) while next_page: e = session.get(next_page) next_page = append_message_links(e, links) return set(links) def clean_string(string): return string.encode('ascii', 'ignore') def to_date(date_string): return datetime.strptime(date_string, '%m/%d/%Y %I:%M:%S %p') def parse_all_messages(links): messages = [] for link in links: comment_page = session.get(link) soup = BeautifulSoup(comment_page.text) for message in soup.find_all(attrs={'style': re.compile('width:500px.*')}): user = soup.find('span', 'username-inbox') user_image_url = soup.find('td', attrs={'width':"60px"}).img.attrs['src'] messages.append(dict(user_username=clean_string(user.text), user_url=pof_url(user.a.attrs['href']), user_image_url=user_image_url, date=user.parent.find('div').text, message=clean_string(message.text))) return sorted(messages, key=lambda m: to_date(m['date']), reverse=True) def save_messages(messages, prefix): template = Template(""" <html> <head> <style> .user, .message, .date { display: inline-block; vertical-align: top; } .message { width: 500px; padding-left: 10px; } </style> </head> <body> <ol> {% for message in messages %} <li> <a href="{{message.user_url}}" class="user"> <img src="{{message.user_image_url}}"/> <div> {{message.user_username}} </div> </a> <div class="message"> {{message.message}} </div> <div class="date"> {{message.date}} </div> </li> {% endfor %} </ol> </body> </html> """) with open('%s.html' % prefix, 'w') as f: f.write(template.render(messages=messages)) with open('%s.json' % prefix, 'w') as f: f.write(json.dumps(messages)) if __name__ == '__main__': if len(sys.argv) != 4: print "Usage: my_pof_messages.py <username> <password> <output_prefix>" links = get_all_message_links(sys.argv[1], sys.argv[2]) messages = parse_all_messages(links) save_messages(messages, sys.argv[3])
Install requests, beautifulsoup4 and jinja2 and run with python. Depending on your inbox size, this may take a couple of minutes. Once the script is done running, open the newly create html file with your favorite browser:
sudo pip install requests beautifulsoup4 jinja2 python my_pof_messages.py your_username your_password output firefox output.html
This script can be easily tweaked to be used with your favorite service provider.