Basic Wordpress crawler in Python
I’m developing a new personal web where I will probably include some links to this, my blog, I thought it would be a good idea to have an automatic way to extract them from this site. As I was a bit bored last Monday and wanted to remember my Python skills, I thought “Let’s do a really simple WP crawler to make that work for me!”.
The next Python crawler it’s a really simple script. I know it could have been coded showing a beautiful menu with options and some methods having default params and all that, but just wanted to code it quickly. Every tricky part it’s commented but I will explain it in a few lines.
-
Wordpress blogs show only the last entries and at the bottom, there’s a link which allows you to go to older entries (linking to something like …/page/number)
-
The crawler will start seeking for every link in the webpage indicated as source
-
For every link found, if it’s an ‘h2’ type, will consider it as a blog entry so will store it in the list wp_entries
-
If it’s an internal link will crawl it as long as it contains the word ‘page’
-
There’s a depth limit so it doesn’t crawl through every page if desire
-
When it has finished crawling, it will show every entry found
Here is the code: [sourcecode wraplines=”false” language=”python”] import urllib2 import urlparse import re
class Crawler: _source = ‘’ _depth = 0 _links = [] _wp_entries = [] _debug = False
def __init__(self, source, depth):
self._source = source
self._depth = depth
self._links = []
self._wp_entries = []
def get_childs(self, url, level):
if (level <= self._depth):
if self._debug : print 'Crawling ', url
try:
page = urllib2.urlopen(url)
# Discard non html files
page_info = page.info()['content-type']
if not(re.search('text\/html', page_info)):
if self._debug: print 'Found a non-html file ', url, ' :', page.info()['content-type']
else:
for line in page:
# Find every link in source code
if ((re.search('a href=', line) != None)):
# If several links in a line --> Iterate through all
links_in_line = re.findall('a href="(.*?)"', line)
for link in links_in_line:
# For each link found
# 0 - Check the link its internal (regex source or a page in same dir)
# 1 - Create the new page (only if it's without source in link)
# 2 - Check it doesnt been crawled (check self._links)
# 3 - Add to the list (self._links)
# 4 - Crawl the new one according to the rules (match 'page' in this case)
new_link = ''
crawl_link = False
if (re.match(self._source, link)):
# Subpage sharing source (source/sthing) -> use that link
if self._debug: print 'Found internal link ', link
new_link = link
crawl_link = True
elif (re.match('http:\/\/', link)):
# External link -> do nothing
if self._debug: print 'Found external link ', link
elif (re.match('\w|\_|\.', link)):
# Internal link -> construct full url
if self._debug: print 'Found internal link', link
new_link = urlparse.urljoin(url, link)
crawl_link = True
else:
# Weird link
if self._debug: print 'Found weird link ', link
if ((self._links.count(new_link) == 0) & crawl_link):
# Found a new link, store & crawl it
self._links.append(new_link)
# WP specific actions
# Store only h2 links (meaning entries)
# Just crawl whenever link contains 'page'
if(re.search('h2', line)):
if self._debug: print 'WP entry ', new_link
self._wp_entries.append(new_link)
level += 1
if(re.search('page', new_link)): self.get_childs(new_link, level)
level -= 1
except urllib2.HTTPError as e:
if self._debug: print 'Found error while crawling ', url, e
except urllib2.URLError as e:
if self._debug: print 'Found error while crawling ', url, e
else:
if self._debug: print 'Maximum depth level reached, skipping'
def show_childs(self):
print len(self._links),' links were found, listing:'
for link in self._links:
print link
def show_wp_entries(self):
print len(self._wp_entries),' WP entries were found, listing:'
for head in self._wp_entries:
print head
Create a new crawler with page and maximu depth level
crawler = Crawler(‘http://hoyhabloyo.wordpress.com/’, 5)
Get childs for that page (could have used defaults params in method)
crawler.get_childs(‘http://hoyhabloyo.wordpress.com/’, 0)
Show me what you found
crawler.show_wp_entries() [/sourcecode]
And now, let’s check the output where I’m asking the crawler to list this blog 6 pages of entries: [sourcecode wraplines=”false”] kets@ExoduS:~/programacion/sources/python$ python wordpres_crawl.py 30 WP entries were found, listing: http://hoyhabloyo.wordpress.com/2012/01/24/mitos-y-verdades-sobre-las-becas-icex-en-informatica/ http://hoyhabloyo.wordpress.com/2012/01/18/xfce-display-switching-dual-single-monitor/ http://hoyhabloyo.wordpress.com/2012/01/08/ano-nuevo-vida-nueva/ http://hoyhabloyo.wordpress.com/2011/12/31/adios-2011/ http://hoyhabloyo.wordpress.com/2011/12/23/100-000-visitas/ http://hoyhabloyo.wordpress.com/2011/12/19/de-dibujos-animados/ http://hoyhabloyo.wordpress.com/2011/12/01/hablemos-de-los-rumanos-vorbim-despre-romanii/ http://hoyhabloyo.wordpress.com/2011/12/01/reencuentro-de-becarios-ic3x-en-navaluenga/ http://hoyhabloyo.wordpress.com/2011/11/29/sigo-vivo/ http://hoyhabloyo.wordpress.com/2011/10/07/va-de-despedidas/ http://hoyhabloyo.wordpress.com/2011/09/27/los-rincones-de-bucarest-piata-matache/ http://hoyhabloyo.wordpress.com/2011/09/22/los-rincones-de-bucarest-parcul-carol-i/ http://hoyhabloyo.wordpress.com/2011/09/13/los-rincones-de-bucarest-piata-universitatii/ http://hoyhabloyo.wordpress.com/2011/09/12/receta-hummus/ http://hoyhabloyo.wordpress.com/2011/09/08/viaje-por-los-balcanes/ http://hoyhabloyo.wordpress.com/2011/09/01/receta-gazpacho/ http://hoyhabloyo.wordpress.com/2011/08/31/los-rincones-de-bucarest-parcul-herestrau/ http://hoyhabloyo.wordpress.com/2011/08/16/viaje-expres-a-espana/ http://hoyhabloyo.wordpress.com/2011/08/02/ruta-por-la-rumania-profunda-valaquia/ http://hoyhabloyo.wordpress.com/2011/07/17/de-paseo-por-los-carpatos/ http://hoyhabloyo.wordpress.com/2011/07/10/guia-para-vivir-en-bucarest/ http://hoyhabloyo.wordpress.com/2011/06/30/viaje-por-asia/ http://hoyhabloyo.wordpress.com/2011/06/28/receta-pescado-blanco-al-microondas/ http://hoyhabloyo.wordpress.com/2011/05/20/la-revolucion-espanola-spanishrevolution/ http://hoyhabloyo.wordpress.com/2011/05/17/viaje-a-belgrado/ http://hoyhabloyo.wordpress.com/2011/04/29/semana-santa-por-la-rumania-profunda/ http://hoyhabloyo.wordpress.com/2011/04/14/bruselas-y-amsterdam/ http://hoyhabloyo.wordpress.com/2011/04/04/roma/ http://hoyhabloyo.wordpress.com/2011/03/17/las-1000-grullas/ http://hoyhabloyo.wordpress.com/2011/03/02/receta-torretas-de-berenejena-queso-y-tomate-vegetarianas/ [/sourcecode]
It works!! As you can see, with just a hundred lines of Python where typed to achieve that, and much more could be done just modifying some parameters. Hope it helps!