Basic Wordpress crawler in Python

3 minute read

I’m developing a new personal web where I will probably include some links to this, my blog, I thought it would be a good idea to have an automatic way to extract them from this site. As I was a bit bored last Monday and wanted to remember my Python skills, I thought “Let’s do a really simple WP crawler to make that work for me!”.

The next Python crawler it’s a really simple script. I know it could have been coded showing a beautiful menu with options and some methods having default params and all that, but just wanted to code it quickly. Every tricky part it’s commented but I will explain it in a few lines.

  1. Wordpress blogs show only the last entries and at the bottom, there’s a link which allows you to go to older entries (linking to something like …/page/number)

  2. The crawler will start seeking for every link in the webpage indicated as source

  3. For every link found, if it’s an ‘h2’ type, will consider it as a blog entry so will store it in the list wp_entries

  4. If it’s an internal link will crawl it as long as it contains the word ‘page’

  5. There’s a depth limit so it doesn’t crawl through every page if desire

  6. When it has finished crawling, it will show every entry found

Here is the code: [sourcecode wraplines=”false” language=”python”] import urllib2 import urlparse import re

class Crawler: _source = ‘’ _depth = 0 _links = [] _wp_entries = [] _debug = False

def __init__(self, source, depth):
    self._source = source
    self._depth = depth
    self._links = [] 
    self._wp_entries = []

def get_childs(self, url, level):
    if (level <= self._depth):
        if self._debug : print 'Crawling ', url
        try:
            page = urllib2.urlopen(url)

            # Discard non html files
            page_info = page.info()['content-type']
            if not(re.search('text\/html', page_info)):
                if self._debug: print 'Found a non-html file ', url, ' :', page.info()['content-type']

            else:
                for line in page:
                    # Find every link in source code
                    if ((re.search('a href=', line) != None)):
                        # If several links in a line -->  Iterate through all
                        links_in_line = re.findall('a href="(.*?)"', line)
                        for link in links_in_line:
                            # For each link found
                            #   0 - Check the link its internal (regex source or a page in same dir)
                            #   1 - Create the new page (only if it's without source in link)
                            #   2 - Check it doesnt been crawled (check self._links)
                            #   3 - Add to the list (self._links)
                            #   4 - Crawl the new one according to the rules (match 'page' in this case)
                            new_link = ''
                            crawl_link = False
                            if (re.match(self._source, link)):
                                # Subpage sharing source (source/sthing) -> use that link
                                if self._debug: print 'Found internal link ', link
                                new_link = link
                                crawl_link = True

                            elif (re.match('http:\/\/', link)):
                                # External link -> do nothing
                                if self._debug: print 'Found external link ', link

                            elif (re.match('\w|\_|\.', link)):
                                # Internal link -> construct full url
                                if self._debug: print 'Found internal link', link
                                new_link = urlparse.urljoin(url, link)
                                crawl_link = True

                            else:
                                # Weird link
                                if self._debug: print  'Found weird link ', link

                            if ((self._links.count(new_link) == 0) & crawl_link):
                                # Found a new link, store & crawl it
                                self._links.append(new_link)

                                # WP specific actions
                                # Store only h2 links (meaning entries)
                                # Just crawl whenever link contains 'page' 
                                if(re.search('h2', line)): 
                                    if self._debug: print 'WP entry ', new_link
                                    self._wp_entries.append(new_link)

                                level += 1
                                if(re.search('page', new_link)): self.get_childs(new_link, level)
                                level -= 1

        except urllib2.HTTPError as e:
            if self._debug: print 'Found error while crawling ', url, e

        except urllib2.URLError as e:
            if self._debug: print 'Found error while crawling ', url, e
    else:
        if self._debug: print 'Maximum depth level reached, skipping'
            
def show_childs(self):
    print len(self._links),' links were found, listing:'
    for link in self._links:
        print link

def show_wp_entries(self):
    print len(self._wp_entries),' WP entries were found, listing:'
    for head in self._wp_entries:
        print head

Create a new crawler with page and maximu depth level

crawler = Crawler(‘http://hoyhabloyo.wordpress.com/’, 5)

Get childs for that page (could have used defaults params in method)

crawler.get_childs(‘http://hoyhabloyo.wordpress.com/’, 0)

Show me what you found

crawler.show_wp_entries() [/sourcecode]

And now, let’s check the output where I’m asking the crawler to list this blog 6 pages of entries: [sourcecode wraplines=”false”] kets@ExoduS:~/programacion/sources/python$ python wordpres_crawl.py 30 WP entries were found, listing: http://hoyhabloyo.wordpress.com/2012/01/24/mitos-y-verdades-sobre-las-becas-icex-en-informatica/ http://hoyhabloyo.wordpress.com/2012/01/18/xfce-display-switching-dual-single-monitor/ http://hoyhabloyo.wordpress.com/2012/01/08/ano-nuevo-vida-nueva/ http://hoyhabloyo.wordpress.com/2011/12/31/adios-2011/ http://hoyhabloyo.wordpress.com/2011/12/23/100-000-visitas/ http://hoyhabloyo.wordpress.com/2011/12/19/de-dibujos-animados/ http://hoyhabloyo.wordpress.com/2011/12/01/hablemos-de-los-rumanos-vorbim-despre-romanii/ http://hoyhabloyo.wordpress.com/2011/12/01/reencuentro-de-becarios-ic3x-en-navaluenga/ http://hoyhabloyo.wordpress.com/2011/11/29/sigo-vivo/ http://hoyhabloyo.wordpress.com/2011/10/07/va-de-despedidas/ http://hoyhabloyo.wordpress.com/2011/09/27/los-rincones-de-bucarest-piata-matache/ http://hoyhabloyo.wordpress.com/2011/09/22/los-rincones-de-bucarest-parcul-carol-i/ http://hoyhabloyo.wordpress.com/2011/09/13/los-rincones-de-bucarest-piata-universitatii/ http://hoyhabloyo.wordpress.com/2011/09/12/receta-hummus/ http://hoyhabloyo.wordpress.com/2011/09/08/viaje-por-los-balcanes/ http://hoyhabloyo.wordpress.com/2011/09/01/receta-gazpacho/ http://hoyhabloyo.wordpress.com/2011/08/31/los-rincones-de-bucarest-parcul-herestrau/ http://hoyhabloyo.wordpress.com/2011/08/16/viaje-expres-a-espana/ http://hoyhabloyo.wordpress.com/2011/08/02/ruta-por-la-rumania-profunda-valaquia/ http://hoyhabloyo.wordpress.com/2011/07/17/de-paseo-por-los-carpatos/ http://hoyhabloyo.wordpress.com/2011/07/10/guia-para-vivir-en-bucarest/ http://hoyhabloyo.wordpress.com/2011/06/30/viaje-por-asia/ http://hoyhabloyo.wordpress.com/2011/06/28/receta-pescado-blanco-al-microondas/ http://hoyhabloyo.wordpress.com/2011/05/20/la-revolucion-espanola-spanishrevolution/ http://hoyhabloyo.wordpress.com/2011/05/17/viaje-a-belgrado/ http://hoyhabloyo.wordpress.com/2011/04/29/semana-santa-por-la-rumania-profunda/ http://hoyhabloyo.wordpress.com/2011/04/14/bruselas-y-amsterdam/ http://hoyhabloyo.wordpress.com/2011/04/04/roma/ http://hoyhabloyo.wordpress.com/2011/03/17/las-1000-grullas/ http://hoyhabloyo.wordpress.com/2011/03/02/receta-torretas-de-berenejena-queso-y-tomate-vegetarianas/ [/sourcecode]

It works!! As you can see, with just a hundred lines of Python where typed to achieve that, and much more could be done just modifying some parameters. Hope it helps!