Basic Wordpress crawler in Python

3 minute read

I’m developing a new personal web where I will probably include some links to this, my blog, I thought it would be a good idea to have an automatic way to extract them from this site. As I was a bit bored last Monday and wanted to remember my Python skills, I thought “Let’s do a really simple WP crawler to make that work for me!”.

The next Python crawler it’s a really simple script. I know it could have been coded showing a beautiful menu with options and some methods having default params and all that, but just wanted to code it quickly. Every tricky part it’s commented but I will explain it in a few lines.

Wordpress blogs show only the last entries and at the bottom, there’s a link which allows you to go to older entries (linking to something like …/page/number)
The crawler will start seeking for every link in the webpage indicated as source
For every link found, if it’s an ‘h2’ type, will consider it as a blog entry so will store it in the list wp_entries
If it’s an internal link will crawl it as long as it contains the word ‘page’
There’s a depth limit so it doesn’t crawl through every page if desire
When it has finished crawling, it will show every entry found

Here is the code: [sourcecode wraplines=”false” language=”python”] import urllib2 import urlparse import re

class Crawler: _source = ‘’ _depth = 0 _links = [] _wp_entries = [] _debug = False

def __init__(self, source, depth):
    self._source = source
    self._depth = depth
    self._links = [] 
    self._wp_entries = []

def get_childs(self, url, level):
    if (level <= self._depth):
        if self._debug : print 'Crawling ', url
        try:
            page = urllib2.urlopen(url)

            # Discard non html files
            page_info = page.info()['content-type']
            if not(re.search('text\/html', page_info)):
                if self._debug: print 'Found a non-html file ', url, ' :', page.info()['content-type']

            else:
                for line in page:
                    # Find every link in source code
                    if ((re.search('a href=', line) != None)):
                        # If several links in a line -->  Iterate through all
                        links_in_line = re.findall('a href="(.*?)"', line)
                        for link in links_in_line:
                            # For each link found
                            #   0 - Check the link its internal (regex source or a page in same dir)
                            #   1 - Create the new page (only if it's without source in link)
                            #   2 - Check it doesnt been crawled (check self._links)
                            #   3 - Add to the list (self._links)
                            #   4 - Crawl the new one according to the rules (match 'page' in this case)
                            new_link = ''
                            crawl_link = False
                            if (re.match(self._source, link)):
                                # Subpage sharing source (source/sthing) -> use that link
                                if self._debug: print 'Found internal link ', link
                                new_link = link
                                crawl_link = True

                            elif (re.match('http:\/\/', link)):
                                # External link -> do nothing
                                if self._debug: print 'Found external link ', link

                            elif (re.match('\w|\_|\.', link)):
                                # Internal link -> construct full url
                                if self._debug: print 'Found internal link', link
                                new_link = urlparse.urljoin(url, link)
                                crawl_link = True

                            else:
                                # Weird link
                                if self._debug: print  'Found weird link ', link

                            if ((self._links.count(new_link) == 0) & crawl_link):
                                # Found a new link, store & crawl it
                                self._links.append(new_link)

                                # WP specific actions
                                # Store only h2 links (meaning entries)
                                # Just crawl whenever link contains 'page' 
                                if(re.search('h2', line)): 
                                    if self._debug: print 'WP entry ', new_link
                                    self._wp_entries.append(new_link)

                                level += 1
                                if(re.search('page', new_link)): self.get_childs(new_link, level)
                                level -= 1

        except urllib2.HTTPError as e:
            if self._debug: print 'Found error while crawling ', url, e

        except urllib2.URLError as e:
            if self._debug: print 'Found error while crawling ', url, e
    else:
        if self._debug: print 'Maximum depth level reached, skipping'
            
def show_childs(self):
    print len(self._links),' links were found, listing:'
    for link in self._links:
        print link

def show_wp_entries(self):
    print len(self._wp_entries),' WP entries were found, listing:'
    for head in self._wp_entries:
        print head

Create a new crawler with page and maximu depth level

crawler = Crawler(‘http://hoyhabloyo.wordpress.com/’, 5)

Get childs for that page (could have used defaults params in method)

crawler.get_childs(‘http://hoyhabloyo.wordpress.com/’, 0)

Show me what you found

crawler.show_wp_entries() [/sourcecode]

And now, let’s check the output where I’m asking the crawler to list this blog 6 pages of entries: [sourcecode wraplines=”false”] kets@ExoduS:~/programacion/sources/python$ python wordpres_crawl.py 30 WP entries were found, listing: http://hoyhabloyo.wordpress.com/2012/01/24/mitos-y-verdades-sobre-las-becas-icex-en-informatica/ http://hoyhabloyo.wordpress.com/2012/01/18/xfce-display-switching-dual-single-monitor/ http://hoyhabloyo.wordpress.com/2012/01/08/ano-nuevo-vida-nueva/ http://hoyhabloyo.wordpress.com/2011/12/31/adios-2011/ http://hoyhabloyo.wordpress.com/2011/12/23/100-000-visitas/ http://hoyhabloyo.wordpress.com/2011/12/19/de-dibujos-animados/ http://hoyhabloyo.wordpress.com/2011/12/01/hablemos-de-los-rumanos-vorbim-despre-romanii/ http://hoyhabloyo.wordpress.com/2011/12/01/reencuentro-de-becarios-ic3x-en-navaluenga/ http://hoyhabloyo.wordpress.com/2011/11/29/sigo-vivo/ http://hoyhabloyo.wordpress.com/2011/10/07/va-de-despedidas/ http://hoyhabloyo.wordpress.com/2011/09/27/los-rincones-de-bucarest-piata-matache/ http://hoyhabloyo.wordpress.com/2011/09/22/los-rincones-de-bucarest-parcul-carol-i/ http://hoyhabloyo.wordpress.com/2011/09/13/los-rincones-de-bucarest-piata-universitatii/ http://hoyhabloyo.wordpress.com/2011/09/12/receta-hummus/ http://hoyhabloyo.wordpress.com/2011/09/08/viaje-por-los-balcanes/ http://hoyhabloyo.wordpress.com/2011/09/01/receta-gazpacho/ http://hoyhabloyo.wordpress.com/2011/08/31/los-rincones-de-bucarest-parcul-herestrau/ http://hoyhabloyo.wordpress.com/2011/08/16/viaje-expres-a-espana/ http://hoyhabloyo.wordpress.com/2011/08/02/ruta-por-la-rumania-profunda-valaquia/ http://hoyhabloyo.wordpress.com/2011/07/17/de-paseo-por-los-carpatos/ http://hoyhabloyo.wordpress.com/2011/07/10/guia-para-vivir-en-bucarest/ http://hoyhabloyo.wordpress.com/2011/06/30/viaje-por-asia/ http://hoyhabloyo.wordpress.com/2011/06/28/receta-pescado-blanco-al-microondas/ http://hoyhabloyo.wordpress.com/2011/05/20/la-revolucion-espanola-spanishrevolution/ http://hoyhabloyo.wordpress.com/2011/05/17/viaje-a-belgrado/ http://hoyhabloyo.wordpress.com/2011/04/29/semana-santa-por-la-rumania-profunda/ http://hoyhabloyo.wordpress.com/2011/04/14/bruselas-y-amsterdam/ http://hoyhabloyo.wordpress.com/2011/04/04/roma/ http://hoyhabloyo.wordpress.com/2011/03/17/las-1000-grullas/ http://hoyhabloyo.wordpress.com/2011/03/02/receta-torretas-de-berenejena-queso-y-tomate-vegetarianas/ [/sourcecode]

It works!! As you can see, with just a hundred lines of Python where typed to achieve that, and much more could be done just modifying some parameters. Hope it helps!

Jaime

Basic Wordpress crawler in Python

Create a new crawler with page and maximu depth level

Get childs for that page (could have used defaults params in method)

Show me what you found

You May Also Enjoy

¡Adios, 2015!

¡Hasta siempre!

¡Adios, 2014!

La familia crece…