Roll your own spider bot

Praxent

12 years ago

Roll your own spider bot

Who wants to wait for a hosted service to crawl their site? This is the 21st century for (insert favorite expletive)’s sake! I want results yesterday – not a week from now. So how does Google do it? How do they get all twenty fingers and toes in all the digital pies ever baked in inter-blogo-sphere-ville? I know! They must have warehouses full of monkeys trained to click on all the links they see, and then they… type the results into Google docs.. and… wait. Aren’t computers good at doing repetitive tasks in bulk really quickly?

All joking aside, armed with a few under-appreciated command line tools and, God willing, a little patience with regular expressions, you can build a simple but powerful script that will crawl any site and return useful information. This type of tool is especially great if you are working on SEO, because you can ask it to give you page titles, or meta tags, or any other piece of content you are interested in. It is definitely not a replacement for a full-featured SEO service and Google’s webmaster tools, but it’s another trick in your bag of tricks. It’s also a good idea to constantly improve your command-line ninja skills, and I don’t know about you, but I need a functional project to work on if I’m going to tackle something as annoying as reg-ex in bash.

The Goods:

Here is the source code for those of you who prefer reading the cryptic nonsense that is bash over English.

#!/bin/bash  #   # Crawls a domain   # Retreives all visible URLs and their page titles  # Saves to CSV          # Text color variables   txtund=$(tput sgr 0 1) # Underline   txtbld=$(tput bold) # Bold   bldred=${txtbld}$(tput setaf 1) # red   bldblu=${txtbld}$(tput setaf 4) # blue   bldgreen=${txtbld}$(tput setaf 2) # green   bldwht=${txtbld}$(tput setaf 7) # white   txtrst=$(tput sgr0) # Reset   info=${bldwht}*${txtrst} # Feedback   pass=${bldblu}*${txtrst}   warn=${bldred}*${txtrst}   ques=${bldblu}?${txtrst}           printf "%s=== Crawling $1 ===  %s" "$bldgreen" "$txtrst"           # wget in Spider mode, outputs to wglog file   # -R switch to ignore specific file types (images, javascript etc.)   wget --spider -r -l inf -w .25 -nc -nd $1 -R bmp,css,gif,ico,jpg,jpeg,js,mp3,mp4,pdf,png,swf,txt,xml,xls,zip 2>&1 | tee wglog v          printf " %s========================================== " "$bldgreen"   printf "%s=== Crawl Finished... ===%s " "$bldgreen" "$txtrst"   printf "%s=== Begin retreiving page titles... ===%s " "$bldgreen" "$txtrst"   printf "%s==========================================  " "$bldgreen"           printf "%s** Run tail -f $1.csv for progress%s  " "$bldred" "$txtrst"           # from wglog, grab URLs   # curl each URL and grep title   cat wglog | grep '^--' | awk '{print $3}' | sort | uniq | while read url; do {   printf "%s* Retreiving title for: %s$url%s " "$bldgreen" "$txtrst$txtbld" "$txtrst"   printf ""${url}","`curl -# ${url} | sed -n -e 's!.*<title>(.*)</title>.*!1!p'`" " >> $1.csv  printf " "           }; done           # clean up log file   rm wglog   exit

The real meat of this script happens in two separate sections. The first uses wget to recursively crawl a specified domain name and save the output to a file, wglog.

wget --spider -r -l inf -w .25 -nc -nd $1 -R bmp,css,gif,ico,jpg,jpeg,js,mp3,mp4,pdf,png,swf,txt,xml,xls,zip 2>&1 | tee wglog

Check out the manual for wget to see what the options do, but the gist is, look for html while ignoring images and other useless content, follow all links to their end but don’t leave the domain, and only make a request 4 times a second to give the server some breathing room. This isn’t strictly necessary, but I think it’s polite. The end of this command pipes the output of wget to two places, your screen so you can watch the progress, and a tmp file called wglog. The next section of the script uses some standard unix tools to slice and dice the results into a simple list of URLs.

cat wglog | grep '^--' | awk '{print $3}' | sort | uniq | while read url; do {

From left to right: grep a specific line in the wget output, grab a substring of that line, sort alphabetically the results (the URLs), remove duplicates due to greedy grep, and then loop over the resulting output. Probably some way of doing this more efficiently, but hey, it’s fun to use as many tools and pipes as possible.

In the while loop, we have access to each url we found using wget. cURL fetches that URL, and sed gets interesting data from the output. This is where regex, unfortunately, comes in handy.

...  printf "%s* Retreiving title for: %s$url%s " "$bldgreen" "$txtrst$txtbld" "$txtrst"  printf ""${url}","`curl -# ${url} | sed -n -e 's!.*<title>(.*)</title>.*!1!p'`" " >> $1.csv  printf " "          }; done

This varient of the sed command fetches the page title which gets paired with the URL and saved to a CSV file. By changing the pattern in the sed command, you can grab virtually any piece of content you’re interested in: meta tags, headers, whatever. This script could be easily modified to include selectable default expressions to get common page content. For example:

sed -n -e 's!.*&lt;meta name="description" content=(.*).*!1!p'

…could be used to save meta description content. This isn’t perfect because if there are quotes and commas in the description, it adds columns to the resulting csv, but I’m sure there is a workaround. Happy crawling!