• About
        • About
          • Overview
          • What to Expect
          • Careers
          • Team
          • CANDO Culture
          • FAQ
        • Praxent Pricing Guide

          To give you an idea of what investment looks like we've created a guide with estimates by product type as well as set-priced starter engagements.

          Download Now
  • Industries
        • Industries
          • Fintech
          • Insurance
          • Lending
          • Wealth Management
          • Real Estate
          • Other
        • Praxent Pricing Guide

          To give you an idea of what investment looks like we've created a guide with estimates by product type as well as set-priced starter engagements.

          Download Now
  • Services
    • Design
      • User Experience Design
      • Customer Journey Mapping
      • Design Sprints
      • UX Audit
      • User Testing
      • User Research
    • Development
      • Custom Software Development
      • Application Modernization
      • Mobile App Development
      • Web App Development
      • Web Portal Development
      • Front End Development
      • Backend Development
      • Cloud Deployment
      • Implementations
      • Staff Augmentation
  • Case Studies
  • Insights
  • Schedule a Call
  • About
    • About
    • Overview
    • Careers
    • CANDO Culture
    • What to Expect
    • Team
    • FAQ
    • #
  • Industries
    • Industries
    • Fintech
    • Insurance
    • Lending
    • Wealth Management
    • Real Estate
    • Other
    • #
  • Services
    • Services
    • Design
      • User Experience Design
      • Customer Journey Mapping
      • Design Sprints
      • UX Audit
      • User Research
      • User Testing
    • Development
      • Custom Software Development
      • Application Modernization
      • Mobile App Development
      • Web App Development
      • Web Portal Development
      • Frontend Development
      • Backend Development
      • Cloud Deployment
      • Implementations
      • Staff Augmentation
    • #
  • Case Studies
  • Insights
  • Contact

Speak with an expert

(512) 553-6830
Close menu
by Praxent on November 3, 2012

Roll your own spider bot

Who wants to wait for a hosted service to crawl their site? This is the 21st century for (insert favorite expletive)’s sake! I want results yesterday – not a week from now. So how does Google do it? How do they get all twenty fingers and toes in all the digital pies ever baked in inter-blogo-sphere-ville? I know! They must have warehouses full of monkeys trained to click on all the links they see, and then they… type the results into Google docs.. and… wait. Aren’t computers good at doing repetitive tasks in bulk really quickly?

All joking aside, armed with a few under-appreciated command line tools and, God willing, a little patience with regular expressions, you can build a simple but powerful script that will crawl any site and return useful information. This type of tool is especially great if you are working on SEO, because you can ask it to give you page titles, or meta tags, or any other piece of content you are interested in. It is definitely not a replacement for a full-featured SEO service and Google’s webmaster tools, but it’s another trick in your bag of tricks. It’s also a good idea to constantly improve your command-line ninja skills, and I don’t know about you, but I need a functional project to work on if I’m going to tackle something as annoying as reg-ex in bash.

The Goods:

Here is the source code for those of you who prefer reading the cryptic nonsense that is bash over English.

#!/bin/bash  #   # Crawls a domain   # Retreives all visible URLs and their page titles  # Saves to CSV          # Text color variables   txtund=$(tput sgr 0 1) # Underline   txtbld=$(tput bold) # Bold   bldred=${txtbld}$(tput setaf 1) # red   bldblu=${txtbld}$(tput setaf 4) # blue   bldgreen=${txtbld}$(tput setaf 2) # green   bldwht=${txtbld}$(tput setaf 7) # white   txtrst=$(tput sgr0) # Reset   info=${bldwht}*${txtrst} # Feedback   pass=${bldblu}*${txtrst}   warn=${bldred}*${txtrst}   ques=${bldblu}?${txtrst}           printf "%s=== Crawling $1 ===  %s" "$bldgreen" "$txtrst"           # wget in Spider mode, outputs to wglog file   # -R switch to ignore specific file types (images, javascript etc.)   wget --spider -r -l inf -w .25 -nc -nd $1 -R bmp,css,gif,ico,jpg,jpeg,js,mp3,mp4,pdf,png,swf,txt,xml,xls,zip 2>&1 | tee wglog v          printf " %s========================================== " "$bldgreen"   printf "%s=== Crawl Finished... ===%s " "$bldgreen" "$txtrst"   printf "%s=== Begin retreiving page titles... ===%s " "$bldgreen" "$txtrst"   printf "%s==========================================  " "$bldgreen"           printf "%s** Run tail -f $1.csv for progress%s  " "$bldred" "$txtrst"           # from wglog, grab URLs   # curl each URL and grep title   cat wglog | grep '^--' | awk '{print $3}' | sort | uniq | while read url; do {   printf "%s* Retreiving title for: %s$url%s " "$bldgreen" "$txtrst$txtbld" "$txtrst"   printf ""${url}","`curl -# ${url} | sed -n -e 's!.*<title>(.*)</title>.*!1!p'`" " >> $1.csv  printf " "           }; done           # clean up log file   rm wglog   exit   

The real meat of this script happens in two separate sections. The first uses wget to recursively crawl a specified domain name and save the output to a file, wglog.

wget --spider -r -l inf -w .25 -nc -nd $1 -R bmp,css,gif,ico,jpg,jpeg,js,mp3,mp4,pdf,png,swf,txt,xml,xls,zip 2>&1 | tee wglog  

Check out the manual for wget to see what the options do, but the gist is, look for html while ignoring images and other useless content, follow all links to their end but don’t leave the domain, and only make a request 4 times a second to give the server some breathing room. This isn’t strictly necessary, but I think it’s polite. The end of this command pipes the output of wget to two places, your screen so you can watch the progress, and a tmp file called wglog. The next section of the script uses some standard unix tools to slice and dice the results into a simple list of URLs.

cat wglog | grep '^--' | awk '{print $3}' | sort | uniq | while read url; do {  

From left to right: grep a specific line in the wget output, grab a substring of that line, sort alphabetically the results (the URLs), remove duplicates due to greedy grep, and then loop over the resulting output. Probably some way of doing this more efficiently, but hey, it’s fun to use as many tools and pipes as possible.

In the while loop, we have access to each url we found using wget. cURL fetches that URL, and sed gets interesting data from the output. This is where regex, unfortunately, comes in handy.

...  printf "%s* Retreiving title for: %s$url%s " "$bldgreen" "$txtrst$txtbld" "$txtrst"  printf ""${url}","`curl -# ${url} | sed -n -e 's!.*<title>(.*)</title>.*!1!p'`" " >> $1.csv  printf " "          }; done  

This varient of the sed command fetches the page title which gets paired with the URL and saved to a CSV file. By changing the pattern in the sed command, you can grab virtually any piece of content you’re interested in: meta tags, headers, whatever. This script could be easily modified to include selectable default expressions to get common page content. For example:

sed -n -e 's!.*&lt;meta name="description" content=(.*).*!1!p'  

…could be used to save meta description content. This isn’t perfect because if there are quotes and commas in the description, it adds columns to the resulting csv, but I’m sure there is a workaround. Happy crawling!

1 Comment

  1. Random surfer says

    August 24, 2017 at 3:59 pm

    This is super handy! Thank you for sharing.

    BTW, for the newbies, save this script as, say, “spiderbot.sh”. Then “chmod +x spiderbot.sh”. Then run the script passing in the name of this site you want to crawl, for example “./spiderbot.sh http://www.someSite.com“

    Reply

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Featured

What the Kardashians can teach your FI about fintech partners to identify niche markets.

What the Kardashians can teach your FI about fintech partners to identify niche markets.

Read more

The 4 Reasons Software Modernizations Fail (and 12 Strategies for Success)

The 4 Reasons Software Modernizations Fail (and 12 Strategies for Success)

We share the strategies you’ll need to modernize your online customer experience so you can outperform your competitor...Read more

Making Sense of User Research: 5 Tools for Finding & Refining Winning Product Ideas (Plus Free Templates)

Making Sense of User Research: 5 Tools for Finding & Refining Winning Product Ideas (Plus Free Templates)

Making Sense of User Research: 5 Tools for Finding & Refining Winning Product Ideas Collecting quality data about … Read More

Many companies have built software applications that no longer meet customer expectations. We help financial services companies modernize those applications so they can remain relevant against born-digital competitors.

4330 Gaines Ranch Loop, Suite 230
Austin, TX 78735

(512) 553-6830

[email protected]

DESIGN
  • UX Design
  • Design Sprints
  • User Research
  • User Testing
DEVELOP
  • Custom Software
  • Web Portals
  • App Modernization
  • Web Apps
  • Mobile Apps
ABOUT
  • Case Studies
  • Team
  • Culture
  • Careers
  • Insights
  • Contact

Sign up for our newsletter

© 2022 Praxent
Privacy Terms Site Map