colorful rat Ratfactor.com > Dave's Repos

chklnks.rb

Website link crawler and checker written in Ruby
git clone http://ratfactor.com/repos/chklnks.rb/chklnks.rb.git

chklnks.rb/README.md

Download raw file: README.md

1 # chklnks.rb 2 3 Website link crawler and checker written in Ruby. 4 5 * Around 250 lines of Ruby 6 * Requires no gems - uses only Ruby core/std libraries 7 * Recursively crawls "internal" links 8 * Checks status of "external" links 9 * Follows (one level of) redirection 10 * Tracks all page origin(s) of each unique link 11 * Generates HTML report 12 13 Each line of the report is a unique link in your site in queue order: 14 15 ![output report screenshot](raw/screenshot1.png) 16 17 You can click to expand full details and headers from each link check: 18 19 ![output detail expanded screenshot](raw/screenshot2.png) 20 21 ## Usage 22 23 `chklnks` writes an HTML report to STDOUT and progress messages to STDERR. 24 You'll almost certainly want to redirect the report to a file: 25 26 ./chklnks 'http://example.com/' > report.html 27 28 While it runs, you'll be able to see queue progress. It may take a while, but 29 at least it's fun to watch the queue grow and then shrink. 30 31 When it's done, open the report in a browser and enjoy. 32 33 ## Limitations 34 35 chklnks assumes that all links with just a path (e.g. `foo/bar/baz.html`) 36 are "internal" to the site and all links with a protocol and host 37 (e.g. `http://example.com/...`) are external. This is true of my website, but 38 may not be true of yours. 39 40 Redirects (HTTP 301 and HTTP 302) are only followed to one level. Anything 41 other than an HTTP 200 OK after the first redirect is marked as an "error" even 42 if it's fine. The rationale is that if you're redirected more than once, you 43 should probably update your link, right? 44 45 _All_ link checks begin with a HEAD request. "External" links are _only_ 46 checked with a HEAD request. I'm starting to see servers which respond to HEAD 47 requests with a HTTP 405 Method Not Allowed. I personally think those servers 48 aren't being good citizens. 49 50 HTML `<base>` tags are not checked. The base href defines a URI base for 51 relative links on the page. I don't use this tag on my websites, so I didn't 52 bother to implement this. I'll happily accept a pull request with the 53 addition. 54 55 ## How it works 56 57 Starting with the first page, chklnks scans the page for anchor tags (`<a href="...">...</a>`). 58 59 Each link is parsed as a normalized Ruby `URI` and merged with the page's URI. 60 61 The string form of the normalized URI is used as a key to store in a new `Link` 62 (created as a `Struct`) in a `Hash` to ensure exactly *one* copy of each unique 63 link is stored. 64 65 Unique Links are put in a `Queue` to guarantee that each link is visited 66 exactly one time. 67 68 An HTTP HEAD request is performed on each link and the resulting status code 69 and headers are stored for the link. If the result is a redirect, the 70 redirected URI is attempted immediately (with another HEAD request) and the 71 redirected result is also stored. 72 73 If the link is considered "internal" to the site (and appears to be `text/html` 74 content), it is scanned for more links and the circle of life continues. 75 76 ## Contributing 77 78 Every website is a unique snowflake and you may wish to simply fork this repo 79 as a simple base for _your_ link checker. On the other hand, suggestions will 80 be considered and improvements are very welcome.