gothwalk: (Default)
([personal profile] gothwalk Oct. 20th, 2008 03:49 pm)
So, one of my colleagues is starting out on a large new project here. One thing he has to do is collate information about the client's website, which contains something between 200 and 1600 pages. Is there any piece of software out that that will spider through a site, and at the end, provide a listing - ideally, a CSV or something - containing all the page URLs and titles?

He's not overly technical, so nothing TOO involved, if it can be helped.

From: [identity profile] puppytown.livejournal.com


I use Xenu Link Sleuth - it's a link checker, but also lists all the urls on a site. Free, and anti-Scientologist!

From: [identity profile] sciamachy.livejournal.com


I believe there's ways to use unix's wget (wget -m maybe?)

There's a Java web-crawler explained here:
http://java.sun.com/developer/technicalArticles/ThirdParty/WebCrawler/

You'd just need to download the source code & compile it, then jar it up for him. I'd do it myself but I can't email you compiled code from my work.

Actually w3c have a link checker as part of their validation site.
Edited Date: 2008-10-20 03:03 pm (UTC)

From: [identity profile] sbisson.livejournal.com


There are lots of web site mapping tools out there...

From: [identity profile] mollydot.livejournal.com


there's a search engine called swish-e. I think it outputs the paths when spidering. Or if not by default, can probably be told to.
.