gothwalk | Trawling

You're viewing

gothwalk's journal
Create a Dreamwidth Account Learn More

Reload page in style: site light

So, one of my colleagues is starting out on a large new project here. One thing he has to do is collate information about the client's website, which contains something between 200 and 1600 pages. Is there any piece of software out that that will spider through a site, and at the end, provide a listing - ideally, a CSV or something - containing all the page URLs and titles?

He's not overly technical, so nothing TOO involved, if it can be helped.

Flat | Top-Level Comments Only

Date: 2008-10-20 02:59 pm (UTC)

From:

puppytown.livejournal.com

no subject

I use Xenu Link Sleuth - it's a link checker, but also lists all the urls on a site. Free, and anti-Scientologist!

Date: 2008-10-20 03:02 pm (UTC)

From:

sciamachy.livejournal.com

no subject

I believe there's ways to use unix's wget (wget -m maybe?)

There's a Java web-crawler explained here:
http://java.sun.com/developer/technicalArticles/ThirdParty/WebCrawler/

You'd just need to download the source code & compile it, then jar it up for him. I'd do it myself but I can't email you compiled code from my work.

Actually w3c have a link checker as part of their validation site.

Edited Date: 2008-10-20 03:03 pm (UTC)