Software to scrape forum threads?

RWP

1 kW
Joined
Dec 19, 2008
Messages
463
Location
SoCal, USA
Anyone know of software to scrape forum threads?

ES is just great...and a pita to follow a long thread. I would like to set loose some kind of bot that would ball up a complete thread for me to read off line.
Does this beast exist?

Thanks,
Roy
 
Download This Site
http://www.httrack.com/
can download the entire forum, pictures and all, but it will take a long time (probably days) the first time around. It can be set to just refresh / update an existing download, so you can just get stuff that wasnt' there last time you used it.

You do have to set it for a single level of URL; any deeper and it'll go way out on the web and start downloading stuff from all the links in all the posts and signatures, which could be several dozen gigabytes of data by the time it's done and take weeks. ;)

I know of no way to make it get just part of a forum's posts, as there is no pattern to the URLs in PHPBB psots/threads/forums based on what forum or thread they are in. So there's no way to tell any software like that which threads to get.

It could be possible for a plugin or software to be made (or exist already) to do this, but it probably isn't quite as simple as the old NNTP newsgroup readers did it.
 
Like amberwolf says you can use HTTrack. To download everything in a thread, you give it the URL to a thread. For example:
http://endless-sphere.com/forums/viewtopic.php?f=6&t=42313
(that's HombreNeuvoElectro's excellent Giant NRS MAC 8T Build Thread--3 pages, 32 posts)

I set maximum mirroring depth to 3 and the following excludes to try to prevent HTTrack from going to these pages:
[pre]-http://endless-sphere.com/forums/index.php
-http://endless-sphere.com/forums/search.php
-http://endless-sphere.com/forums/viewforum.php
-http://endless-sphere.com/forums/ucp.php
-http://endless-sphere.com/forums/search.php
-http://endless-sphere.com/forums/faq.php
-http://endless-sphere.com/forums/memberlist.php
-http://endless-sphere.com/forums/viewonline.php[/pre]

It took around 7 minutes (lots of big pictures in that thread) but after it's done I had the whole thread content on my disk to browse offline.
 
RWP said:
I can use Fusion or dual boot to get to my Win side and give it a try.

It's available on OS X as a MacPorts package too. Probably the command-line version.
 
Back
Top