Tuesday 2 September 2008

New SocSciBot 4 allowing multple crawls

There is a new version of SocSciBot 4 online (on the SocSciBot web site) that has a new "multiple crawl" mode. Using this mode, you can give it a set of home pages and then it will run crawls of all the sites simultaneously. This should make it easier to run projects involving many crawls of small web sites. The maximum number of URLs in total for all crawls is 900,000, with 15,000 per individual site (approximately) so it probably would not be a good idea to crawl more than 40 sites unless they are all small. Also it may take a long time to crawl many sites!

9 comments:

Unknown said...

Dear Mr. Thelwall,

I tried your new SSB version 4.1 over the past weekend and unfortuanetley it always freezes. I turn in a alist of 90 URLs and it starts crawling for some hours till it always freezes. Do you have an idea what the problem might be?
Thanks alot, Martin

Mike Thelwall said...

Dear Martin,

This has also happened to me once but I don't know what the cause is yet and it may take some time to find out. Do you know if it happens on a particular site? If so, please email me the URL of the start of the crawl for the site.

As a temporary measure, I would suggest splitting the 90 URLs into smaller groups - e.g., 45 or even 10 each. This might give you all the data or at least narrow down which one does not work.

Best wishes,
Mike

Best wishes,
Mike

Unknown said...

Dear Mr. Thelwall,

which of the three possible way uf multiple crawling your SSB allows should I use for the split-parts of my 90 URLs? Why is it that that I just can push the "Load list of URLs to Crawl" Button with the first and third option but nor with the second one, which I would prefere to use?!
Once I loaded a small list of 10 to 20 URLs with the first or third option on the multi crawl site do I still have to write a URL in the "First URL to Crawl" Spaces? And if so, which ones?
Than you very much again! Sincerely yours

Martin Klaus

p.s.

My URL.txt looks like this for example:


http://www.huffingtonpost.com
http://www.techcrunch.com
http://www.engadget.com/
http://gizmodo.com/
http://boingboing.net/
http://lifehacker.com/
http://arstechnica.com/
http://mashable.com/
http://www.dailykos.com/
http://www.readwriteweb.com/
http://smashingmagazine.com/
http://beppegrillo.it/
http://googleblog.blogspot.com/
http://sethgodin.typepad.com/
http://www.problogger.net/
http://perezhilton.com/
http://gigazine.net/
http://doshdosh.com/
http://postsecret.blogspot.com/
http://gawker.com/
http://treehugger.com/
http://kotaku.com/

Mike Thelwall said...

Dear Martin,

Sorry, the diabled URL list button for the second option is a bug in the system. To get around this, please switch to one of the other two options before loading the URL list and then switch back after loading the list. This bug will be removed in the next version.

There is no need to add a "First URL to crawl" if you are loading a list of URLs.

Best wishes,
Mike

Mike Thelwall said...

Dear Martin,
This freezing problem should be fixed now - see the new blog post. Sorry it took so long.
Best wishes,
Mike

Anonymous said...

Dear Mike,

Regarding splitting the URLs into small groups, what are the steps to combine the results when I am about to do the analysis?

Mike Thelwall said...

Dear Chien,
There are no extra steps to take - for example if you enter 8 URLs for a multiple crawl then when the crawl is finished you can analyse as normal - as if you had done 8 consecutive crawls (for a link analysis). You can't split up a single crawl into multiple part crawls using this feature though. Is this what you meant?
Best wishes,
Mike

Anonymous said...

Dear Mike

Sorry for not getting it well.

My situation is that I forgot to include few URLs in the list (say list A). Can I crawl those URLs (say list B) using the function and then combine A & B together for the analysis??

Mike Thelwall said...

Dear Chien,
Sorry for the very delayed reply. You can add extra crawls but only one at a time, and not using the multiple crawl feature. Hope this helps.
Best wishes,
Mike