Solr 5 started supporting simple webcrawling ([Java Doc][1]). If want search, Solr is the tool, if you want to crawl, Nutch/Scrapy is better :)
To get it up and running, you can take a detail look at [here][2]. However, here is how to get it up and running in one line:
java
-classpath <pathtosolr>/dist/solr-core-5.4.1.jar
-Dauto=yes
-Dc=gettingstarted -> collection: gettingstarted
-Ddata=web -> web crawling and indexing
-Drecursive=3 -> go 3 levels deep
-Ddelay=0 -> for the impatient use 10+ for production
org.apache.solr.util.SimplePostTool -> SimplePostTool
[To see links please register here]
-> a testing wordpress blog
The crawler here is very "naive" where you can find all the code from [this][3] Apache Solr's github repo.
Here is how the response looks like:
SimplePostTool version 5.0.0
Posting web pages to Solr url
[To see links please register here]
Entering auto mode. Indexing pages with content-types corresponding to file endings xml,json,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log
SimplePostTool: WARNING: Never crawl an external web site faster than every 10 seconds, your IP will probably be blocked
Entering recursive mode, depth=3, delay=0s
Entering crawl at level 0 (1 links total, 1 new)
POSTed web resource
[To see links please register here]
(depth: 0)
Entering crawl at level 1 (52 links total, 51 new)
POSTed web resource
[To see links please register here]
(depth: 1)
...
Entering crawl at level 2 (266 links total, 215 new)
...
POSTed web resource
[To see links please register here]
(depth: 2)
...
Entering crawl at level 3 (846 links total, 656 new)
POSTed web resource
[To see links please register here]
(depth: 3)
SimplePostTool: WARNING: The URL
[To see links please register here]
returned a HTTP result status of 302
423 web pages indexed.
COMMITting Solr index changes to
[To see links please register here]
...
Time spent: 0:05:55.059
In the end, you can see all the data are indexed properly.
[![enter image description here][4]][4]
[1]:
[To see links please register here]
[2]:
[To see links please register here]
[3]:
[To see links please register here]
[4]: