regex - Nutch - why are my url exclusions not excluding those urls? -

surprise! have apache nutch v1.5 question. in crawling , indexing our site solr via nutch, need able exclude content falls under path.

so have our site: http://oursite.com/ , have path don't want index @ http://oursite.com/private/

i have http://oursite.com/ in seed.txt file , +^http://www.oursite.com/([a-z0-9\-a-z]*\/)* in regex-urlfilter.txt file

i thought putting: -.*/private/.* in regex-urlfilter.txt file exclude path , under it, crawler still fetching , indexing content under /private/ path.

is there kind of restart need on server, solr? or regex not right way this?

thanks

my guess url accepted first regex , second 1 isn't checked anymore. if want deny urls, put regexes first in list.

Brazier

Search This Blog

regex - Nutch - why are my url exclusions not excluding those urls? -

Comments

Post a Comment