Restrict authorized Scrapy redirections to the website start_urls
Project description
scrapy-redirect restricts authorized HTTP redirections to the website start_urls
Why?
If the Scrapy REDIRECT_ENABLED config key is set to False and a request to the homepage of the crawled website returns a 3XX status code, the crawl will stop immediatly, as the redirection will not be followed.
scrapy-redirect will force Scrapy to tolerate redirections coming from the start_urls urls, in the case where REDIRECT_ENABLED = False, to avoid this particular problem.
Installation
$ pip install scrapy-redirect
Configuration
Install scrapy-redirect in your Scrapy middlewares by adding the following key/value pair in the SPIDER_MIDDLEWARES settings key (in settings.py):
SPIDER_MIDDLEWARES = {
...
'scrapyredirect.HomepageRedirectMiddleware': 575,
...
}
Note that it is important for the middleware order value to be inferior to 600 (the default value of the 'scrapy.contrib.downloadermiddleware.redirect.RedirectMiddleware' middleware), as it must be executed before Scrapy blocks the redirection.
NB: if REDIRECT_ENABLED = True, scrapy-redirect does nothing.
License
scrapy-redirect is published under the MIT License.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.