Wednesday, June 26, 2013

let wget ignore robots.txt

我們可以寫 robots.txt 來防止機器人或是 crawler 亂爬我們的網站

wget 是很遵守 robots.txt 的,不過還是有方法可以偽裝我們不是機器人

wget -e robots=off [url]

-e 是可以附加在 wgetrc 中沒寫的功能


--execute command
           Execute command as if it were a part of .wgetrc.  A command thus invoked will be executed after the commands in .wgetrc, thus taking precedence over them.  If you need to specify more than one wgetrc command, use multiple instances of -e.


Reference

http://www.gnu.org/software/wget/manual/html_node/Robot-Exclusion.html

No comments:

Post a Comment