在使用urllib
抓取网页的时候出现"HTTP Error 403: Forbidden"错误。
在执行以下代码时出现异常
1 2 3 4 |
def fetch_data(url): req = request.Request(url) with request.urlopen(req) as f: return json.loads(f.read().decode('utf-8')) |
异常信息如下
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
Traceback (most recent call last): File ".\btin_urllib.py", line 23, in <module> data = fetch_data(URL) File ".\btin_urllib.py", line 18, in fetch_data with request.urlopen(req) as f: File "C:\Users\Epins\AppData\Local\Programs\Python\Python38\lib\urllib\request .py", line 222, in urlopen return opener.open(url, data, timeout) File "C:\Users\Epins\AppData\Local\Programs\Python\Python38\lib\urllib\request .py", line 531, in open response = meth(req, response) File "C:\Users\Epins\AppData\Local\Programs\Python\Python38\lib\urllib\request .py", line 640, in http_response response = self.parent.error( File "C:\Users\Epins\AppData\Local\Programs\Python\Python38\lib\urllib\request .py", line 569, in error return self._call_chain(*args) File "C:\Users\Epins\AppData\Local\Programs\Python\Python38\lib\urllib\request .py", line 502, in _call_chain result = func(*args) File "C:\Users\Epins\AppData\Local\Programs\Python\Python38\lib\urllib\request .py", line 649, in http_error_default raise HTTPError(req.full_url, code, msg, hdrs, fp) urllib.error.HTTPError: HTTP Error 403: Forbidden |
出现上面的异常是因为用 urllib 打开一个URL,服务器端只会收到一个单纯的对于该页面访问的请求,但是服务器并不知道发送这个请求使用的浏览器,操作系统,硬件平台等信息,而缺失这些信息的请求往往都是非正常的访问,例如爬虫。有些网站为了防止这种非正常的访问,会验证请求信息中的UserAgent,如果UserAgent存在异常或者是不存在,那么这次请求将会被拒绝。
解决方法:
在请求中添加UserAgent,代码如下
1 2 3 4 5 |
def fetch_data(url): req = request.Request(url) req.add_header('User-Agent','Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0') with request.urlopen(req) as f: return json.loads(f.read().decode('utf-8')) |
我的微信
扫一扫加我微信