Open
Description
Hi,
Thanks you guys for the great framework.
I am using scrapy to crawl multiple sites. Sites are diffrerent encodings.
One site is encoding as 'gbk' and it's declared in HTML meta. but scrapy can not auto detect the encoding.
I tried using Beautiful soup, it can parse it correctly. So I dig into w3lib. found that the pattern
_BODY_ENCODING_BYTES_RE
can not correctly found the encoding in meta.
HTML snippet as below:
b'<HTML>\r\n <HEAD>\r\n <TITLE>\xcd\xf8\xd5\xbe\xb5\xd8\xcd\xbc</TITLE>\r\n <meta httpequiv="ContentType" content="text/html; charset=gbk" />\r\n <META NAME="Keywords" CONTENT="\xe5\xd0\xa1\xcb\xb5,\xcd\xf8\xd5\xbe\xb5\xd8\xcd\xbc">'
my test :
>>> from w3lib.encoding import html_body_declared_encoding
>>> b
b'<HTML>\r\n <HEAD>\r\n <TITLE>\xcd\xf8\xd5\xbe\xb5\xd8\xcd\xbc</TITLE>\r\n <meta httpequiv="ContentType" content="text/html; charset=gbk" />\r\n <META NAME="Keywords" CONTENT="\xe5\xd0\xa1\xcb\xb5,\xcd\xf8\xd5\xbe\xb5\xd8\xcd\xbc">'
>>> html_body_declared_encoding(b)
>>> enc = html_body_declared_encoding(b)
>>> enc
>>> print('"%s"' % enc)
"None"
>>> soup = BeautifulSoup(b)
>>> soup.title
<title>网站地图</title>
>>> soup.original_encoding
'gbk'
>>>