场景
在抓取第三方网站的数据时,有时会遇到这种情况:请求的url响应的是一段js文本,你要的是其中的数据。
样例
jsstr:
<pre mdtype="fences" cid="n56" lang="" class="md-fences md-end-block ty-contain-cm modeLoaded" spellcheck="false" style="box-sizing: border-box; overflow: visible; font-family: var(--monospace); font-size: 0.9em; display: block; break-inside: avoid; text-align: left; white-space: normal; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(248, 248, 248); position: relative !important; border: 1px solid rgb(231, 234, 237); border-radius: 3px; padding: 8px 4px 6px; margin-bottom: 15px; margin-top: 15px; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">var result = {column:{name:0,age:1,tel:2,addr:3}, data:[['张三',19,'13500223344','北四环12号'],['李四',34,'88220334',],['王五',22,,]]};</pre>
说明:
key是不带引号的
存在空值,有的空值在中间,有的在末尾
字符串值用了单引号
以上均不符合json的标准.
方法
采用eval()
json.loads()
采用PyExecJs
采用demjson (执行效率慢)
采用Js2Py (推荐)
PyV8(未验证)
处理过程
eval
对等号右侧的字符串进行eval().但是因为key不带引号,不是合法的json字符串.当遇到不规范的情况时,会报异常。
第一步:规范json字符串
<pre mdtype="fences" cid="n103" lang="python" class="md-fences md-end-block ty-contain-cm modeLoaded" spellcheck="false" style="box-sizing: border-box; overflow: visible; font-family: var(--monospace); font-size: 0.9em; display: block; break-inside: avoid; text-align: left; white-space: normal; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(248, 248, 248); position: relative !important; border: 1px solid rgb(231, 234, 237); border-radius: 3px; padding: 8px 4px 6px; margin-bottom: 15px; margin-top: 15px; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;"> jsonstr = jsstr.split('=')[1] #取等号后面的部分
jsonstr = jsonstr.strip().rstrip(';') #去掉末尾;号
jsonstr = jsonstr.replace(''', '"') #单引号替换为双引号
jsonstr = jsonstr.replace(', ,', ', None,') #替换中间空值
jsonstr = jsonstr.replace(', ]', ', None]') #替换末尾空值</pre>
第二步:调整eval参数
<pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="" cid="n109" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: var(--monospace); font-size: 0.9em; display: block; break-inside: avoid; text-align: left; white-space: normal; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(248, 248, 248); position: relative !important; border: 1px solid rgb(231, 234, 237); border-radius: 3px; padding: 8px 4px 6px; margin-bottom: 15px; margin-top: 15px; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">eval(jsonstr, type('Dummy', (dict,), dict(getitem=lambda s,n:n))())</pre>
json.loads
会报类似json.decoder.JSONDecodeError: Expecting property name enclosed in double quotes:
的异常。 即属性名必须用双引号包裹。而给属性名包裹引号有点麻烦。
PyExecJS
pip install PyExecJS
方式一:
<pre mdtype="fences" cid="n116" lang="" class="md-fences md-end-block ty-contain-cm modeLoaded" spellcheck="false" style="box-sizing: border-box; overflow: visible; font-family: var(--monospace); font-size: 0.9em; display: block; break-inside: avoid; text-align: left; white-space: normal; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(248, 248, 248); position: relative !important; border: 1px solid rgb(231, 234, 237); border-radius: 3px; padding: 8px 4px 6px; margin-bottom: 15px; margin-top: 15px; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">>>>d = execjs.eval(jsstr)
type(d)
<class 'dict'></pre>
方式二:
<pre mdtype="fences" cid="n123" lang="" class="md-fences md-end-block ty-contain-cm modeLoaded" spellcheck="false" style="box-sizing: border-box; overflow: visible; font-family: var(--monospace); font-size: 0.9em; display: block; break-inside: avoid; text-align: left; white-space: normal; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(248, 248, 248); position: relative !important; border: 1px solid rgb(231, 234, 237); border-radius: 3px; padding: 8px 4px 6px; margin-bottom: 15px; margin-top: 15px; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">>>>js = execjs.compile(jsstr)
js.eval('result')
{'column': {'name': 0, 'age': 1, 'tel': 2, 'addr': 3}, 'data': [['锟斤拷锟斤拷', 19, '13500223344', '锟斤拷锟侥伙拷锟斤拷锟斤拷锟斤拷'], ['锟斤拷锟斤拷', 34, '88220334'], ['锟斤拷锟斤拷', 22, None]]}
type(js)
<class 'execjs._external_runtime.ExternalRuntime.Context'></pre>
缺点:中文处理不太好。
demjson
demjson的简介
python处理json是需要第三方json库来支持,工作中遇到处理json数据,是没有安装第三方的json库。demjson模块提供用于编码或解码用语言中性JSON格式表示的数据的类和函数(这在ajax Web应用程序中通常被用作XML的简单替代品)。此实现尽量尽可能遵从JSON规范(RFC 4627),同时仍然提供许多可选扩展,以允许较少限制的JavaScript语法。它包括完整的Unicode支持,包括UTF-32,BOM,和代理对处理。它还可以支持JavaScript的南方和无穷数字类型以及它的“未定义”类型。它还包括一个皮棉像JSON语法验证器测试对于严格遵守标准的JSON文本。
<pre mdtype="fences" cid="n152" lang="" class="md-fences md-end-block ty-contain-cm modeLoaded" spellcheck="false" style="box-sizing: border-box; overflow: visible; font-family: var(--monospace); font-size: 0.9em; display: block; break-inside: avoid; text-align: left; white-space: normal; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(248, 248, 248); position: relative !important; border: 1px solid rgb(231, 234, 237); border-radius: 3px; padding: 8px 4px 6px; margin-bottom: 15px; margin-top: 15px; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">>>>data = demjson.decode(jsonstr)
data['column']
{'addr': 3, 'age': 1, 'name': 0, 'tel': 2}
type(data)
<class 'dict'></pre>
优点:良好的处理中文处理
缺点:执行效率比较慢;
不算缺点的缺点:decode的返回值为dict,只能以data['column']方式引用属性值
Js2Py
pip install Js2Py
<pre mdtype="fences" cid="n139" lang="" class="md-fences md-end-block ty-contain-cm modeLoaded" spellcheck="false" style="box-sizing: border-box; overflow: visible; font-family: var(--monospace); font-size: 0.9em; display: block; break-inside: avoid; text-align: left; white-space: normal; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(248, 248, 248); position: relative !important; border: 1px solid rgb(231, 234, 237); border-radius: 3px; padding: 8px 4px 6px; margin-bottom: 15px; margin-top: 15px; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">>>>import js2py
result = js2py.eval_js(jsstr)
result
{'column': {'addr': 3, 'age': 1, 'name': 0, 'tel': 2}, 'data': [['张三', 19, '13500223344', '北四环12号'], ['李四', 34, '88220334'], ['王五', 22, None]]}
result.column.addr
type(js)
<class 'js2py.base.JsObjectWrapper'></pre>
优点1:比较完美地处理中文
优点2:返回值为js2py.base.JsObjectWrapper对象,可直接使用属性名引用,如result.column
PyV8
Python3 安装不要使用pip,因为官方只支持 Python2,需要在这里下载对应系统的二进制文件:emmetio/pyv8-binaries
然后解压后将 PyV8.py 与 _PyV8.so (注意:如不是这两个文件名需要修改) ,将两文件复制到 Python 的 site-packages 目录下,如 /usr/local/lib/python3.6/site-packages
。
(转)
缺点:目前(2019.7)只支持到Python 3.3