什么是GPT-4?完整指南
Python + BaiduTransAPI :快速检索千篇英文文献(附源码)
内容概要
利用 Python 系统爬取外文文献主要信息(题目、摘要、关键字等),形成某一类型的文献的数据库。接着调用百度翻译 API 对重要信息进行机器翻译,形成粗略的文献检索库,便于首次筛选文献和对外文文献形成总体预览、方便管理
优点:海量文献的快速粗翻、方便针对性挑选文献细读、系统+完整的收集和管理文献
缺点:首次爬取数据较慢(国外嘛)、机翻不如人翻
成果预览
Journey of Rural Studies 期刊全部文献翻译库
(共3007篇)
后期排版一下,便于阅读
①预览题目、关键字等 → ②找到需要文献 → ③点击链接 → ④文章细读
灵感来源
快速选择文献——中文 VS 英文
要想找到一篇与自己研究内容相关的文献,应该先从广度上进行筛选,找到合适自己研究的文章,接着细读,研究其观点、研究方法等。相对于阅读英文,我们对中文的阅读速度更快、能够更有效的通过关键信息筛选出自己需要的文章。避免文章看了好久终于看完了发现与自己的研究相关性不大。
于是想法诞生,如果能批量的将英文文献自动翻译形成数据库,不仅能够方便阅读,而且能在相同时间内阅读更多的信息,延展文献筛选的广度,并且没有网络延迟,提高了思维的流畅性。寒假在家顺便实验了一下,感觉效果不错,实现方式见下文。
逻辑设计
路径 & 框架
要想实现这个想法,总体操作流程应该有两大块。一是利用 Python 爬取相关的数据,二是调用 百度翻译API 接口进行自动翻译。详细流程整理如下图:
物理设计
源码 & 实现
1、文献数据抓取
本次使用 Journal of Rural Studies期刊作为测试,具体的网址如下,任务就是爬取该期刊从创刊以来到现在所有的文章信息。
https://www.journals.elsevier.com/journal-of-rural-studies/
# 导入库
import requests as re
from lxml import etree
import pandas as pd
import time
# 构造请求头
headers = {'user-agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36'}
先拿一个网页做一个测试,看看X path解析结果
url = 'https://www.sciencedirect.com/journal/journal-of-rural-studies/issues'
res = re.get(url,headers = headers).text
res = etree.HTML(res)
testdata = res.xpath("//a[@class='anchor js-issue-item-link text-m']/@href")
testdata
结果发现一级网页解析结果只有一个次级链接,按照道理来说应该有一级网页的全部链接,通过多次尝试发现,网页设计过程中第一个次级链接为get请求,而其余的次级链接都是POST请求,该网页一共有page2,为了方便,将所有链接都点开之后将网页保存为HTML文件之后再导入较为方便
html1 = etree.parse('G:\\Pythontest\\practice\\test1.html1',etree.HTMLParser())
html2 = etree.parse('G:\\Pythontest\\practice\\test2.html1',etree.HTMLParser())
data1 = html1.xpath("//a[@class='anchor js-issue-item-link text-m']/@href")
data2 = html2.xpath("//a[@class='anchor js-issue-item-link text-m']/@href")
LINKS = []
LINKS.append(data1)
LINKS.append(data2)
TLINKS = []
for i in LINKS:
link = 'https://www.sciencedirect.com' + i
TLINKS = append(link)
得到 TLINKS 是所有一级网页的链接,观察长度共有158条数据,数据获取正确。接下来获取所有的二级网络链接,这个时候就看看直播之类的吧,访问国外网站有点慢。完成之后共得到3007个次级链接(即3007篇文章)
SUBLINKS = []
for link in TLINKS:
subres = re.get(link,headers = headers).text
subres = etree.HTML(subres)
subllinks = subres.xpath("//a[@class = 'anchor article-content-title u-margin-xs-top u-margin-s-bottom']/@href")
SUBLINKS.append(sublinks)
print("第",TLINKS.index(link),"条数据OK")
time.sleep(0.2)
print('ALL IS OK')
LINKS = []
for i in SUBLINKS:
link = 'https://www.sciencedirect.com' + i
LINKS.append(link)
得到二级网页网页链接之后,需要分析三级网页的网页结构,并将需要的信息进行筛选,构造字典比存储。
allinfo = []
for LINK in LINKS:
info = {}
res = re.get(LINK,headers=headers).text
res = etree.HTML(res)
vol = res.xpath("//a[@title = 'Go to table of contents for this volume/issue']/text()")
datainfo = res.xpath("//div[@class = 'text-xs']/text()")
timu = res.xpath("//span[@class = 'title-text']/text()")
givenname = res.xpath("//span[@class='text given-name']/text()")
surname = res.xpath("//span[@class='text surname']/text()")
web = res.xpath("//a[@class='doi']/@href")
abstract = res.xpath("//p[@id='abspara0010']/text()")
keywords = res.xpath("//div[@class='keyword']/span/text()")
highlights = res.xpath("//dd[@class='list-description']/p/text()")
info['vol'] = vol
info['datainfo'] = datainfo
info['timu'] = timu
info['givenname'] = givenname
info['surname'] = surname
info['web'] = web
info['abstract'] = abstract
info['keywords'] = keywords
info['highlights'] = highlights
allinfo.append(info)
print("第",LINKS.index(LINK),"条数据 IS FINISHED,总进度是",(LINKS.index(LINK)+1)/len(LINKS))
df = pd.DataFrame(allinfo)
df
df.to_excel(r'G:\PythonStudy\practice1\test.xls',sheet_name='sheet1')
由此数据的爬取工作完成,得到了拥有所有文章信息的DataFrame
2、数据清洗
去除掉数据中多余的字符、将一些爬取时合并的信息进行拆分,形成面向翻译的Data Frame
# 删除多余的字符
data = df.copy()
data['abstract'] = data['abstract'].str.replace('[','').str.replace(']','').str.replace('\'','')
data['datainfo'] = data['datainfo'].str.replace('[','').str.replace(']','').str.replace('\'','')
data['givenname'] = data['givenname'].str.replace('[','').str.replace(']','').str.replace('\'','')
data['highlights'] = data['highlights'].str.replace('[','').str.replace(']','').str.replace('\'','')
data['keywords'] = data['keywords'].str.replace('[','').str.replace(']','').str.replace('\'','')
data['surname'] = data['surname'].str.replace('[','').str.replace(']','').str.replace('\'','')
data['timu'] = data['timu'].str.replace('[','').str.replace(']','').str.replace('\'','')
data['vol'] = data['vol'].str.replace('[','').str.replace(']','').str.replace('\'','')
data['web'] = data['web'].str.replace('[','').str.replace(']','').str.replace('\'','')
# 分割合并的信息
data['date'] = data['datainfo'].str.split(',').str.get(1)
data['page'] = data['datainfo'].str.split(',').str.get(2)
3、关键部分批量翻译
得到具有全部文献信息的Data Frame之后,需要调用 百度翻译 API 进行批量翻译。需要具体看一下官方的技术文档,所需要的请求参数在文档中有详细的说明。
[https://api.fanyi.baidu.com/doc/21],
字段名 | 类型 | 必填参数 | 描述 | 备注 |
q | TEXT | Y | 请求翻译query | UTF-8编码 |
from | TEXT | Y | 请求翻译的源语言 | zh中文、en英语 |
to | TEXT | Y | 译文语言 | zh中文、en英语 |
salt | TEXT | Y | 随机数 | |
appid | TEXT | Y | APP ID | 自己申请 |
sign | TEXT | Y | 签名 | appid+q+salt+密钥的MD5值 |
# 导入相应的库
import http.client
import hashlib
import urllib
import random
import json
import requests as re
# 构造自动翻译函数 translateBaidu
def translateBaidu(content):
appid = '20200119000376***'
secretKet = 'd7SAX0xhIHEEYQ7qp***'
url = 'http://api.fanyi.baidu.com/api/trans/vip/translate'
fromLang = 'en'
toLang = 'zh'
salt = str(random.randint(32555,65333))
sign = appid + content + salt + secretKet
sign = hashlib.md5(sign.encode('utf-8')).hexdigest()
try:
params = {
'appid' : appid,
'q' : content
'from' : fromLang,
'to' : toLang,
'salt' : salt,
'sign' : sign
}
res = re.get(url,params)
jres = res.json()
# 转换为json格式之后需要分析json的格式,并取出相应的返回翻译结果
dst = str(jres['trans_result'][0]['dst'])
return dst
except Exception as e:
print(e)
构造完成后测试一下,结果返回正确,当输入参数为空时,返回‘trans_result’
万事具备,现在只需要将 爬取到的文献的数据利用translateBaidu进行翻译并构造新的 DateFrame即可。
# 在DataFrame中构建相应的新列
data['trans-timy'] = 'NULL'
data['trans-keywords'] = 'NULL'
data['trans-abstract'] = 'NULL'
data['trans-hightlights'] = 'NULL'
# 开始翻译并赋值
for i in range(len(data)):
data['trans-timu'][i] = translateBaidu(data['timu'][i])
data['trans-keywords'][i] = translateBaidu(data['keywords'][i])
data['trans-abstract'][i] = translateBaidu(data['abstract'][i])
data['trans-hightlights'][i] = translateBaidu(data['hightlights'][i])
#按照文档要求,每秒的请求不能超过10条
time.sleep(0.5)
print('ALL FINISHED')
看一下翻译的效果
最后调用 ODBC 接口把数据存入数据库中,保存OK,以后过一段时间睡觉之前跑一下程序就能不断更新文献库了。可以把经常看的期刊依葫芦画瓢都编写一下,以后就可以很轻松的监察文献动态了……
质量测评
机翻 vs 人翻
在翻译完成之后,还是有点担心百度机翻的质量(谷歌接口有点难搞),所以随机抽样几条数据来检查一下翻译的质量。emmmm,大概看了一下,感觉比我翻的好(手动滑稽)…….
[关键词翻译的准确度 > 题目翻译的准确度 > 摘要 > highlights ]
但是粗粗的看一下还是没有问题的,能够理解大概的意思,不影响理解。
整理后的代码
# 相应库的导入
import requests as re
from lxml import etree
import pandas as pd
import time
import http.client
import hashlib
import urllib
import random
import json
import requests as re
# 请求头的构造
headers = {'user-agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36'}
# 获取第一层网页链接
html1 = etree.parse('G:\\Pythontest\\practice\\test1.html1',etree.HTMLParser())
html2 = etree.parse('G:\\Pythontest\\practice\\test2.html1',etree.HTMLParser())
data1 = html1.xpath("//a[@class='anchor js-issue-item-link text-m']/@href")
data2 = html2.xpath("//a[@class='anchor js-issue-item-link text-m']/@href")
LINKS = []
LINKS.append(data1)
LINKS.append(data2)
TLINKS = []
for i in LINKS:
link = 'https://www.sciencedirect.com' + i
TLINKS = append(link)
# 获取第二层网页链接
SUBLINKS = []
for link in TLINKS:
subres = re.get(link,headers = headers).text
subres = etree.HTML(subres)
subllinks = subres.xpath("//a[@class = 'anchor article-content-title u-margin-xs-top u-margin-s-bottom']/@href")
SUBLINKS.append(sublinks)
print("第",TLINKS.index(link),"条数据OK")
time.sleep(0.2)
print('ALL IS OK')
LINKS = []
for i in SUBLINKS:
link = 'https://www.sciencedirect.com' + i
LINKS.append(link)
# 获取第三层网页的数据
allinfo = []
for LINK in LINKS:
info = {}
res = re.get(LINK,headers=headers).text
res = etree.HTML(res)
vol = res.xpath("//a[@title = 'Go to table of contents for this volume/issue']/text()")
datainfo = res.xpath("//div[@class = 'text-xs']/text()")
timu = res.xpath("//span[@class = 'title-text']/text()")
givenname = res.xpath("//span[@class='text given-name']/text()")
surname = res.xpath("//span[@class='text surname']/text()")
web = res.xpath("//a[@class='doi']/@href")
abstract = res.xpath("//p[@id='abspara0010']/text()")
keywords = res.xpath("//div[@class='keyword']/span/text()")
highlights = res.xpath("//dd[@class='list-description']/p/text()")
# 字典内部数据结构的组织
info['vol'] = vol
info['datainfo'] = datainfo
info['timu'] = timu
info['givenname'] = givenname
info['surname'] = surname
info['web'] = web
info['abstract'] = abstract
info['keywords'] = keywords
info['highlights'] = highlights
allinfo.append(info)
print("第",LINKS.index(LINK),"条数据 IS FINISHED,总进度是",(LINKS.index(LINK)+1)/len(LINKS))
# 保存数据到excel文件
df = pd.DataFrame(allinfo)
df
df.to_excel(r'G:\PythonStudy\practice1\test.xls',sheet_name='sheet1')
# 数据的初步清洗
data = df.copy()
data['abstract'] = data['abstract'].str.replace('[','').str.replace(']','').str.replace('\'','')
data['datainfo'] = data['datainfo'].str.replace('[','').str.replace(']','').str.replace('\'','')
data['givenname'] = data['givenname'].str.replace('[','').str.replace(']','').str.replace('\'','')
data['highlights'] = data['highlights'].str.replace('[','').str.replace(']','').str.replace('\'','')
data['keywords'] = data['keywords'].str.replace('[','').str.replace(']','').str.replace('\'','')
data['surname'] = data['surname'].str.replace('[','').str.replace(']','').str.replace('\'','')
data['timu'] = data['timu'].str.replace('[','').str.replace(']','').str.replace('\'','')
data['vol'] = data['vol'].str.replace('[','').str.replace(']','').str.replace('\'','')
data['web'] = data['web'].str.replace('[','').str.replace(']','').str.replace('\'','')
data['date'] = data['datainfo'].str.split(',').str.get(1)
data['page'] = data['datainfo'].str.split(',').str.get(2)
# 构造自动翻译函数 translateBaidu
def translateBaidu(content):
appid = '20200119000376***'
secretKet = 'd7SAX0xhIHEEYQ7qp***'
url = 'http://api.fanyi.baidu.com/api/trans/vip/translate'
fromLang = 'en'
toLang = 'zh'
salt = str(random.randint(32555,65333))
sign = appid + content + salt + secretKet
sign = hashlib.md5(sign.encode('utf-8')).hexdigest()
try:
params = {
'appid' : appid,
'q' : content
'from' : fromLang,
'to' : toLang,
'salt' : salt,
'sign' : sign
}
res = re.get(url,params)
jres = res.json()
# 转换为json格式之后需要分析json的格式,并取出相应的返回翻译结果
dst = str(jres['trans_result'][0]['dst'])
return dst
except Exception as e:
print(e)
# 在DataFrame中构建相应的新列
data['trans-timy'] = 'NULL'
data['trans-keywords'] = 'NULL'
data['trans-abstract'] = 'NULL'
data['trans-hightlights'] = 'NULL'
# 开始翻译并赋值
for i in range(len(data)):
data['trans-timu'][i] = translateBaidu(data['timu'][i])
data['trans-keywords'][i] = translateBaidu(data['keywords'][i])
data['trans-abstract'][i] = translateBaidu(data['abstract'][i])
data['trans-hightlights'][i] = translateBaidu(data['hightlights'][i])
#按照文档要求,每秒的请求不能超过10条
time.sleep(0.5)
print('ALL FINISHED')
# 保存文件
data.to_excel(r'G:\PythonStudy\practice1\test.xls',sheet_name='sheet1')
本文章转载微信公众号@OCD Planners