Python + BaiduTransAPI ：快速检索千篇英文文献（附源码）

内容概要

利用 Python 系统爬取外文文献主要信息（题目、摘要、关键字等），形成某一类型的文献的数据库。接着调用百度翻译 API 对重要信息进行机器翻译，形成粗略的文献检索库，便于首次筛选文献和对外文文献形成总体预览、方便管理

优点：海量文献的快速粗翻、方便针对性挑选文献细读、系统+完整的收集和管理文献

缺点：首次爬取数据较慢（国外嘛）、机翻不如人翻

成果预览

Journey of Rural Studies 期刊全部文献翻译库

（共3007篇）

后期排版一下，便于阅读

①预览题目、关键字等 → ②找到需要文献 → ③点击链接 → ④文章细读

灵感来源

快速选择文献——中文 VS 英文

要想找到一篇与自己研究内容相关的文献，应该先从广度上进行筛选，找到合适自己研究的文章，接着细读，研究其观点、研究方法等。相对于阅读英文，我们对中文的阅读速度更快、能够更有效的通过关键信息筛选出自己需要的文章。避免文章看了好久终于看完了发现与自己的研究相关性不大。

于是想法诞生，如果能批量的将英文文献自动翻译形成数据库，不仅能够方便阅读，而且能在相同时间内阅读更多的信息，延展文献筛选的广度，并且没有网络延迟，提高了思维的流畅性。寒假在家顺便实验了一下，感觉效果不错，实现方式见下文。

逻辑设计

路径 & 框架

要想实现这个想法，总体操作流程应该有两大块。一是利用 Python 爬取相关的数据，二是调用百度翻译API 接口进行自动翻译。详细流程整理如下图：

物理设计

源码 & 实现

1、文献数据抓取

本次使用 Journal of Rural Studies期刊作为测试，具体的网址如下，任务就是爬取该期刊从创刊以来到现在所有的文章信息。

https://www.journals.elsevier.com/journal-of-rural-studies/

# 导入库

import requests as re

from lxml import etree

import pandas as pd

import time



# 构造请求头

headers = {'user-agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36'}

先拿一个网页做一个测试，看看X path解析结果

url = 'https://www.sciencedirect.com/journal/journal-of-rural-studies/issues'

res = re.get(url,headers = headers).text

res = etree.HTML(res)

testdata = res.xpath("//a[@class='anchor js-issue-item-link text-m']/@href")

testdata

结果发现一级网页解析结果只有一个次级链接，按照道理来说应该有一级网页的全部链接，通过多次尝试发现，网页设计过程中第一个次级链接为get请求，而其余的次级链接都是POST请求，该网页一共有page2，为了方便，将所有链接都点开之后将网页保存为HTML文件之后再导入较为方便

html1 = etree.parse('G:\\Pythontest\\practice\\test1.html1',etree.HTMLParser())

html2 = etree.parse('G:\\Pythontest\\practice\\test2.html1',etree.HTMLParser())

data1 = html1.xpath("//a[@class='anchor js-issue-item-link text-m']/@href")

data2 = html2.xpath("//a[@class='anchor js-issue-item-link text-m']/@href")

LINKS = []

LINKS.append(data1)

LINKS.append(data2)

TLINKS = []

for i in LINKS:

    link = 'https://www.sciencedirect.com' + i

    TLINKS = append(link)

得到 TLINKS 是所有一级网页的链接，观察长度共有158条数据，数据获取正确。接下来获取所有的二级网络链接，这个时候就看看直播之类的吧，访问国外网站有点慢。完成之后共得到3007个次级链接（即3007篇文章）

SUBLINKS = []

for link in TLINKS:

    subres = re.get(link,headers = headers).text

    subres = etree.HTML(subres)

    subllinks = subres.xpath("//a[@class = 'anchor article-content-title u-margin-xs-top u-margin-s-bottom']/@href")

    SUBLINKS.append(sublinks)

    print("第",TLINKS.index(link),"条数据OK")

    time.sleep(0.2)

print('ALL IS OK')



LINKS = []

for i in SUBLINKS:

    link = 'https://www.sciencedirect.com' + i

    LINKS.append(link)

得到二级网页网页链接之后，需要分析三级网页的网页结构，并将需要的信息进行筛选，构造字典比存储。

allinfo = []

for LINK in LINKS:

    info = {}

    res = re.get(LINK,headers=headers).text

    res = etree.HTML(res)

    vol = res.xpath("//a[@title = 'Go to table of contents for this volume/issue']/text()")

    datainfo = res.xpath("//div[@class = 'text-xs']/text()")

    timu = res.xpath("//span[@class = 'title-text']/text()")

    givenname = res.xpath("//span[@class='text given-name']/text()")

    surname = res.xpath("//span[@class='text surname']/text()")

    web = res.xpath("//a[@class='doi']/@href")

    abstract = res.xpath("//p[@id='abspara0010']/text()")

    keywords = res.xpath("//div[@class='keyword']/span/text()")

    highlights = res.xpath("//dd[@class='list-description']/p/text()")



    info['vol'] = vol

    info['datainfo'] = datainfo

    info['timu'] = timu

    info['givenname'] = givenname

    info['surname'] = surname

    info['web'] = web

    info['abstract'] = abstract

    info['keywords'] = keywords

    info['highlights'] = highlights

    allinfo.append(info)

    print("第",LINKS.index(LINK),"条数据 IS FINISHED,总进度是",(LINKS.index(LINK)+1)/len(LINKS))



df = pd.DataFrame(allinfo)

df

df.to_excel(r'G:\PythonStudy\practice1\test.xls',sheet_name='sheet1')

由此数据的爬取工作完成，得到了拥有所有文章信息的DataFrame

2、数据清洗

去除掉数据中多余的字符、将一些爬取时合并的信息进行拆分，形成面向翻译的Data Frame

# 删除多余的字符

data = df.copy()

data['abstract'] = data['abstract'].str.replace('[','').str.replace(']','').str.replace('\'','')

data['datainfo'] = data['datainfo'].str.replace('[','').str.replace(']','').str.replace('\'','')

data['givenname'] = data['givenname'].str.replace('[','').str.replace(']','').str.replace('\'','')

data['highlights'] = data['highlights'].str.replace('[','').str.replace(']','').str.replace('\'','')

data['keywords'] = data['keywords'].str.replace('[','').str.replace(']','').str.replace('\'','')

data['surname'] = data['surname'].str.replace('[','').str.replace(']','').str.replace('\'','')

data['timu'] = data['timu'].str.replace('[','').str.replace(']','').str.replace('\'','')

data['vol'] = data['vol'].str.replace('[','').str.replace(']','').str.replace('\'','')

data['web'] = data['web'].str.replace('[','').str.replace(']','').str.replace('\'','')



# 分割合并的信息

data['date'] = data['datainfo'].str.split(',').str.get(1)

data['page'] = data['datainfo'].str.split(',').str.get(2)

3、关键部分批量翻译

得到具有全部文献信息的Data Frame之后，需要调用百度翻译 API 进行批量翻译。需要具体看一下官方的技术文档，所需要的请求参数在文档中有详细的说明。

[https://api.fanyi.baidu.com/doc/21]，

字段名	类型	必填参数	描述	备注
q	TEXT	Y	请求翻译query	UTF-8编码
from	TEXT	Y	请求翻译的源语言	zh中文、en英语
to	TEXT	Y	译文语言	zh中文、en英语
salt	TEXT	Y	随机数
appid	TEXT	Y	APP ID	自己申请
sign	TEXT	Y	签名	appid+q+salt+密钥的MD5值

# 导入相应的库

import http.client

import hashlib

import urllib

import random

import json

import requests as re



# 构造自动翻译函数 translateBaidu

def translateBaidu(content):

    appid = '20200119000376***'

    secretKet = 'd7SAX0xhIHEEYQ7qp***'

    url = 'http://api.fanyi.baidu.com/api/trans/vip/translate'

    fromLang = 'en'

    toLang = 'zh'

    salt = str(random.randint(32555,65333))

    sign = appid + content + salt + secretKet

    sign = hashlib.md5(sign.encode('utf-8')).hexdigest()

    try:

        params = {

            'appid' : appid,

            'q' : content

            'from' : fromLang,

            'to' : toLang,

            'salt' : salt,

            'sign' : sign

        }

        res = re.get(url,params)

        jres = res.json()

        # 转换为json格式之后需要分析json的格式，并取出相应的返回翻译结果

        dst = str(jres['trans_result'][0]['dst'])

        return dst



    except Exception as e:

        print(e)

构造完成后测试一下，结果返回正确，当输入参数为空时，返回‘trans_result’

万事具备，现在只需要将爬取到的文献的数据利用translateBaidu进行翻译并构造新的 DateFrame即可。

# 在DataFrame中构建相应的新列

data['trans-timy'] = 'NULL'

data['trans-keywords'] = 'NULL'

data['trans-abstract'] = 'NULL'

data['trans-hightlights'] = 'NULL'



# 开始翻译并赋值

for i in range(len(data)):

    data['trans-timu'][i] = translateBaidu(data['timu'][i])

    data['trans-keywords'][i] = translateBaidu(data['keywords'][i])

    data['trans-abstract'][i] = translateBaidu(data['abstract'][i])

    data['trans-hightlights'][i] = translateBaidu(data['hightlights'][i])

    #按照文档要求，每秒的请求不能超过10条

    time.sleep(0.5)

print('ALL FINISHED')

看一下翻译的效果

最后调用 ODBC 接口把数据存入数据库中，保存OK，以后过一段时间睡觉之前跑一下程序就能不断更新文献库了。可以把经常看的期刊依葫芦画瓢都编写一下，以后就可以很轻松的监察文献动态了……

质量测评

机翻 vs 人翻

在翻译完成之后，还是有点担心百度机翻的质量（谷歌接口有点难搞），所以随机抽样几条数据来检查一下翻译的质量。emmmm，大概看了一下，感觉比我翻的好（手动滑稽）…….

[关键词翻译的准确度 > 题目翻译的准确度 > 摘要 > highlights ]

但是粗粗的看一下还是没有问题的，能够理解大概的意思，不影响理解。

整理后的代码

# 相应库的导入

import requests as re

from lxml import etree

import pandas as pd

import time

import http.client

import hashlib

import urllib

import random

import json

import requests as re



# 请求头的构造

headers = {'user-agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36'}



# 获取第一层网页链接

html1 = etree.parse('G:\\Pythontest\\practice\\test1.html1',etree.HTMLParser())

html2 = etree.parse('G:\\Pythontest\\practice\\test2.html1',etree.HTMLParser())

data1 = html1.xpath("//a[@class='anchor js-issue-item-link text-m']/@href")

data2 = html2.xpath("//a[@class='anchor js-issue-item-link text-m']/@href")

LINKS = []

LINKS.append(data1)

LINKS.append(data2)

TLINKS = []

for i in LINKS:

    link = 'https://www.sciencedirect.com' + i

    TLINKS = append(link)



# 获取第二层网页链接

    SUBLINKS = []

for link in TLINKS:

    subres = re.get(link,headers = headers).text

    subres = etree.HTML(subres)

    subllinks = subres.xpath("//a[@class = 'anchor article-content-title u-margin-xs-top u-margin-s-bottom']/@href")

    SUBLINKS.append(sublinks)

    print("第",TLINKS.index(link),"条数据OK")

    time.sleep(0.2)

print('ALL IS OK')



LINKS = []

for i in SUBLINKS:

    link = 'https://www.sciencedirect.com' + i

    LINKS.append(link)



# 获取第三层网页的数据

allinfo = []

for LINK in LINKS:

    info = {}

    res = re.get(LINK,headers=headers).text

    res = etree.HTML(res)

    vol = res.xpath("//a[@title = 'Go to table of contents for this volume/issue']/text()")

    datainfo = res.xpath("//div[@class = 'text-xs']/text()")

    timu = res.xpath("//span[@class = 'title-text']/text()")

    givenname = res.xpath("//span[@class='text given-name']/text()")

    surname = res.xpath("//span[@class='text surname']/text()")

    web = res.xpath("//a[@class='doi']/@href")

    abstract = res.xpath("//p[@id='abspara0010']/text()")

    keywords = res.xpath("//div[@class='keyword']/span/text()")

    highlights = res.xpath("//dd[@class='list-description']/p/text()")



    # 字典内部数据结构的组织

    info['vol'] = vol

    info['datainfo'] = datainfo

    info['timu'] = timu

    info['givenname'] = givenname

    info['surname'] = surname

    info['web'] = web

    info['abstract'] = abstract

    info['keywords'] = keywords

    info['highlights'] = highlights

    allinfo.append(info)

    print("第",LINKS.index(LINK),"条数据 IS FINISHED,总进度是",(LINKS.index(LINK)+1)/len(LINKS))



# 保存数据到excel文件

df = pd.DataFrame(allinfo)

df

df.to_excel(r'G:\PythonStudy\practice1\test.xls',sheet_name='sheet1')



# 数据的初步清洗

data = df.copy()

data['abstract'] = data['abstract'].str.replace('[','').str.replace(']','').str.replace('\'','')

data['datainfo'] = data['datainfo'].str.replace('[','').str.replace(']','').str.replace('\'','')

data['givenname'] = data['givenname'].str.replace('[','').str.replace(']','').str.replace('\'','')

data['highlights'] = data['highlights'].str.replace('[','').str.replace(']','').str.replace('\'','')

data['keywords'] = data['keywords'].str.replace('[','').str.replace(']','').str.replace('\'','')

data['surname'] = data['surname'].str.replace('[','').str.replace(']','').str.replace('\'','')

data['timu'] = data['timu'].str.replace('[','').str.replace(']','').str.replace('\'','')

data['vol'] = data['vol'].str.replace('[','').str.replace(']','').str.replace('\'','')

data['web'] = data['web'].str.replace('[','').str.replace(']','').str.replace('\'','')



data['date'] = data['datainfo'].str.split(',').str.get(1)

data['page'] = data['datainfo'].str.split(',').str.get(2)



# 构造自动翻译函数 translateBaidu

def translateBaidu(content):

    appid = '20200119000376***'

    secretKet = 'd7SAX0xhIHEEYQ7qp***'

    url = 'http://api.fanyi.baidu.com/api/trans/vip/translate'

    fromLang = 'en'

    toLang = 'zh'

    salt = str(random.randint(32555,65333))

    sign = appid + content + salt + secretKet

    sign = hashlib.md5(sign.encode('utf-8')).hexdigest()



    try:

        params = {

            'appid' : appid,

            'q' : content

            'from' : fromLang,

            'to' : toLang,

            'salt' : salt,

            'sign' : sign

        }

        res = re.get(url,params)

        jres = res.json()

        # 转换为json格式之后需要分析json的格式，并取出相应的返回翻译结果

        dst = str(jres['trans_result'][0]['dst'])

        return dst



    except Exception as e:

        print(e)



# 在DataFrame中构建相应的新列

data['trans-timy'] = 'NULL'

data['trans-keywords'] = 'NULL'

data['trans-abstract'] = 'NULL'

data['trans-hightlights'] = 'NULL'



# 开始翻译并赋值

for i in range(len(data)):

    data['trans-timu'][i] = translateBaidu(data['timu'][i])

    data['trans-keywords'][i] = translateBaidu(data['keywords'][i])

    data['trans-abstract'][i] = translateBaidu(data['abstract'][i])

    data['trans-hightlights'][i] = translateBaidu(data['hightlights'][i])

    #按照文档要求，每秒的请求不能超过10条

    time.sleep(0.5)

print('ALL FINISHED')



# 保存文件

data.to_excel(r'G:\PythonStudy\practice1\test.xls',sheet_name='sheet1')