grobid的使用

发布于 2024年02月03日 • 3 分钟 • 1131 字

Table of contents

1. Grobid 介绍
2. Grobid 安装
3. Grobid 使用

最近被文本分块虐得不轻，看到有人介绍grobid，赶紧用上了。

1. Grobid 介绍

Grobid 的全称是Generation of Bibliographic Data。它用机器学习来解析、提取文档。

2. Grobid 安装

用docker安装有两个版本，机器内存大用docker pull grobid/grobid:0.8.0（需要10GB），有crf和deep learning两种模型，内存小用docker pull lfoppiano/grobid:0.8.0（需要300MB），只有crf模型。

m1芯片需要JVM 这里有解决办法

我在本地安装的是小模型版本。

docker pull lfoppiano/grobid:0.8.0

docker run --rm --init --ulimit core=0 -p 8070:8070 lfoppiano/grobid:0.8.0

运行上面两条命令之后就可以在http://localhost:8070/ 看到官网的经典页面了。（官网demo：https://kermitt2-grobid.hf.space/）

3. Grobid 使用

3.1 Web 端

web端的使用很简单，在http://localhost:8070/上传文档，点submit就可以了，还能下载TEI结果。（TEI是Text Encoding Initiative，规定了电子文档的结构。）

3.2 API 调用

如果想使用API调用，有Node.js、Jave、Python三种方式。

我选择的是Python调用。GitHub Repo在此：https://github.com/kermitt2/grobid_client_python

git clone https://github.com/kermitt2/grobid_client_python
cd grobid_client_python
python3 setup.py install

运行


from grobid_client.grobid_client import GrobidClient

client = GrobidClient(config_path="./config.json")
client.process("processFulltextDocument", "/mnt/data/covid/pdfs", n=20)

4. grobid_client的语法

usage: grobid_client [-h] [--input INPUT] [--output OUTPUT] [--config CONFIG]
                     [--n N] [--generateIDs] [--consolidate_header]
                     [--consolidate_citations] [--include_raw_citations]
                     [--include_raw_affiliations] [--force] [--teiCoordinates]
                     [--verbose]
                     service

5. grobid_client的参数解释

service，
- ‘processFulltextDocument’
  - 处理整个文档
- ‘processHeaderDocument’
  - HeaderDocument是一种文档的顶部信息部分，通常出现在文档的开头。它包含了关于文档的元数据和描述性信息，通常包含标题（Title）、作者（Author）、日期（Date）、版本（Version）、摘要（Abstract）、目录（Table of Contents）、页眉（Header）和页脚（Footer）
- ‘processReferences’
- ‘processCitationList’
- ‘processCitationPatentST36’
- ‘processCitationPatentPDF’

–input INPUT

包含 PDF 文件或 .txt 文件的目录路径（如果用于processCitationList，每行一个引用） –output OUTPUT

将结果放置的目录路径 –config CONFIG

配置文件路径，默认为 ./config.json –n N

并发服务使用数 –generateIDs

为结果文件中的文本XML元素生成随机 xml:id –consolidate_header

调用 GROBID 时合并从标头中提取的元数据 –consolidate_citations 调用 GROBID 时合并提取的文献引用 –include_raw_citations 调用 GROBID 请求提取原始引文 –include_raw_affiliations 调用 GROBID 请求提取原始机构 –force 当已存在 TEI 输出文件时，强制重新处理 PDF 输入文件 –teiCoordinates 将原始的 PDF 坐标（边界框）添加到提取的元素中 –segmentSentences 在文档的文本内容中使用额外的 <s> 元素分割句子 –verbose 在控制台打印有关已处理文件的信息

有位网友提供了一种用requests来解析文档的方法。

import os
import requests

def getXml( file_path, output_filepath):
    url = "http://localhost:8070/api/processFulltextDocument"
    filename = file_path.split("/")[-1].split(".")[-2]
    params = dict(input=open(file_path , 'rb'))
    response = requests.post(url, files=params, timeout=300)
    fh = open(os.path.join(output_filepath ,filename + ".xml"), "w", encoding="utf-8")
    fh.write(response.text)
    fh.close()

def run(files_paths,files):
    
    for file_path in files_paths :
        
        getXml(file_path, files)

if __name__ ==  "__main__":
    files = "./data/grobid_data"
    files_path = [os.path.join(files, i)  for i in os.listdir(files)]
    run(files_path,files)