AWS Textract OCR 将 PDF 作为单行阅读，而不是保留换行符。_编程开发

AWS Textract OCR 将 PDF 作为单行阅读，而不是保留换行符。

创始人

2024-11-18 12:01:27

0次

要使用AWS Textract OCR将PDF作为单行阅读，可以使用AWS SDK（例如Python SDK）来实现。以下是一个简单的代码示例：

import boto3

def pdf_to_single_line(pdf_path):
    # 创建Textract客户端
    textract_client = boto3.client('textract')

    # 读取PDF文件
    with open(pdf_path, 'rb') as file:
        pdf_data = file.read()

    # 调用Textract的StartDocumentTextDetection API
    response = textract_client.start_document_text_detection(
        Document={
            'Bytes': pdf_data
        }
    )

    # 获取JobId
    job_id = response['JobId']

    # 调用Textract的GetDocumentTextDetection API，直到完成
    while True:
        response = textract_client.get_document_text_detection(
            JobId=job_id
        )
        status = response['JobStatus']
    
        if status in ['SUCCEEDED', 'FAILED']:
            break

    # 获取结果
    result = ""
    blocks = response['Blocks']
    for block in blocks:
        if block['BlockType'] == 'LINE':
            result += block['Text'] + " "

    return result

# 用法示例
pdf_path = 'path/to/your/pdf/file.pdf'
result = pdf_to_single_line(pdf_path)
print(result)

在上述代码示例中，我们首先创建了一个Textract客户端。然后，我们通过使用start_document_text_detection API将PDF文件发送给Textract进行OCR处理，并获取JobId。

接下来，我们使用get_document_text_detection API来获取OCR结果。由于Textract需要一些时间来处理PDF，并生成结果，因此我们需要循环调用此API，直到Job完成为止。

最后，我们遍历结果中的每个文本块，并将类型为"LINE"的文本块的内容添加到最终结果中。

请确保在运行代码之前已安装AWS SDK，并替换代码中的'path/to/your/pdf/file.pdf'为实际的PDF文件路径。

上一篇：AWS Textract NodeJS：从本地内容中检测文档，而不是从S3 URL中检测

下一篇：AWS Textract PDF转JSON键值对的困惑

AWS Textract OCR 将 PDF 作为单行阅读，而不是保留换行符。

相关内容

热门资讯