在上一篇文章《基于Pyqt5实现笔记本摄像头拍照及PaddleOCR测试》的基础上,继续做了个简单的扩展:
将PDF文档转换为DOC文档。
一、界面增加一个按钮,如下图:
二、源码修改
1、paddleocr.py文件直接拷贝 Github下载的源码PaddleOCR-release-2.6中的文件,但要注释掉def main():那段代码。
2、修改ui后保存文件,回到PyCharm工程,ocr_camera.ui文件,然后右键,找到"External Tools",选择PyUIC,更新ocr_camera.py文件。
3、main函数修改:
3.1、新增PDF转DOC按钮点击槽链接:
3.2、槽函数定义如下,实现打开文件夹选择pdf文件及转换功能。
def pdfRecognition(self, image_dir):fname, _ = QFileDialog.getOpenFileName(self, '选择PDF文件', './', 'Image files(*.PDF *.pdf)')self.showtext.append("loadPDF {}".format(fname))self.PDFmain(fname)def PDFmain(self,image_dir):args = paddleocr.parse_args(mMain=True)args.use_pdf2docx_api=Trueargs.recovery=Trueargs.output='pdf'print("{}".format(image_dir))if paddleocr.is_link(image_dir):paddleocr.download_with_progressbar(image_dir, 'tmp.jpg')image_file_list = ['tmp.jpg']else:image_file_list = get_image_file_list(image_dir)if len(image_file_list) == 0:logger.error('no images find in {}'.format(image_dir))self.showtext.append('no images find in {}'.format(image_dir))returnengine = paddleocr.PPStructure()for img_path in image_file_list:img_name = os.path.basename(img_path).split('.')[0]logger.info('{}{}{}'.format('*' * 10, img_path, '*' * 10))img, flag_gif, flag_pdf = paddleocr.check_and_read(img_path)if not flag_gif and not flag_pdf:img = cv2.imread(img_path)if args.recovery and args.use_pdf2docx_api and flag_pdf:from pdf2docx.converter import Converterdocx_file = os.path.join(args.output,'{}.docx'.format(img_name))cv = Converter(img_path)cv.convert(docx_file)cv.close()logger.info('docx save to {}'.format(docx_file))self.showtext.append('docx save to {}'.format(docx_file))continueif not flag_pdf:if img is None:logger.error("error in loading image:{}".format(img_path))continueimg_paths = [[img_path, img]]else:img_paths = []for index, pdf_img in enumerate(img):os.makedirs(os.path.join(args.output, img_name), exist_ok=True)pdf_img_path = os.path.join(args.output, img_name,img_name + '_' + str(index) + '.jpg')cv2.imwrite(pdf_img_path, pdf_img)img_paths.append([pdf_img_path, pdf_img])all_res = []for index, (new_img_path, img) in enumerate(img_paths):logger.info('processing {}/{} page:'.format(index + 1,len(img_paths)))self.showtext.append('processing {}/{} page:'.format(index + 1,len(img_paths)))new_img_name = os.path.basename(new_img_path).split('.')[0]result = engine(new_img_path, img_idx=index)paddleocr.save_structure_res(result, args.output, img_name, index)if args.recovery and result != []:from copy import deepcopyfrom ppstructure.recovery.recovery_to_doc import sorted_layout_boxesh, w, _ = img.shaperesult_cp = deepcopy(result)result_sorted = sorted_layout_boxes(result_cp, w)all_res += result_sortedif args.recovery and all_res != []:try:from ppstructure.recovery.recovery_to_doc import convert_info_docxconvert_info_docx(img, all_res, args.output, img_name)except Exception as ex:logger.error("error in layout recovery image:{}, err msg: {}".format(img_name, ex))continuefor item in all_res:item.pop('img')item.pop('res')logger.info(item)logger.info('result save to {}'.format(args.output))self.showtext.append('processing {}/{} page:'.format(index + 1,len(img_paths)))
三、编译
编译时报错:AttributeError: ‘Document’ object has no attribute ‘pageCount’
网上查说是PyMuPDF版本不对,安装1.18.14版本
pip install PyMuPDF==1.18.14
安装完成后,提示pdf2docx需要PyMuPDF>=1.19.0。先编译,果然报错
重新安装1.19.0版本后编译通过。
pip install PyMuPDF==1.19.0
四、测试
1、第一次测试时,会下载相关模型。
2、打开的pdf文件转换为同名的docx文件
转换前后对比如下: