-
Notifications
You must be signed in to change notification settings - Fork 32
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
update posts: pdf2docx technical documentation
- Loading branch information
Showing
7 changed files
with
350 additions
and
99 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,139 @@ | ||
--- | ||
categories: [process automation] | ||
tags: [python] | ||
--- | ||
|
||
# pdf2docx开发概要:解析段落 | ||
|
||
|
||
--- | ||
|
||
|
||
经过[表格解析](2020-08-15-pdf2docx开发概要:解析表格.md)后,我们得到了整合的块元素:文本/图片块`TextBlock`和表格块`TableBlock`。其中文本/图片块将被重建为段落,表格块将被重建为表格,单元格内的块元素按照相同的逻辑进行递归处理。在此基础上,计算相邻元素之间的间距,例如竖直方向的前后段间距、水平方向的段落缩进。 | ||
|
||
|
||
|
||
## 竖直方向定位:段间距 | ||
|
||
块级元素之间通过间距确定相对位置关系。由于Word中 **段落具有段前/段后间距属性**,文本块将被作为定位的参考元素。 | ||
|
||
竖直间距确定原则: | ||
|
||
- 考察竖直方向上相邻的两个块级元素,前一个是参考块,后一个是当前块。对于第一个块级元素,参考块是上边距,当前块即为自身。 | ||
|
||
- 如果当前是文本块或者图片块(不论参考块是文本、图片还是表格),则设置当前块的段前间距`before_space`为二者之间垂直距离。 | ||
|
||
- 如果当前块是表格块,则考察参考块: | ||
- 如果参考块是文本块或者图片块(此时当前块为表格),则设置参考块的段后距离`after_space` | ||
- 如果参考块还是表格块,则设置当前块的段前间距`before_space` | ||
|
||
|
||
!!! warning "注意" | ||
- `docx`中无法直接设置两个表格的间距,创建表格时采用变通方式:在间距中插入空文本块,然后设置该文本块的段前间距。 | ||
- 如果表格在末尾,`MS Word`会自动加一个标准间距的空段落,这样可能导致非预期的换页。因此,此种情况注意人为添加一个空段落并设置其为最小的间距值,例如前后零间距,行高1磅。 | ||
|
||
|
||
## 竖直方向定位:行间距 | ||
|
||
已知块的高度和内部行数,容易计算得到平均行距。Word中有两种设置行间距的方式: | ||
|
||
- 固定值:直接设置为计算出的平均行距,优点是定位精确,缺点是不会随着字号的变化而改变,不利于编辑。例如一旦增大字号,则有可能导致该行文本显示不全。 | ||
|
||
- 倍数行距:与单倍行距的比值,优缺点刚好与固定行距相反。 | ||
|
||
综合来看,倍数行距有更好的适应性,v0.5.2版开始启用倍数行距。 | ||
|
||
注意,倍数行距并非行高与字号的简单比值或者流传的1.2倍的比例关系,而是 **与具体字体相关**。单开一篇介绍倍数行距计算问题: | ||
|
||
> [此坑待填...](to_do) | ||
|
||
## 水平方向定位:对齐方式与缩进 | ||
|
||
水平方向从内部(块内元素`Line`的对齐关系)和外部(页面中的位置)两个方面确定对齐方式:左/居中/右/分散对齐。其中左对齐为默认方式,因为结合段落左缩进和制表符,总能正确定位任何块间元素。 | ||
|
||
**内部对齐关系**:行与行之间位置关系 | ||
|
||
- 如果块内有不连续的行(相邻`line`存在明显的间距),则设为左对齐,以便结合制表符定位 | ||
- 如果只有一行,则参考外部对齐关系 | ||
- 判断各行左边距、右边距、中心距离差值是否小于指定值,即是否对齐: | ||
- 左、右都对齐:如果行数不少于3行,则为分散对齐,否则不能确定,需要进一步参考 外部对齐关系; | ||
- 否则,依次按左对齐、右对齐、中心对齐判断下去 | ||
|
||
!!! warning "注意" | ||
- 判断左对齐时注意排除第一行,因为第一行可能缩进或悬挂缩进;如果满足左对齐,计算第一行的缩进量(负值表示悬挂缩进)。 | ||
- 判断右对齐时注意排除最后一行,目的是考虑分散对齐——最后一行可能不满,但依旧满足分散对齐 | ||
|
||
|
||
**外部对齐关系**:块在页面中的位置 | ||
|
||
分别计算块与页面的边距:左边距、右边距、中心距离差值,然后顺序判断 | ||
|
||
- 块中心与页面中心差值很小 -> 居中对齐 | ||
- 依次判断左、右边距差值,哪个差值小即为相应对齐方式 | ||
|
||
|
||
最后,创建`docx`时通过设置段落的左/右缩进和对齐方式来实现。并且,左对齐方式还要通过 **制表位** 来保证段落内不同行的水平位置。 | ||
|
||
|
||
!!! warning "重建docx时的一些优化处理" | ||
- 同时设置左、右边距将严格限定段落的水平位置,可能导致意外的换行。因此,可以适当放宽“对面”的间距:左对齐时放宽(减小)右边距,右对齐放宽左边距,居中对齐则同时放宽左、右边距。 | ||
- 如果只有一行,则将这个放宽放到极限,即设置相应边距等于0。 | ||
|
||
|
||
## 文本样式 | ||
|
||
以上解析结果确定了段落在页面中的位置和呈现样式,接下来深入到段内文本。`PyMuPDF`提取的原始文本块自带了字体、颜色、斜体、粗体等属性,但是`高亮`、`下划线`、`删除线`等具体样式需要进一步根据 **文本和形状的位置关系** 来判定。 | ||
|
||
具体参考下文: | ||
|
||
> [pdf2docx开发概要:解析文本样式](2020-07-20-pdf2docx开发概要:解析文本样式.md) | ||
## 数据结构 | ||
|
||
综上,文本块`TextBlock`在标准`Block`数据结构(`Line`->`Span`->`Char`)的基础上,引入了如下定位相关属性: | ||
|
||
```python | ||
# text block | ||
{ | ||
"type": 0, | ||
"bbox": [float, float, float, float], | ||
|
||
# ----- vertical spacing ----- | ||
"before_space": float, | ||
"line_space": float, | ||
"after_space": float | ||
|
||
# ----- horizontal spacing ----- | ||
"alignment": int, | ||
"left_space": float, | ||
"right_space": float, | ||
"first_line_space": float, | ||
"tab_stops": [float, float, ...], | ||
|
||
"lines": [ | ||
{ | ||
"bbox": [float, float, float, float], | ||
"wmode": int, | ||
"dir": [float, float], | ||
"line_break": int, # new property | ||
"tab_stop": int, # new property | ||
"spans": [ | ||
{ | ||
"bbox": [float, float, float, float], | ||
"color": int, | ||
"font": str, | ||
"size": float, | ||
"flags": int, | ||
"text": str, | ||
"chars": [{ | ||
"bbox": [float, float, float, float], | ||
"c": str | ||
}] | ||
} | ||
] | ||
} | ||
] | ||
} | ||
``` | ||
|
This file was deleted.
Oops, something went wrong.
Oops, something went wrong.