# ofdtotext

**Repository Path**: cheeryoung/ofdtotext

## Basic Information

- **Project Name**: ofdtotext
- **Description**: 提取 ofd 文档的文字
- **Primary Language**: Python
- **License**: Not specified
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 1
- **Created**: 2024-10-22
- **Last Updated**: 2024-10-22

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# ofd2txt

## 使用截图

![](./screenshot.png)

## Usage

命令行调用
```bash
python3 ofd_test.py 1.ofd
```

代码中引用
```python
from ofdtotext import OFDFile


doc = OFDFile('test.ofd')
print(doc.get_text())
```


# ref
核心代码参考自 [ofd2img](https://github.com/geniusnut/ofd2img)

## 程序思路

先通过 **ofd2img** 项目中的代码解压 ofd(该文件类似于 docx 是一个zip压缩包) 文件
将 xml 通过[在线网站](https://www.freeformatter.com/xml-to-json-converter.html#ad-output)转为 json 格式，可清晰看出文本消息所处的层级关系，依次定义如下数据结构，即可提取所有文字

```python
class TextCode:
    def __init__(self, text_code):
        self.text = text_code.text


class TextObject:
    def __init__(self, text_obj):
        self.text_code = [TextCode(i['TextCode']) for i in text_obj.children]


class Layer:
    def __init__(self, layer):
        self.text_obj = layer['TextObject']


class Content:
    def __init__(self, content):
        self.layer = TextObject(content['Layer'])
```