Skip to content

perf: Enhance Word parsing #2612

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Mar 19, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
39 changes: 33 additions & 6 deletions apps/common/handle/impl/doc_split_handle.py
Original file line number Diff line number Diff line change
Expand Up @@ -110,24 +110,51 @@ def get_image_id(image_id):
return get_image_id


title_font_list = [
[36, 100],
[26, 36],
[24, 26],
[22, 24],
[18, 22],
[16, 18]
]


def get_title_level(paragraph: Paragraph):
try:
if paragraph.style is not None:
psn = paragraph.style.name
if psn.startswith('Heading') or psn.startswith('TOC 标题') or psn.startswith('标题'):
return int(psn.replace("Heading ", '').replace('TOC 标题', '').replace('标题',
''))
if len(paragraph.runs) == 1:
font_size = paragraph.runs[0].font.size
pt = font_size.pt
if pt >= 16:
for _value, index in zip(title_font_list, range(len(title_font_list))):
if pt >= _value[0] and pt < _value[1]:
return index + 1
except Exception as e:
pass
return None


class DocSplitHandle(BaseSplitHandle):
@staticmethod
def paragraph_to_md(paragraph: Paragraph, doc: Document, images_list, get_image_id):
try:
psn = paragraph.style.name
if psn.startswith('Heading') or psn.startswith('TOC 标题') or psn.startswith('标题'):
title = "".join(["#" for i in range(
int(psn.replace("Heading ", '').replace('TOC 标题', '').replace('标题',
'')))]) + " " + paragraph.text
title_level = get_title_level(paragraph)
if title_level is not None:
title = "".join(["#" for i in range(title_level)]) + " " + paragraph.text
images = reduce(lambda x, y: [*x, *y],
[get_paragraph_element_images(e, doc, images_list, get_image_id) for e in
paragraph._element],
[])

if len(images) > 0:
return title + '\n' + images_to_string(images, doc, images_list, get_image_id) if len(
paragraph.text) > 0 else images_to_string(images, doc, images_list, get_image_id)
return title

except Exception as e:
traceback.print_exc()
return paragraph.text
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, there isn't anything fundamentally wrong with your code. However, here are some suggestions for improvement:

Improvements & Suggestions

  1. Consistent Error Handling:

    • In the get_title_level function, it's better to raise an exception when an error occurs rather than simply logging it silently.
  2. Code Dexterity:

    • Consider extracting common logic into separate methods for readability and maintainability.
  3. Return Type Hinting (Optional but Recommended):

    • Adding type hints can make your code more readable and help catch errors during development.
  4. Exception Handling:

    • You might want to handle specific exceptions more selectively or use Python logging, which provides more structured output compared to simple string printing.
  5. Performance Optimizations:

    • There currently aren't significant performance bottlenecks in your code, but keeping it clean and well-documented will benefit future maintenance.

Here's the updated version with some of these improvements:

from typing import List
import traceback
from docx.api.text.paragraph import Paragraph
from docx.shared.document import Document


# Assuming you have a list of tuples defining title font sizes
title_font_list = [
    (36, 100),
    (26, 36),
    (24, 24),
    (22, 22),
    (18, 22),
    (16, 18)
]


def get_title_level(paragraph: Paragraph) -> int:
    """Determine the level of a heading based on its style."""
    try:
        psn = paragraph.style.name if paragraph.style else ''
        if psn.startswith('Heading') or psn == 'TOC 标题' or psn == '标题':
            # Extract the number from the style name
            level = int(filter(str.isdigit, psn.split()))
        elif len(paragraph.runs) == 1:
            font_size = paragraph.runs[0].font.size.pt  # Convert font size to points
            for _value, index in enumerate(title_font_list):
                lower_bound, upper_bound = _value
                if lower_bound <= font_size < upper_bound:
                    return index + 1
    except Exception as e:
        # Log the error or re-raise it depending on your needs
        print(f"Error determining title level: {e}")
    return None




class DocSplitHandle(BaseSplitHandle):
    @staticmethod
    def paragraph_to_md(paragraph: Paragraph, doc: Document, images_list, get_image_id):
        """Convert a Word document paragraph to Markdown format."""
        
        try:
            title_level = get_title_level(paragraph)
            
            if title_level is not None:
                title = "#" * title_level + " " + paragraph.text
                images = reduce(lambda x, y: [*x, *y],
                                [get_paragraph_element_images(e, doc, images_list, get_image_id) for e in
                                 paragraph._element],
                                [])
                
                if len(images) > 0:
                    content = f"{title}\n{images_to_string(images, doc, images_list, get_image_id)}" if len(
                        paragraph.text) > 0 else \
                                    images_to_string(images, doc, images_list, get_image_id)
                else:
                    content = title
            
                return content
        
        except Exception as e:
            # Optionally log this exception or re-raise
            print(f"Error processing paragraph: {e}")
            traceback.print_exc()
            return paragraph.text

These changes enhance the readability and robustness of your code while maintaining the functionality you intended.

Expand Down
12 changes: 8 additions & 4 deletions apps/common/util/split_model.py
Original file line number Diff line number Diff line change
Expand Up @@ -339,13 +339,14 @@ def parse(self, text: str):
for e in result:
if len(e['content']) > 4096:
pass
return [item for item in [self.post_reset_paragraph(row) for row in result] if
title_list = list(set([row.get('title') for row in result]))
return [item for item in [self.post_reset_paragraph(row, title_list) for row in result] if
'content' in item and len(item.get('content').strip()) > 0]

def post_reset_paragraph(self, paragraph: Dict):
def post_reset_paragraph(self, paragraph: Dict, title_list: List[str]):
result = self.filter_title_special_characters(paragraph)
result = self.sub_title(result)
result = self.content_is_null(result)
result = self.content_is_null(result, title_list)
return result

@staticmethod
Expand All @@ -357,11 +358,14 @@ def sub_title(paragraph: Dict):
return paragraph

@staticmethod
def content_is_null(paragraph: Dict):
def content_is_null(paragraph: Dict, title_list: List[str]):
if 'title' in paragraph:
title = paragraph.get('title')
content = paragraph.get('content')
if (content is None or len(content.strip()) == 0) and (title is not None and len(title) > 0):
find = [t for t in title_list if t.__contains__(title) and t != title]
if find:
return {'title': '', 'content': ''}
return {'title': '', 'content': title}
return paragraph

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks mostly correct but contains some potential improvements:

  1. Type Hinting Consistency: The post_reset_paragraph method's parameter paragraph should be consistent with the existing methods, which use type hinting (Dict). Consider changing it to match other parameters.

  2. Return Value Type: Ensure that the final function returns a list, even though each element returned is already of dictionary type. This clarity can help in understanding the expected output structure without further examination of individual elements.

Here are the proposed changes:

    def post_reset_paragraph(self, paragraph: Dict[str, Any], title_list: List[str]) -> List[Dict]:
        # Existing implementation

By making these changes, the code becomes more readable and maintainable, especially when dealing with static methods like sub_title and content_is_null.

Expand Down