perf: Enhance Word parsing #2612

shaohuzhang1 · 2025-03-19T04:04:36Z

perf: Enhance Word parsing

f2c-ci-robot · 2025-03-19T04:04:41Z

Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

f2c-ci-robot · 2025-03-19T04:04:43Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

shaohuzhang1 · 2025-03-19T04:05:00Z

apps/common/handle/impl/doc_split_handle.py

                if len(images) > 0:
                    return title + '\n' + images_to_string(images, doc, images_list, get_image_id) if len(
                        paragraph.text) > 0 else images_to_string(images, doc, images_list, get_image_id)
                return title
+
        except Exception as e:
            traceback.print_exc()
            return paragraph.text


In general, there isn't anything fundamentally wrong with your code. However, here are some suggestions for improvement:

Improvements & Suggestions

Consistent Error Handling:

In the get_title_level function, it's better to raise an exception when an error occurs rather than simply logging it silently.

Code Dexterity:

Consider extracting common logic into separate methods for readability and maintainability.

Return Type Hinting (Optional but Recommended):

Adding type hints can make your code more readable and help catch errors during development.

Exception Handling:

You might want to handle specific exceptions more selectively or use Python logging, which provides more structured output compared to simple string printing.

Performance Optimizations:

There currently aren't significant performance bottlenecks in your code, but keeping it clean and well-documented will benefit future maintenance.

Here's the updated version with some of these improvements:

from typing import List import traceback from docx.api.text.paragraph import Paragraph from docx.shared.document import Document # Assuming you have a list of tuples defining title font sizes title_font_list = [ (36, 100), (26, 36), (24, 24), (22, 22), (18, 22), (16, 18) ] def get_title_level(paragraph: Paragraph) -> int: """Determine the level of a heading based on its style.""" try: psn = paragraph.style.name if paragraph.style else '' if psn.startswith('Heading') or psn == 'TOC 标题' or psn == '标题': # Extract the number from the style name level = int(filter(str.isdigit, psn.split())) elif len(paragraph.runs) == 1: font_size = paragraph.runs[0].font.size.pt # Convert font size to points for _value, index in enumerate(title_font_list): lower_bound, upper_bound = _value if lower_bound <= font_size < upper_bound: return index + 1 except Exception as e: # Log the error or re-raise it depending on your needs print(f"Error determining title level: {e}") return None class DocSplitHandle(BaseSplitHandle): @staticmethod def paragraph_to_md(paragraph: Paragraph, doc: Document, images_list, get_image_id): """Convert a Word document paragraph to Markdown format.""" try: title_level = get_title_level(paragraph) if title_level is not None: title = "#" * title_level + " " + paragraph.text images = reduce(lambda x, y: [*x, *y], [get_paragraph_element_images(e, doc, images_list, get_image_id) for e in paragraph._element], []) if len(images) > 0: content = f"{title}\n{images_to_string(images, doc, images_list, get_image_id)}" if len( paragraph.text) > 0 else \ images_to_string(images, doc, images_list, get_image_id) else: content = title return content except Exception as e: # Optionally log this exception or re-raise print(f"Error processing paragraph: {e}") traceback.print_exc() return paragraph.text

These changes enhance the readability and robustness of your code while maintaining the functionality you intended.

shaohuzhang1 · 2025-03-19T04:05:04Z

apps/common/util/split_model.py

        if 'title' in paragraph:
            title = paragraph.get('title')
            content = paragraph.get('content')
            if (content is None or len(content.strip()) == 0) and (title is not None and len(title) > 0):
+                find = [t for t in title_list if t.__contains__(title) and t != title]
+                if find:
+                    return {'title': '', 'content': ''}
                return {'title': '', 'content': title}
        return paragraph



The code looks mostly correct but contains some potential improvements:

Type Hinting Consistency: The post_reset_paragraph method's parameter paragraph should be consistent with the existing methods, which use type hinting (Dict). Consider changing it to match other parameters.

Return Value Type: Ensure that the final function returns a list, even though each element returned is already of dictionary type. This clarity can help in understanding the expected output structure without further examination of individual elements.

Here are the proposed changes:

def post_reset_paragraph(self, paragraph: Dict[str, Any], title_list: List[str]) -> List[Dict]: # Existing implementation

By making these changes, the code becomes more readable and maintainable, especially when dealing with static methods like sub_title and content_is_null.

perf: Enhance Word parsing

aaf328e

f2c-ci-robot bot added the do-not-merge/release-note-label-needed label Mar 19, 2025

shaohuzhang1 merged commit 5ec9486 into main Mar 19, 2025
4 checks passed

shaohuzhang1 deleted the pr@main@refactor_word_parsing branch March 19, 2025 04:04

shaohuzhang1 commented Mar 19, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf: Enhance Word parsing #2612

perf: Enhance Word parsing #2612

Uh oh!

shaohuzhang1 commented Mar 19, 2025

Uh oh!

f2c-ci-robot bot commented Mar 19, 2025

Uh oh!

f2c-ci-robot bot commented Mar 19, 2025

Uh oh!

Uh oh!

shaohuzhang1 Mar 19, 2025

Uh oh!

shaohuzhang1 Mar 19, 2025

Uh oh!

Uh oh!

perf: Enhance Word parsing #2612

perf: Enhance Word parsing #2612

Uh oh!

Conversation

shaohuzhang1 commented Mar 19, 2025

Uh oh!

f2c-ci-robot bot commented Mar 19, 2025

Uh oh!

f2c-ci-robot bot commented Mar 19, 2025

Uh oh!

Uh oh!

shaohuzhang1 Mar 19, 2025

Choose a reason for hiding this comment

Improvements & Suggestions

Uh oh!

shaohuzhang1 Mar 19, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!