Skip to content

perf: Enhance Word parsing #2612

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Mar 19, 2025
Merged

Conversation

shaohuzhang1
Copy link
Contributor

perf: Enhance Word parsing

Copy link

f2c-ci-robot bot commented Mar 19, 2025

Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copy link

f2c-ci-robot bot commented Mar 19, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@shaohuzhang1 shaohuzhang1 merged commit 5ec9486 into main Mar 19, 2025
4 checks passed
@shaohuzhang1 shaohuzhang1 deleted the pr@main@refactor_word_parsing branch March 19, 2025 04:04
if len(images) > 0:
return title + '\n' + images_to_string(images, doc, images_list, get_image_id) if len(
paragraph.text) > 0 else images_to_string(images, doc, images_list, get_image_id)
return title

except Exception as e:
traceback.print_exc()
return paragraph.text
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, there isn't anything fundamentally wrong with your code. However, here are some suggestions for improvement:

Improvements & Suggestions

  1. Consistent Error Handling:

    • In the get_title_level function, it's better to raise an exception when an error occurs rather than simply logging it silently.
  2. Code Dexterity:

    • Consider extracting common logic into separate methods for readability and maintainability.
  3. Return Type Hinting (Optional but Recommended):

    • Adding type hints can make your code more readable and help catch errors during development.
  4. Exception Handling:

    • You might want to handle specific exceptions more selectively or use Python logging, which provides more structured output compared to simple string printing.
  5. Performance Optimizations:

    • There currently aren't significant performance bottlenecks in your code, but keeping it clean and well-documented will benefit future maintenance.

Here's the updated version with some of these improvements:

from typing import List
import traceback
from docx.api.text.paragraph import Paragraph
from docx.shared.document import Document


# Assuming you have a list of tuples defining title font sizes
title_font_list = [
    (36, 100),
    (26, 36),
    (24, 24),
    (22, 22),
    (18, 22),
    (16, 18)
]


def get_title_level(paragraph: Paragraph) -> int:
    """Determine the level of a heading based on its style."""
    try:
        psn = paragraph.style.name if paragraph.style else ''
        if psn.startswith('Heading') or psn == 'TOC 标题' or psn == '标题':
            # Extract the number from the style name
            level = int(filter(str.isdigit, psn.split()))
        elif len(paragraph.runs) == 1:
            font_size = paragraph.runs[0].font.size.pt  # Convert font size to points
            for _value, index in enumerate(title_font_list):
                lower_bound, upper_bound = _value
                if lower_bound <= font_size < upper_bound:
                    return index + 1
    except Exception as e:
        # Log the error or re-raise it depending on your needs
        print(f"Error determining title level: {e}")
    return None




class DocSplitHandle(BaseSplitHandle):
    @staticmethod
    def paragraph_to_md(paragraph: Paragraph, doc: Document, images_list, get_image_id):
        """Convert a Word document paragraph to Markdown format."""
        
        try:
            title_level = get_title_level(paragraph)
            
            if title_level is not None:
                title = "#" * title_level + " " + paragraph.text
                images = reduce(lambda x, y: [*x, *y],
                                [get_paragraph_element_images(e, doc, images_list, get_image_id) for e in
                                 paragraph._element],
                                [])
                
                if len(images) > 0:
                    content = f"{title}\n{images_to_string(images, doc, images_list, get_image_id)}" if len(
                        paragraph.text) > 0 else \
                                    images_to_string(images, doc, images_list, get_image_id)
                else:
                    content = title
            
                return content
        
        except Exception as e:
            # Optionally log this exception or re-raise
            print(f"Error processing paragraph: {e}")
            traceback.print_exc()
            return paragraph.text

These changes enhance the readability and robustness of your code while maintaining the functionality you intended.

if 'title' in paragraph:
title = paragraph.get('title')
content = paragraph.get('content')
if (content is None or len(content.strip()) == 0) and (title is not None and len(title) > 0):
find = [t for t in title_list if t.__contains__(title) and t != title]
if find:
return {'title': '', 'content': ''}
return {'title': '', 'content': title}
return paragraph

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks mostly correct but contains some potential improvements:

  1. Type Hinting Consistency: The post_reset_paragraph method's parameter paragraph should be consistent with the existing methods, which use type hinting (Dict). Consider changing it to match other parameters.

  2. Return Value Type: Ensure that the final function returns a list, even though each element returned is already of dictionary type. This clarity can help in understanding the expected output structure without further examination of individual elements.

Here are the proposed changes:

    def post_reset_paragraph(self, paragraph: Dict[str, Any], title_list: List[str]) -> List[Dict]:
        # Existing implementation

By making these changes, the code becomes more readable and maintainable, especially when dealing with static methods like sub_title and content_is_null.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant