-
Notifications
You must be signed in to change notification settings - Fork 2.3k
perf: Enhance Word parsing #2612
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
if len(images) > 0: | ||
return title + '\n' + images_to_string(images, doc, images_list, get_image_id) if len( | ||
paragraph.text) > 0 else images_to_string(images, doc, images_list, get_image_id) | ||
return title | ||
|
||
except Exception as e: | ||
traceback.print_exc() | ||
return paragraph.text |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, there isn't anything fundamentally wrong with your code. However, here are some suggestions for improvement:
Improvements & Suggestions
-
Consistent Error Handling:
- In the
get_title_level
function, it's better to raise an exception when an error occurs rather than simply logging it silently.
- In the
-
Code Dexterity:
- Consider extracting common logic into separate methods for readability and maintainability.
-
Return Type Hinting (Optional but Recommended):
- Adding type hints can make your code more readable and help catch errors during development.
-
Exception Handling:
- You might want to handle specific exceptions more selectively or use Python
logging
, which provides more structured output compared to simple string printing.
- You might want to handle specific exceptions more selectively or use Python
-
Performance Optimizations:
- There currently aren't significant performance bottlenecks in your code, but keeping it clean and well-documented will benefit future maintenance.
Here's the updated version with some of these improvements:
from typing import List
import traceback
from docx.api.text.paragraph import Paragraph
from docx.shared.document import Document
# Assuming you have a list of tuples defining title font sizes
title_font_list = [
(36, 100),
(26, 36),
(24, 24),
(22, 22),
(18, 22),
(16, 18)
]
def get_title_level(paragraph: Paragraph) -> int:
"""Determine the level of a heading based on its style."""
try:
psn = paragraph.style.name if paragraph.style else ''
if psn.startswith('Heading') or psn == 'TOC 标题' or psn == '标题':
# Extract the number from the style name
level = int(filter(str.isdigit, psn.split()))
elif len(paragraph.runs) == 1:
font_size = paragraph.runs[0].font.size.pt # Convert font size to points
for _value, index in enumerate(title_font_list):
lower_bound, upper_bound = _value
if lower_bound <= font_size < upper_bound:
return index + 1
except Exception as e:
# Log the error or re-raise it depending on your needs
print(f"Error determining title level: {e}")
return None
class DocSplitHandle(BaseSplitHandle):
@staticmethod
def paragraph_to_md(paragraph: Paragraph, doc: Document, images_list, get_image_id):
"""Convert a Word document paragraph to Markdown format."""
try:
title_level = get_title_level(paragraph)
if title_level is not None:
title = "#" * title_level + " " + paragraph.text
images = reduce(lambda x, y: [*x, *y],
[get_paragraph_element_images(e, doc, images_list, get_image_id) for e in
paragraph._element],
[])
if len(images) > 0:
content = f"{title}\n{images_to_string(images, doc, images_list, get_image_id)}" if len(
paragraph.text) > 0 else \
images_to_string(images, doc, images_list, get_image_id)
else:
content = title
return content
except Exception as e:
# Optionally log this exception or re-raise
print(f"Error processing paragraph: {e}")
traceback.print_exc()
return paragraph.text
These changes enhance the readability and robustness of your code while maintaining the functionality you intended.
if 'title' in paragraph: | ||
title = paragraph.get('title') | ||
content = paragraph.get('content') | ||
if (content is None or len(content.strip()) == 0) and (title is not None and len(title) > 0): | ||
find = [t for t in title_list if t.__contains__(title) and t != title] | ||
if find: | ||
return {'title': '', 'content': ''} | ||
return {'title': '', 'content': title} | ||
return paragraph | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code looks mostly correct but contains some potential improvements:
-
Type Hinting Consistency: The
post_reset_paragraph
method's parameterparagraph
should be consistent with the existing methods, which use type hinting (Dict
). Consider changing it to match other parameters. -
Return Value Type: Ensure that the final function returns a list, even though each element returned is already of dictionary type. This clarity can help in understanding the expected output structure without further examination of individual elements.
Here are the proposed changes:
def post_reset_paragraph(self, paragraph: Dict[str, Any], title_list: List[str]) -> List[Dict]:
# Existing implementation
By making these changes, the code becomes more readable and maintainable, especially when dealing with static methods like sub_title
and content_is_null
.
perf: Enhance Word parsing