-
Notifications
You must be signed in to change notification settings - Fork 2.3k
perf: Enhance Word parsing #2612
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -339,13 +339,14 @@ def parse(self, text: str): | |
for e in result: | ||
if len(e['content']) > 4096: | ||
pass | ||
return [item for item in [self.post_reset_paragraph(row) for row in result] if | ||
title_list = list(set([row.get('title') for row in result])) | ||
return [item for item in [self.post_reset_paragraph(row, title_list) for row in result] if | ||
'content' in item and len(item.get('content').strip()) > 0] | ||
|
||
def post_reset_paragraph(self, paragraph: Dict): | ||
def post_reset_paragraph(self, paragraph: Dict, title_list: List[str]): | ||
result = self.filter_title_special_characters(paragraph) | ||
result = self.sub_title(result) | ||
result = self.content_is_null(result) | ||
result = self.content_is_null(result, title_list) | ||
return result | ||
|
||
@staticmethod | ||
|
@@ -357,11 +358,14 @@ def sub_title(paragraph: Dict): | |
return paragraph | ||
|
||
@staticmethod | ||
def content_is_null(paragraph: Dict): | ||
def content_is_null(paragraph: Dict, title_list: List[str]): | ||
if 'title' in paragraph: | ||
title = paragraph.get('title') | ||
content = paragraph.get('content') | ||
if (content is None or len(content.strip()) == 0) and (title is not None and len(title) > 0): | ||
find = [t for t in title_list if t.__contains__(title) and t != title] | ||
if find: | ||
return {'title': '', 'content': ''} | ||
return {'title': '', 'content': title} | ||
return paragraph | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The code looks mostly correct but contains some potential improvements:
Here are the proposed changes: def post_reset_paragraph(self, paragraph: Dict[str, Any], title_list: List[str]) -> List[Dict]:
# Existing implementation By making these changes, the code becomes more readable and maintainable, especially when dealing with static methods like |
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, there isn't anything fundamentally wrong with your code. However, here are some suggestions for improvement:
Improvements & Suggestions
Consistent Error Handling:
get_title_level
function, it's better to raise an exception when an error occurs rather than simply logging it silently.Code Dexterity:
Return Type Hinting (Optional but Recommended):
Exception Handling:
logging
, which provides more structured output compared to simple string printing.Performance Optimizations:
Here's the updated version with some of these improvements:
These changes enhance the readability and robustness of your code while maintaining the functionality you intended.