gh-135661: Fix CDATA section parsing in HTMLParser #135665

serhiy-storchaka · 2025-06-18T10:49:00Z

"] ]>" and "]] >" no longer end the CDATA section.

Issue: HTMLParser differences from the HTML5 specification #135661

"] ]>" and "]] >" no longer end the CDATA section.

serhiy-storchaka · 2025-07-04T06:20:47Z

Lib/html/parser.py

+            j = rawdata.find(']]>')
+            if j < 0:
+                return -1
+            self.unknown_decl(rawdata[i+3: j])
+            return j + 3


According to the HTML5 standard (https://html.spec.whatwg.org/multipage/parsing.html#markup-declaration-open-state), it should be either data or bogus comment (which ends with >, not ]]>), but this depends on the context. It may be that I incorrectly understand the HTML5 standard, because this part is difficult to implement.

I tried copying the content of the tests in the following file:

<!DOCTYPE html> <html> <body> <![CDATA[just some plain text]]><hr> <![CDATA[]]><hr> <![CDATA[&not-an-entity-ref;]]><hr> <![CDATA[<not a='start tag'>]]><hr> <![CDATA[]]><hr> <![CDATA[[[I have many brackets]]]]><hr> <![CDATA[I have a > in the middle]]><hr> <![CDATA[I have a ]] in the middle]]><hr> <![CDATA[] ]>]]><hr> <![CDATA[]] >]]><hr> <![CDATA[ if (a < b && a > b) { printf("[<marquee>How?</marquee>]"); } ]]><hr> </body> </html>

and this was the result on Firefox:

<html><head></head><body> <hr> ]]><hr> <hr> ]]><hr> <hr> <hr>  in the middle]]><hr> <hr> ]]><hr> ]]><hr>  b) { printf("[<marquee>How?</marquee>]"); } ]]><hr> </body></html>

Yes, and if you try <svg><text y="100"><![CDATA[foo<br>bar]]></text></svg>, you will see that content between <![CDATA[ and ]]> is interpreted as a raw data.

This is context dependent.

HTMLParser is actually just a tokenizer. To determine the context automatically, it needs to support the stack of open elements and to know what elements are in the HTML namespace. This is all in the specification, and we will implement this in future. But this is a different level of complexity. So I solved the issue by letting the user to determine the context. New method support_cdata() sets how HTMLParser will parse CDATA. This is not good, but perhaps better than the current state.

ezio-melotti · 2025-07-05T00:06:41Z

Lib/html/parser.py

+            j = rawdata.find(']]>')
+            if j < 0:
+                return -1
+            self.unknown_decl(rawdata[i+3: j])


Suggested change

self.unknown_decl(rawdata[i+3: j])

self.unknown_decl(rawdata[i+3:j])

ezio-melotti · 2025-07-05T00:19:18Z

Lib/html/parser.py

+            j = rawdata.find(']]>')
+            if j < 0:
+                return -1
+            self.unknown_decl(rawdata[i+3: j])
+            return j + 3


I tried copying the content of the tests in the following file:

<!DOCTYPE html> <html> <body> <![CDATA[just some plain text]]><hr> <![CDATA[]]><hr> <![CDATA[&not-an-entity-ref;]]><hr> <![CDATA[<not a='start tag'>]]><hr> <![CDATA[]]><hr> <![CDATA[[[I have many brackets]]]]><hr> <![CDATA[I have a > in the middle]]><hr> <![CDATA[I have a ]] in the middle]]><hr> <![CDATA[] ]>]]><hr> <![CDATA[]] >]]><hr> <![CDATA[ if (a < b && a > b) { printf("[<marquee>How?</marquee>]"); } ]]><hr> </body> </html>

and this was the result on Firefox:

<html><head></head><body> <hr> ]]><hr> <hr> ]]><hr> <hr> <hr>  in the middle]]><hr> <hr> ]]><hr> ]]><hr>  b) { printf("[<marquee>How?</marquee>]"); } ]]><hr> </body></html>

* Add HTMLParser.support_cdata().

pythongh-135661: Fix CDATA section parsing in HTMLParser

f7f9f56

"] ]>" and "]] >" no longer end the CDATA section.

serhiy-storchaka requested a review from ezio-melotti as a code owner June 18, 2025 10:49

serhiy-storchaka added needs backport to 3.13 bugs and security fixes needs backport to 3.14 bugs and security fixes labels Jun 18, 2025

bedevere-app bot added the awaiting core review label Jun 18, 2025

bedevere-app bot mentioned this pull request Jun 18, 2025

HTMLParser differences from the HTML5 specification #135661

Open

serhiy-storchaka added 4 commits July 3, 2025 18:16

Merge branch 'main' into htmlparser-cdata

816f34e

Move to Security.

cf918e3

Update 2025-06-18-13-34-55.gh-issue-135661.NZlpWf.rst

d346c10

Merge branch 'main' into htmlparser-cdata

9e1ae33

serhiy-storchaka added type-security A security issue needs backport to 3.9 only security fixes needs backport to 3.10 only security fixes needs backport to 3.11 only security fixes needs backport to 3.12 only security fixes labels Jul 4, 2025

serhiy-storchaka commented Jul 4, 2025

View reviewed changes

ezio-melotti reviewed Jul 5, 2025

View reviewed changes

* Make CDATA section parsing context depending.

524cac5

* Add HTMLParser.support_cdata().

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

gh-135661: Fix CDATA section parsing in HTMLParser #135665

gh-135661: Fix CDATA section parsing in HTMLParser #135665

Uh oh!

serhiy-storchaka commented Jun 18, 2025 •

edited by bedevere-app bot

Loading

Uh oh!

serhiy-storchaka Jul 4, 2025

Uh oh!

ezio-melotti Jul 5, 2025 •

edited

Loading

Uh oh!

serhiy-storchaka Jul 5, 2025

Uh oh!

ezio-melotti Jul 5, 2025

Uh oh!

ezio-melotti Jul 5, 2025 •

edited

Loading

Uh oh!

Uh oh!

	self.unknown_decl(rawdata[i+3: j])
	self.unknown_decl(rawdata[i+3:j])

Uh oh!

gh-135661: Fix CDATA section parsing in HTMLParser #135665

Are you sure you want to change the base?

gh-135661: Fix CDATA section parsing in HTMLParser #135665

Uh oh!

Conversation

serhiy-storchaka commented Jun 18, 2025 • edited by bedevere-app bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

serhiy-storchaka Jul 4, 2025

Choose a reason for hiding this comment

Uh oh!

ezio-melotti Jul 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

serhiy-storchaka Jul 5, 2025

Choose a reason for hiding this comment

Uh oh!

ezio-melotti Jul 5, 2025

Choose a reason for hiding this comment

Uh oh!

ezio-melotti Jul 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

serhiy-storchaka commented Jun 18, 2025 •

edited by bedevere-app bot

Loading

ezio-melotti Jul 5, 2025 •

edited

Loading

ezio-melotti Jul 5, 2025 •

edited

Loading