Skip to content

gh-76960: Fix urljoining with an empty query string. #5645

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions Lib/test/test_urlparse.py
Original file line number Diff line number Diff line change
Expand Up @@ -437,6 +437,12 @@ def test_urljoins(self):
# issue 23703: don't duplicate filename
self.checkJoin('a', 'b', 'b')

# issue 32779: clear the query string when joining with '?'
self.checkJoin('http://a/b/c?d=e', '?', 'http://a/b/c')
self.checkJoin('http://a/b/c?d=e', 'http:?', 'http://a/b/c')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the test case that @orsenthil suggests is wrong as it doesn't match other languages which result in http:? as the output.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's correct. We need to remove this. Perhaps I will modify this and bring this PR to a close.

self.checkJoin('http://a/b/c#d?e=f', '?#g', 'http://a/b/c#g')
self.checkJoin('http://a/b/c?d=e#f', '#?', 'http://a/b/c?d=e#?')

def test_RFC2732(self):
str_cases = [
('http://Test.python.org:5432/foo/', 'test.python.org', 5432),
Expand Down
5 changes: 4 additions & 1 deletion Lib/urllib/parse.py
Original file line number Diff line number Diff line change
Expand Up @@ -535,7 +535,10 @@ def urljoin(base, url, allow_fragments=True):
if not path and not params:
path = bpath
params = bparams
if not query:
# since urlparse doesn't leave any evidence of whether there was a bare
# '?' with an empty query string, we need to check whether it was there.
has_empty_query = url[0] == '?' or url.startswith(scheme + ':?')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this url.startswith(scheme + ':?') special case requirement for and where is this referenced?

The behavior in Ruby and Golang was different for this scenario (but consistent).

require 'uri'

base_url = 'https://www.example.com/?a=b'
relative_url = 'https:?'

url = URI.join(base_url, relative_url).to_s
puts url

https:?

And with golang https://go.dev/play/p/lui16M9pFyo

package main

import (
	"fmt"
	"net/url"
)

func main() {
	base, _ := url.Parse("https://example.com/?a=b")
	u, _ := url.Parse("http:?")
	fmt.Println(base.ResolveReference(u))
}

http:?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this condition is removed or url.startswith(scheme + ':?') from this patch, we can consider this PR as it brings the expected behavior seen across other language libraries.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From https://datatracker.ietf.org/doc/html/rfc3986#section-5.2.2:

      -- A non-strict parser may ignore a scheme in the reference
      -- if it is identical to the base URI's scheme.

For now, urllib.parse behaves as a non-strict parser (there is special test for this). We can add an option to switch this, but this is a different feature.

But testing only url.startswith(scheme + ':?') is not enough, because http:?#z should clear query as well. And there are similar cases with other empty components. I am working on larger and more general PR.

if not has_empty_query:
query = bquery
return _coerce_result(urlunparse((scheme, netloc, path,
params, query, fragment)))
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Fix urljoining with '?', a URL containing only a bare query string.