Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

encoding/xml: accepts various ill-formed XML #68293

Open
14 tasks
DemiMarie opened this issue Jul 4, 2024 · 21 comments
Open
14 tasks

encoding/xml: accepts various ill-formed XML #68293

DemiMarie opened this issue Jul 4, 2024 · 21 comments
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@DemiMarie
Copy link
Contributor

DemiMarie commented Jul 4, 2024

Go version

go version go1.21.11 linux/amd64

Output of go env in your module/workspace:

GO111MODULE=''
GOARCH='amd64'
GOBIN=''
GOCACHE='/home/user/.cache/go-build'
GOENV='/home/user/.config/go/env'
GOEXE=''
GOEXPERIMENT=''
GOFLAGS=''
GOHOSTARCH='amd64'
GOHOSTOS='linux'
GOINSECURE=''
GOMODCACHE='/home/user/go/pkg/mod'
GONOPROXY=''
GONOSUMDB=''
GOOS='linux'
GOPATH='/home/user/go'
GOPRIVATE=''
GOPROXY='direct'
GOROOT='/usr/lib/golang'
GOSUMDB='off'
GOTMPDIR=''
GOTOOLCHAIN='local'
GOTOOLDIR='/usr/lib/golang/pkg/tool/linux_amd64'
GOVCS=''
GOVERSION='go1.21.11'
GCCGO='gccgo'
GOAMD64='v1'
AR='ar'
CC='gcc'
CXX='g++'
CGO_ENABLED='1'
GOMOD='/dev/null'
GOWORK=''
CGO_CFLAGS='-O2 -g'
CGO_CPPFLAGS=''
CGO_CXXFLAGS='-O2 -g'
CGO_FFLAGS='-O2 -g'
CGO_LDFLAGS='-O2 -g'
PKG_CONFIG='pkg-config'
GOGCCFLAGS='-fPIC -m64 -pthread -Wl,--no-gc-sections -fmessage-length=0 -ffile-prefix-map=/tmp/go-build3671854511=/tmp/go-build -gno-record-gcc-switches'

What did you do?

https://go.dev/play/p/r8y4cgcybkS

What did you see happen?

encoding/xml accepts ill-formed XML.

What did you expect to see?

encoding/xml should reject all ill-formed XML. Except for the last tryUnmarshal() call I linked, the constraints that it fails to check can be checked for without resolving namespaces, and therefore can be checked for even by RawToken().

See:

Edit: removed #53728 because it is about serialization, not parsing.

@ianlancetaylor
Copy link
Contributor

Unfortunately our experience with encoding/xml is that making it more strictly inevitably breaks existing working code.

@DemiMarie
Copy link
Contributor Author

Does that apply to rejecting clearly ill-formed XML (#68294, #68295)? Could there be new APIs that take a strict bool flag?

One option for detecting future such bugs would be to differentially fuzz encoding/xml against libxml2, which is known for its standards conformance.

@ianlancetaylor
Copy link
Contributor

We could introduce new API. Or in some cases it might be more appropriate to introduce a GODEBUG setting. But we can't make the package more strict without supporting easy fallbacks. Either approach would require a proposal; see https://go.dev/s/proposal.

@cagedmantis cagedmantis added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Jul 8, 2024
@cagedmantis cagedmantis added this to the Backlog milestone Jul 8, 2024
@Zxilly
Copy link
Member

Zxilly commented Jul 9, 2024

Maybe it would be better to choose a 3rd party library to handle this? Do we have such an option now?

@DemiMarie
Copy link
Contributor Author

Maybe it would be better to choose a 3rd party library to handle this? Do we have such an option now?

I’m not aware of an open-source pure-Go third-party XML tokenizer.

@bjorndm
Copy link

bjorndm commented Jul 15, 2024

It seems the encoding/xml package is broken in various ways. The solution would be to make a new encoding/xml/v2 package ad to not break backwards compatibility. In doing so the API can also be improved. However this is a large effort, and I suspect someone else than the Go dev team will have to do it.

@DemiMarie
Copy link
Contributor Author

It seems the encoding/xml package is broken in various ways.

That it is. There are several problems with it:

  1. It accepts directives by default. This requires either accepting ill-formed XML or performing complex checks on DTDs.
  2. It can return the namespace prefix or the namespace URL, but not both. This means that one cannot round-trip XML without reimplementing namespace tracking.
  3. encoding/xml returns leading and trailing whitespace in a well-formed document as tokens. Changing this might break backwards compatibility. The same problem arises for the XML declaration.
  4. There are no guarantees of round-trip stability. Those who need that must reimplement serialization themselves, or use a third-party validator.
  5. There is no support for schema validation. Right now, I believe that all current solutions use libxml2 bindings. There are tools that generate Go code from XML Schema Definition files, but I’m not sure if xml.Unmarshal is sufficiently expressive to guarantee that only documents that validate against the schema are accepted, though it might actually be.

@bjorndm
Copy link

bjorndm commented Jul 16, 2024

@DemiMarie my meaning is that this would be a great occasion to "scratch your own itch" and make such a Go language package for the benefit of the whole community. The Go developers likely have their hands full with other issues.

@DemiMarie
Copy link
Contributor Author

@bjorndm I don’t have anywhere near enough time for that, sorry. If they have interest, I think @russellhaering might be a good choice, since they have had to deal with the limitations of encoding/xml in their own libraries.

@DemiMarie
Copy link
Contributor Author

My current thoughts:

  1. The current Name API is broken. Name needs to have both the prefix and the URL.
  2. The automatic conversion of namespace URLs to prefixes done during serialization is both buggy and ill-advised. I recommend requiring the caller to provide proper prefixes instead, and merely checking that the caller made a choice that results in well-formed and namespace-well-formed XML.
  3. The acceptance of arbitrary directives was a mistake.
  4. Only toplevel documents need to be supported.
  5. Round-trip stability should be guaranteed.
  6. Well-formedness should be checked during both parsing and serialization.

@bjorndm
Copy link

bjorndm commented Jul 31, 2024

Sorry, but looking at the current use of the xml package your point 4 is not correct. For example for generating and parsing SVG or other XML documents it is necessary to generate and parse partial XML as well. For such use cases parsing is necessarily less strict.

@DemiMarie
Copy link
Contributor Author

Do you have an example @bjorndm?

@bjorndm
Copy link

bjorndm commented Jul 31, 2024

@DemiMarie
Copy link
Contributor Author

Well, excelize and svgo come to mind.

https://github.com/qax-os/excelize https://github.com/ajstarks/svgo

Can you provide specific files that need to be parsed?

@bjorndm
Copy link

bjorndm commented Jul 31, 2024

Excelize parses and generates Microsoft Excel xls files.
Svgo generates scalable vector graphics.
You can find examples of those everywhere on the web. The hard part is supporting all features of these file formats. The source code of the projects mentioned is more instructive there.

@DemiMarie
Copy link
Contributor Author

Do either of those formats use DTDs? If not, they only need support for parsing & generating well-formed XML toplevel documents.

@bjorndm
Copy link

bjorndm commented Jul 31, 2024

These formats can use DTDs. However, in practice, these two libraries and many others use encoding/xml to generate and parse XML fragments, often using xml.Marshal and XML Unmarshal. The generation of xml fragments is this an important use case.

@DemiMarie
Copy link
Contributor Author

Are these fragments ever not well-formed themselves?

@bjorndm
Copy link

bjorndm commented Aug 1, 2024

These fragments are likely to be well formed, but they are not complete XML documents as they do not have the headers.

@DemiMarie
Copy link
Contributor Author

Well-formedness is sufficient here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

No branches or pull requests

6 participants