Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preliminary XML support #224

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

Conversation

retrography
Copy link

This commit adds preliminary support for XML dataset import and export (no databook yet). The code uses only Python's internal libraries and works on both Python 2 and 3. It supports reading XML datasets with data saved as element or as attributes. I am sure the code has a lot of room for improvement, but I prefer to get some early feedback before finishing up.

@kennethreitz
Copy link
Contributor

This library has had a joke in the documentation that "xml will never be supported". :)

I enjoy this joke, and it would entertain me to see it remain true. But, it can easily be removed instead.

@kennethreitz
Copy link
Contributor

I haven't executed the code, but the approach looks relatively sensible. Do you think having xml in/out will be useful, even though the various forms it takes are so variable?

@retrography
Copy link
Author

Haha, I am aware of the joke. We will try to find another format to denigrate once this is done (let's say RDF)!

Actually most datasets provided in XML format come in two pretty simple flavours: 1) One record per line, fields as attributes (like Stack Exchange data dump), and 2) Records as elements, fields as sub-elements. In my experience many XML data dumps are either already in one of these formats or can be reduced to one of these with a simple XPath.

My objective, if I have time, is to support these two formats first. Then we can add XPath support in a later version. That is exactly what Google Sheets does, and I have found it largely sufficient for most data imports.

For now the data read code snippet is a bit buggy, but it reads 70-80 percent of the files I have tried in the two formats I just mentioned. The XML writer must be more robust -- I spent some time on it this morning.

I sincerely hate dealing with XML files, and that is why I am writing this: I just want to be able to turn them into other less finicky formats with as little hassle as possible. I am more of a data analyst than a programmer, and I think such a tool can be very useful for people like me.

Try it with some data from Stack Exchange. It doesn't work with every dataset yet, but the outcomes is pretty cool.

@kennethreitz
Copy link
Contributor

We support RDF too! Don't worry, we'll think of something ;)

This is great work – I'm excited about it. If there's anything I can do to support the process, please let me know!

@retrography
Copy link
Author

Just let me know if I have to respect some conventions that you have abided by in the code up to now. I am pretty excited too: Finally I will have one data interchange package for all my needs (or most of them at least...)!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants