Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Task list (e.g. for GSOC) #1169

Closed
cartazio opened this issue Feb 19, 2014 · 14 comments
Closed

Task list (e.g. for GSOC) #1169

cartazio opened this issue Feb 19, 2014 · 14 comments

Comments

@cartazio
Copy link

tasks that might be suitable for part of a GSOC for some lucky student!

@jgm
Copy link
Owner

jgm commented Mar 11, 2014

Here is a list of projects that could be done in pandoc:

  1. Use Text instead of String throughout. This would involve API changes and extensive (but not too difficult) changes to
  • pandoc-types
  • pandoc
  • texmath
  • highlighting-kate
  • pandoc-citeproc

Also worth considering whether it would help to use my custom text-based parser combinators from cheapskate (perhaps with some amplifications) instead of the slower parsec.
2. Add a Haddock writer (#1135)
3. Create a new Haddock reader based on the current haddock code. (The present reader is based on older haddock code that used alex and happy, and it doesn't match current haddock syntax.)
4. Modify Image type to allow explicit encoding of embedded images (data instead of URL). That is, instead of containing a field for a URL, an Image would contain a field that could be either a URL or an encoded image (ByteString and MIME type).
5. Add a Lines (or LineBlock) Block element, and modify readers, writers, and associated code. Currently we just use a Para with LineBreaks. This is non-ideal, especially when converting into formats allowing line blocks.
6. Add an EPUB reader. (The embedded images would be important for this, since EPUBs frequently contain images.) (#652)
7. Create a flexible system for labeling objects (images, tables, code) and referring back by number and/or link. Or, more minimally: handle \label and \ref better when parsing LaTeX.
8. Automatic identifiers for images and tables in HTML output. (#208) Must take care not to break existing documents relying on automatic identifiers for headers.
9. Syntax and (block or inline?) element for anchors.
10. Allow attributes on links, images?

@cartazio
Copy link
Author

great!

@knrafto
Copy link

knrafto commented Mar 13, 2014

How about a PDF reader/writer pair for pandoc? The writer could be written with HPDF. The reader would strip stuff it can't understand, but try to keep text, headings, images, and the like. I've been (minimally) working on a PDF reading library already, and this looks like a great application.

@jgm
Copy link
Owner

jgm commented Mar 13, 2014

+++ Kyle Raftogianis [Mar 13 14 13:56 ]:

How about a PDF reader/writer pair for pandoc? The writer could be
written with HPDF. The reader would strip stuff it can't understand,
but try to keep text, headings, images, and the like. I've been
(minimally) working on a PDF reading library already, and this looks
like a great application.

Interesting idea. My worry is that functionality would be too limited
for both the reader and the writer. I suppose proper typesetting of
math is out of the question. But if the writer could do everything
else - emphasis, paragraph layout, lists, tables - then I'd be open
to it. As for the reader, how much structure can be gotten from a PDF?
I'm skeptical but open-minded.

@knrafto
Copy link

knrafto commented Mar 13, 2014

I see your point. PDFs are for rendered documents, not for markup. However, PDF documents still store metadata, and can store outlines (which can be turned into section headings) and "article threads", which describe how the text is connected into sections. It definitely needs more thought, though.

@cartazio
Copy link
Author

type setting math at some level could done, it just might require having access to CM font or something

@cartazio
Copy link
Author

though that might get out of scope of whats safe for a gsoc project.

@KurtPfeifle
Copy link

On Thu, Mar 13, 2014 at 10:01 PM, John MacFarlane
[email protected]:

+++ Kyle Raftogianis [Mar 13 14 13:56 ]:

How about a PDF reader/writer pair for pandoc? The writer could be
written with HPDF. The reader would strip stuff it can't understand,
but try to keep text, headings, images, and the like. I've been
(minimally) working on a PDF reading library already, and this looks
like a great application.

Interesting idea. My worry is that functionality would be too limited
for both the reader and the writer. I suppose proper typesetting of
math is out of the question. But if the writer could do everything
else - emphasis, paragraph layout, lists, tables - then I'd be open
to it. As for the reader, how much structure can be gotten from a PDF?

It's not possible to answer that in a generic way. From some PDFs you
cannot get any structure at all.

Good chances to get most of the structure are from "tagged"[*] PDFs and
PDF/UA (this is a new ISO standard meaning "universal accessibility"), but
these are still very rare outside there in the big, big world...


[*] If you do not know about "tagged" PDF just try to imaging a lot of
additional markup being contained unvisibly in the rendered document.

@cartazio
Copy link
Author

but for the writer there should be a way right?

@knrafto
Copy link

knrafto commented Mar 13, 2014

I think the LaTeX writer offers much more than a PDF writer would. I don't think this idea would be very feasible. Thanks for the comments!

@mpickering
Copy link
Collaborator

I am in the process of writing a proposal for adding an EPUB reader.

@mpickering
Copy link
Collaborator

1,5,7,8,9,10 are still open from this list for anyone finding it.

@cartazio
Copy link
Author

cartazio commented Dec 8, 2014

and some would probably make great GSOCs!

@jgm jgm changed the title Document Task wishlist Task list (e.g. for GSOC) Jan 2, 2015
@jgm
Copy link
Owner

jgm commented Jan 2, 2015

Closing. Opened new list at #1852.

@jgm jgm closed this as completed Jan 2, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants