Syzygist

This is a small set of utilities for working with Scalaz streams. The split package is a kind of port of Haskell's Data.List.Split, and parse provides some tools for using parboiled2 in the context of Scalaz streams.

split is reasonably well documented and tested. parse isn't.

Example usage

The Penn Treebank includes several thousand files containing parsed sentences with the following format:

( (S 
    (NP-SBJ (DT The) 
      (ADJP (RBS most) (JJ troublesome) )
      (NN report) )
    (VP (MD may) 
      (VP (VB be) 
        (NP-PRD 
          (NP (DT the) (NNP August) (NN merchandise) (NN trade) (NN deficit) )
          (ADJP (JJ due) 
            (ADVP (IN out) )
            (NP-TMP (NN tomorrow) )))))
    (. .) ))

We can write a simple s-expression parser with parboiled2:

import org.parboiled2._
import org.syzygist.parse._
import scalaz.Tree

class SentenceParser(input: ParserInput) extends ValueParser(input) {
  type Value = Tree[String]

  def value: Rule1[Tree[String]] = rule {
    Whitespace ~ OpenBracket ~ Node ~ CloseBracket
  }

  def Whitespace: Rule0 = rule { zeroOrMore(anyOf(" \t\n")) }
  def OpenBracket: Rule0 = rule { '(' ~ Whitespace }
  def CloseBracket: Rule0 = rule { ')' ~ Whitespace }

  def Terminal: Rule1[String] = rule {
    capture(oneOrMore(noneOf(" ()"))) ~ Whitespace
  }

  def Node: Rule1[Tree[String]] = rule {
    OpenBracket ~ (Branch | Leaf) ~ CloseBracket
  }

  def Branch: Rule1[Tree[String]] = rule {
    Terminal ~ oneOrMore(Node) ~> (
      (tag: String, nodes: Seq[Tree[String]]) => Tree.node(tag, nodes.toStream)
    )
  }

  def Leaf: Rule1[Tree[String]] = rule {
    Terminal ~ Terminal ~> (
      (tag: String, word: String) => Tree.node(tag, Stream(Tree.leaf(word)))
    )
  }
}

This parser accepts individual sentences, but we want to perform streaming processing on thousands of files, each of which may contain many sentences. The split package's whenElt and parse's parseWith make this easy:

import org.syzygist.split.Splitter.whenElt
import scalaz.concurrent.Task
import scalaz.stream._

val sentenceSplitter = whenElt[String](_.startsWith("(")).keepDelimsL.split

def parseFile(file: String): Process[Task, Tree[String]] =
  io.linesR(file)
    .pipe(sentenceSplitter)
    .map(_.mkString)
    .filter(_.nonEmpty).evalMap(parseWith(new SentenceParser(_)))

val sentences = parseFile("penn-treebank-rel3/parsed/mrg/wsj/24/wsj_2400.mrg")

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
benchmarks/src		benchmarks/src
parse/src		parse/src
project		project
split/src		split/src
.gitignore		.gitignore
.travis.yml		.travis.yml
LICENSE		LICENSE
README.markdown		README.markdown
build.sbt		build.sbt
version.sbt		version.sbt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Syzygist

Example usage

About

Releases

Packages

Contributors 2

Languages

License

travisbrown/syzygist

Folders and files

Latest commit

History

Repository files navigation

Syzygist

Example usage

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages