Skip to content

Popular repositories Loading

  1. behemoth behemoth Public archive

    Behemoth is an open source platform for large scale document analysis based on Apache Hadoop.

    Java 281 60

  2. TextClassification TextClassification Public

    A Text Classification API in Java originally developed by DigitalPebble Ltd. The API is independent from the ML implementations used and can be used as a front end to various ML algorithms. libSVM …

    Java 48 21

  3. textclassification-examples textclassification-examples Public

    Use cases for DigitalPebble's TextClassification API

    Java 10 3

  4. stormcrawlerfight stormcrawlerfight Public

    Crawl configurations for benchmarking / testing StormCrawler

    Shell 10 5

  5. stormcrawler-docker stormcrawler-docker Public

    Resources for running StormCrawler with Docker services

    Dockerfile 10 3

  6. ansible-storm ansible-storm Public

    Ansible playbook for deploying a Storm cluster

    7 1

Repositories

Showing 10 of 26 repositories
  • stormcrawler-docker Public

    Resources for running StormCrawler with Docker services

    DigitalPebble/stormcrawler-docker’s past year of commit activity
    Dockerfile 10 Apache-2.0 3 0 0 Updated Nov 10, 2024
  • digitalpebble.github.io Public

    Resources for the DigitalPebble website

    DigitalPebble/digitalpebble.github.io’s past year of commit activity
    SCSS 0 0 0 0 Updated Jul 17, 2024
  • crawlurlfrontier Public

    Crawl config used to test URL Frontier on a large scale and produce WARCs for CommonCrawl.

    DigitalPebble/crawlurlfrontier’s past year of commit activity
    FLUX 1 0 0 0 Updated May 16, 2024
  • storm Public Forked from apache/storm

    Mirror of Apache Storm

    DigitalPebble/storm’s past year of commit activity
    Java 0 Apache-2.0 4,154 0 0 Updated Apr 10, 2024
  • tika-detector-stormcrawler Public

    Wraps the charset detection logic from StormCrawler as a Tika module

    DigitalPebble/tika-detector-stormcrawler’s past year of commit activity
    Java 0 Apache-2.0 1 0 0 Updated Feb 2, 2024
  • tika Public Forked from apache/tika

    The Apache Tika toolkit detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF).

    DigitalPebble/tika’s past year of commit activity
    Java 0 Apache-2.0 793 0 0 Updated Jan 25, 2024
  • benchmark Public

    StormCrawler topology to evaluate the performance of different backends and configurations

    DigitalPebble/benchmark’s past year of commit activity
    Shell 0 0 0 0 Updated Jan 22, 2024
  • docs Public Forked from docker-library/docs

    Documentation for Docker Official Images in docker-library

    DigitalPebble/docs’s past year of commit activity
    Shell 0 MIT 2,286 0 0 Updated Jan 16, 2024
  • ansible-storm Public

    Ansible playbook for deploying a Storm cluster

    DigitalPebble/ansible-storm’s past year of commit activity
    7 1 0 0 Updated Dec 7, 2023
  • nutch Public Forked from apache/nutch

    Apache Nutch is an extensible and scalable web crawler

    DigitalPebble/nutch’s past year of commit activity
    Java 1 Apache-2.0 1,253 0 0 Updated Nov 8, 2023

Top languages

Loading…

Most used topics

Loading…