Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Splitting code base in multiple packages #228

Closed
nicola opened this issue Apr 10, 2015 · 16 comments
Closed

Splitting code base in multiple packages #228

nicola opened this issue Apr 10, 2015 · 16 comments

Comments

@nicola
Copy link

nicola commented Apr 10, 2015

I think this could be useful, so that one could just import tfidf it that's the only bit needed.
In other words, modularization the node way.

@kkoch986
Copy link
Member

what is the goal? If all you want to do is load tfidf, you can do that, see https://github.com/NaturalNode/natural/blob/master/lib/natural/index.js

If the goal is not to have to download everything, that would probably be pretty tricky short of moving all the code into separate npm packages which doesnt really seem great to me.

@nicola
Copy link
Author

nicola commented Apr 10, 2015

@kkoch986 that is exactly what I was thinking. I was thinking about writing up an interactive tutorial in the browser and I felt downloading the whole library is an overkill.

I am not sure about this being a great idea, yet. Thanks for considering this

@kkoch986
Copy link
Member

no problem, so you are using browserify? I'm really interested in an online interactive tutorial, actually @chrisumbel and I have been working on plans for naturalnode.com and i think that could be a pretty cool feature.

Im not super familiar with how browserify and friends work but i image you could download the package locally and just include the things you need directly. i.e. rather than include natural include something like lib/natural/tfidf.

let me know if that makes any sense of if you have any ideas on how it would work better.

@nicola
Copy link
Author

nicola commented Apr 10, 2015

Ok, a quick fix for now could be I think in browserify, you still want to do require('natural/tfidf').

Please do ping me for naturalnode.com

@jefffriesen
Copy link

I would love to see this modularized for similar reasons as @nicola. Doing interactive NLP in browsers opens opens up a lot of options that you just can't do on the server. For example, you can interactively train Bayes Classifiers for a subset of text to build intuition for a project you're working on (my current need). Or building a linguistic and machine learning teaching tool for a wider audience.

I know there are already GH issues for running this in the browser (#253, #25, #189) with some Browserify hacks, but this is another gentle lobby for reducing the barrier to getting it running on a browser by modularizing.

To start though, I tried the suggestion of require('natural/tfidf') and it did't work for me. Same with BayesClassifier. Here's a simple script where both of those fail:

// Doesn't work: Cannot find module 'natural/BayesClassifier'
// var BayesClassifier = require('natural/BayesClassifier')
// var classifier = new BayesClassifier()

// Doesn't work: Cannot find module 'natural/tfidf'
// var tfidf = require('natural/tfidf')

// Works fine
var natural = require('natural')
var classifier = new natural.BayesClassifier()

classifier.addDocument('i am long qqqq', 'buy')
classifier.addDocument('buy the q\'s', 'buy')
classifier.addDocument('short gold', 'sell')
classifier.addDocument('sell gold', 'sell')
classifier.train()

console.log('classifier: ', classifier.classify('i am short silver'))
console.log('classifier: ', classifier.classify('i am long copper'))
// output: classifier:  sell
// output: classifier:  buy

I traced the exports from here https://github.com/NaturalNode/natural/blob/master/lib/natural/index.js
They looked right but maybe I missed something simple.

As far as full-fledged modularizing goes, there are some popular examples in the wild to look at.

D3 did it by splitting it up into their own separate repos (as @kkoch986 suggested): https://github.com/d3

Lodash is all in one repo: https://github.com/lodash/lodash/ This is really nicely done. This allows you to do:

var merge = require('lodash/merge') 

// or in ES6: 
import {merge} from 'lodash'

// Now you can use merge instead of _.merge if that's all you want in your project
var ab = merge({a:1},{b:2}) 

I think the ideal browser setup would be to be able to import just the component you need, similar to the lodash example. It would import only the modules it needs. Maybe we can even specify the language (English, Russian, etc) to prevent loading additional files. It may also make sense to deprecate functions that require node-only modules such as fs. load() and save() are nice helpers but are fairly simple to replace if you're in a node environment (but not in a browser environment).

Unfortunately, at the moment I have almost no bandwidth to contribute to this effort. This is just a suggestion and I realize this is a ton of work. But I do think getting it working well in a browser and doing a couple browser-based demos would give it a lot of visibility. Either way, it's still a great project. Thank you.

@jefffriesen
Copy link

For those coming to this who want a workaround using webpack I have it working based off this these threads:

http://stackoverflow.com/a/27275791
pugjs/pug-loader#8

var webpack = require('webpack')
var ignore = new webpack.IgnorePlugin(new RegExp('^(lapack|WNdb)$'))
module.exports = {
  // lots of other options go here
  plugins: [ignore],
  node: {
    fs: "empty"  // loads an empty 'fs' module. Should work as long as you don't call functions that depend on 'fs'
  }
}

@kkoch986
Copy link
Member

@jefffriesen all interesting points, im also interested the the language stuff you mentioned (i mentioned it breifly in #159).

i dont think splitting it into different repos is the best option but id definitely be open to moving towards the lodash approach. I unfortunately dont have much bandwidth at the moment either.

I still have a somewhat big project to wrap up which might make the browserify thing a little less simple, i think the basic problem is that as we get into more complicated algorithms external files become more and more important. As a matter of fact the code im working on now will introduce a corpus-type interface (think nltk http://www.nltk.org/data.html). So i think maybe we need to plan a way to divide the browser-compatible code from the not but i dont think deprecating/removing will be it.

I would be happy to use this issue to figure out a solid plan for reorganizing things to make it more browser-friendly while keeping it flexible and keep moving towards getting more algorithms implemented. For starters i think ill merge #159 into this issue, i think both problems can be solved at once.

@jefffriesen
Copy link

Makes sense to merge #159 into this thread.

Just a clarification:

i think the basic problem is that as we get into more complicated algorithms external files become more and more important.

Do you mean external large files become more and more important? That is definitely true. I think external files are important for both server and browser environments though. (of course the browser can only handle smaller files). I'm not just talking about file uploads through the UI (although that is possible for public tools). I'm loading CSVs as the app starts up. The value that I'm seeing by doing this is that I can build interactive tools and visualizations to get a feel for the data and the algorithms. Once I'm confident in the approach, then I can point it to really big sets of files server-side.

I think there are two problems to solve for the browser:

  1. Node-only modules: From what I've seen, the only one is fs. My suggestion earlier about deprecating save() and load() wasn't that people wouldn't be able to load or save files, it was that if they needed to and they were on the server, they could use fs directly. But there may be another approach to take than removing those functions. A while back I ran into a similar problem where a server-side library needed the canvas element. So it required https://www.npmjs.com/package/canvas which is written in C behind the scenes. So my client-side app was ironicaly trying to load a C-based recreation of the canvas element, which is native to the browser. Someone wrote a really wrapper that picked the right canvas element based on the environment (https://github.com/dominictarr/canvas-browserify). I just checked and someone did something similar here with fs: https://www.npmjs.com/package/browserify-fs. I haven't used it and there may be others. I'm posting it here because I think the idea is interesting.
  2. Loading unnecessary files: language files, Wordnet database, etc. You summed this up nicely already. I also agree the lodash approach would be ideal. If I think of any other libraries that have done this effectively I'll add them to this thread.

@wassname
Copy link

wassname commented Apr 3, 2016

I successfully loaded natural into webpack to use in the front-end. I just copied index.js and commented out the exports that require 'fs'.

Here is the code. Hopefully it helps other with this.

@i039666
Copy link

i039666 commented Aug 24, 2020

Separate packages would be nice since it would mean that we also don’t need to download giant e.g. wordnet-db files as part of npm install. Exposing separate “public” entrypoints (e.g. import {PorterStemmer} from 'natural/porter-stemmer'; ) would be a runner-up as it would make the install size of the package big but let us avoid including the entire lib in the bundled app.

@alfaproject
Copy link

If the goal is better tree-shaking this is also a good alternative #608

(it won't help the npm package size but at least we can have smaller bundles in the browser/lambda)

@Hugo-ter-Doest
Copy link
Collaborator

Hugo-ter-Doest commented Nov 11, 2021

Some time ago I created an index.js in each module folder. The main index.js in lib/natural includes all of these index files. You can either require the complete index file or one of the modules.
Including a module looks like this:

const tfidf = require('natural/lib/natural/tfidf')

I think this addresses the issue.

@Gipetto
Copy link

Gipetto commented Nov 27, 2023

This is a few years after the fact, but it would be nice if the above information was in the documentation. I'm just using natural for the distance calculators right now and only importing those saves A LOT of space.

@Hugo-ter-Doest
Copy link
Collaborator

Hugo-ter-Doest commented Nov 27, 2023

I see what you mean. I will update the documentation for the use of separate modules.

@Hugo-ter-Doest
Copy link
Collaborator

See https://naturalnode.github.io/natural/development.html

@Gipetto
Copy link

Gipetto commented Nov 27, 2023

Awesome, thank you!

Also a side note: Using this technique to bring the distance measurement functions in to a SvelteKit app hosted on Vercel. Built using Vite.

Thank you for all your hard work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants