-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Removing WikiHow dataset at WikiHow request #3034
Conversation
Sadly, WikiHow Team contacted me and said that their data cannot be used in this way...
I have contacted LAION and some other devs about the legality of WikiHow ban and potential solutions, but we probably won't be able to use my dataset in its current form. |
What legal basis/ability does WikiHow even have to take against an open-source project like this? Is there any precedent for lawsuits for scraping websites and training a model on them? Training algorithms on copyrighted data is not illegal, according to the United States 2nd Circuit Court btw. |
I think you're right that there's no legal problem here, and we should not talk as if there is. However I think as a project OA is inclined to respect the wishes of website operators if they ask us not to use their data. |
Regarding the removal of the WikiHow dataset, I share your disappointment and believe that the decision made by the WikiHow team was unjustified considering the nature of both projects. The fact that the wikiHow content is licensed under Creative Commons. An unported License indicates that reuse and distribution are permitted, provided that attribution is given (if they want credit, let's just give it to them). Therefore, I don't see any clear reason why the wikiHow dataset couldn't be incorporated into the Open Assistant project. It's important to note that open-source projects rely heavily on the contributions made by many individuals and other projects. It wouldn't make sense to exclude valuable, relevant sources simply because someone claims ownership over them. This is not how open source works and I could imagine that many contributors of the wikiHow platform, who spend countless hours contributing their knowledge to be licensed under the creative commons, would agree! Openness, sharing, and collaboration between projects should be encouraged to ensure the continued advancement of machine learning technology. Finally, I would like to add that many closed-source LLMs like ChatGPT also rely on open-source projects to train their models and I think they do not offer an opt-out option. In contrast, Open Assistant is an open-source project that provides the opportunity for anyone to contribute and improve the dataset. By removing WikiHow articles, we are not only limiting the knowledge pool but also handicapping the project's potential unfairly. What is the next exclusion, the Wikipedia Foundation? |
Sadly, WikiHow Team contacted me and said that their data cannot be used in this way...