Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

data.table principles #5693

Closed
TysonStanley opened this issue Sep 27, 2023 · 25 comments
Closed

data.table principles #5693

TysonStanley opened this issue Sep 27, 2023 · 25 comments
Labels
governance Project governance

Comments

@TysonStanley
Copy link
Member

TysonStanley commented Sep 27, 2023

As part of #5676 we would also like to compile a list of principles tied to data.table. This will be incorporated into other material over the next few months but wanted to see what you all thought about the list we have initially put together.

  • Few (if any) dependencies
  • Time & memory efficiency
  • Concise syntax (minimal redundancy in code)
  • Few breaking changes
  • Backward compatibility (certain time period of support, e.g., R version from 5 years ago?)
  • Comprehensive and accessible documentation
  • Clear error and warning messages

Anything you'd add to this list? Anything you'd argue does not belong on the list?

Thanks!

@jangorecki
Copy link
Member

jangorecki commented Sep 28, 2023

Low memory usage.
6. I am not sure, we provide only Chinese, so I don't think the point fits very well

@TysonStanley
Copy link
Member Author

Thanks, @jangorecki . I guess 6 was more of a goal than a current practice? I really like the addition of low memory usage. Will include that.

@jangorecki
Copy link
Member

I don't think we want to maintain multiple languages. Chinese, Russian, Spanish/Portuguese is reasonable maximum IMO.

@DavidArenburg
Copy link
Member

Good list
re 5, I think you could rephrase to "backward compatibility"
re 6, what does it mean? Like documentation wise? If so, who is going to maintain that?

@jangorecki
Copy link
Member

jangorecki commented Sep 28, 2023

@DavidArenburg only error/warning/verbose. We have them translated to Chinese for Chinese locale in user session. Point 4 is about backward compatibility (in our api). Point 5 extends it for running on old R version.

@jangorecki
Copy link
Member

jangorecki commented Sep 28, 2023

I would possibly add to the list a comprehensive documentation. I haven't seen a package documented better than DT actually. Many just make minimal manual and put more info to vignettes, which is indirect documentation when it comes to description of a function, it's return value, etc. Vignettes should be an accompanying documentation, not the main.

@tdhock
Copy link
Member

tdhock commented Sep 28, 2023

About international/multilingual/translations, it is true that only Chinese is supported in current message translations. Going forward in the next two years, I plan to invite more translators (of messages and docs), and I actually have money to pay them (20 translation projects, US$500 each). I expect that whoever contributes the intitial translation may be interested to maintain in the future. The goal of the translation effort is to increase the number of potential users and contributors in the data.table ecosystem.

@tdhock
Copy link
Member

tdhock commented Sep 28, 2023

I wonder if you could please clarify point 4? Maybe change "Few breaking changes" to "Few breaking changes, to make it easy for other packages to use data.table" Is that what you meant?

@TysonStanley
Copy link
Member Author

By point 4, yes, in my experience DT was always very careful about any releases that would have breaking changes requiring changes to other packages/code bases. There could be a better way of phrasing it but that was the idea behind it.

@TysonStanley
Copy link
Member Author

I would possibly add to the list a comprehensive documentation. I haven't seen a package documented better than DT actually. Many just make minimal manual and put more info to vignettes, which is indirect documentation when it comes to description of a function, it's return value, etc. Vignettes should be an accompanying documentation, not the main.

@jangorecki I agree. I'll add comprehensive documentation to the list as number 8.

@markseeto
Copy link
Contributor

Maybe consider including something about readability/useability. This could be its own principle, or part of principle 3, e.g. "Concise syntax (minimal redundancy in code), while maintaining readability and ease of use".

The reason I suggest this is that data.table seems to have a reputation for being fast but relatively difficult to learn and use. I sometimes see comments like (paraphrasing) "tidyverse is fantastic, and in situations where speed is really important, there's data.table", as though the only advantage of data.table is its speed.

Maybe also consider adding something about extensive functionality, unless this goes without saying.

@jangorecki
Copy link
Member

jangorecki commented Sep 29, 2023

relatively difficult to learn and use

It nails down to from where you as a user are coming from.
If you are a psychologists just doing some stats then I can imagine you may find it hard.
If you are coming from data analytics (databases, SQL), maths relational algebra or engineering, you are likely to find it not only easy, but much easier than anything that exists in R (for data.frame), and way more superior to those you are coming from.
My career shifted from data warehouses to R exactly because of that.

I understand your point well, and am observing the same. It just if we want to counter some judgments, marketed at some point by a new project that was targeting less technical audience, about data.table syntax then we could try to make it very precisely. @arunsrinivasan made a nice comment on syntax in his SO answer here: https://stackoverflow.com/a/27718317/2490497

Maybe also consider adding something about extensive functionality, unless this goes without saying.

Another good point.

@TysonStanley
Copy link
Member Author

@markseeto for the extensive functionality, I think that makes a lot of sense. As I think about it, there is definitely some overlap with concise syntax as there are a bunch of things that can be done without going away from the DT[i, j, by] syntax (e.g., any data frame operation, grouping functions, aggregation, joins, etc.). Is there a way to communicate that concisely in the list? Something like "Extensive functionality with minimal need for additional functions" or something?

@jangorecki thanks for that link. Feel like that answer should be turned into a blog post or something too. So much gold in that. Also, I think your point of it naturally fitting with SQL (relational databases) is one of its immediate strengths in learning the code. Was wondering if there is a principle there, potentially? Like "syntactic overlap with data analytics, engineering, and mathematics" or something like that?

@markseeto
Copy link
Contributor

@TysonStanley For "extensive functionality", what I'm thinking of is separate from concise syntax, although I agree that there is some overlap. I'm thinking of the ability to do an extensive range of useful operations with the data, whether that's with DT[i, j, by] syntax or with functions like dcast, groupingsets, etc. But maybe this isn't really a "principle" like the principles you've listed.

@MichaelChirico
Copy link
Member

I don't think we want to maintain multiple languages. Chinese, Russian, Spanish/Portuguese is reasonable maximum IMO.

I think this which I prepared for core R is a useful reference:
Table 'Languages with R Translations' from https://docs.google.com/document/d/1XbfOf3CLVb2UFyUZGJoVLkBUDZ6Hs3APCDW8UzuOvZk/edit

A list which includes Russian/Spanish should include at least Arabic and a South Asian language (e.g. Hindi).

Anyway, agree there is some maintenance overhead, but tooling changes can reduce that overhead. Rather than set an "arbitrary" limit, I'd rather the maintainers decide incrementally (1/2 languages at a time) whether to accept new translations.

For now, my bigger concern has been package size. The checked-in .mo binaries are about .22MiB per language, and the plain-text .po files are about .26MiB -- precious storage given we're always bumping up against the limit to generate a CRAN note. There is some initial discussion with R core about generating .mo at build time, but that's probably a way's off still.

BTW, in the initial quest for Chinese translations, I made sure to make note of other community members offering translations in other languages, those are: Vietnamese, French, Russian, Portugese, Farsi, Turkish, Hindi. That's already 4 years ago, so of course would need to check their interest again.

@MichaelChirico
Copy link
Member

Comprehensive documentation

I would say 'Comprehensive and accessible documentation'. I think we strive to have both technically complete, but also user-friendly Rd/vignettes and error/warning messages and NEWS entries

@MichaelChirico
Copy link
Member

Is the list meant to be numbered? i.e. are these principles ranked? If so, putting computational & memory efficiency in the same bullet makes sense to me.

@TysonStanley
Copy link
Member Author

@MichaelChirico thanks! It's not ranked necessarily so I made it bullets instead. And updated it with your suggestions.

@tdhock
Copy link
Member

tdhock commented Oct 2, 2023

I feel like international/multilingual bullet point could be deleted, since that is the "accessible" part of "Comprehensive and accessible documentation" ?

@tdhock
Copy link
Member

tdhock commented Oct 12, 2023

I think it would clarify/simplify to combine "Few breaking changes" with "Backward compatibility" since they both are about stability of the code. How about "Stable code base (easy for users to upgrade to new data.table, and compatible with old R versions)"
or clarify each item?
"Few breaking changes (easy for users to upgrade to new data.table versions)"
"Compatible with old versions of base R"

@tdhock
Copy link
Member

tdhock commented Oct 12, 2023

@MichaelChirico "I made sure to make note of other community members offering translations in other languages, those are: Vietnamese, French, Russian, Portugese, Farsi, Turkish, Hindi. That's already 4 years ago, so of course would need to check their interest again." -> could you please send me their contact info, so I can ask if they would be interested to apply for translation project awards?

@MichaelChirico
Copy link
Member

MichaelChirico commented Oct 12, 2023 via email

@stefanfritsch
Copy link

I'd definitely add clear error messages that provide underlying causes, explanations and possible solutions.

I.e. not NA where TRUE/FALSE needed but "it seems you didn't specify x but we need it because of y. Try z if unsure."

Your errors have helped me convert a few users.

@tdhock tdhock pinned this issue Oct 17, 2023
@leofontenelle
Copy link
Contributor

leofontenelle commented Oct 18, 2023

About international/multilingual/translations, it is true that only Chinese is supported in current message translations. Going forward in the next two years, I plan to invite more translators (of messages and docs), and I actually have money to pay them (20 translation projects, US$500 each). I expect that whoever contributes the initial translation may be interested to maintain in the future. The goal of the translation effort is to increase the number of potential users and contributors in the data.table ecosystem.

If Brazilian Portuguese is to be one of the languages, please contact me. I used to translate GNOME to pt_BR and even coordinated the national i10n team until I decided to focus on activities closer to my profession (medicine), which eventually came to mean doing research, which is how I know data.table. I'm not necessarily offering myself (although the money is tempting) but I can find one or another competent free software translator here and help them as needed.

edit: now I see someone else volunteered already, so I guess they should probably be the first choice

@MichaelChirico
Copy link
Member

edit: now I see someone else volunteered already, so I guess they should probably be the first choice

FWIW Mandarin took a team of 26 translators -- it's a rather sizeable pool of messages to translate, so having >1 hand available will be appreciated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
governance Project governance
Projects
None yet
Development

No branches or pull requests

8 participants