Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement the exercise "custom-set" #748

Open
siebenschlaefer opened this issue Jan 31, 2022 · 11 comments
Open

Implement the exercise "custom-set" #748

siebenschlaefer opened this issue Jan 31, 2022 · 11 comments

Comments

@siebenschlaefer
Copy link
Contributor

I want a translation of the exercise custom-set to C.

But I want YOUR input before I submit a PR.

Q: Which approaches do we want to allow/encourage?
The three most "obvious" approaches are

  1. Use a sorted (dynamically allocated) array.
    Pro: Probably the easiest to implement, cache-friendly.
    Con: insert() is in O(n).
  2. Use a binary tree (e.g. Red-Black-Tree).
    Pro: In my experience having implemented it correctly is a great achievement.
    Con: That's hard to implement and time-consuming to mentor.
  3. Use a hash table.
    Pro: insert() is in amortized O(1), contains() is in expected O(1).
    Con: For beginners that's a hard task, they have to choose a hash function, an approach for collision resolution, and they might even have to resize it. And it's probably hard to mentor, too.

The tests and the initial .h file can be written in a way that encourages/allows/disallows those three approaches, e.g. if custom_set_create() takes a hash function or if the struct custom_set is defined in the .h file that would steer the students in a certain direction.

Q: Is there a fixed limit for the size of the set?

  • Yes: That's much easier to implement.
  • No: IMHO that's closer to many "real-life" applications.

Q: Do we want to allow/encourage/enforce arbitrary element types?

  • Yes: That's a nice thing to learn and it should give some insight on how to implement those type of (old-school) generic data structures and how qsort() and bsearch() work.
  • No: The canonical tests only use integers. That's much easier to implement.

(BTW: The tests mention "mixed-type sets" in a comment, but I wouldn't know how to do that properly in C without hard-coding the allowed types or having some sort of "base struct".)

The canonical tests require these operations:
some sort of creation, add, empty, contains, subset, disjoint, equal, intersection, difference, and union.
Q: Do we want anything further, e.g. size or some sort of access to the elements?

And last not least, Q: Should be have this exercise at all in the C track?
Please be candid, I'm open to all sorts of constructive criticism.


I already did some of the work and created two translations. Both have a Python script that reads the canonical-data.json and prints the complete test_custom_set.c. Both implement the struct custom_set with a dynamically allocated sorted array.

In the first one the type of the elements is int, the function custom_test_create() takes a size and a variable argument list (probably a first in the C track, right?). The .h file looks like this:

struct custom_set;

struct custom_set *custom_set_create(size_t size, ...);
void custom_set_destroy(struct custom_set *cs);

size_t custom_set_size(struct custom_set const *cs);
bool custom_set_is_empty(struct custom_set const *cs);
bool custom_set_contains(struct custom_set const *cs, int element);
bool custom_set_is_subset_of(struct custom_set const *subset, struct custom_set const *superset);
bool custom_set_disjoint(struct custom_set const *cs1, struct custom_set const *cs2);
bool custom_set_equal(struct custom_set const *cs1, struct custom_set const *cs2);

bool custom_set_add(struct custom_set *cs, int element);
struct custom_set *custom_set_intersection(struct custom_set const *cs1, struct custom_set const *cs2);
struct custom_set *custom_set_difference(struct custom_set const *cs1, struct custom_set const *cs2);
struct custom_set *custom_set_union(struct custom_set const *cs1, struct custom_set const *cs2);

The second one is more generic, its custom_test_create() takes the size of the element type and a comparison function. The .h file looks like this:

struct custom_set;

struct custom_set *custom_set_create(
        size_t elem_size,
        int (*cmp)(void const *, void const *));
struct custom_set *custom_set_create_with_elems(
        size_t elem_size,
        int (*cmp)(void const *, void const *),
        size_t nelems, void const *elements);
void custom_set_destroy(struct custom_set *cs);

size_t custom_set_size(struct custom_set const *cs);
bool custom_set_is_empty(struct custom_set const *cs);
bool custom_set_contains(struct custom_set const *cs, void const *element);
bool custom_set_is_subset_of(struct custom_set const *subset, struct custom_set const *superset);
bool custom_set_disjoint(struct custom_set const *cs1, struct custom_set const *cs2);
bool custom_set_equal(struct custom_set const *cs1, struct custom_set const *cs2);

bool custom_set_add(struct custom_set *cs, void const *element);
struct custom_set *custom_set_intersection(struct custom_set const *cs1, struct custom_set const *cs2);
struct custom_set *custom_set_difference(struct custom_set const *cs1, struct custom_set const *cs2);
struct custom_set *custom_set_union(struct custom_set const *cs1, struct custom_set const *cs2);

Both translations compile, pass the tests and make memcheck.
Let me know what you think.

@wolf99
Copy link
Contributor

wolf99 commented Jan 31, 2022

Will definitely welcome such a PR!

Some thoughts:

Q: Which approaches do we want to allow/encourage?

Open to any approach really. Some general thoughts though:

  • When deciding on the complexity of the implied solution it will be good to consider where in the track progression it will fit
  • We already have a few exercises that boil down to: do a lot of fiddly stuff with dynamic allocation just to move some numbers around. I have been generally trying to avoid implementations where the memory allocation work heavily outweighs the actual work of the exercise just for the sake of implementing classic design structures. There's nothing necessarily wrong with those types of implementation, but I feel they might get boring/frustrating quite quickly.

Q: Is there a fixed limit for the size of the set?

Yes: That's much easier to implement.
No: IMHO that's closer to many "real-life" applications.

This might be a nice exercise to introduce varargs which, as you pointed out, is not used in any other exercise on the track yet.

Q: Do we want to allow/encourage/enforce arbitrary element types?

  • Mixed typing gets convoluted fast in C I feel. If you see an implementation that looks not too complicated then go for it. But again I would have concerns about the "busy-work" of allowing that outweighing the logic of the exercise itself.
  • If the canonical data sticks with integers that makes things easier for us and the students - remember we have to be able to also test these things in a clear enough manner that a student can infer what is expected of them

The canonical tests require these operations:
some sort of creation, add, empty, contains, subset, disjoint, equal, intersection, difference, and union.
Q: Do we want anything further, e.g. size or some sort of access to the elements?

Not sure - nothing springs to mind at the moment.

And last not least, Q: Should be have this exercise at all in the C track?

I see no reason not to add it and I feel it would be a welcome addition!

An additional thought, one of the slight sticking points I have come across before with exercises that can imply opaque data types, is that naive implementations can lead to the tests having to rely on the implementation to query the underlying data. Obviously it would be good to avoid that if possible, such that each test only tests one function, rather than some foo function plus inherently testing an accessor type function at the same time.

@wolf99
Copy link
Contributor

wolf99 commented Jan 31, 2022

A very small thought about the samples you have added:

Instead of always needing to refer to struct in struct custom_set cs, you could add a typedef in the header file such that typedef struct custom_set custom_set_t or similar and then simply use custom_set_t cs. Perhaps the shown usage is simply a result of using a generator script?

Other than that the first sample looks like a slightly simpler interface and includes the varargs which, as mentioned would be good to add to the track.

@ryanplusplus
Copy link
Member

I agree with @wolf99's comments. I'll add also that pushing for a red-black tree or hash would probably be a mistake as the complexity of the both of these far exceeds the complexity of set operations. If possible it would be nice to be as implementation-agnostic as possible so that students can start with a sorted (or even unsorted!) array and go with another representation if they're looking for a challenge.

@siebenschlaefer
Copy link
Contributor Author

When deciding on the complexity of the implied solution it will be good to consider where in the track progression it will fit

I haven't thought about that at all. Can you help with that?

Mixed typing gets convoluted fast in C I feel. If you see an implementation that looks not too complicated then go for it. But again I would have concerns about the "busy-work" of allowing that outweighing the logic of the exercise itself.

Then maybe we should have a C-specific exercise for that, e.g. implement dynamic array for elements with a size specified by the caller, with a few "generic" functions like partition() or find_if() that take a "comparator" or "predicate" function. But that's a completely separate topic.

If the canonical data sticks with integers that makes things easier for us and the students - remember we have to be able to also test these things in a clear enough manner that a student can infer what is expected of them
[...]
Other than that the first sample looks like a slightly simpler interface and includes the varargs which, as mentioned would be good to add to the track.

OK, I will ditch the "generic" version.

An additional thought, one of the slight sticking points I have come across before with exercises that can imply opaque data types, is that naive implementations can lead to the tests having to rely on the implementation to query the underlying data. Obviously it would be good to avoid that if possible, such that each test only tests one function, rather than some foo function plus inherently testing an accessor type function at the same time.

I'm not sure I understand. Do you mind rephrasing that?

I think there are just two alternatives:

  1. Define the struct in the .h file and have the tests access it directly.
  2. Rely on interface and call the functions from the solution, e.g. call union() to get a new set and then call equal() (also defined by the user) to compare it with the expected result. Or instead of equal() have some way of accessor functions that allow inspecting the size and elements ("binary-search-tree" does something like that).

I tend to go with calling equal().

Instead of always needing to refer to struct in struct custom_set cs, you could add a typedef in the header file such that typedef struct custom_set custom_set_t or similar and then simply use custom_set_t cs. Perhaps the shown usage is simply a result of using a generator script?

Actually that was a deliberate decision (just like the "east const"). As far as I can tell there's no consensus in the C community about whether to "always typedef" or "never typedef". The proponents for the typedef say it's shorter and easier to read, the opponents say that it hides the nature of the type. And both sides can be pretty opinionated (see Why should we typedef a struct so often in C? on StackOverflow).
(BTW: Personally I have no problem with either style, I use the established style of the existing code, but if I start something new I avoid those typedef-ed structs).

I did not use the typedef here to show new C programmers that they don't have to use it but of course they are free to add it in their solution.
But that's not essential. I'll add the typedef if that's the consensus here.

I'll add also that pushing for a red-black tree or hash would probably be a mistake as the complexity of the both of these far exceeds the complexity of set operations.

All the programming languages I know implement sets either with balanced binary trees or hash tables.
But I agree, that's pretty complex.

@wolf99
Copy link
Contributor

wolf99 commented Feb 1, 2022

An additional thought, one of the slight sticking points I have come across before with exercises that can imply opaque data types, is that naive implementations can lead to the tests having to rely on the implementation to query the underlying data. Obviously it would be good to avoid that if possible, such that each test only tests one function, rather than some foo function plus inherently testing an accessor type function at the same time.

I'm not sure I understand. Do you mind rephrasing that?

An example of the logic of a sample test function (psuedo code):

add_a_value_to_a_set(the_set, the_value);   // the function intended to be tested
result = retrieve_the_set_values(set);  // this  call is a problem as it means the test relies on the implementation to inspect the data
test(result, expected);

The first line is fine, that's the mutator that is under test.
If the data structure is opaque to the tests, then for the second line it must use an implementation defined accessor (e.g. something like a void const *custom_set_get(struct custom_set *cs); function or the equal() function you mention ) to retrieve a result that can be input to the test framework.
Having the tests rely on the implementation is bad practice as it effectively means that actually two things are under test
This could be an issue for example, when creating tests for the add test cases.

The problem is avoided if the data type is transparent, i.e. the struct is defined in the .h file.

I did not use the typedef here to show new C programmers that they don't have to use it but of course they are free to add it in their solution.
But that's not essential. I'll add the typedef if that's the consensus here.

Agreed, typedef is not mandatory and its OK to not use it here if you prefer not to 🙂 .

@wolf99
Copy link
Contributor

wolf99 commented Feb 1, 2022

When deciding on the complexity of the implied solution it will be good to consider where in the track progression it will fit

I haven't thought about that at all. Can you help with that?

Anyone know where that track curriculum visualiser is and whether it works with v3? (asking on slack: https://exercism-team.slack.com/archives/CAQP7JL3T/p1643749842495399)
TBH I am more used to looking at the repo and so thinking of the exercises in alphabetical order than in the track progression!
Very generally speaking I guess a more complex implementation would fit later in the track rather than earlier in the curriculum

@ryanplusplus
Copy link
Member

When deciding on the complexity of the implied solution it will be good to consider where in the track progression it will fit

I haven't thought about that at all. Can you help with that?

Anyone know where that track curriculum visualiser is and whether it works with v3? (asking on slack: https://exercism-team.slack.com/archives/CAQP7JL3T/p1643749842495399) TBH I am more used to looking at the repo and so thinking of the exercises in alphabetical order than in the track progression! Very generally speaking I guess a more complex implementation would fit later in the track rather than earlier in the curriculum

I'm not at a computer now, but it used to be part of configlet and may still be.

@wolf99
Copy link
Contributor

wolf99 commented Feb 9, 2022

This is the tool I was thinking of but it does not work with v3 configs :(
https://jonmcalder.shinyapps.io/exercism-config-viz/

@siebenschlaefer
Copy link
Contributor Author

@ryanplusplus wrote

If possible it would be nice to be as implementation-agnostic as possible so that students can start with a sorted (or even unsorted!) array and go with another representation if they're looking for a challenge.

@wolf99 wrote

If the data structure is opaque to the tests, then for the second line it must use an implementation defined accessor (e.g. something like a void const *custom_set_get(struct custom_set *cs); function or the equal() function you mention ) to retrieve a result that can be input to the test framework.
Having the tests rely on the implementation is bad practice as it effectively means that actually two things are under test
This could be an issue for example, when creating tests for the add test cases.

I have thought long and hard about that, and I can't see how I can both keep the tests implementation-agnostic and avoid relying on the implementation (e.g. avoiding having to call retrieve_the_set_values() to see whether add_a_value_to_a_set() was successful.)

I believe the statement "Having the tests rely on the implementation is bad practice as it effectively means that actually two things are under test" is wrong. In consequence that would mean when some function changes some state we would always need some way to inspect the changed state directly. The could be no hidden state, static global variables, no opaque data structures. Imagine that in the Java world for example. You have an operation on a class and you can't call any other method of that class because that would put that second method under test, too, instead all the member variables would have to be public. Black box testing in general would not be possible.
And wouldn't that mean that we should rewrite the tests of several exercises? The tests of binary-search-tree, circular-buffer, linked-list, and react call functions to arrange some state, call a function to act, and then call functions to assert the expected state.

I would like to test the public interface through the functions of the set. Objections? Did I miss or misunderstand anything?

@wolf99
Copy link
Contributor

wolf99 commented Apr 8, 2022

Hi,
I think you have a good point!
So far when creating exercises I have tried to keep the tested data public.

But reading your thoughts above I agree that this does not need to be a hard requirement.

@ryanplusplus
Copy link
Member

I would like to test the public interface through the functions of the set. Objections? Did I miss or misunderstand anything?

💯

This is what I was trying to advocate. Using only the public API means that we don't couple the tests to the implementation details so that a wide variety of implementations (with varying performance and difficulty!) are possible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants