Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modification during iteration #17

Open
rtheunissen opened this issue Mar 6, 2016 · 42 comments
Open

Modification during iteration #17

rtheunissen opened this issue Mar 6, 2016 · 42 comments

Comments

@rtheunissen
Copy link
Member

It's currently possible to modify / update a collection during iteration. This leads to undefined behaviour, because the internal iterator is not aware of these modifications. There are three possible solutions here:

  1. Keep a modcount, throw exception during iterator validation if modcount is different.
  2. Keep an iterator count on the structure, and throw an exception during updates if > 0.
  3. Leave as is, it's the user's responsibility to be sensible.

Could also just raise a warning instead of throwing an exception.

@krakjoe, curious to hear your thoughts. 💡

@rtheunissen rtheunissen self-assigned this Mar 6, 2016
@rtheunissen rtheunissen added this to the 1.0.0-alpha milestone Mar 6, 2016
@rtheunissen
Copy link
Member Author

I'm leaning towards 1, as a modcount can also be used to make other shortcuts, such as having sort reset the modcount to 0, and have modcount == 0 imply a sorted collection. (Just an idea)

The overhead is similar between 1 and 2, where in the former you increment during updates and compare during validation, and in the latter you increment during iterator init but compare during updates.

@nikic
Copy link
Contributor

nikic commented Mar 6, 2016

However, shouldn't sorting also be forbidden while iterating? If you mix modcount usage for both, this will not be properly enforced, right?

@rtheunissen
Copy link
Member Author

Sorting wouldn't be allowed for obvious reasons. You could therefore add an element, then sort, and the iterator wouldn't know about it. So it's a flawed idea, unless sort uses a 'last sorted modcount' rather than updating the modcount.

That was just an example though, a potential bonus.

@john-wilkinson
Copy link

I think modifying values during iteration should be fine. The only concern here is modifying the actual collection- i.e., changing the size of the collection.

I feel like it would be fairly straightforward to template behaviour off Java in this instance.
https://docs.oracle.com/javase/7/docs/api/java/util/ConcurrentModificationException.html

@rtheunissen rtheunissen removed this from the 1.0.0-alpha milestone Jul 30, 2016
@rtheunissen rtheunissen added this to the 2.0.0 milestone Sep 4, 2016
@rtheunissen
Copy link
Member Author

Going with the "please be sensible" option here, it's on the user for now until this causes a problem that can't be avoided or is a big risk. Maybe 2.0 can see a modification guard that is incremented and decremented by the iterator, and all functions that modify should check the guard first.

@dktapps
Copy link
Contributor

dktapps commented Jan 26, 2019

I think this does need to be addressed with explicit exceptions. I'm currently in the process of migrating to Map from array on a project, and I ran into this unexpected behaviour on investigating a bug:

$m = new \Ds\Map();
for($i = 0; $i < 100000; ++$i){
	$m[$i] = $i;
}

foreach($m as $k => $v){
	unset($m[$k]);
}
var_dump(count($m));

outputs int(50000) instead of int(0). I did not realize this code wasn't supported until finding this issue.

@rtheunissen
Copy link
Member Author

I agree and will work to implement this asap. 👍

@krakjoe
Copy link
Contributor

krakjoe commented Jan 27, 2019

I'm sure I replied to your question outside of github ... but in case I didn't ... 1) ...

@rtheunissen
Copy link
Member Author

@krakjoe I was actually thinking 2. We can assume that there will be more modifications than iterator instances, so was thinking an if (UNEXPECTED(obj->iteratorCount)) { throw } on update could be faster than obj->modcount++; which would be checked during each step of the iteration? (Not that that is reason enough / significant).

Modcount:

  • Increment on update
  • Set starting modcount on iterator ctor
  • Check if modcount != starting modcount in valid?

Iterator count:

  • Check if iterator count > 0 on update
  • Increment iterator count on iterator init
  • Decrement iterator count on iterator dtor

There's also the possibility of modcount overflowing, though it would always be non-zero in that case, right?

@dktapps
Copy link
Contributor

dktapps commented Jan 27, 2019

Completely disallowing modifications during iteration sounds more desirable, because it'll make it easier to find where a bug is coming from.

If one part of user code adds a new hashtable key during iteration (without realizing it wasn't just overwriting) and the code only crashes later on when the next iterator next occurs, this is going to create some very hard-to-find bugs. The user will know an illegal modification occurred, but not where it occurred.

@dktapps
Copy link
Contributor

dktapps commented Jan 27, 2019

With that said, it's also possible to just disallow size-changing modifications if iteratorCount > 0 I guess. I strongly feel that any exception should be thrown at the point of attempted modification though.

@rtheunissen
Copy link
Member Author

I agree with that @dktapps

@krakjoe
Copy link
Contributor

krakjoe commented Jan 27, 2019

So long as this is true:

I strongly feel that any exception should be thrown at the point of attempted modification though.

It's all good ...

@rtheunissen
Copy link
Member Author

I have a draft implementation here: 76eec53

This follows @dktapps suggestion of size-changing modifications, which makes sense to me. I don't think we should disallow all modifications - only those that would corrupt the iterator.

The list I came up with and applied is:
Sequence:

  • push
  • pop
  • shift
  • unshift
  • clear

Map:

  • put
  • putAll
  • clear

Set:

  • add
  • remove
  • clear

Stack/Queue

  • push
  • pop
  • clear

This allows us to still do stuff like:

foreach ($sequence as $index => $value) {
    $sequence[$index - 1] = $value;
}

@rtheunissen
Copy link
Member Author

I will leave this open for discussion, but will clean up and release at some stage if no changes are suggested. I will need to update the tests and the polyfill also, but those are easy to do.

@rtheunissen
Copy link
Member Author

This will be implemented in 2.0 using copy-on-write for internal buffers.

@dktapps
Copy link
Contributor

dktapps commented Mar 14, 2019

If I understand you correctly @rtheunissen that means the iteration behaviour will be consistent with arrays (iterating on a copy, modification of original permitted), did I understand that correctly?

@rtheunissen
Copy link
Member Author

That is correct.

@rtheunissen
Copy link
Member Author

@dktapps it also means that a clone does not clone the data until the cloned object is written to. It creates a new instance which shares the same data, but is separated from other containers when modified.

@rtheunissen rtheunissen reopened this Mar 14, 2019
@rtheunissen rtheunissen removed this from the 2.0.0 milestone Mar 14, 2019
@rtheunissen
Copy link
Member Author

Still valid for 1.x

@githubeing
Copy link

is it safe to use IteratorIterator($collection) while modifying the $collection?

as i know, ArrayIterator can do this, why Collection cannot?

doesn't IteratorIterator create its own iterator or is it using the collection's one?

@githubeing
Copy link

e.g. if i want:

$iterator = new IteratorIterator($map);
$iterator->rewind();
$iterator->current();
$iterator->next();
$iterator->current();
$map->put($key, $value);
$iterator->next();
$iterator->current();

@githubeing
Copy link

githubeing commented Apr 14, 2019

my keys implement Hashable, so if Collection cannot do this, i'll have to use ArrayIterator and instead of $map[$key] = $value; i will use $arrayIterator[$key->hash()] = $value; directly.

I just don't get why Collection cannot simply do what ArrayIterator can

@rtheunissen
Copy link
Member Author

ArrayIterator can do this, why Collection cannot?

This is because arrays are copy-on-write/persistent and ext-ds structures do not support this yet. When you iterate an array it increases its reference count, so that when you modify the array it actually creates a copy of the entire array and the modification is done on the copy. The iteration will still continue on the previous version of the array.

doesn't IteratorIterator create its own iterator or is it using the collection's one?

I think it behaves a bit like this:

https://3v4l.org/G3cI7

<?php

// Collection
$collection = [1, 2, 3];

// IteratorIterator
$generator = function (iterable $i) {
    yield from $i;
};

// new IteratorIterator
$iterator = $generator($collection);

var_dump($iterator->current());

my keys implement Hashable, so if Collection cannot do this, i'll have to use ArrayIterator and instead of $map[$key] = $value; i will use $arrayIterator[$key->hash()] = $value; directly.

I would strongly suggest against this pattern. The hash is not the key. Multiple keys that are not equal at all can share the same hash.

I just don't get why Collection cannot simply do what ArrayIterator can

Because arrays are copy-on-write/persistent. ext-ds 2.0 will focus on persistence, but there are currently no plans to backport that to the 1.x branch.

If you must modify a collection that may be iterated somewhere, you should clone it before the iteration. However, creating a full copy every time is a major waste of resources so this should only be seen as a temporary solution while 2.x is being developed.

@githubeing
Copy link

githubeing commented Apr 14, 2019

my keys implement Hashable, so if Collection cannot do this, i'll have to use ArrayIterator and instead of $map[$key] = $value; i will use $arrayIterator[$key->hash()] = $value; directly.

I would strongly suggest against this pattern. The hash is not the key. Multiple keys that are not equal at all can share the same hash.

idk if i'm doing this right but.. according to ddd, i have an Entity and an EntityId classes. something like this:

class Entity { private EntityId $entityId; }
class EntityId { private int $integerId; }

in this case, $entity->entityId->integerId is both the hash and the unique identifier (unique amongst Entity instances).
I understand that "hash is not the key" but what is wrong with "the key is a hash"?
Or did you mean I should correct it in such a way: $arrayIterator[(string) $key->integerId] = $value?

@githubeing
Copy link

@rtheunissen thank you for your replies, it was interesting and informative to read them 👍

@githubeing
Copy link

ArrayIterator can do this, why Collection cannot?
This is because arrays are copy-on-write/persistent and ext-ds structures do not support this yet.

fyi, i've just got to know, that in php 7.3 ArrayIterator has a bug and cannot do this too. e.g.: https://3v4l.org/Ft0Dh

Output for 7.3.0 - 7.3.4
15:         valid(): bool(false)
16:       current(): NULL
17: offsetSet(5, 5): NULL
18:         valid(): bool(true)
19:       current(): UNKNOWN:0

Notice: Undefined variable: s in /in/2Aelr on line 5
20:                : 

after returning an undefined value on current() call, it stops the iteration even though i append new values

@nikic
Copy link
Contributor

nikic commented Apr 15, 2019

@githubeing Please create a report on bugs.php.net. That UNKNOWN:0 is definitely a bug.

@rtheunissen
Copy link
Member Author

@githubeing

I understand that "hash is not the key" but what is wrong with "the key is a hash"?

In your specific case, it seems like the hash is unique so could be used as a key, sure. It is not recommended when you can not guarantee that a hash will be unique.

@githubeing if you use a generator rather than an IteratorIterator and you clone the structure before you iterate it, does that solve your use case? I know that can be an expensive operation (the clone every time), but it may carry you until we implement an improvement around iteration.

@githubeing
Copy link

@githubeing Please create a report on bugs.php.net. That UNKNOWN:0 is definitely a bug.

i've done: https://bugs.php.net/bug.php?id=77903

you may vote for it if you want btw

@githubeing
Copy link

githubeing commented Apr 15, 2019

if you use a generator rather than an IteratorIterator and you clone the structure before you iterate it, does that solve your use case?

not sure i understand what you mean. in my specific case, i use the ArrayIterator as a cache to convert a given Generator to an Iterator which will also have the Map's get(EntityId $obj):Entity method. for me, this bug isn't a big problem, 'cause i can simply not to use the ArrayIterator Iterator methods untill i'm done with caching Generator's output during the first pass. it's difficult to explain with words, the code speaks much more comprehensible: https://3v4l.org/4dkKn (this is the version that modifies ArrayIterator during the iteration, which fails because of the php bug introduced in 7.3 version)

@githubeing
Copy link

githubeing commented Apr 15, 2019

the idea was to make a class that would wrap a generator, lazy cache it and perform fetches from database only at moments when the data is actually used. and the get(object $key) method fetches the database until it finds the requested item (and caches all it's fetched).
see example: https://3v4l.org/SWo7c

it prints the following (when all classes are loaded):

lazy load entities from repository.
doing something else...

start using our first 5 entities one by one...
only now we're querying the database:
SELECT * FROM entity WHERE id IN (40,37,34,31,28,25,22,19,16,13,10,7,4,1)
fetching another entity from database: # 40
using the entity # 40...
fetching another entity from database: # 37
using the entity # 37...
fetching another entity from database: # 34
using the entity # 34...
fetching another entity from database: # 31
using the entity # 31...
fetching another entity from database: # 28
using the entity # 28...

start using some particular entities: # 19, 28, 13...
going to use entity # 19...
fetching another entity from database: # 25
fetching another entity from database: # 22
fetching another entity from database: # 19
using the entity # 19...
going to use entity # 28...
using the entity # 28...
going to use entity # 13...
fetching another entity from database: # 16
fetching another entity from database: # 13
using the entity # 13...

start using our first 12 entities one by one...
using the entity # 40...
using the entity # 37...
using the entity # 34...
using the entity # 31...
using the entity # 28...
using the entity # 25...
using the entity # 22...
using the entity # 19...
using the entity # 16...
using the entity # 13...
fetching another entity from database: # 10
using the entity # 10...
fetching another entity from database: # 7
using the entity # 7...

i.e. the entity isn't fetched until it's being used directly.
as you can see, the ArrayIterator (in IndexedMapCachingOverGenerator->$generatorCache, that i've sent earlier) works fine on php 7.2 allowing to modify it during iteration
it'il be good if/when Map will can do it in the same way ArrayIterator does it now, but currently (taking into account that my Entities have integer ids) this isn't a problem for me

@githubeing
Copy link

@nikic , i've only just seen that you're the same person that fixed the bug 😂 cool 👍

@cspray
Copy link

cspray commented Aug 24, 2019

I encountered this issue where iterating over a Map I needed to conditionally remove some keys. The iteration using foreach would never get to the last element of the Map. If I used Map::map I was able to remove each element while also iterating over the entire collection.

@JasperHorn
Copy link

JasperHorn commented Mar 6, 2020

Until this is fixed, this should really be in the documentation!

(I was looking at the documentation of php.net of this extension. I do not know if it's actually generated from something in this repo, or that such a note would have to be added elsewhere...)

I was aware of the potential problem, but without a note in the documentation, I assumed the foreach would either be unaffected or just get the new object. Instead, the foreach ended up encountering an int (despite the fact that no int had ever been in the Set in question). Truly undefined behavior indeed.

@dktapps
Copy link
Contributor

dktapps commented Mar 12, 2020

@rtheunissen out of curiosity, how are you planning to implement COW? The only solution I can think of is to make the modification APIs fluent (return modified copy, or self if refcount == 1) and get rid of in-place modification (except for structures that specifically want to be modified in-place).

@rtheunissen
Copy link
Member Author

rtheunissen commented Mar 15, 2020

@dktapps internally we track the refcount of the internal structure attached to each object, and we copy that structure as necessary on write.

@dktapps
Copy link
Contributor

dktapps commented Mar 15, 2020

@rtheunissen My question was aimed at the PHP reference side of things.

So far as I know we can't easily mutate variable referencing this from within an extension to directly point to something else, which would be necessary to prevent other stuff from still referencing the same object, e.g. if two variables reference the same Vector, one of them would have to be changed to point to a different Vector PHP object when COW occurs. That's why I mentioned the fluency thing (it's the only practical way I can think of to control the PHP references).

It's possible my understanding of PHP internals is flawed, so this may not apply at all.

@dktapps
Copy link
Contributor

dktapps commented Mar 15, 2020

In code:

$vec = new Vector();
$vec2 = $vec;
$vec->push(1); //here $vec would need to be changed to a new Vector
var_dump($vec2); //should be empty

@rtheunissen
Copy link
Member Author

Each of those objects will hold a reference to an internal vector, and it is only that vector will which be copied. The values aren't stored in or alongside the PHP object itself. Two vectors can share the same internal vector.

@dktapps
Copy link
Contributor

dktapps commented Dec 11, 2020

For v1: What about implementing IteratorAggregate in the extension? That would enable the underlying datastructure to be directly copied on start of iteration into an iterator, which would bypass this problem entirely.

This issue is the biggest thing stopping me from adopting ds fully and it doesn't look like v2 is coming anytime soon, so my attention is focused on v1.

(I notice the polyfill already does this using generators. For example, Set's underlying array is referenced by-value here: https://github.com/php-ds/polyfill/blob/master/src/Set.php#L413)

@enumag
Copy link

enumag commented Dec 11, 2020

For v1: What about implementing IteratorAggregate in the extension?

I agree. It would be nice. See #166

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants