Polyflow task sync sometimes not working #1251

simonhir · 2024-01-29T09:35:01Z

Very rarely it occurs that engine task changes are not correctly represented in polyflow. That means that tasks are not created or not closed.

Observations with below example

Task and process has been completed in engine
TaskCompletedEngineEvent correctly in COMAIN_EVENT_ENTRY table in engine db
Task still present in plf_task table in tasklist db

Acceptance criteria

All events are correctly represented in the tasklist

Reference

INC0380533
SCTASK0714575
INC0382149
INC0382566
INC0382560
https://git.muenchen.de/digitalisierung/digiwf-support/-/issues/442

The text was updated successfully, but these errors were encountered:

simonhir · 2024-01-29T10:46:40Z

Problem seems to be that a older message when sorted via timestamp has a lower global_index. The task so was closed and then recreated by the assign event.

simonhir · 2024-01-29T10:51:07Z

For above mentioned incident the solution was to manually delete the task with all correlating data.

simonhir · 2024-02-02T08:13:33Z

Problem reoccurred but this time the timestamps and global_index are not in conflicting order. Seems to be not the reason for the problem.

simonhir · 2024-02-02T10:58:31Z

Reoccurred but this time the task was not created in the tasklist.

zambrovski · 2024-02-08T12:56:33Z

We need more metrics in prometheus to detect the occurencies.

…dd /actuator/prometheus to permitted paths

* ongoing metrics redefinition * configured kafka-ui, resorted services * change visibility * wip: new gauge * new logging * more logging * #1251: move event EventMessageCountingMonitor setup to constructor. add /actuator/prometheus to permitted paths * create additional metrics for tasks, reduce loggingin integration itests * confgured monitoring * tasklist service fix prometheus path * remove unnneded config * rename metric to match the sender side * rename label --------- Co-authored-by: stephan.strehler <[email protected]> Co-authored-by: Simon Hirtreiter <[email protected]>

darenegade · 2024-02-20T09:15:03Z

Wartet auf #1229, um den Fehler einzugrenzen

* ongoing metrics redefinition * configured kafka-ui, resorted services * change visibility * wip: new gauge * new logging * more logging * #1251: move event EventMessageCountingMonitor setup to constructor. add /actuator/prometheus to permitted paths * create additional metrics for tasks, reduce loggingin integration itests * confgured monitoring * tasklist service fix prometheus path * remove unnneded config * rename metric to match the sender side * rename label --------- Co-authored-by: stephan.strehler <[email protected]> Co-authored-by: Simon Hirtreiter <[email protected]>

* Feature/monitoring (#1306) * ongoing metrics redefinition * configured kafka-ui, resorted services * change visibility * wip: new gauge * new logging * more logging * #1251: move event EventMessageCountingMonitor setup to constructor. add /actuator/prometheus to permitted paths * create additional metrics for tasks, reduce loggingin integration itests * confgured monitoring * tasklist service fix prometheus path * remove unnneded config * rename metric to match the sender side * rename label --------- Co-authored-by: stephan.strehler <[email protected]> Co-authored-by: Simon Hirtreiter <[email protected]> * configurble metrics, fix #1338 (#1353) * configurble metrics, fix #1338 * Update digiwf-engine/digiwf-engine-service/src/main/resources/application.yml Co-authored-by: Simon Hirtreiter <[email protected]> * Update digiwf-libs/digiwf-camunda-prometheus/digiwf-camunda-prometheus-starter/src/main/java/de/muenchen/oss/digiwf/camunda/prometheus/MetricsProviderSchedulerAutoConfiguration.java Co-authored-by: Simon Hirtreiter <[email protected]> * Update digiwf-libs/digiwf-camunda-prometheus/digiwf-camunda-prometheus-starter/src/main/java/de/muenchen/oss/digiwf/camunda/prometheus/CamundaPrometheusProperties.java Co-authored-by: Simon Hirtreiter <[email protected]> --------- Co-authored-by: Simon Hirtreiter <[email protected]> --------- Co-authored-by: Simon Zambrovski <[email protected]> Co-authored-by: stephan.strehler <[email protected]>

* #881 move types folder * #881 update types in package.json * chore: bump release version (#1289) * chore: mvn auto version bump to 1.7.2-SNAPSHOT * v1.7.2 --------- Co-authored-by: DigiWF Github Bot <[email protected]> * fix components.d.ts * Fix: DMS-Input returns COOs * chore: bump release version (#1342) * chore: mvn auto version bump to 1.7.4-SNAPSHOT * v1.7.4 --------- Co-authored-by: DigiWF Github Bot <[email protected]> * chore: bump release version (#1344) * chore: mvn auto version bump to 1.7.5-SNAPSHOT * v1.7.5 --------- Co-authored-by: DigiWF Github Bot <[email protected]> * Feature/metrics for 1.7 release (#1356) * Feature/monitoring (#1306) * ongoing metrics redefinition * configured kafka-ui, resorted services * change visibility * wip: new gauge * new logging * more logging * #1251: move event EventMessageCountingMonitor setup to constructor. add /actuator/prometheus to permitted paths * create additional metrics for tasks, reduce loggingin integration itests * confgured monitoring * tasklist service fix prometheus path * remove unnneded config * rename metric to match the sender side * rename label --------- Co-authored-by: stephan.strehler <[email protected]> Co-authored-by: Simon Hirtreiter <[email protected]> * configurble metrics, fix #1338 (#1353) * configurble metrics, fix #1338 * Update digiwf-engine/digiwf-engine-service/src/main/resources/application.yml Co-authored-by: Simon Hirtreiter <[email protected]> * Update digiwf-libs/digiwf-camunda-prometheus/digiwf-camunda-prometheus-starter/src/main/java/de/muenchen/oss/digiwf/camunda/prometheus/MetricsProviderSchedulerAutoConfiguration.java Co-authored-by: Simon Hirtreiter <[email protected]> * Update digiwf-libs/digiwf-camunda-prometheus/digiwf-camunda-prometheus-starter/src/main/java/de/muenchen/oss/digiwf/camunda/prometheus/CamundaPrometheusProperties.java Co-authored-by: Simon Hirtreiter <[email protected]> --------- Co-authored-by: Simon Hirtreiter <[email protected]> --------- Co-authored-by: Simon Zambrovski <[email protected]> Co-authored-by: stephan.strehler <[email protected]> * chore: bump release version (#1358) * chore: mvn auto version bump to 1.7.6-SNAPSHOT * v1.7.6 --------- Co-authored-by: DigiWF Github Bot <[email protected]> * chore: bump release version (#1371) * chore: mvn auto version bump to 1.8.0-SNAPSHOT * v1.8.0 --------- Co-authored-by: DigiWF Github Bot <[email protected]> --------- Co-authored-by: stephan.strehler <[email protected]> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: DigiWF Github Bot <[email protected]> Co-authored-by: darenegade <[email protected]> Co-authored-by: Simon Hirtreiter <[email protected]> Co-authored-by: Simon Zambrovski <[email protected]>

simonhir · 2024-03-01T10:03:40Z

@zambrovski im Rahmen von #1392 habe ich auf Processes-Hotfix einen Reimport angestoßen, dabei ist mir folgendes aufgefallen:

Ausgangszustand: In der Engine existieren 6 Tasks mehr als in der Tasklist (Engine: 120, Tasklist: 114)
Beim Import wird für jeden Task in der Engine (120) ein Axon Event gesendet und Empfangen, jedoch erhöht sich die Anzahl in der Tasklist nicht
Im Log der Tasklist sind keine Fehler erkennbar

Sieht für mich so aus, wie wenn die Taskliste also für manche Events nichts ändert und das an den Tasks selbst zu liegen scheint, weil ja die gleichen immer wieder ignoriert werden, so wie es aussieht.
Die entsprechenden Tasks sind hier verlinkt: #1392 (comment)

zambrovski · 2024-03-07T10:36:42Z

Hat sich nicht bestätigt... Verbessertes logging eingebaut und mit Polyflow 4.1.4 released. Ist eingebaut und wir schauen, was nun passiert

darenegade · 2024-03-21T12:48:30Z

Problem konnte durch einen koordinierten Lasttest nicht mehr reproduziert werden und ist auch schon länger nicht mehr aufgetreten.

Wir richten einen Alert ein auf den Processes-* und Prod. Bis das Problem wieder aufkommt, wird das Ticket erstmal geschlossen

simonhir · 2024-03-22T12:18:36Z

Aufgetreten auf Processes-Test für Instanz ecb94808-e82a-11ee-8b4a-0a580a8a27ad mit Task 9ef184d3-e82b-11ee-90eb-0a580a8a319e:

Verhalten war hier, dass die Engine in dem Moment gecrasht hat, das führte zu folgender Fehlermeldung. Wie man aus der Fehlermeldung schließen kann wurde in diesem Fall gar kein Axon-Event erzeugt.

Log

2024-03-22T10:07:38.741

org.axonframework.commandhandling.NoHandlerForCommandException: No handler was subscribed for command [io.holunda.camunda.taskpool.api.task.CreateTaskCommand].
	at org.axonframework.commandhandling.SimpleCommandBus.doDispatch(SimpleCommandBus.java:167)
	at org.axonframework.commandhandling.SimpleCommandBus.lambda$dispatch$1(SimpleCommandBus.java:131)
	at org.axonframework.tracing.Span.run(Span.java:101)
	at org.axonframework.commandhandling.SimpleCommandBus.dispatch(SimpleCommandBus.java:125)
	at org.axonframework.commandhandling.gateway.AbstractCommandGateway.send(AbstractCommandGateway.java:76)
	at org.axonframework.commandhandling.gateway.DefaultCommandGateway.send(DefaultCommandGateway.java:83)
	at io.holunda.polyflow.taskpool.sender.gateway.AxonCommandListGateway.sendToGateway(AxonCommandListGateway.kt:31)
	at io.holunda.polyflow.taskpool.sender.task.DirectTxAwareAccumulatingEngineTaskCommandSender.send(DirectTxAwareAccumulatingEngineTaskCommandSender.kt:27)
	at io.holunda.polyflow.taskpool.sender.task.TxAwareAccumulatingEngineTaskCommandSender$send$2.beforeCommit(TxAwareAccumulatingEngineTaskCommandSender.kt:46)
	at org.springframework.transaction.support.TransactionSynchronizationUtils.triggerBeforeCommit(TransactionSynchronizationUtils.java:97)
	at org.springframework.transaction.support.AbstractPlatformTransactionManager.triggerBeforeCommit(AbstractPlatformTransactionManager.java:916)
	at org.springframework.transaction.support.AbstractPlatformTransactionManager.processCommit(AbstractPlatformTransactionManager.java:727)
	at org.springframework.transaction.support.AbstractPlatformTransactionManager.commit(AbstractPlatformTransactionManager.java:711)
	at org.springframework.transaction.support.TransactionTemplate.execute(TransactionTemplate.java:152)
	at org.camunda.bpm.engine.spring.SpringTransactionInterceptor.execute(SpringTransactionInterceptor.java:71)
	at org.camunda.bpm.engine.impl.interceptor.ProcessApplicationContextInterceptor.execute(ProcessApplicationContextInterceptor.java:70)
	at org.camunda.bpm.engine.impl.interceptor.CommandCounterInterceptor.execute(CommandCounterInterceptor.java:35)
	at org.camunda.bpm.engine.impl.interceptor.LogInterceptor.execute(LogInterceptor.java:33)
	at org.camunda.bpm.engine.impl.interceptor.ExceptionCodeInterceptor.execute(ExceptionCodeInterceptor.java:55)
	at org.camunda.bpm.engine.impl.jobexecutor.ExecuteJobHelper.executeJob(ExecuteJobHelper.java:57)
	at org.camunda.bpm.engine.impl.jobexecutor.ExecuteJobsRunnable.executeJob(ExecuteJobsRunnable.java:110)
	at org.camunda.bpm.engine.impl.jobexecutor.ExecuteJobsRunnable.run(ExecuteJobsRunnable.java:71)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at java.base/java.lang.Thread.run(Thread.java:840)

darenegade · 2024-03-22T12:38:02Z

Scheint eine Race-Condition beim Gracefull-Shutdown vorzuliegen, da der Handler für CreateTaskCommand existiert an sich, da es sonst gar nicht gehen würde

darenegade · 2024-03-22T12:53:53Z

Änderung zum Graceful-Shutdown:
#1207

simonhir · 2024-03-25T08:31:23Z

Entsprechendes Axon-Issue dazu: AxonFramework/AxonFramework#891
Muss mal noch schauen ab welcher Version das behoben ist und ob es eigentlich schon gehen müsste

darenegade · 2024-03-25T08:38:26Z

Muss mal noch schauen ab welcher Version das behoben ist und ob es eigentlich schon gehen müsste

Hatte ich schon geprüft. Axon hat das seit 4.3 und wir nutzen 4.9.

Wo ich noch nicht sicher bin, ist ob es richtig bei uns integriert ist, sodass dies auch greift und ob das auch Polyflow richtig unterstützt.

simonhir · 2024-03-28T09:32:03Z

Aktueller Stand: Events werden immer korrekt in die DOMAIN_EVENT_ENTRY Tabelle geschrieben, aber von dort nicht an die Tasklist weitergeleitet

darenegade · 2024-03-28T09:34:48Z

https://docs.axoniq.io/reference-guide/extensions/kafka

simonhir · 2024-03-28T11:15:33Z

Debugging durch setzen des Tokens auf einen vergangen Wert um das erneute senden der Events auszulosen.

select processor_name, to_char(token) from token_entry where processor_name = 'de.muenchen.oss.digiwf.task.polyflow.kafka';


update token_entry set token = to_blob( utl_raw.cast_to_raw('{"index":1010650,"gaps":[]}')) where processor_name = 'de.muenchen.oss.digiwf.task.polyflow.kafka';
commit;

Fehlende Tasks werden immer noch nicht angelegt.

simonhir · 2024-03-28T11:16:13Z

Debugging durch setzen des Tokens auf einen vergangen Wert um das erneute senden der Events auszulosen.

select processor_name, to_char(token) from token_entry where processor_name = 'de.muenchen.oss.digiwf.task.polyflow.kafka';


update token_entry set token = to_blob( utl_raw.cast_to_raw('{"index":1010650,"gaps":[]}')) where processor_name = 'de.muenchen.oss.digiwf.task.polyflow.kafka';
commit;

Fehlende Tasks werden immer noch nicht angelegt.

zambrovski · 2024-03-28T14:57:39Z

Aktuelle Diagnose

Der global_index und timestamp sind nicht konsistent in der domain_event_entry - es tauchen Events mit einer höherem Index zeitlich "VOR" den Events mit kleineren Nummern. Das füht dazu das TEP nicht richtig die Events ans Kafka liefert, was zur Diskrepanz zwischen der Engine und tasklist führt.

Grund

Die Verwendung von JPA/Hibernate default sequence increment von 50 führt zu diesem Verhalten bei mehr als einem Knoten.

Behebung

INCREMENT BY in DOMAIN_EVENT_ENTRY_SEQ auf 1 setzen (in DB)
set property in Spring: spring.jpa.properties.hibernate.id.optimizer.pooled.preferred=none

Mehr details

Weitere Hinweise

Falls es doch nicht so klapped habe ich auch gelesen, dass es einen hibernate.id.optimizer.pooled.preferred=org.hibernate.id.enhanced.NoopOptimizer gibt... Vielleicht hilft der dann.

darenegade · 2024-04-02T07:03:49Z

Pods runterfahren
In DB:

ALTER SEQUENCE domain_event_entry_seq
INCREMENT BY 1;

Config hinzufügen:
SPRING_JPA_PROPERTIES_HIBERNATE_ID_OPTIMIZER_POOLED_PREFERRED = none
Pods hochfahren

markostreich · 2024-04-02T07:19:00Z

@zambrovski @darenegade müssen domain_event_entry_seq und global_index auch noch angeglichen werden?
Sie sind gerade 878 Werte auseinander gedriftet.

darenegade · 2024-04-02T07:45:18Z

Lass uns das als PN kurz besprechen. Falls du auf die Demo geschaut hast, darfst du den Werten dort nicht trauen, da ich aktuell ein paar Tests für den HotFix mache

darenegade · 2024-04-02T07:56:32Z

Änderung wurde auf Demo ausgerollt und getestet. Grundsätzlich funktioniert die Config und Tasks werden synchronisiert. Die alten Fälle lassen sich damit natürlich nicht mehr "heilen", da die Reihenfolge nicht korrigiert ist. Man kann dies aber manuell beheben und dann passt es wieder.

Hier eine Query um die Fälle herauszufinden:

SELECT count(*) FROM domain_event_entry t1 
INNER JOIN domain_event_entry t2 
ON t1.aggregate_identifier = t2.aggregate_identifier 
WHERE t1.TIME_STAMP < t2.TIME_STAMP 
AND t1.GLOBAL_INDEX > t2.GLOBAL_INDEX
AND t1.payload_type='io.holunda.camunda.taskpool.api.task.TaskCreatedEngineEvent'
AND (
    t2.payload_type='io.holunda.camunda.taskpool.api.task.TaskCompletedEngineEvent' 
    OR
    t2.payload_type='io.holunda.camunda.taskpool.api.task.TaskDeletedEngineEvent'
    )
order by t1.TIME_STAMP ASC;

Anzahl Fälle:

Prod: 1272
Proc-Test: 949
Proc-Demo: 242

darenegade · 2024-04-02T10:12:48Z

Fix ausgerollt auf Processes-*. Nun warten auf Rückmeldung, ob das Problem damit behoben...

darenegade · 2024-04-02T10:46:44Z

Ansonsten ist die Lösung noch etwas wackelig, was eine Neuinstallation oder so angeht. Da wird der INCREMENT BY wieder standardmäßig auf 50 gesetzt. Das müssten wir noch anpassen.

Siehe: #1097

Update: Hier angepasst #1527

darenegade · 2024-04-02T10:47:46Z

Falls es doch nicht so klapped habe ich auch gelesen, dass es einen hibernate.id.optimizer.pooled.preferred=org.hibernate.id.enhanced.NoopOptimizer gibt... Vielleicht hilft der dann.

none sollte eigentlich den NoopOptimizer setzen, womit diese Config gleichwertig wäre
https://docs.jboss.org/hibernate/orm/5.5/javadocs/org/hibernate/id/enhanced/StandardOptimizerDescriptor.html#NONE

darenegade · 2024-04-02T13:51:14Z

@simonhir Wir müssen das Monitoring auch unbedingt wieder aktivieren. Aktuell kann man auf keiner Umgebung die Daten für die Engine sehen

simonhir · 2024-04-10T11:19:37Z

Unterschied der mir auf dev im Gegensatz zu processestraining aufgefallen ist:

dev: gleicher User für beide Schemata
processestraining: unterschiedlicher User je Schema

Hab mit REQ0682591 eine Anpassung von dev auf den gleichen Zustand beantragt um das auszuschließen. Denke aber es könnte schon daran liegen.

simonhir · 2024-04-16T12:12:30Z

Aufsplitten zu zwei separaten PostgreSQL-Usern für Engine und Tasklist hat das Problem wie oben vermutet behoben.

simonhir · 2024-04-19T13:40:41Z

Nach erneuten Fehlern merke: In Zukunft Schema neu anlegen lassen anstatt auf neuen User zu übertragen

Fehler danach behoben

simonhir added bug Something isn't working digiwf-task labels Jan 29, 2024

simonhir self-assigned this Jan 29, 2024

simonhir assigned zambrovski Jan 30, 2024

simonhir unassigned zambrovski and simonhir Feb 6, 2024

zambrovski self-assigned this Feb 8, 2024

StephanStrehlerCGI added a commit that referenced this issue Feb 12, 2024

#1251: move event EventMessageCountingMonitor setup to constructor. a…

961f37f

…dd /actuator/prometheus to permitted paths

darenegade added the internal label Feb 21, 2024

simonhir mentioned this issue Feb 29, 2024

Camunda Process Monitoring #1229

Closed

2 tasks

darenegade closed this as completed Mar 21, 2024

darenegade reopened this Mar 22, 2024

darenegade assigned simonhir and unassigned zambrovski Mar 22, 2024

darenegade mentioned this issue Apr 2, 2024

Deactivate Hibernate ID-Pool #1527

Merged

9 tasks

darenegade mentioned this issue Apr 5, 2024

Release 2024 April #1544

Closed

7 tasks

simonhir closed this as completed Apr 16, 2024

simonhir reopened this Apr 17, 2024

simonhir closed this as completed Apr 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Polyflow task sync sometimes not working #1251

Polyflow task sync sometimes not working #1251

simonhir commented Jan 29, 2024 •

edited by darenegade

Loading

simonhir commented Jan 29, 2024

simonhir commented Jan 29, 2024

simonhir commented Feb 2, 2024

simonhir commented Feb 2, 2024

zambrovski commented Feb 8, 2024

darenegade commented Feb 20, 2024

simonhir commented Mar 1, 2024 •

edited

Loading

zambrovski commented Mar 7, 2024

darenegade commented Mar 21, 2024

simonhir commented Mar 22, 2024

darenegade commented Mar 22, 2024 •

edited

Loading

darenegade commented Mar 22, 2024

simonhir commented Mar 25, 2024 •

edited

Loading

darenegade commented Mar 25, 2024 •

edited

Loading

simonhir commented Mar 28, 2024

darenegade commented Mar 28, 2024

simonhir commented Mar 28, 2024

simonhir commented Mar 28, 2024

zambrovski commented Mar 28, 2024 •

edited

Loading

darenegade commented Apr 2, 2024

markostreich commented Apr 2, 2024

darenegade commented Apr 2, 2024

darenegade commented Apr 2, 2024 •

edited

Loading

darenegade commented Apr 2, 2024

darenegade commented Apr 2, 2024 •

edited

Loading

darenegade commented Apr 2, 2024 •

edited

Loading

darenegade commented Apr 2, 2024

simonhir commented Apr 10, 2024 •

edited

Loading

simonhir commented Apr 16, 2024

simonhir commented Apr 19, 2024

Polyflow task sync sometimes not working #1251

Polyflow task sync sometimes not working #1251

Comments

simonhir commented Jan 29, 2024 • edited by darenegade Loading

Observations with below example

Acceptance criteria

Reference

simonhir commented Jan 29, 2024

simonhir commented Jan 29, 2024

simonhir commented Feb 2, 2024

simonhir commented Feb 2, 2024

zambrovski commented Feb 8, 2024

darenegade commented Feb 20, 2024

simonhir commented Mar 1, 2024 • edited Loading

zambrovski commented Mar 7, 2024

darenegade commented Mar 21, 2024

simonhir commented Mar 22, 2024

darenegade commented Mar 22, 2024 • edited Loading

darenegade commented Mar 22, 2024

simonhir commented Mar 25, 2024 • edited Loading

darenegade commented Mar 25, 2024 • edited Loading

simonhir commented Mar 28, 2024

darenegade commented Mar 28, 2024

simonhir commented Mar 28, 2024

simonhir commented Mar 28, 2024

zambrovski commented Mar 28, 2024 • edited Loading

Aktuelle Diagnose

Grund

Behebung

Mehr details

Weitere Hinweise

darenegade commented Apr 2, 2024

markostreich commented Apr 2, 2024

darenegade commented Apr 2, 2024

darenegade commented Apr 2, 2024 • edited Loading

darenegade commented Apr 2, 2024

darenegade commented Apr 2, 2024 • edited Loading

darenegade commented Apr 2, 2024 • edited Loading

darenegade commented Apr 2, 2024

simonhir commented Apr 10, 2024 • edited Loading

simonhir commented Apr 16, 2024

simonhir commented Apr 19, 2024

simonhir commented Jan 29, 2024 •

edited by darenegade

Loading

simonhir commented Mar 1, 2024 •

edited

Loading

darenegade commented Mar 22, 2024 •

edited

Loading

simonhir commented Mar 25, 2024 •

edited

Loading

darenegade commented Mar 25, 2024 •

edited

Loading

zambrovski commented Mar 28, 2024 •

edited

Loading

darenegade commented Apr 2, 2024 •

edited

Loading

darenegade commented Apr 2, 2024 •

edited

Loading

darenegade commented Apr 2, 2024 •

edited

Loading

simonhir commented Apr 10, 2024 •

edited

Loading