-
Notifications
You must be signed in to change notification settings - Fork 535
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug][Jira] issues disappearing from dataset, progressive load issue? #7826
Comments
Hello, @wouldd! I'm a bot that can help you solve bugs, answer questions, and become a contributor. I'm here to assist you while you wait for a human maintainer. The disappearance of Jira issues from the dataset in DevLake, especially for projects with a large number of issues, could be due to the way the To resolve this, ensure that the Here is the relevant part of the schema for reference: | `id` | varchar | 255 | An issue's `id` is composed of < plugin >:< Entity >:< PK0 >[:PK1]..." <ul><li>For Github issues, a Github issue's id is like "github:GithubIssues:< GithubIssueId >". Eg. 'github:GithubIssues:1049355647'</li> <li>For Jira issues, a Jira issue's id is like "jira:JiraIssues:< JiraSourceId >:< JiraIssueId >". Eg. 'jira:JiraIssues:1:10063'. < JiraSourceId > is used to identify which jira source the issue came from, since DevLake users can import data from several different Jira instances at the same time.</li></ul> | PK | Ensure that the Furthermore, if you are using webhooks, ensure that each project has its own webhook. DevLake determines the project an issue or deployment belongs to by segregating between the webhooks. If a webhook is used by multiple projects, all issues or deployments published by that webhook will be replicated among those projects, leading to potential data inconsistencies. Each project should have its own dedicated webhook to avoid such issues [3][4][5]. For further troubleshooting, you can refer to the DORA Validation Dashboard in DevLake, which provides a step-by-step guide to check which step went wrong for all four DORA metrics. This can help you identify if the issue lies in the data mapping or querying process. Each chart in the dashboard has a hidden button in the top-right corner to access the context menu, where you can click If none of these solutions resolve your problem, consider creating an issue on the DevLake GitHub repository for further assistance. |
@klesh Any update on this? anything I can do to get more information about what is happening to help you run it down? this is happening on the regular for 2 specific projects, each time I can force a full refresh and the data returns, but one refresh later and much of the data for those projects 'dissapears' |
The issue you're experiencing with Jira issues disappearing is likely related to the progressive load mechanism in DevLake. This mechanism first deletes existing data and then inserts new data. If the deletion step completes but the insertion step fails, it can result in data loss [1]. To gather more information and troubleshoot further, you can:
By following these steps, you should be able to gather more information and potentially identify the root cause of the issue. |
@wouldd Sorry, no update. I don't really understand what happened, nor do I have any theories at this point, the description is too broad and vague, and too little information to work with. I need any of the following conditions to be fulfilled before I can do anything:
|
@klesh I'd love to be able to provide more info. Is there anyway to explicitly capture logs for a particular pipeline ? I had a theory that the default of 'skipping failed steps' might be causing issues. the projects that seem to suffer have quite a lot of git repos configured for checking after the jira step. but I also need to understand more about the sequence of events in the process to guess at what's going wrong. is the jira task entirely self contained? or does something wait until all the other steps in the pipeline run? |
You can download the pipeline log by clicking the button shown in the screenshot below: To address the issue, I suggest we start by identifying which issues are missing and look for any patterns. One approach could be to back up the database before each pipeline execution and compare it with the version where issues are missing. This could help us pinpoint where the discrepancies occur. |
@klesh do those logs survive a pod restart? when I try downloading from a previously run pipeline I get a 0 length file |
a little more on this. exploring the db I see that if I query: this suggests to me that the collector is working (though I'm not sure why there are 11 unaccounted for raw data rows) |
@wouldd The logs should be available after a restart if the following settings are correctly configured in your
Did all 169 issues come from the same Jira board? How many issues in the |
@klesh I'm deploying into kubernetes using your helm chart so I assume those settings would be correct there? I'll double check, but if you're not making a persistent volume claim then they probably disapear on node restart |
@klesh somewhat related to this, I'm wondering if something can be done to avoid dropping all the previous data at the start of the refresh process? even if it works, I wind up with potentially a long period where a whole project just disappears from the graphs whilst that project refresh is running. It seems like it would be better to only replace lines that get updated, rather than block delete everything then refill. even a working system creates quite long periods where you cannot trust the graphs are showing an accurate picture. |
@wouldd, I’m not particularly familiar with the Helm chart either, and you’re right about the persistent volume claim (AFAIK). By the way, do these issues belong to multiple boards? |
@klesh Good question, I'm almost certain there will be multiple boards in existence that refrence the same tickets, but I'm not sure if it would be the case that they all belong to multiple boards that are being processed by devlake. I know some teams do have boards that pull from multiple projects so it's a distinct possibility. Would that cause problems? |
@wouldd It could be, extractors and converters would wipe out all records of the board before populating the target table. |
@klesh to be clear, nothing is disapearing on the jira side. the and if the board changes filter then the refresh refuses to run without a full refresh anyway. I'm still quite concerned around any logic that is wiping out data before loading new - this is happening on a refresh right now one of our bigger more important projects, there is only one Jira board associated and it's simple query, there are >60k issues all time in this project. as soon as a refresh starts this project completely disappears from the graphs, and remains gone for quite a long time whilst the refresh is running. given this could happen at any point in the day it causes concern with the business. Is it not possible to avoid this protracted period of having no data? |
@wouldd I completely understand your concern because I share the same. Would it be possible to set up a new instance with a fresh database, focusing solely on the important board? This way, we can test whether the problem persists in that isolated environment. |
@klesh we do have a test environment for devlake, but we're already hitting jira api ratelimits, I'm not sure about duplicating a large project, what would this achieve that we can't do in our main system? I'm still not able to get useful logs out which I think is the helm chart does not define any persistent volume claims so those logs do not last through a restart. that's going to be the same in our dev environment. |
@klesh so I think I may have finally tripped this scenario whilst slowly controlling the refreshes but also putting hte system under some load. I n this case the jira refresh starts, the database wipes the info, but then the collector fails because jira throws a 429 - too many requests, I'll paste the full error at the end. that error is: Something went wrongTry reloading the page, then check our Statuspage for any current outages. If there are no relevant outages, create a support request so we can help you out. If you create a request, include the following so we can help you as fast as possible:
Something went wrongTry reloading the page, then check our Statuspage for any current outages. If there are no relevant outages, create a support request so we can help you out. If you create a request, include the following so we can help you as fast as possible:
|
I normally use docker-compose for development, but our SaaS service does use the helm-chart and we have a centralized logging system so we don't need the persistent volume. I don't think the 429 error is the cause of the data missing problem because incremental mode would keep all the previously collected data and the consecutive subtasks like extractors and convertors should be fine. Honestly, I don't have enough material to investigate so I don't have any clue or how to proceed next. |
@klesh sadly the cloud offering won't be of much use to us. since you don't officially support on prem azure devops anyway. and time="2024-09-06 11:45:27" level=info msg=" [pipeline service] [pipeline #3736] [task #88338] executing subtask convertIssues" It's not clear to me how it got 12768 issues from the board filter but converted 1 ? perhaps I'm misunderstanding what these logs show. I can say that I am trying at the moment with a new board filter which excludes everything outside the last 365 days, since this particular project actually has >65k issue all time. but I really only care about the last year at most. task-88338-2-1-jira.log |
Interesting, that was indeed very odd... |
How about your database setup? Is it an external database server? |
@klesh yes it's a mysql (8.0) in amazon RDS |
Weird, RDS should be fine. |
@klesh my version is approximately your v1 branch, with some customisation for the azure devops go plugin to support our internal server setup. |
@wouldd I'm not entirely sure at this point. It might help if we cross-reference the code changes with the time the issue first appeared. Do you recall when this problem began? |
@klesh the problem is not consistently happening (which is part of the problem) so I don't think it's something obviously co-inciding with code changes. rather I suspect a subtle timing condition based on the jira project itself and how things happen to run in the code. in general my observation is that the stricture of these raw data tables is forcing a situation whereby there is no unique identifier for given issue payload? maybe I'm misreading things but it would seem there would be no need to purge this table ahead of a full refresh if the id was based on the jira unique issue id, it would just be able to do a createorupdate which would mean you'd never have weird gaps when the data is dropped etc. I will say that having instrumented the code and switched on debug logging I have not caught a failure scenario which could be bad luck or it could be that the act of logging more has shifted the timing a little to make it less of a problem |
Wow, this might be the key issue! It looks like the BatchSaveDivider could be accessed by multiple threads, and without any locking mechanism in place, it's highly likely that this is causing the problem. Great catch—well done! Would you be able to implement a locking mechanism and verify if this resolves the issue? |
fwiw I have implemented a fix which I'm testing this week. I'm actually on holiday this week, but I'm leaving things running with my fork and I'll check next week to see if it survived without losing any data. we shall see. |
Looking forward to it. |
@wouldd Is there any progress? |
@d4x1 yes, my fork has been running for a couple of weeks now without dropping any issues so i'm happy that I've fixed the problem we were having. however to do so I made some changes to the core logic that require plugin changes and I obviously have only updated the two that I am using. I also need to merge some of the latest into my fork just to be properly up to date. but I didn't want to the rock the boat on my side before fixing the core issue. |
@wouldd I am curious about what's wrong with the current code, you can give us some hints. :) We can evaluate the priority of this bug. If it is emergency, we should fix it ASAP. If it's limited, you can feel free to submit your PR to fix it. |
@d4x1 So I alluded to the observation in an earlier comment. the current implementation is designed in such a way that it must delete all the contents from the raw tables before populating again because it just uses randomly generated primary keys. So if anything goes wrong during the process then you can wind up without data. I'm not 100% certain but I think there are cases where an sql deadlock error in a batch save can cause a failure that gets swallowed. |
@wouldd Thanks for your reply. You've made two significant improvement: unique id in raw table and deadlock retry. Can you disable one of these two improvement and see what will happen? It can find out which part is working. As to to grafana, just feel free to upgrade it.(We also find some vulnerabilities and wait grafana team fixes them.) |
@d4x1 Hi, I'm afraid this is a pretty busy time of year for me at work and I don't have a setup that would reasonably let me test these independently. the changes work for me and my users are happy, I'm not going to risk breaking that again. once the busy time is past I may be able to at least bring my fork inline with latest to make it easier to assess my changes as a potential feature etc. |
@wouldd take your time and we're not in hurry. |
This issue has been automatically marked as stale because it has been inactive for 60 days. It will be closed in next 7 days if no further activity occurs. |
Search before asking
What happened
We have noticed in some of our graphs that sometimes we see a complete picture, and other times we're only seeing some fraction of the relevant jira issues.
In both cases this seems to impact jira projects that I know have quite a lot of issues, in once case >10k in the last year or so
When this was first brought up, I triggered a full-refresh , which did not seem to fix it, but then seemingly after another couple of 'normal' refreshes the issues did re-populate the db.
However now a couple of days later they've gone again. it seems mostly that I'm left with small numbers of issues from back at the begining of 2023, which makes me think maybe something is emptying out the previous data, kicking off process to work through re-import but then failing for some unknown reason.
What do you expect to happen
I expected the jira issues to consistently be present, they're still in jira under the board filter etc. So there is no obvious reason that they should disspear from devlake
How to reproduce
hard to say at this time. I'd suggest a long lived jira project with thousands of issues spread over several months is a good place to start. I guess run repeated refresh cycles and see if the data population in devlake is changing between them in ways it should not.
Anything else
so far i've seen this happen for 2 specific projects (out of about 25 that we sync. though it's possible that the same issue happens else wehre it's just less obvious in a graph.
not sure if it matters but some of our pipelines are alson syncing quite a few azure devops repos, so the project pipeline iself can take >6 hours to run.
I tried looking in the container logs but did not spot anything that looked like an error.
I'm happy to try running with additional debug or whatever if it would be helpful to assist in understanging this issue better.
Obviously it rather undermines the business faith in the graphs to have them suddenly underreporting by hundreds of missing jira issues.
Version
v1-custom
Are you willing to submit PR?
Code of Conduct
The text was updated successfully, but these errors were encountered: