Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Step Loss due to Printcounter (SKR1.3 & Possibly Similar Boards) #20785

Closed
ConstantijnCrijnen opened this issue Jan 16, 2021 · 16 comments
Closed

Comments

@ConstantijnCrijnen
Copy link
Contributor

Bug Description

As already stated in my comment on issue 20645, I'm having a problem wich I've been trying to hunt down for over a year. I always thought that it was a mechanical issue, but after a year of tinkering and observing I'm now pretty sure that it's a software problem.
Issue 20645 states a knocking noise which seems to be coming from the use of SoftwareSerial when MONITOR_DRIVER_STATUS is enabled. I thought that my issue was related, but after doing some more testing I'm now confident that it's not.

There are two different issues, which are so similar that I think there might be a common issue between both of them.

1. [MAIN ISSUE] Step loss during SD Print
The drivers will randomly give a loud knock and and skip a couple of steps (around 0.5 to 2mm with GT2 20T pulleys and 1.8deg motors). It doesn't happen often and its occurrence seems to correlate with the feedrate. At 30mm/s max. speed the issue will occur around once per week with 24/7 printing. At 60mm/s about every two days, at 100mm/s 1-2 times per day and at 150mm/s every 1-2 hours.
The step loss always seems to occur in both X and Y simultaneously and without any kind of unexpected mechanical load such as collisions with boogers or warped prints.

2. SD Abort stops print but then continues for a bit more
Here's a video of what happens when I stop the SD print via LCD screen. The print stops immediately, then continues for a short moment and then proceeds with EVENT_GCODE_SD_ABORT as expected. During the short move after the initial stop, the axes sometimes seem to move in a different direction as they otherwise should have if the print wasn't stopped.
As @thinkyhead commented on issue 5982, this shouldn't be happening since the command queue should have been cleared and blocked.

Configuration Files

Configuration.zip
In trinamic.cpp under #if HAS_DRIVER(TMC2209) I changed the following blank time from 1 to 3 since I'm running quite high-inductance motors (6mH per phase) chopconf.tbl = 0b11; // blank_time = 40.

The version I'm using is the bugfix from around a week ago. Unfortunately I can't just upload and experiment with new bugfixes because the printers are running at my university and it's nearing the end of the semester so the printers have to printing non-stop.

Steps to Reproduce

I'm not entirely sure what it takes to reproduce. As I said, it seems to correlate with feedrate. My current solution is throttling the FR to max. 30mm/s. That way there is a step loss of around once per week. If I increase the FR, the time till failure (step loss) decreases.

Testing

As stated before, I did some testing to see if the issue is related to MONITOR_DRIVER_STATUS. I printed solid 100mm diameter cylinders without perimeters and 190mm/s print speed. The print was aborted as soon as steps were lost, after which the TTF (Time Till Failure) was noted and a picture of the top surface was taken.
As a control I tested with MONITOR_DRIVER_STATUS enabled and SQUARE_WAVE_STEPPING disabled.
To test issue 20645, I first disabled MONITOR_DRIVER_STATUS and then enabled SQUARE_WAVE_STEPPING (see pictures).

IMG_0924
All tests combined. Green arrow: step loss, orange arrow: initiation of SD abort, red arrow: initiation of EVENT_GCODE_SD_ABORT.

IMG_0925
Close-up. Green arrow: step loss, orange arrow: initiation of SD abort, red arrow: initiation of EVENT_GCODE_SD_ABORT.

Discussion

At least with this small sample size, I didn't see any improvement by disabling MONITOR_DRIVER_STATUS and enabling SQUARE_WAVE_STEPPING. In fact, the TTF was worse with those prints. The data is of course not significant enough to make any quantitative statements, but qualitatively, the step loss-issue did persist.
I didn't capture the step loss-issue on video, but it sounds identical to the video I shared above when aborting via LCD. The movement pattern of both the step loss and SD abort also looks very similar. This leads me to believe that there might be an issue where an error state is triggered which calls one of the functions from the SD abort routine. That's a wild guess of course.

Other than that, I'm a bit out of ideas unfortunately. I do want to optimise the chopper parameters according to Trinamic's datasheet in the future, but I'll need to get a current probe for that. My current chopper settings are all theoretical and based on Trinamic's formulas. Maybe those are just wrong. But I can confirm that the stock chopper settings gave the same issue and made my specific motors sound worse.

Additional Information

M122 after losing steps and SD abort:

22:34:37.763 > X Y Z Z2 E
22:34:37.763 > Address 0 0 0 0 0
22:34:37.763 > Enabled false false true true false
22:34:37.826 > Set current 800 1100 800 800 500
22:34:37.826 > RMS current 795 1049 795 795 489
22:34:37.871 > MAX current 1121 1479 1121 1121 689
22:34:37.916 > Run current 25/31 18/31 25/31 25/31 15/31
22:34:37.916 > Hold current 12/31 9/31 12/31 12/31 7/31
22:34:37.916 > CS actual 12/31 9/31 12/31 12/31 7/31
22:34:37.961 > PWM scale
22:34:37.961 > vsense 1=.18 0=.325 1=.18 1=.18 1=.18
22:34:38.006 > stealthChop false false true true true
22:34:38.050 > msteps 16 16 16 16 16
22:34:38.095 > interp true true true true true
22:34:38.140 > tstep max max max max max
22:34:38.184 > PWM thresh.
22:34:38.229 > [mm/s]
22:34:38.229 > OT prewarn false false false false false
22:34:38.274 > pwm scale sum 14 11 28 30 19
22:34:38.319 > pwm scale auto 0 0 0 0 0
22:34:38.363 > pwm offset auto 36 36 71 76 77
22:34:38.408 > pwm grad auto 14 14 14 14 23
22:34:38.453 > off time 3 3 4 4 4
22:34:38.497 > blank time 54 54 54 54 54
22:34:38.542 > hysteresis
22:34:38.542 > -end 0 0 2 2 2
22:34:38.587 > -start 4 1 1 1 1
22:34:38.631 > Stallguard thrs 0 0 0 0 0
22:34:38.631 > uStep count 392 376 104 232 984
22:34:38.676 > DRVSTATUS X Y Z Z2 E
22:34:38.676 > sg_result 0 2 148 46 78
22:34:38.721 > stst
22:34:38.765 > olb
22:34:38.810 > ola
22:34:38.855 > s2gb
22:34:38.899 > s2ga
22:34:38.944 > otpw
22:34:38.988 > ot
22:34:39.033 > 157C
22:34:39.077 > 150C
22:34:39.122 > 143C
22:34:39.166 > 120C
22:34:39.211 > s2vsa
22:34:39.211 > s2vsb
22:34:39.211 > Driver registers:
22:34:39.211 > X 0x80:0C:00:00
22:34:39.220 > Y 0x80:09:00:00
22:34:39.229 > Z 0xC0:0C:00:00
22:34:39.238 > Z2 0xC0:0C:00:00
22:34:39.247 > E 0xC0:07:00:00
22:34:39.256 >
22:34:39.256 >
22:34:39.256 > Testing X connection... OK
22:34:39.265 > Testing Y connection... OK
22:34:39.274 > Testing Z connection... OK
22:34:39.283 > Testing Z2 connection... OK
22:34:39.292 > Testing E connection... OK
22:34:39.301 > ok P31 B3

@ConstantijnCrijnen ConstantijnCrijnen changed the title [BUG] Step Loss during SD Print and SD Abort (SKR1.4, TMC2209) [BUG] Step Loss during SD Print and SD Abort (SKR1.3, TMC2209) Jan 16, 2021
@daleckystepan
Copy link
Contributor

At the first, thank you for exhaustive information about your issue. I have just went through you config and have some ideas you can try:

  1. Disable S_CURVE_ACCELERATION and EXPERIMENTAL_SCURVE
  2. Check temperature on motors and drivers. You should be able to touch them and hold them without pain. That is below 60 C approximately.
  3. Disable EEPROM_SETTINGS and EEPROM_CHITCHAT . Sometimes a lot of "garbage" is in the EEPROM and you cannot be sure which settings are really in place.
  4. Try CLASSIC_JERK instead of JUNCTION_DEVIATION
  5. SLOWDOWN_DIVISOR 4 with BLOCK_BUFFER_SIZE 32
  6. Check voltage stability of the power supply under the load.

🤞

@sjasonsmith sjasonsmith added Bug: Potential ? Needs: More Data We need more data in order to proceed labels Jan 17, 2021
@ConstantijnCrijnen
Copy link
Contributor Author

Thanks for your suggestions! I'm not sure if my new data set is significant enough, but if it is, then Something seems to have fixed the problem now; let me explain.

Temperatures
First of all, the temperature of the motors is warm but not hot and the drivers have a lot of forced air moving over the heatsinks and at least the heat sinks are cool to the touch. I've been monitoring them with M122 closely over the last year and I've never seen an OT flag come up.

Voltage Stability
Voltage seems stable at 24.1V, but I only measured with a multimeter. Tomorrow I could measure with an oscilloscope but with my recent observations I don't think that that is necessary.

Testing
I tested your suggested settings on the original configuration files from my original post, so with MONITOR_DRIVER_STATUS enabled and SQUARE_WAVE_STEPPING disabled.
One thing I didn't mention with that post was, that I did the testing on two identical machines. The testing of your suggestions was done on the same two machines. This time I used one color per machine, so red for printer 1 and yellow for printer 2.
After each firmware update I did a power cycle and selected Restore Defaults followed by Store Settings via the lcd screen. I've also been doing that with the tests from the first post.

IMG_0204

On printer 1 I first tested disabling S_CURVE_ACCELERATION and EXPERIMENTAL_SCURVE.
In the mean time I set printer 2 up with EEPROM_SETTINGS and EEPROM_CHITCHAT disabled. I also had to disable PRINTCOUNTER.
After two quick successive fails on printer 1 I reverted the firmware and enabled CLASSIC_JERK.
To my surprise, both printers were able to print two full pucks without any issues.
After that, wasting more time and plastic on more pucks felt unnecessary (180g per puck 🤫), so I skipped testing SLOWDOWN_DIVISOR 4. However, a divisor of 4 isn't a bad idea with a buffer size of 32 anyway, is it?

Since I now had four successful pucks from two printers with two different sets of settings I was kind of sceptical, so I did one more control print with my original settings on printer 2, which also succeeded.

With my current understanding, that would make sense if there was indeed some garbage in the eeprom, but the build with the disabled eeprom settings never touched printer 1, yet those pucks also finished without any issues.

While the printers were printing the pucks with CLASSIC_JERK and EEPROM_SETTINGS, EEPROM_CHITCHAT and PRINTCOUNTER disabled, I looked at what M503 had to say. I should have done that earlier, because it seems like by that point I had already done something that fixed the problem. Unfortunately I didn't think of it early enough...

M503 Printer 1 M503 Printer 2
00:07:53.709 > echo: G21 ; Units in mm (mm) 00:10:59.393 > echo: G21 ; Units in mm (mm)
00:07:53.710 > echo: M149 C ; Units in Celsius 00:10:59.394 > echo: M149 C ; Units in Celsius
00:07:53.710 > 00:10:59.394 >
00:07:53.710 > echo:; Filament settings: Disabled 00:10:59.394 > echo:; Filament settings: Disabled
00:07:53.710 > echo: M200 S0 D1.75 00:10:59.394 > echo: M200 S0 D1.75
00:07:53.710 > echo:; Steps per unit: 00:10:59.394 > echo:; Steps per unit:
00:07:53.710 > echo: M92 X80.00 Y80.00 Z400.00 E380.00 00:10:59.394 > echo: M92 X80.00 Y80.00 Z400.00 E380.00
00:07:53.711 > echo:; Maximum feedrates (units/s): 00:10:59.394 > echo:; Maximum feedrates (units/s):
00:07:53.711 > echo: M203 X300.00 Y300.00 Z5.00 E30.00 00:10:59.394 > echo: M203 X300.00 Y300.00 Z5.00 E30.00
00:07:53.712 > echo:; Maximum Acceleration (units/s2): 00:10:59.394 > echo:; Maximum Acceleration (units/s2):
00:07:53.712 > echo: M201 X800.00 Y800.00 Z300.00 E1000.00 00:10:59.394 > echo: M201 X800.00 Y800.00 Z300.00 E1000.00
00:07:53.713 > echo:; Acceleration (units/s2): P<print_accel> R<retract_accel> T<travel_accel> 00:10:59.395 > echo:; Acceleration (units/s2): P<print_accel> R<retract_accel> T<travel_accel>
00:07:53.713 > echo: M204 P700.00 R800.00 T700.00 00:10:59.395 > echo: M204 P700.00 R800.00 T700.00
00:07:53.713 > echo:; Advanced: B<min_segment_time_us> S<min_feedrate> T<min_travel_feedrate> X<max_x_jerk> Y<max_y_jerk> Z<max_z_jerk> E<max_e_jerk> 00:10:59.395 > echo:; Advanced: B<min_segment_time_us> S<min_feedrate> T<min_travel_feedrate> J<junc_dev>
00:07:53.714 > echo: M205 B20000.00 S0.00 T0.00 X10.00 Y10.00 Z0.40 E7.00 00:10:59.396 > echo: M205 B20000.00 S0.00 T0.00 J0.02
00:07:53.715 > echo:; Home offset: 00:10:59.396 > echo:; Home offset:
00:07:53.715 > echo: M206 X0.00 Y0.00 Z0.00 00:10:59.396 > echo: M206 X0.00 Y0.00 Z0.00
00:07:53.722 > echo:; Auto Bed Leveling: 00:10:59.396 > echo:; Auto Bed Leveling:
00:07:53.722 > echo: M420 S1 Z10.00 00:10:59.396 > echo: M420 S1 Z10.00
00:07:53.722 > echo: G29 W I0 J0 Z-0.16750 00:10:59.396 > echo: G29 W I0 J0 Z0.03500
00:07:53.722 > echo: G29 W I1 J0 Z-0.13000 00:10:59.397 > echo: G29 W I1 J0 Z-0.14250
... ...
00:07:53.725 > echo:; Material heatup parameters: 00:10:59.404 > echo:; Material heatup parameters:
00:07:53.725 > echo: M145 S0 H210 B60 F0 00:10:59.404 > echo: M145 S0 H210 B60 F0
00:07:53.725 > echo: M145 S1 H240 B70 F0 00:10:59.404 > echo: M145 S1 H240 B70 F0
00:07:53.725 > echo:; PID settings: 00:10:59.404 > echo:; PID settings:
00:07:53.725 > echo: M301 P30.65 I3.41 D68.85 00:10:59.404 > echo: M301 P30.65 I3.41 D68.85
00:07:53.726 > echo: M304 P23.30 I2.94 D123.13 00:10:59.404 > echo: M304 P23.30 I2.94 D123.13
00:07:53.726 > echo:; Power-Loss Recovery: 00:10:59.404 > echo:; Power-Loss Recovery:
00:07:53.727 > echo: M413 S1 00:10:59.404 > echo: M413 S1
00:07:53.727 > echo:; Z-Probe Offset (mm): 00:10:59.404 > echo:; Z-Probe Offset (mm):
00:07:53.727 > echo: M851 X38.00 Y18.00 Z-2.43 00:10:59.404 > echo: M851 X38.00 Y18.00 Z-2.15
00:07:53.727 > echo:; Stepper driver current: 00:10:59.404 > echo:; Stepper driver current:
00:07:53.727 > echo: M906 X800 Y1100 Z800 00:10:59.405 > echo: M906 X800 Y1100 Z800
00:07:53.728 > echo: M906 I1 Z800 00:10:59.405 > echo: M906 I1 Z800
00:07:53.728 > echo: M906 T0 E500 00:10:59.405 > echo: M906 T0 E500
00:07:53.728 > 00:10:59.405 >
00:07:53.728 > echo:; Driver stepping mode: 00:10:59.405 > echo:; Driver stepping mode:
00:07:53.728 > echo: M569 S1 Z 00:10:59.405 > echo: M569 S1 Z
00:07:53.728 > echo: M569 S1 I1 Z 00:10:59.405 > echo: M569 S1 I1 Z
00:07:53.728 > echo: M569 S1 T0 E 00:10:59.405 > echo: M569 S1 T0 E
00:07:53.728 > echo:; Linear Advance: 00:10:59.405 > echo:; Linear Advance:
00:07:53.728 > echo: M900 K0.09 00:10:59.405 > echo: M900 K0.09
00:07:53.728 > echo:; Filament load/unload lengths: 00:10:59.405 > echo:; Filament load/unload lengths:
00:07:53.729 > echo: M603 L45.00 U60.00 00:10:59.405 > echo: M603 L45.00 U60.00
00:07:53.729 > echo:; Filament runout sensor: 00:10:59.405 > echo:; Filament runout sensor:
00:07:53.729 > echo: M412 S1 00:10:59.405 > echo: M412 S1
00:07:53.729 > ok P0 B0 00:10:59.405 > ok P0 B0

So at this point I'd say that there was indeed something with the eeprom settings. But I'm still confused why printer 1 suddenly started printing well. As I said, I had always selected Restore Defaults via the lcd screen and this time was no different. My understanding is, that clicking on Restore Defaults followed by Store Settingsis the same as sending M502 and M500, isn't it? And what could have been in EEPROM that wouldn't have been overwritten by M502 and M500?

Anyway, I'm really grateful for your help. I will do some more "real-world" prints to see if it is really fixed, but at this point it seems that it is.

@ConstantijnCrijnen
Copy link
Contributor Author

On printer 1 I first tested disabling S_CURVE_ACCELERATION and EXPERIMENTAL_SCURVE.
In the mean time I set printer 2 up with EEPROM_SETTINGS and EEPROM_CHITCHAT disabled. I also had to disable PRINTCOUNTER.
After two quick successive fails on printer 1 I reverted the firmware and enabled CLASSIC_JERK.
To my surprise, both printers were able to print two full pucks without any issues.

I just went through the code again to make 100% sure that I reverted all the settings correctly and realised that I had forgotten to turn PRINTCOUNTER back on, which I had turned off when I disabled EEPROM_SETTINGS and EEPROM_CHITCHAT. That means that for all the successive builds and prints, it was turned off.

I now enabled PRINTCOUNTER again started the same test print with printer 2. Printer 1 is currently on a job but once it's available again, I'll test that one as well.

@ConstantijnCrijnen
Copy link
Contributor Author

It seems like there is something wrong with PRINTCOUNTER as all prints failed. Can anyone confirm?

IMG_0207

@sjasonsmith
Copy link
Contributor

Do they fail at the one hour mark?

@ConstantijnCrijnen
Copy link
Contributor Author

Do they fail at the one hour mark?

Not really...
Here's a sorted list of all the TTF:
0:10
0:46
0:48
0:50
0:56
1:00
1:11
1:42
1:54
1:56
1:56
2:58

If I squint really hard, they do seem to accumulate a bit around the full hours though.

@sjasonsmith
Copy link
Contributor

I think the default interval for PRINTCOUNTER to store stats is one hour.

On boards that store “EEPROM” contents in flash it has to write the entire EEPROM contents. It can’t update only the bytes needed for print stats. My guess is that is interfering with the print.

One detail that might be throwing off your testing is that the flash sector is divided into 8 “eeprom slots” on this board. After storing to eeprom 8 times, the ninth has to perform a full erase of the sector, which will take much longer that a normal store.

@sjasonsmith sjasonsmith removed the Needs: More Data We need more data in order to proceed label Jan 18, 2021
@ConstantijnCrijnen
Copy link
Contributor Author

Yes that seems to be the case:

if (ELAPSED(now, eeprom_next)) {
    eeprom_next = now + saveInterval * 1000;
    saveStats();
}

with

/**
* @brief Interval in seconds between EEPROM saves
* @details This const value defines what will be the time between each
* EEPROM save cycle, the development team recommends to set this value
* no lower than 3600 secs (1 hour).
*/
static constexpr uint16_t saveInterval = 3600;

I just tested using saveInterval = 60 and I could hear an audible "clunk" exactly every minute. So within 10 minutes I could hear 10 times a "clunk". 2 of those times caused a minor step loss of up to a mm and 6 times it caused a major step loss of up to a cm.
IMG_0208

So knowing that that is an issue with at least the SKR1.3 boards, could we just dismiss this query for all affected boards

if (ELAPSED(now, eeprom_next)) {
    eeprom_next = now + saveInterval * 1000;
    saveStats();
}

and just rely on

bool PrintCounter::stop() {
  TERN_(DEBUG_PRINTCOUNTER, debug(PSTR("stop")));

  if (super::stop()) {
    data.finishedPrints++;
    data.printTime += deltaDuration();

    if (duration() > data.longestPrint)
      data.longestPrint = duration();

    saveStats();
    return true;
  }
  else return false;
}

to do the job? Or is there a specific reason why you'd want to save the stats to EEPROM on a regular basis?

@sjasonsmith
Copy link
Contributor

I like the idea to still allow print stats but only at the end of each print. That seems like a good compromise.

I think we would want to apply that to any board using FLASH_EEPROM_EMULATION.

@ConstantijnCrijnen
Copy link
Contributor Author

VSCode is telling me that FLASH_EEPROM_EMULATION isn't enabled in my build. Are you sure about that?

@sjasonsmith
Copy link
Contributor

Actually you are right. The E3 Mini V2 and probably MX have actual eeprom chips. Older do not. Maybe the problem is wider spread than I thought.

@ConstantijnCrijnen
Copy link
Contributor Author

Unfortunately I know too little about microcontrollers and marlin for that matter to know what exactly is going on 😕

@ConstantijnCrijnen ConstantijnCrijnen changed the title [BUG] Step Loss during SD Print and SD Abort (SKR1.3, TMC2209) [BUG] Step Loss due to Printcounter (SKR1.3 & Possibly Similar Boards) Jan 19, 2021
ConstantijnCrijnen added a commit to ConstantijnCrijnen/Marlin that referenced this issue Jan 21, 2021
@ConstantijnCrijnen
Copy link
Contributor Author

2. SD Abort stops print but then continues for a bit more
Here's a video of what happens when I stop the SD print via LCD screen. The print stops immediately, then continues for a short moment and then proceeds with EVENT_GCODE_SD_ABORT as expected. During the short move after the initial stop, the axes sometimes seem to move in a different direction as they otherwise should have if the print wasn't stopped.
As @thinkyhead commented on issue 5982, this shouldn't be happening since the command queue should have been cleared and blocked.

I did some more experimenting and found that this second issue I was having was also related to the printcounter, as the board would save the stats to eeprom as soon as I stopped the print. Unfortunately I can't do a pull request on my fix as it's always showing me the unicorn. I've never done this before... Am I doing something wrong?

@slowbro
Copy link
Member

slowbro commented Jan 23, 2021

@ConstantijnCrijnen That happens to me too. Make sure you are doing a PR against the bugfix-2.0.x branch. It seems to default to 2.0.x no matter what... I always have to edit the URL.

sjasonsmith added a commit to ConstantijnCrijnen/Marlin that referenced this issue Jan 24, 2021
ConstantijnCrijnen added a commit to ConstantijnCrijnen/Marlin that referenced this issue Jan 25, 2021
thinkyhead added a commit to ConstantijnCrijnen/Marlin that referenced this issue Jan 26, 2021
@thisiskeithb
Copy link
Member

#20856 was merged. Closing.

@github-actions
Copy link

This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@github-actions github-actions bot locked and limited conversation to collaborators Mar 28, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants