Skip to content

PMM-14576 Backup improvements.#5490

Open
JiriCtvrtka wants to merge 20 commits into
mainfrom
PMM-14576-backup-improvements
Open

PMM-14576 Backup improvements.#5490
JiriCtvrtka wants to merge 20 commits into
mainfrom
PMM-14576-backup-improvements

Conversation

@JiriCtvrtka

@JiriCtvrtka JiriCtvrtka commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

PMM-14576

Fixes false failed to get backup status / failed to get restore status errors during MongoDB PBM backup and restore jobs when describe-backup / describe-restore temporarily fail while the operation is still running or metadata is not ready yet.

Refactors backup/restore completion polling into a shared describePoller that:

  1. treats known PBM transient errors as retryable (no such file, file is empty, get backup meta: not found, etc.)
  2. keeps waiting while pbm status reports the operation as running
  3. falls back to pbm status snapshots (backup) or pbm list --restore (restore) when describe keeps failing after completion
  4. preserves the full retry budget until the operation has finished

@JiriCtvrtka

Copy link
Copy Markdown
Contributor Author

@copilot review

@codecov

codecov Bot commented Jun 11, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 64.64286% with 99 lines in your changes missing coverage. Please review.
✅ Project coverage is 44.25%. Comparing base (2478fef) to head (dccff14).

Files with missing lines Patch % Lines
agent/runner/jobs/pbm_helpers.go 68.18% 74 Missing and 3 partials ⚠️
agent/runner/jobs/mongodb_restore_job.go 0.00% 20 Missing ⚠️
agent/utils/poll/poll.go 88.88% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #5490      +/-   ##
==========================================
+ Coverage   43.50%   44.25%   +0.74%     
==========================================
  Files         413      414       +1     
  Lines       42953    41908    -1045     
==========================================
- Hits        18687    18546     -141     
+ Misses      22392    21596     -796     
+ Partials     1874     1766     -108     
Flag Coverage Δ
agent 49.88% <64.64%> (+0.85%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@JiriCtvrtka

Copy link
Copy Markdown
Contributor Author

@copilot review

@JiriCtvrtka

Copy link
Copy Markdown
Contributor Author

@copilot review

@JiriCtvrtka

Copy link
Copy Markdown
Contributor Author

@copilot review

@JiriCtvrtka JiriCtvrtka marked this pull request as ready for review June 16, 2026 09:11
@JiriCtvrtka JiriCtvrtka requested a review from a team as a code owner June 16, 2026 09:11
@JiriCtvrtka JiriCtvrtka removed the request for review from a team June 16, 2026 09:11
JiriCtvrtka and others added 2 commits June 17, 2026 17:27
Resolve conflict in pbm_helpers.go by keeping the refactored
describePoller-based polling from this branch.

Co-authored-by: Cursor <cursoragent@cursor.com>

@maxkondr maxkondr Jun 22, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can it be implemented easier?
here is the native poller implementation (I would place it into utils/) and just call it with appropriate values:

import (
	"context"
	"errors"
	"fmt"
	"time"
)

// ConditionFunc returns true when the condition is successfully met.
type ConditionFunc func(ctx context.Context) (done bool, err error)

// PollUntilContextTimeout native standard library equivalent
func PollUntilContextTimeout(ctx context.Context, interval time.Duration, condition ConditionFunc) error {
	// 1. Run an immediate initial check before starting the ticker.
	// If the condition is already true, we skip waiting.
	if done, err := condition(ctx); err != nil {
		return err
	} else if done {
		return nil
	}

	// 2. Initialize a ticker for subsequent intervals.
	ticker := time.NewTicker(interval)
	defer ticker.Stop()

	for {
		select {
		case <-ctx.Done():
			// The context timeout expired or was canceled by the caller
			return ctx.Err()

		case <-ticker.C:
			// Trigger the condition check on every tick
			done, err := condition(ctx)
			if err != nil {
				return err
			}
			if done {
				return nil
			}
		}
	}
}

@JiriCtvrtka

Copy link
Copy Markdown
Contributor Author

@copilot review

@JiriCtvrtka

Copy link
Copy Markdown
Contributor Author

@copilot review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants