u/SquachSeven

I wanted to share our experience as a VSP after upgrading our Cloud Connect environment to Veeam 13, partly to see whether other providers are seeing similar behaviour, and partly because at the moment it has been a very difficult few weeks operationally.

To be clear upfront, we did not initially intend to upgrade as early as we did.

As a service provider, we tend to prioritise stability over moving immediately onto major new versions. However, a growing number of customers began upgrading their own VBR servers due to the security vulnerability concerns and advisories around older versions.

As many providers will know, once customers upgrade ahead of the provider platform, Cloud Connect compatibility becomes an issue. Even though Veeam does warn customers during installation, many understandably proceed anyway due to the security messaging, and we eventually reached a point where enough customers had upgraded themselves that we effectively had to move the provider platform forward sooner than planned.

Unfortunately, since upgrading, we have experienced a sequence of different issues affecting various parts of the platform.

The first issue we encountered involved infrastructure components apparently presenting incorrect or mismatched identities/certificates after upgrade. This seemed to lead to transport instability and certificate validation problems between components.

After working through that, we then began seeing .NET SSL-related issues within VSPC.

Separately, we also experienced periods where Veeam ONE activity appeared to place significant load on the Veeam RESTful API, which impacted provider-side administration tasks such as editing tenants, managing the platform and even running of PowerShell scripts.

Most recently, we are now dealing with a large amount of Cloud Connect retry activity associated with Continuous Copy Jobs.

Initially we saw intermittent failures, but over time this developed into very large numbers of repeated retries across multiple tenants. While investigating, we began noticing some chains with extremely large incremental depths, and from our observations it appears possible that retention/consolidation may not always be occurring as expected in certain scenarios.

The concern from our side is that these large chains then appear to contribute to ongoing retry behaviour, with jobs repeatedly attempting to process throughout the day.

Storage consumption has also become a concern. In some cases, triggering an Active Full appears to temporarily restore job operation, however if the previous chain is not cleaned up as expected, storage usage can grow rapidly due to older incremental chains remaining alongside new fulls and new incremental growth.

This evening our second ticket was close and re-opened in another ticket as a new issue (each time we fix something a new ticket needs opening for the other parts), and we were provided with a potential workaround specifically relating to the recurring terminated/killed session behaviour we are currently investigating.

The proposed workaround from Veeam R&D was to add the following registry key on the customer-side VBR server:

Path:
Computer\HKEY_LOCAL_MACHINE\SOFTWARE\Veeam\Veeam Backup and Replication

Name:
ProcessVirtualizationExecutionsPerManager

Type:
DWORD

Value:
1 (Decimal)

The difficulty for us operationally is that, as a provider, we do not necessarily have administrative access to customer VBR servers, nor can we realistically perform out-of-hours reboots or manual changes to customer environments while active backups are running and it would need prior customer agreement. So sadly its going to be another day before we get to see if this will work or not.

We were advised that this workaround is currently the available recommendation from R&D and that a more permanent fix is expected in a future product version.

I also want to be fair regarding support. The individual engineers we have interacted with have generally been polite and professional throughout. However, the overall process has been quite difficult to navigate during an active provider-side incident.

Because the issues have manifested in different ways over time, cases have sometimes been separated or recreated, which has unfortunately led to repeated log uploads, repeated explanations of earlier findings, and delays while context is re-established between teams and regions.

There have also been occasions where we have remained available outside normal working hours after being asked whether we could continue troubleshooting with teams in other regions, but communication and handovers have not always flowed as smoothly as we would have hoped during those periods.

From a provider perspective, the overall operational impact has been significant:

customer backup failures
increased retry activity
storage growth concerns
platform administration difficulties
and increased customer concern/confidence issues

At the moment we are still actively working through these issues with support and trying to stabilise the platform.

I would genuinely be interested to hear whether any other VCSPs or Cloud Connect providers have experienced similar behaviour after moving to Veeam 13, particularly around:

Continuous Copy Job retries
large incremental chain depth
retention/consolidation behaviour
VSPC SSL issues
or provider-side administration/API performance concerns

Hopefully this helps start a constructive conversation between providers experiencing similar issues.

Thought I’d throw this up here because this has bitten me a few times over the years now, and every single time I end up finding absolutely nothing useful online about it.

Had it happen again today during a Veeam upgrade and finally thought, right, I’m posting the fix in case it saves somebody else the headache.

The issue:

You run a Veeam install, upgrade, or Service Provider Console upgrade.

Installer says prerequisites need installing.
It asks for a reboot.
You reboot.
Installer comes back.
Same prerequisite message again.
Rinse and repeat forever.

Nothing actually installs.
No obvious errors.
Just stuck in a stupid reboot/prerequisite loop.

What made this especially frustrating was Windows itself isnt reporting any pending reboot state's. All the usual checks look clean.

In every instance, the culprit has been a stale Veeam RunOnce registry value that never got cleared.

This was the value:

HKLM\SOFTWARE\Microsoft\Windows\CurrentVersion\RunOnce\VeeamWizardEngineNeedReboot

More specifically:

Key:

HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows\CurrentVersion\RunOnce

Value:

VeeamWizardEngineNeedReboot

Deleted that value, reran the installer, and the upgrade immediately carried on normally.

No more reboot loop.

Obviously usual disclaimer applies:
Be careful in the registry, export the key first if you’re unsure, etc.

Not saying this is the fix for every prerequisite issue, but if:

Windows says no reboot pending
Veeam still insists it needs one
installer keeps looping endlessly

…definitely check that key before you lose half your day to it like I did.

Hopefully helps someone in the future searching this at 2am while questioning their life choices.

VSP Experience After Upgrading to Veeam 13

Veeam Upgrade / Install Stuck in Endless Prerequisite Reboot Loop? Check This Registry Key