u/noamiko2004

EventID 393 / 2041 / 2042 / 2153 on Exchange SE DAG — passives flap every few minutes, suspect network layer

Update on my earlier post — consolidated with my teammate who owns the Exchange platform. Picture is broader than I first described, so re-posting with the full state.

Environment

  • 16 Exchange Server SE mailbox servers in a single DAG, split across 2 sites
  • All virtualized on VMware ESXi, Windows Server 2025
  • 3 copies per DB (1 active + 2 passive), DBs are brand new on SE (not migrated)
  • Single NIC per server — MAPI and Replication share the same network (no dedicated replication network)
  • No AV, no host firewall on the Exchange servers
  • DAG witness / AD / DNS all healthy

Symptom

Passive copies on all 16 servers go Disconnected → reconnected every few minutes. Happens both inter-site and intra-site, not just DR. Active copies are clean. Test-ReplicationHealth is green. CopyQueueLength / ReplayQueueLength stay near 0 (occasional 1).

Main events on the passive side — three of the four are from the HighAvailability source, which puts this squarely in the Microsoft.Exchange.Cluster.Replay log-copy channel (hostnames lightly redacted):

Event 393 — Source: HighAvailability, Task Category: ReplayState

>SetDisconnected called for the local copy of database DB21. LastCopied: 0x3FE82C (4188204) LastNotified: 0x3FE82C (4188204)

Event 2041 — Source: HighAvailability, Task Category: NetworkMonitoring

>A network error happened at LogCopyServer.SendLogs: Microsoft.Exchange.Cluster.Replay.NetworkCommunicationException: An error occurred while communicating with server mbx-pr03. Error: Unable to write data to the transport connection: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond. ---> System.IO.IOException: Unable to write data to the transport connection: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond. ---> System.Net.Sockets.SocketException: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.

   at System.Net.Sockets.NetworkStream.Write(Byte[] buffer, Int32 offset, Int32 size)
   --- End of inner exception stack trace ---
   at System.Net.Sockets.NetworkStream.Write(Byte[] buffer, Int32 offset, Int32 size)
   at System.Net.Security.NegotiateStream.StartWriting(Byte[] buffer, Int32 offset, Int32 count, AsyncProtocolRequest asyncRequest)
   at System.Net.Security.NegotiateStream.ProcessWrite(Byte[] buffer, Int32 offset, Int32 count, AsyncProtocolRequest asyncRequest)
   at System.Net.Security.NegotiateStream.Write(Byte[] buffer, Int32 offset, Int32 count)
   at Microsoft.Exchange.Cluster.Replay.NetworkPackagingLayer.WriteXpressBlock(Byte[] buf, Int32 offset, Int32 length)
   at Microsoft.Exchange.Cluster.Replay.NetworkPackagingLayer.WriteXpress(Byte[] buf, Int32 off, Int32 len)
   at Microsoft.Exchange.Cluster.Replay.NetworkChannel.<>c__DisplayClass110_0.<Write>b__0()
   at Microsoft.Exchange.Cluster.Replay.NetworkChannel.InvokeWithCatch(CatchableOperation op)
   --- End of inner exception stack trace ---
   at Microsoft.Exchange.Cluster.Replay.NetworkChannel.InvokeWithCatch(CatchableOperation op)
   at Microsoft.Exchange.Cluster.Replay.MonitoredDatabase.SendLog(Int64 logGen, NetworkChannel channel, SourceDatabase PerformanceCountersInstance perfCounters, Boolean useCopyLogReply2, Boolean transmissionThrottled, String fullBlockModeFileName, Nullable`1 blockModePos, Nullable`1 blockModeUtc)
   at Microsoft.Exchange.Cluster.Replay.LogCopyServerContext.SendNextLog()
   at Microsoft.Exchange.Cluster.Replay.LogCopyServerContext.SendLogs()
   at Microsoft.Exchange.Cluster.Replay.LogCopyServerContext.SendLogsEntryPoint(Object dummy)

Event 2042 — Source: HighAvailability

>A network timeout happened at LogCopyServer.SendLogs: Microsoft.Exchange.Cluster.Replay.NetworkTimeoutException: A timeout occurred while communicating with server mbx-pr03. Error: The network read operation didn't complete within 5 seconds.

   at Microsoft.Exchange.Cluster.Replay.NetworkChannel.InvokeWithCatch(CatchableOperation op)
   at Microsoft.Exchange.Cluster.Replay.LogCopyServerContext.EnterBlockMode()
   at Microsoft.Exchange.Cluster.Replay.LogCopyServerContext.SendNextLog()
   at Microsoft.Exchange.Cluster.Replay.LogCopyServerContext.SendLogs()
   at Microsoft.Exchange.Cluster.Replay.LogCopyServerContext.SendLogsEntryPoint(Object dummy)

Event 2153 — Source: MSExchangeRepl, Task Category: Service

>The log copier was unable to communicate with server mbx-pr03.contoso.local. The copy of database DB21\mbx-dr07 is in a disconnected state. The communication error was: An error occurred while communicating with server mbx-pr03. Error: Unable to write data to the transport connection: An established connection was aborted by the software in your host machine. The copier will automatically retry after a short delay.

The 2042 timeout being 5 seconds stands out — that feels low as a hard cutoff for log shipping, but I can't find documentation on whether that's tunable on SE.

What we've tried

  • Suspend-MailboxDatabaseCopy + Resume-MailboxDatabaseCopy (the workaround from the 2021 MS Q&A) — does not stick, error returns
  • Disk I/O — Avg Disk sec/Read and /Write well within Exchange thresholds
  • Connectivity — ping/MTU/routing between all nodes is clean
  • AV / host firewall — none installed
  • NIC type swap — older VMXNET3 NIC showed huge ReceivedDiscardedPackets, matching VMware KB 2039495. Swapped 3 of 16 servers to a different NIC type (1 Gbps), discards dropped to 0 on those — but the replication flapping continues on both swapped and unswapped servers
  • VMXNET3 advanced settings on the original NICs: disabled Recv Segment Coalescing (IPv4/IPv6), IPv4 Checksum Offload, Large Send Offload V2 (IPv4/IPv6); maxed Rx Ring #1 Size and Small Rx Buffers — no change to the replication behavior

We haven't ruled VMXNET3 out as part of the picture — clearing the discards on 3 servers didn't stop the flapping, but that just means it isn't the sole cause. Strong suspicion is still on the network/transport layer.

Health Checker findings (one server, representative)

  • Packets Received Discarded: 138,330,656 — flagged as error (KB 2039495 territory on the older NIC)
  • Sleepy NIC Disabled: False — warning, NIC power saving not disabled
  • NIC Teamed: False
  • Disable IPv6 Correctly: False — IPv6 is not fully disabled by intent; only some NIC-level checkboxes are unchecked. Health Checker flags DisabledComponents = -1 as an error.
  • Nothing else flagged

Where we are

Fairly confident the root cause is in the network / transport layer. The stack traces consistently point at Microsoft.Exchange.Cluster.Replay.LogCopyServer.SendLogs failing with either a NetworkCommunicationException (write failed) or NetworkTimeoutException (read didn't complete in 5s). Not sure yet whether the right thing to look at is VMXNET3, the shared MAPI+Replication NIC topology, TCP behavior on Server 2025, or something between the sites.

Questions

  1. With Exchange SE on Server 2025 + VMXNET3, is a dedicated replication network essentially required now? On 2019 we got away with single-NIC DAGs in similar environments.
  2. Is the 5-second LogCopyServer read timeout configurable on SE, or is that fixed? It feels like the bar to trip is very low.
  3. Anyone seen this exact combo (393 / 2041 / 2042 / 2153, all LogCopyServer.SendLogs failures) and traced it to a specific root cause?

Happy to share Get-DatabaseAvailabilityGroupNetwork, full Health Checker output, or anything else useful. Thanks!

reddit.com
u/noamiko2004 — 1 day ago

EventID 2153 (MSExchangeRepl) on Exchange SE across two sites — log copier "connection aborted by software in your host" on DR-side passives

Hey Guys!

Following up on this recent post and the older 2021 Microsoft Q&A on the same Event ID. Both threads stalled — the 2021 one ended on Suspend/Resume-MailboxDatabaseCopy as a temporary workaround that was never confirmed as a real fix, and the recent Reddit thread never got an answer. We're hitting the exact same symptom on a fresh Exchange SE deployment and looking for someone who's actually root-caused it.

Environment

  • 16 Exchange Server SE mailbox servers in a single DAG, split across 2 sites (primary datacenter + DR site, separate subnets/VLANs)
  • All virtualized on VMware ESXi
  • Windows Server 2025
  • 3 copies per database (1 active + 2 passive), DBs are newly created on SE — not migrated from a previous version
  • DAG witness, AD, DNS — all healthy
  • Active copies currently live on PR-site nodes

Symptom

Application log on the DR-site SE nodes is filling with EventID 2153 from MSExchangeRepl:

> The log copier was unable to communicate with server 'Exchange1.Domain.com'. The copy of database 'MailDBxx\Exchange1' is in a disconnected state. The communication error was: An error occurred while communicating with server 'Exchange1'. Error: Unable to write data to the transport connection: An established connection was aborted by the software in your host machine. The copier will automatically retry after a short delay.

Same error across all databases on the DR-side passive copies. PR-site nodes log nothing.

Get-MailboxDatabaseCopyStatus -ConnectionStatus | FT Identity,IncomingLogCopyingNetwork on the DR nodes shows the disconnected/aborted state on the MapiDagNetwork. CopyQueueLength / ReplayQueueLength are 0 most of the time, occasional 1.

What we've tried / ruled out

  • Test-ReplicationHealth on all nodes → all green
  • Suspend-MailboxDatabaseCopy + Resume-MailboxDatabaseCopy (the "fix" from the 2021 thread) → does not resolve it, error returns
  • Disk I/O angle from the 2021 thread — Avg Disk sec/Read and Avg Disk sec/Write are well within Exchange thresholds on both sides. Not an I/O issue.
  • L3 between PR and DR — all servers ping each other, no drops, MTU consistent
  • No relevant errors on the active node side
  • DBs are brand new (created on SE), so this isn't legacy / migrated-from-2019 baggage

Question

Is this a known issue with Exchange SE DAG members across two networks/subnets specifically? Anything around:

  • VMXNET3 offloads / RSS / RSC settings on Windows Server 2025 VMs
  • TCP behaviour or RPC over HTTP/MapiHttp changes specific to SE
  • A DAG network configuration nuance that's different on SE vs. 2019

We can share Get-DatabaseAvailabilityGroup, Get-DatabaseAvailabilityGroupNetwork, NIC binding/offload settings, ESXi host config — whatever helps narrow it down.

Disclaimer, we did use AI to help refine this post haha. Thanks in advance!

u/noamiko2004 — 10 days ago