u/alexjalexj

My crazy Aruba and NVX lesson and follow-up (with a resolution, sort of)

This is a follow up to my post the other week: https://www.reddit.com/r/ArubaNetworks/comments/1taak5b/aoscx_switches_leaking_igmp_group_memberships_at/

I'm mostly posting this because I've been banging my head against the wall trying to figure this out over many weeks. My hope is that other people can avoid this, or similar, pitfalls in the future. I already have both Crestron and Aruba engineers working on resolutions, so I don't think there will be a lot of useful additional information from the community.

That said, I will start with something that would've been great to know about NVX, even if not using Aruba switches. It is something I've never seen discussed or documented anywhere before:

Two different IGMP message methods

After the IGMP querier sends an IGMP query, the devices all respond with IGMP membership reports. According to RFC 4541, it is recommended that these are suppressed so that only the switch itself sees these and uses them to build its snooping tables (or they get forwarded to other routers). Devices should not see any other membership reports. In wireshark, if you connect to the switch, you should only see the queries and your own. Some switches sort of do the opposite of this. They send the IGMP memberships to everyone all the time.

Netgear: Tested on an M4250 with various profiles. It forwards IGMP memberships to everyone all the time. I didn't try it, but it seems like you can configure this behavior so it will suppress these messages as well. They might do this intentionally for NVX, which you will see why later.

Cisco: I don't have Cisco gear to test, but I believe they suppress the IGMP membership messages. I believe it is configurable to flood or suppress.

Aruba: Suppresses IGMP membership messages, and there is no way to configure this behavior. Tested on 6000 (12, 24, 48 port), 6100 (24 port), 6300 (24 port)

How NVX responds

NVX works with either method above, both the suppression and the flooding. However, if a device sees a single IGMP membership join for a group it is sending to, it will want to see those messages repeatedly forever.

Let's say you have an encoder sending on 239.1.0.0. The decoder will send IGMP joins for 239.1.0.0. If these are flooded, the encoder will see those joins from the decoder every {query interval} seconds. If they are suppressed, the encoder will never see those joins. If the encoder sees a single IGMP join for 239.1.0.0, it starts a timer for about 4 minutes and 30 seconds. If it never sees another, it will stop sending the multicast data. You have to stop and start the stream again. The nvx units will still think everything is happily working, as does the switch.

Why would Crestron do this? Not sure, but maybe to somehow sort of work on a switch with a querier but no snooping? Maybe it saves on power usage? It is like it is trying to manage multicast itself instead just letting the switch do it.

AES67 SAP

This also happens for SAP announcement messages. Since both encoders and decoders using NAX join the SAP group of 239.255.255.255, this one tends to break even more easily. Also, the only way I know to recover is to reboot the NVX unit. However, this only tends to happen if you have a Q-SYS on the network, and possibly other AES67 devices. I explain why more below.

How it really gets messed up on Aruba

The boot process on Aruba is a bit slow. It first brings up layer 2, then layer 3. What I am seeing is multicast is flooding on my Aruba switches for 20-40 seconds at boot. This includes IGMP messages.

There is a command on Aruba called "filter-unknown-mcast" that is supposed to suppress this stuff at boot, but it only partially works. For what its worth, I've tested what I'm talking about across AOS CX 10.11, 10.13, 10.16, 10.17 and different sub-versions as well, and they all behaved the same.

The problem here is during this boot window, the NVX units see a single IGMP join because of this flooding and then never again. Which, as discussed above, then breaks the streams 4m30s later, plus the AES67 SAP announcements.

When the devices first come online, they immediately send an unsolicited IGMP membership join - this is what is flooding during the boot process. This unsolicited list actually appears to differ from the solicited IGMP membership joins. Interestingly, on NVX, it won't join SAP group of 239.255.255.255 unsolicited. So AES67 SAP won't ever break if using JUST NVX. However, Q-SYS does send an unsolicited join for 239.255.255.255, which will then break the announcements on all the NVX units during boot.

Once the Aruba querier is running and sends the very first membership query, it is properly suppressing everything so the solicited joins are never seen by other devices.

You can actually mimic (and prove) this behavior by mirroring a NVX decoder's tx to a NVX encoders rx. This will send the IGMP join packets to the encoder. Then break the mirroring so the IGMP messages are cut off. The NVX will stop sending video 4m30s later, and SAP will break at that point as well.

Different behavior on different Aruba models

This started to drive me really crazy. Why isn't this reported more? Well, I'm using a newer series heavily, the 6000.

It appears only the Aruba 6000 series floods these unsolicited joins during boot. The 6100 and 6300 I tested do not do this.

However, the 6100 and 6300 are not flawless. There is a feature on Aruba called IGMP fastlearn (different from fast leave). If you enable this, it sends a IGMP query any time the switch topology changes (like a link comes online). So at boot, you will see many IGMP queries rapidly. Since these occur before the querier is fully ready, the responses from the devices are then flooded, causing the same issue to occur on 6100 and 6300.

Edit to add: The real nitty-gritty of why this happens on the 6000 is in the event logs of the switch. The command I mentioned "filter-unknown-mcast" is applying after the interfaces come online in the 6000. In the others, it occurs before the interfaces come online. Regardless, filter-unknown-mcast will filter unsolicited joins, but it won't filter the joins initiated from fastlearn before the switch querier is fully up and running.

Final conclusion

Don't use Aruba 6000 with NVX for now, until either Aruba or Crestron has a fix for this behavior. I don't see a solution at this point other than a firmware fix. Oh, also if ALL the nvx is PoE, you won't have this issue since they power up later in the boot process. I am using cards.

For 6100 and up, do not use IGMP fastlearn (fast leave is OK). If you really want to be careful, you can make an ACL on 6200 and up (6000 and 6100 don't support outbound ACL). The ACL should be to block IGMP messages OUT only to edge devices in the NVX ranges for audio/video plus SAP. I don't think the ACL would work for IGMPv3 due to it using the same address for all joins.

As for Netgear, I imagine it is possible they forward these IGMP joins everywhere all the time just to avoid potential issues like what I see on Aruba.

Crossing my fingers for Aruba or Crestron (or both) to fix this issue so I don't have to swap out all my Aruba 6000s.

reddit.com
u/alexjalexj — 3 days ago

AOS-CX switches leaking IGMP group memberships at boot

Tested on: Aruba 6000 48G and 6300 24G, firmware 10.13.1161, 10.13.1170, 10.16.1010, 10.16.1040, 10.17.1001, 10.17.1010. I am testing everything below in a single switch lab environment, with nothing else on the switch but two test devices.

Very basic example config:

ip igmp snooping drop-unknown vlan-shared
ip igmp snooping filter-unknown-mcast
vlan 1
    name av-general
    ip igmp snooping enable
    ip igmp snooping version 2
    client track ip
interface vlan 1
   ip address 10.119.24.11/22
    ip igmp enable
    ip igmp querier
    ip igmp querier-wait-time 1
    ip igmp version 2
    ip igmp querier interval 125
    ip igmp query-max-response-time 10
!
ip route 0.0.0.0/0 10.119.24.1
!
!Switchport config starts here
!
interface 1/1/1-1/1/52
    no shutdown
    mtu 9198
    vlan access 1
    ip igmp snooping fastleave vlan 1

Basic Problem: When the switch boots, devices connect and immediately send IGMP group membership messages. These IGMP join messages are not supposed to be seen by other devices. However, the switch is broadcasting them to all ports on the vlan. Once the switch is fully up and running, it properly filters these messages.

Extra detail: IGMPv2 group membership messages are sent to the multicast address for the group. I am using 239.255.255.255 (SDP announcement), for example. I have filter-unknown-mcast enabled and drop-unknown enabled as well, in an attempt to prevent oddities at boot up. However, I am still seeing these group membership messages at boot in Wireshark on my laptop on another interface port. It behaves normally once the querier is actually running. This means I don't see any IGMP group membership messages at all in Wireshark except from my own computer. Technically, even if you are subscribed to that multicast group, you still shouldn't see the IGMP join messages from other devices. This is how the Aruba works - once it is running fully.

Why this is problematic: I have devices that stop sending multicast data to a group if they see multicast join messages for that group. I believe this is to prevent flooding on switches that aren't working properly or don't support IGMP snooping. So if I reboot my Aruba switch, these devices see a rogue IGMP membership message to their group, stop sending, and require a reboot to start working again. I am working with the device manufacturer, but they are also going to point to the Aruba behavior as problematic.

Possible solutions? On the 6300, I think I could set up ACLs to block IGMP membership back to the devices. However, I am using 6000 series heavily. I don't see any solution to this on 6000 series. Is there a way to change the boot behavior or config to prevent this?

reddit.com
u/alexjalexj — 12 days ago