My crazy Aruba and NVX lesson and follow-up (with a resolution, sort of)
This is a follow up to my post the other week: https://www.reddit.com/r/ArubaNetworks/comments/1taak5b/aoscx_switches_leaking_igmp_group_memberships_at/
I'm mostly posting this because I've been banging my head against the wall trying to figure this out over many weeks. My hope is that other people can avoid this, or similar, pitfalls in the future. I already have both Crestron and Aruba engineers working on resolutions, so I don't think there will be a lot of useful additional information from the community.
That said, I will start with something that would've been great to know about NVX, even if not using Aruba switches. It is something I've never seen discussed or documented anywhere before:
Two different IGMP message methods
After the IGMP querier sends an IGMP query, the devices all respond with IGMP membership reports. According to RFC 4541, it is recommended that these are suppressed so that only the switch itself sees these and uses them to build its snooping tables (or they get forwarded to other routers). Devices should not see any other membership reports. In wireshark, if you connect to the switch, you should only see the queries and your own. Some switches sort of do the opposite of this. They send the IGMP memberships to everyone all the time.
Netgear: Tested on an M4250 with various profiles. It forwards IGMP memberships to everyone all the time. I didn't try it, but it seems like you can configure this behavior so it will suppress these messages as well. They might do this intentionally for NVX, which you will see why later.
Cisco: I don't have Cisco gear to test, but I believe they suppress the IGMP membership messages. I believe it is configurable to flood or suppress.
Aruba: Suppresses IGMP membership messages, and there is no way to configure this behavior. Tested on 6000 (12, 24, 48 port), 6100 (24 port), 6300 (24 port)
How NVX responds
NVX works with either method above, both the suppression and the flooding. However, if a device sees a single IGMP membership join for a group it is sending to, it will want to see those messages repeatedly forever.
Let's say you have an encoder sending on 239.1.0.0. The decoder will send IGMP joins for 239.1.0.0. If these are flooded, the encoder will see those joins from the decoder every {query interval} seconds. If they are suppressed, the encoder will never see those joins. If the encoder sees a single IGMP join for 239.1.0.0, it starts a timer for about 4 minutes and 30 seconds. If it never sees another, it will stop sending the multicast data. You have to stop and start the stream again. The nvx units will still think everything is happily working, as does the switch.
Why would Crestron do this? Not sure, but maybe to somehow sort of work on a switch with a querier but no snooping? Maybe it saves on power usage? It is like it is trying to manage multicast itself instead just letting the switch do it.
AES67 SAP
This also happens for SAP announcement messages. Since both encoders and decoders using NAX join the SAP group of 239.255.255.255, this one tends to break even more easily. Also, the only way I know to recover is to reboot the NVX unit. However, this only tends to happen if you have a Q-SYS on the network, and possibly other AES67 devices. I explain why more below.
How it really gets messed up on Aruba
The boot process on Aruba is a bit slow. It first brings up layer 2, then layer 3. What I am seeing is multicast is flooding on my Aruba switches for 20-40 seconds at boot. This includes IGMP messages.
There is a command on Aruba called "filter-unknown-mast" that is supposed to suppress this stuff at boot, but it only partially works. For what its worth, I've tested what I'm talking about across AOS CX 10.11, 10.13, 10.16, 10.17 and different sub-versions as well, and they all behaved the same.
The problem here is during this boot window, the NVX units see a single IGMP join because of this flooding and then never again. Which, as discussed above, then breaks the streams 4m30s later, plus the AES67 SAP announcements.
When the devices first come online, they immediately send an unsolicited IGMP membership join - this is what is flooding during the boot process. This unsolicited list actually appears to differ from the solicited IGMP membership joins. Interestingly, on NVX, it won't join SAP group of 239.255.255.255 unsolicited. So AES67 SAP won't ever break if using JUST NVX. However, Q-SYS does send an unsolicited join for 239.255.255.255, which will then break the announcements on all the NVX units during boot.
Once the Aruba querier is running and sends the very first membership query, it is properly suppressing everything so the solicited joins are never seen by other devices.
You can actually mimic (and prove) this behavior by mirroring a NVX decoder's tx to a NVX encoders rx. This will send the IGMP join packets to the encoder. Then break the mirroring so the IGMP messages are cut off. The NVX will stop sending video 4m30s later, and SAP will break at that point as well.
Different behavior on different Aruba models
This started to drive me really crazy. Why isn't this reported more? Well, I'm using a newer series heavily, the 6000.
It appears only the Aruba 6000 series floods these unsolicited joins during boot. The 6100 and 6300 I tested do not do this.
However, the 6100 and 6300 are not flawless. There is a feature on Aruba called IGMP fastlearn (different from fast leave). If you enable this, it sends a IGMP query any time the switch topology changes (like a link comes online). So at boot, you will see many IGMP queries rapidly. Since these occur before the querier is fully ready, the responses from the devices are then flooded, causing the same issue to occur on 6100 and 6300.
Edit to add: The real nitty-gritty of why this happens on the 6000 is in the event logs of the switch. The command I mentioned "filter-unknown-mcast" is applying after the interfaces come online in the 6000. In the others, it occurs before the interfaces come online. Regardless, filter-unknown-mcast will filter unsolicited joins, but it won't filter the joins initiated from fastlearn before the switch querier is fully up and running.
Final conclusion
Don't use Aruba 6000 with NVX for now, until either Aruba or Crestron has a fix for this behavior. I don't see a solution at this point other than a firmware fix. Oh, also if ALL the nvx is PoE, you won't have this issue since they power up later in the boot process. I am using cards.
For 6100 and up, do not use IGMP fastlearn (fast leave is OK). If you really want to be careful, you can make an ACL on 6200 and up (6000 and 6100 don't support outbound ACL). The ACL should be to block IGMP messages OUT only to edge devices in the NVX ranges for audio/video plus SAP. I don't think the ACL would work for IGMPv3 due to it using the same address for all joins.
As for Netgear, I imagine it is possible they forward these IGMP joins everywhere all the time just to avoid potential issues like what I see on Aruba.
Crossing my fingers for Aruba or Crestron (or both) to fix this issue so I don't have to swap out all my Aruba 6000s.