u/lofat

Possible bug with MV and cluster by auto in pipeline?

Ran into a new one today and I'm curious to know if anyone else has hit this.

I've had a pipeline running for a bit. Works like a champ. Uses materialized views with cluster by auto set to true.

Today, I started getting a pipeline validation error.

> Cannot resolve the clustering column enzyme__row__id.__enzyme__row__id__table__2 in root

Genie is telling me this is probably a bug with cluster by auto.

> A Lakeflow SDP pipeline using cluster_by_auto=True and pipeline_internal.enzymeMode: Advanced fails on update because Enzyme selected its own internal struct fields as clustering columns for output materialized views.

> These fields exist in the internal _materialization_mat_487f530d..._qm_gen_info_1 table as STRUCT<__enzyme__row__id__table__1:bigint, __enzyme__row__id__table__2:bigint> at ordinal_position 0, but do NOT exist in the user-facing output MV schema.

I thought changing the pipeline and redeploying as FULL reload would potentially resolve it, but no. The cluster by auto is still on the object, so it fails to validate, so I can't publish the update. I cannot seem to manually alter the clustering because the object is managed (it's refusing to let me update it), which leaves me with the last option - dropping the impacted objects.

Anyone else run into this? It's not consistent on all of my pipelines and it's only really just started happening. My guess is the optimization processes finally rolled around to selecting these internal columns.

reddit.com
u/lofat — 3 days ago