ACM Sigmod 2026 Pre-Workshop Talks

Published on 2026-05-30

ai databases data

Test Automation for Low-Code ETL Workflows

Speaker: Meenakshi D'Souza (IIT Madras)

Introduction

ETL is a data integration process that combines data from various sources according to the desired requirements, and then loads it into a target (typically a data warehouse). No-code/low-code ETL platforms provide features for business users to develop ETL pipelines without writing much code. Low-code open source ETL tools: Apache Airflow, Apache Ni-Fi, CDAP.

Objective

Can we design a set of functional testing plugins for ETL workflows, mainly for data pipelines using ETL tools?

Structure of the ETL Process

Data from multiple sources are transformed using a well defined set of syntactic rules that are applied step by step, and loaded into a sink. Typically, this is represented as a DAG. Each node in the DAG is a transformation step, and edges connect one transformation step into another.

Proposed Framework (EasyTest ETL)

Consists of 3 plugins: Assertion, Fixture and Mutation.

Assertion: Evaluates a validation rule. Uses Java EXL library to specify rule expressions.
Fixture: This transform implements a singleton class.
Mutation: This project presents a general framework for plugins to facilitate functional testing of low-code ETL workflows.

Fueling Enterprise AI through Robust Data Ingestion

Speaker: Prasad M Deshpande (Databricks)

Introduction

Typical enterprise data is spread over multiple sources. The first step is to get all this data into a single platform. 3 steps in the process:

Ingest
Transform
Orchestrate (Make it run regularly and reliably while balancing cost and latency)

Why is Ingestion Hard?

There are hundreds of source types, each with different APIs, protocols and quirks. Enterprises also have these requirements:

Scalability
Efficiency and cost
Handle updates and deletes
Low latency
Failure recovery
Governance Ingestion has two phases: Snapshot (initial copy) and Incremental (keeping the data fresh). Sources can be of 2 types: SaaS Systems and Database Systems

Another challenge is rate limits while using APIs for SaaS sources: Use smart backoff strategies.

Overall Flow

Source > Reader > Buffer > Merger > Destination

Reader: Extract data from source
Buffer: Design choice. It decouples reading from merging, and enables independent scaling
Merger: Merges multiple data sources.

Incremental vs Batch Ingestion

Cursor

Challenge: How do we fetch only new data? This can be done using the Cursor logic. Steps:

Find a col that changes whenever a record is updated
Keep track of maximum value seen so far.

The ideal cursor field changes whenever the record is updated, always increases and is strictly ordered by update time.

Chunking Large Datasets

Naive Approach

Chunking using offset + limit
Go to an offset and then fetch n records.
Problem: Sort order is unstable when cursor fields change. Standard offsets miss records during updates.

Robust Chunking

Do not rely on offset. Use last updated cursor value.
This works, but it cannot handle bulk updates.
LIMIT cuts in the middle of a group.
Can use >= but it causes duplicates.
Cursor does not progress if the whole chunk has the same value.

Keyset Pagination Solution

Idea: Sort on cursor + primary key.
Use composite as chunking mechanism

Unstructured Data for AI

Challenges:

Parsing complexity
Content representation
Incremental syncing

ACLs and Permission Problem

Approach 1: Data Storage

Store ACLs as data
Easy implementation
Reies on trusting application layer

Approach 2: Platform Security

Map to destination row-level security.

Database Sources

These sources have CDCs that can read the transaction log. The two phases here are CDC (read change stream) and Snapshot (capture current state by querying tables). CDC needs to start before snapshot. CDC records have a sequence number; The correct sequence number S should be the LSN at CDC start.

Snapshot Split Problem

A 1TB table at 100MB/s requires nearly 3 hours for a full scan. In such a situation, the solution is to read in parallel, checkpoint individually, and retry independently if issues occur. Splits should ideally be even. Use composite key partitioning for even splits.

Challenges

Merging

Duplicates
Out of Order
Partial Records Merger must reconcile these technical hurdles to ensure data integrity.

PostgreSQL Architecture

Speaker: Pavan Deolasse (EDB)

Architecture

Postgres uses processes, not threads. Helps with portability, debugging and fault isolation, but leads to high process overhead.

Postmaster: Supervisor, listens for incoming connections and forks per-connection backends.
Per-Connection Backend: Isolated env. Entire query lifecycle runs in a single dedicated OS process.
Background Workers: System maintenance.

Query Processing

Parser > Analyzer > Rewriter > Planner > Executor Planner/optimizer is the most active contribution area in core Postgres.

Storage

Heap Tables: Primary storage area. Default page size is 8kB, and is segmented at 1GB files on disk.
Indexes: Similar structures to heap tables, but diff page format optimized for search.
TOAST: Large values are transparently compressed or moved to side tables.
Free Space/Visibility Maps: Per-page bitmaps that guide insert and vacuum ops for efficiency.
Tablespaces: Allows pointing tables and indexes at diff filesystems for storage tiering.
TAM: Pluggable columnar architecture.

MVCC

Multiversion Concurrency Control. The mechanism is that tuples are never updated in-place. Always create a new version of the row that sits next to the old version of the row. This ensures readers and writers do not block each other. It gives accurate snapshot isolation, and rollbacks are easy.

WAL

WAL is used for durability and replication. Changes are logged before hitting the disk. Replication is both physical and logical in postgres.

Pluggable Indexes

Built-in Index Types

B-tree
Hash
GiST and SP-GiST
GIN
BRIN

Plugins

pgvector for AI embeddings
pg_trgm for trigram via GIN.

Extensibility

Extensibility is a design principle in Postgres. It is built to be extended without forking. The primary mode of contributions is through the pgsql-hackers mailing list.

Rethinking Relational DBs in the Age of GenAI

Speaker: Carsten Binnig (TU Darmstadt)

Introduction

Almost every critical system depends on Relational DBs, and cloud helps it become more scalable. However, there is a high overhead to pay, to use these DBs. Not much changed even when everything moved to cloud.

Original Promises:

Easy to use data model and queries
DB optimizes the execution for you. However, data tends to be unstructured, and does not come in tables.

Relational tax: Overheads are rooted in the design of the relational model.

Query tax: Query authoring is complex. Tuning tax: DBs require massive tuning.

Cutting these Relational Taxes with AI

Query and Data Tax

Natural language queries easier for user, but can be hard to interpret.
A more efficient query type for the DB is sloppy SQL: a mix of natural language and SQL. This is an approach a lot of companies are considering.

Major Issues of using LLMs as DBs

A pure LLM/RAG-style approach for natural language queries leads to some issues.

Limited to simple NL queries
Limited data understanding if data is already structured.
Enterprise data is stored in structured form
Black-box processing
High cost.

A Solution

LLM-Augmented DBs: Extend DBs with LLMs as needed. LLMs and DBs can be used to complement each other.
Relational + LLM-based operators. Use LLM-driven Multimodal filters. LLM can be used for query planning. Carsten's project is called Caesura. Working: Take NL query and logical operators, and ask the LLM to create a logical plan to possibly execute the query. This logical plan is then converted into a physical plan (actual execution strategy with the tech stack being used). LLM is used to reason over data and logical operations.

Text to SQL in Enterprise Data

Benchmarks: Spider and BIRD.
SeRA: Semantic Restauration Agent. Uses ReACT + Reasoning over semantic meaning of schema elements.
Bespoke DBs: A DBMS tailored for one specific workload.