anshul paruchuri

ACM Sigmod 2026 Pre-Workshop Talks

Anshul Paruchuri — Sat, 30 May 2026 00:00:00 +0000

Test Automation for Low-Code ETL Workflows

Speaker: Meenakshi D'Souza (IIT Madras)

Introduction

ETL is a data integration process that combines data from various sources according to the desired requirements, and then loads it into a target (typically a data warehouse). No-code/low-code ETL platforms provide features for business users to develop ETL pipelines without writing much code. Low-code open source ETL tools: Apache Airflow, Apache Ni-Fi, CDAP.

Objective

Can we design a set of functional testing plugins for ETL workflows, mainly for data pipelines using ETL tools?

Structure of the ETL Process

Data from multiple sources are transformed using a well defined set of syntactic rules that are applied step by step, and loaded into a sink. Typically, this is represented as a DAG. Each node in the DAG is a transformation step, and edges connect one transformation step into another.

Proposed Framework (EasyTest ETL)

Consists of 3 plugins: Assertion, Fixture and Mutation.

Assertion: Evaluates a validation rule. Uses Java EXL library to specify rule expressions.
Fixture: This transform implements a singleton class.
Mutation: This project presents a general framework for plugins to facilitate functional testing of low-code ETL workflows.

Fueling Enterprise AI through Robust Data Ingestion

Speaker: Prasad M Deshpande (Databricks)

Introduction

Typical enterprise data is spread over multiple sources. The first step is to get all this data into a single platform. 3 steps in the process:

Ingest
Transform
Orchestrate (Make it run regularly and reliably while balancing cost and latency)

Why is Ingestion Hard?

There are hundreds of source types, each with different APIs, protocols and quirks. Enterprises also have these requirements:

Scalability
Efficiency and cost
Handle updates and deletes
Low latency
Failure recovery
Governance Ingestion has two phases: Snapshot (initial copy) and Incremental (keeping the data fresh). Sources can be of 2 types: SaaS Systems and Database Systems

Another challenge is rate limits while using APIs for SaaS sources: Use smart backoff strategies.

Overall Flow

Source > Reader > Buffer > Merger > Destination

Reader: Extract data from source
Buffer: Design choice. It decouples reading from merging, and enables independent scaling
Merger: Merges multiple data sources.

Incremental vs Batch Ingestion

Cursor

Challenge: How do we fetch only new data? This can be done using the Cursor logic. Steps:

Find a col that changes whenever a record is updated
Keep track of maximum value seen so far.

The ideal cursor field changes whenever the record is updated, always increases and is strictly ordered by update time.

Chunking Large Datasets

Naive Approach

Chunking using offset + limit
Go to an offset and then fetch n records.
Problem: Sort order is unstable when cursor fields change. Standard offsets miss records during updates.

Robust Chunking

Do not rely on offset. Use last updated cursor value.
This works, but it cannot handle bulk updates.
LIMIT cuts in the middle of a group.
Can use >= but it causes duplicates.
Cursor does not progress if the whole chunk has the same value.

Keyset Pagination Solution

Idea: Sort on cursor + primary key.
Use composite as chunking mechanism

Unstructured Data for AI

Challenges:

Parsing complexity
Content representation
Incremental syncing

ACLs and Permission Problem

Approach 1: Data Storage

Store ACLs as data
Easy implementation
Reies on trusting application layer

Approach 2: Platform Security

Map to destination row-level security.

Database Sources

These sources have CDCs that can read the transaction log. The two phases here are CDC (read change stream) and Snapshot (capture current state by querying tables). CDC needs to start before snapshot. CDC records have a sequence number; The correct sequence number S should be the LSN at CDC start.

Snapshot Split Problem

A 1TB table at 100MB/s requires nearly 3 hours for a full scan. In such a situation, the solution is to read in parallel, checkpoint individually, and retry independently if issues occur. Splits should ideally be even. Use composite key partitioning for even splits.

Challenges

Merging

Duplicates
Out of Order
Partial Records Merger must reconcile these technical hurdles to ensure data integrity.

PostgreSQL Architecture

Speaker: Pavan Deolasse (EDB)

Architecture

Postgres uses processes, not threads. Helps with portability, debugging and fault isolation, but leads to high process overhead.

Postmaster: Supervisor, listens for incoming connections and forks per-connection backends.
Per-Connection Backend: Isolated env. Entire query lifecycle runs in a single dedicated OS process.
Background Workers: System maintenance.

Query Processing

Parser > Analyzer > Rewriter > Planner > Executor Planner/optimizer is the most active contribution area in core Postgres.

Storage

Heap Tables: Primary storage area. Default page size is 8kB, and is segmented at 1GB files on disk.
Indexes: Similar structures to heap tables, but diff page format optimized for search.
TOAST: Large values are transparently compressed or moved to side tables.
Free Space/Visibility Maps: Per-page bitmaps that guide insert and vacuum ops for efficiency.
Tablespaces: Allows pointing tables and indexes at diff filesystems for storage tiering.
TAM: Pluggable columnar architecture.

MVCC

Multiversion Concurrency Control. The mechanism is that tuples are never updated in-place. Always create a new version of the row that sits next to the old version of the row. This ensures readers and writers do not block each other. It gives accurate snapshot isolation, and rollbacks are easy.

WAL

WAL is used for durability and replication. Changes are logged before hitting the disk. Replication is both physical and logical in postgres.

Pluggable Indexes

Built-in Index Types

B-tree
Hash
GiST and SP-GiST
GIN
BRIN

Plugins

pgvector for AI embeddings
pg_trgm for trigram via GIN.

Extensibility

Extensibility is a design principle in Postgres. It is built to be extended without forking. The primary mode of contributions is through the pgsql-hackers mailing list.

Rethinking Relational DBs in the Age of GenAI

Speaker: Carsten Binnig (TU Darmstadt)

Introduction

Almost every critical system depends on Relational DBs, and cloud helps it become more scalable. However, there is a high overhead to pay, to use these DBs. Not much changed even when everything moved to cloud.

Original Promises:

Easy to use data model and queries
DB optimizes the execution for you. However, data tends to be unstructured, and does not come in tables.

Relational tax: Overheads are rooted in the design of the relational model.

Query tax: Query authoring is complex. Tuning tax: DBs require massive tuning.

Cutting these Relational Taxes with AI

Query and Data Tax

Natural language queries easier for user, but can be hard to interpret.
A more efficient query type for the DB is sloppy SQL: a mix of natural language and SQL. This is an approach a lot of companies are considering.

Major Issues of using LLMs as DBs

A pure LLM/RAG-style approach for natural language queries leads to some issues.

Limited to simple NL queries
Limited data understanding if data is already structured.
Enterprise data is stored in structured form
Black-box processing
High cost.

A Solution

LLM-Augmented DBs: Extend DBs with LLMs as needed. LLMs and DBs can be used to complement each other.
Relational + LLM-based operators. Use LLM-driven Multimodal filters. LLM can be used for query planning. Carsten's project is called Caesura. Working: Take NL query and logical operators, and ask the LLM to create a logical plan to possibly execute the query. This logical plan is then converted into a physical plan (actual execution strategy with the tech stack being used). LLM is used to reason over data and logical operations.

Text to SQL in Enterprise Data

Benchmarks: Spider and BIRD.
SeRA: Semantic Restauration Agent. Uses ReACT + Reasoning over semantic meaning of schema elements.
Bespoke DBs: A DBMS tailored for one specific workload.

GOSS and LightGBM

Anshul Paruchuri — Thu, 28 May 2026 00:00:00 +0000

Introduction

LightGBM is a FOSS gradient boosting framework for ML developed by Microsoft. It uses the Gradient-Based One-Side Sampling (GOSS) mechanism.

GOSS

GOSS is an sampling method that first, sorts the training data by the gradients of the loss function with respect to the current model, and then selects a subset of the data based on the magnitude of these gradients. In regular Gradient Boosting, a model is trained by iteratively adding weak learners to the model, with each new learner being trained on the residual errors of the previous learners. This process continues until the model reaches a pre-defined stopping criteria. On the other hand, GOSS selects a subset of training data based on the gradients of the loss fn, with respect to the current model. It has 2 steps:

For each data instance, the algorithm computes its gradient, and adds it to a sorted list. This is then divided into 2 parts: The top k gradients, and the bottom n-k gradients.
For the large gradients, the algorithm includes all of the corresponding data instances in the subset of the data instances to consider for the split points. For the small gradients, the algorithm randomly samples a fixed number of data instances to include in the subset. The number of data instances to sample are determined by a bagging function.

The idea is to sample instances to account for the gradient of the loss fn with respect to the predictions made by the model.

Why LightGBM?

While most boosting libraries grow trees level-by-level, LightGBM grows it leaf-wise. It splits the single leaf that will reduce the most. This produces deeper, more asymmetric trees with fewer splits.

In sparse data, many features are rarely non-zero at the same time. LightGBM bundles such mutually exclusive features into a single feature, reduncing dimensionality without losing information.

GOSS allows it to sort all samples by their gradient, randomly sample the small gradients, and upweight the small gradient examples to correct for the sampling bias. This helps prevent overfitting.

LightGBM is particularly useful in ML hackathons as it trains much faster than XGBoost/Random Forest, does not require feature scaling/normalization, and can handle missing values natively.

Context Graphs

Anshul Paruchuri — Sun, 03 May 2026 00:00:00 +0000

Introduction

Can current systems survive the shift to agents?
Agents become the interface (instead of record systems such as Salesforce, Workday, etc.)
Decision Traces: Missing layer that actually runs enterprises. They capture what happens in specific cases.
Rules: Tell an agent what should happen in general.
Agents do not just need rules. They need access to decision traces to show how rules were applied in the past, and how conflicts are resolved.
Agent systems sit in the execution path. They see the full context at decision time. If these traces are persisted, we get a queryable record of how decisions were made. This currently does not exist.
This accumulated structure formed by these traces is called a context graph. It is a living record of decision traces stitched across entitites, and time, so the precedent becomes searchable.

What Current Systems do not Capture

Agents ship into real workflows. Decision traces are missing.
We can store the following as durable artifacts:
1. Exception logic that lives in people's heads
2. Precedent from past decisions
3. Cross-system synthesis
4. Approval chains
When startups instrument the agent orchestration layer to emit a decision trace on every tun, they get something that enterprises almost never have today, which is a replayable history of turning context into action.
Over time, records should naturally form a context graph.

Why it is not Currently Possible

Operational incumbents are siloed, and they prioritize their current state. Even if these systems introduce agents, they do not preserve the context that justified the decision.
You cannot replay the state of the world at decision time.
A system of agents has an advantage that the agents are in the orchestration path. When an agent triages an escalation, responds to an incident or decides a discount, it pulls context from multiple systems and acts.
The orchestration layer sees the full picture, because it executes the workflow.

blogs

Anshul Paruchuri — Thu, 30 Apr 2026 00:00:00 +0000

coming soon.

MedGraphRAG

Anshul Paruchuri — Tue, 10 Mar 2026 00:00:00 +0000

Summary

MedGraphRAG is an innovative framework designed to improve the accuracy and safety of LLMs in the medical field.
Uses a 3-tier hierarchical graph that links private user data to established medical textbooks and foundation dictionaries.

Key Points

The 3-tier layer is as follows:
1. Top level (User-Provided)
2. Medium level (Medical Papers and Books)
3. Bottom level (Fundamental Medical Dictionary)
The paper proposes a U-retrieve strategy to combine top-down retrieval with bottom-up response generation to answer user queries. This is designed to not allow the LLM to generate too much information and depend on the facts.
Meta-Graphs: These are weighted nodes used to construct the system's comprehensive global knowledge graph.
The pre-defined medical categories used for tag generation are symptoms, patient history, body functions and medications.
The paper suggests a hybrid static-semantic method to divide larger medical documents into manageable data chunks. It uses a technique called Proposition Transfer to the text, which transforms the raw paragraphs into standalone, self-sustaining statements. This is then fed to an LLM that uses a zero-shot approach to decide whether a statement belongs to a existing data chunk or if it requires initiating a new chunk.

Notes

3-Tier Graph

Top Level:
- Consists of specific, confidential user data.
- User-specific and experiences the highest frequency of updates and changes.
- The paper uses MIMIC-IV for this.
- Entities are extracted from documents and then linked to stuff in the second tier based on relevance.
Medium Level:
- Built from up-to-date, peer-reviewed medical textbooks and articles.
- Acts as a bridge.
- Updated at a medium frequency, typically at an annual basis.
- MedC-K dataset used by the paper.
Bottom Level:
- Provides detailed explanation of medical terms and their semantic relationships.
- Most fundamental and authoritative data tier.
- UMLS dataset used for this layer.

U-Retrieve Strategy

Top-Down Retrieval:
- Structure the user's query using predefined medical tags.
- Using these summarized tag descriptions, the system performs a top-down matching process, starting from the largest, highest-level global graphs, and progressively indexes down into the smaller, more specific graphs.
- This downward matching is repeated until the system reaches the foundational layer where it activates multiple relevant medical entities.
- All the pertinent information related to these activated medical entities is gathered. This includes the content of the entities, their top-k related entities, their relationships and any associated foundational medical knowledge.
Bottom-Up Response Generation:
- Once the content is retrieved, the LLM is prompted to generate an initial, intermediate text response.
- This is then carried upwards and combined with the summarized tag information of the next higher-level graph.
- This is repeated until the highest level of the graph structure is reached.

Meta-Graphs

After user documents are segmented into chunks, and entities are extracted and linked, the system creates a meta-graph for each individual data chunk.
The system prompts an LLM to identify relationships between the extracted entities based on their names, descriptions, definitions and associated lower-level medical knowledge.
The LLM establishes these relationships by identifying the source and target entities, and then assigning a closeness score. This resulting weighted graph is what is referred to as a meta-graph.
These individual meta-graphs are then merged iteratively using the generated tags, and similarity calculation.
This bottom-up merging process repeats until a single global graph remains.

Expanding MedGraphRAG

Temporal Knowledge Graphs
Real-time physiological data streams
Standardized clinical risk-scoring systems.
The base idea is to augment the 3-tier structure of MedGraphRAG with time-stamped edges and agentic reasoning loops.
The static meta-graphs in MedGraphRAG can be evolved into [[Temporal KGs]] that can model a patient's health trajecteroy as a sequence of state-dependent snapshots.
In our expansion, we can have a patient-centred graph that is defined by specific temporal and causal relationships.

anshul paruchuri

Anshul Paruchuri — Thu, 01 Jan 1970 00:00:00 +0000

bengaluru evenings

i'm anshul. third-year cse student at pes university, bengaluru. interested in ai and data. other than that, i like football and photography.
previous sde intern @ karplexus; ex-core @ hsp (foss club)

links: github · mail · photography

this site is inspired by my friend aditya hegde's site.

recent scrobbles

photography

Anshul Paruchuri — Thu, 01 Jan 1970 00:00:00 +0000

resume

Anshul Paruchuri — Thu, 01 Jan 1970 00:00:00 +0000

I am actively looking for internship opportunities in Bengaluru (or remote), primarily in Data Analytics, MLOps and AI Infrastructure during Fall 2026. If something sounds relevant, please reach out to me via email (anshulparuchuri@gmail.com).

download resume (pdf)