AI Foundations  ·  Data Engineering

Data Engineering for AI

Before any model trains, data must be collected, cleaned, transformed, versioned, and served reliably. This guide covers everything an AI engineer needs to know about the data layer — from pipelines and file formats to feature stores, orchestration, streaming, and vector databases.

10 sections · ~45 min read · Intermediate
1 What is Data Engineering for AI?

Data engineering is the discipline of designing, building, and operating the infrastructure that collects, stores, transforms, and delivers data reliably. For AI/ML workloads, it takes on a specific character: data quality directly determines model quality. Garbage in, garbage out is not a cliche — it is the single biggest reason production AI systems fail.

💡 In most production ML teams, 60–80% of engineering effort is spent on data — not models. A fast pipeline with clean data consistently beats a fancy model trained on dirty data.
Traditional DE vs AI-Focused DE
Traditional Data Engineering
  • Serves BI dashboards and reports
  • Schema-on-write (structured tables)
  • Batch daily or hourly is acceptable
  • SQL is the primary interface
  • Data freshness: hours is fine
AI/ML Data Engineering
  • Serves training pipelines and model inference
  • Schema-on-read (semi-structured data lakes)
  • May need sub-second real-time features
  • Python, Spark, Arrow as primary tools
  • Data versioning and reproducibility are critical
Key Roles in the AI Data Stack
RolePrimary ResponsibilityData Tasks
Data EngineerBuilds and operates data pipelinesIngestion, transformation, storage, orchestration
ML EngineerTrains models and deploys themFeature engineering, data versioning, training pipelines
Data ScientistAnalyses data and experiments with modelsEDA, feature selection, data labelling
MLOps EngineerOperates ML systems in productionData drift monitoring, retraining triggers, CI/CD

At smaller companies one person covers all four roles. At scale they split into dedicated teams. Either way, understanding the data engineering layer is non-negotiable for any AI engineer who wants to ship reliable systems. This knowledge sits at Layer 1 of the AI Skills Stack — everything else builds on it.

The AI Data Engineering Lifecycle
📥 Collect
APIs, databases, web scraping, sensors, user events, third-party data
🔒 Store
Object storage, data lakes, warehouses, lakehouses with proper partitioning
🔧 Transform
Clean, deduplicate, type-cast, join, aggregate into features
✅ Validate
Schema checks, null rates, distribution tests, drift detection
🚀 Serve
Feature store for training; online store for inference; APIs for models
2 Data Pipeline Architecture

A data pipeline is a sequence of stages that moves data from source systems to a destination where it can be consumed by a dashboard, a model, or an API. For AI, the canonical three phases are Ingest → Transform → Serve.

The Medallion Architecture (Bronze → Silver → Gold)

The most widely adopted pattern in modern AI data stacks is the medallion architecture, popularised by Databricks. Raw data lands in Bronze, is cleaned into Silver, and is aggregated into Gold for models and reports.

📥 Bronze (Raw)
Exact copy of source data. Logs, API dumps, DB exports. Never delete. Append-only.
🔨 Silver (Cleaned)
Deduplicated, typed, null-handled. Schema enforced. Ready for analysis and feature engineering.
✨ Gold (Curated)
Aggregated, joined, feature-engineered. Directly consumed by ML models and BI dashboards.
Batch vs Streaming Pipelines
Batch Processing
  • Processes data in chunks on a schedule
  • Tools: Spark, dbt, pandas, Polars, DuckDB
  • Latency: minutes to hours
  • Use cases: daily training jobs, offline reports
  • Simpler to debug and reprocess
Stream Processing
  • Processes events continuously as they arrive
  • Tools: Kafka, Flink, Spark Streaming
  • Latency: milliseconds to seconds
  • Use cases: fraud detection, real-time recommendations
  • Harder to manage state and exactly-once semantics
A Minimal Python Batch Pipeline (Bronze to Silver)
Python — Bronze to Silver transformation
import pandas as pd
from pathlib import Path

# Bronze: read raw data exactly as received
bronze = pd.read_json("data/bronze/events_2026-06-30.jsonl", lines=True)

# Silver: clean and type
silver = (bronze
  .drop_duplicates(subset=["event_id"])
  .dropna(subset=["user_id", "timestamp"])
  .assign(
    timestamp=pd.to_datetime(bronze["timestamp"]),
    amount=bronze["amount"].astype(float).clip(lower=0)
  )
)

# Write Silver as Parquet (columnar, compressed, fast for ML)
Path("data/silver").mkdir(parents=True, exist_ok=True)
silver.to_parquet("data/silver/events_2026-06-30.parquet", index=False)
print(f"Bronze rows: {len(bronze):,}  ->  Silver rows: {len(silver):,}")
Polars: The Faster Pandas for Large Datasets

Polars is a Rust-based DataFrame library that is typically 5–20x faster than pandas for large dataset operations. It uses lazy evaluation (builds a query plan and optimises before executing), native multi-threading, and Arrow as its memory format. For Gold-layer transformations on datasets over 10M rows, Polars is becoming the default choice.

Polars — lazy evaluation pipeline
import polars as pl

# Lazy: builds query plan, doesn't execute yet
result = (
    pl.scan_parquet("data/silver/*.parquet")   # lazy read
    .filter(pl.col("amount") > 0)
    .group_by("user_id")
    .agg([
        pl.col("amount").sum().alias("total_spend"),
        pl.col("event_id").count().alias("num_events"),
        pl.col("timestamp").max().alias("last_seen"),
    ])
    .sort("total_spend", descending=True)
)

# .collect() executes the optimised query plan
df = result.collect()   # uses all CPU cores automatically
3 File Formats & Serialization

The file format you choose has a larger impact on ML pipeline performance than most engineers expect. For a 100M-row dataset, the difference between CSV and Parquet can be 10–50x in read speed and 5–10x in storage cost.

Format Comparison Table
FormatLayoutCompressionBest ForAvoid When
CSVRow (text)NoneSmall datasets, human inspection, data interchangeLarge ML datasets, analytical queries
ParquetColumnar (binary)Snappy / ZSTDML feature tables, analytics on selected columnsAppend-heavy workloads, tiny files
Apache Arrow (IPC)Columnar (in-memory)LZ4Zero-copy data sharing across Pandas, Spark, DuckDBOn-disk long-term storage
JSON Lines (JSONL)Row (text)None / GZIPLLM training datasets, log events, streamingAnalytical queries, large numerical arrays
HDF5 / NPYArray (binary)OptionalNumerical tensors, numpy arrays, scientific dataHeterogeneous schemas, non-numerical data
TFRecordSequential binaryGZIPTensorFlow training datasets with tf.dataNon-TensorFlow frameworks
AvroRow (binary)Deflate / SnappyKafka message schemas, schema evolutionColumn-heavy analytics workloads
Why Parquet Dominates ML Workloads

Parquet stores data column-by-column rather than row-by-row. When your training script reads only user_age, purchase_amount, and label from a 200-column table, Parquet reads only 3 column chunks off disk. CSV reads all 200 columns and discards 197. That difference compounds massively at scale.

Python — CSV vs Parquet read performance
import pandas as pd, time, os

# Same 10M-row, 50-column dataset in both formats
cols_needed = ["age", "purchase_amount", "label"]

t = time.time()
df_csv = pd.read_csv("data.csv", usecols=cols_needed)
print(f"CSV:     {time.time()-t:.1f}s  |  disk: {os.path.getsize('data.csv')/1e9:.1f} GB")

t = time.time()
df_pq = pd.read_parquet("data.parquet", columns=cols_needed)
print(f"Parquet: {time.time()-t:.1f}s  |  disk: {os.path.getsize('data.parquet')/1e9:.1f} GB")

# Typical results:
# CSV:     18.4s  |  disk: 4.7 GB
# Parquet:  0.9s  |  disk: 0.4 GB  (5x smaller, 20x faster)
JSONL for LLM Training Data

Large language model training datasets are almost always stored as JSONL — one JSON object per line. This allows streaming reads (no need to load the entire file), easy sharding across GPU workers, and simple append-only updates. HuggingFace datasets natively supports JSONL.

JSONL — LLM instruction tuning dataset format
{"instruction": "Explain gradient descent in simple terms", "response": "Gradient descent is an optimisation algorithm..."}
{"instruction": "What is a transformer architecture?", "response": "A transformer uses self-attention to process sequences..."}
{"instruction": "Write a Python function to sort a list", "response": "def sort_list(lst): return sorted(lst)"}
Apache Arrow: Zero-Copy Data Transfer

Apache Arrow defines a standardised in-memory columnar format. Because Pandas, Spark, Polars, DuckDB, and PyArrow all speak Arrow natively, you can pass a DataFrame between them without serialisation or copying. For ML pipelines that chain Python tools (pandas → scikit-learn → model), Arrow is the invisible glue that keeps it fast.

4 Storage Architecture: Lakes, Lakehouses & Warehouses

Where you store data shapes how fast your ML pipeline runs, how much it costs, and whether you can reproduce experiments from six months ago. Three patterns dominate modern AI stacks.

Data Lake

A data lake stores raw files in cheap object storage (Amazon S3, Google GCS, Azure Blob) in any format — structured, semi-structured, or unstructured. Extremely cheap (~$0.02/GB/month on S3). No enforced schema. Great for raw Bronze data and unstructured content (text, images, audio) for AI training. The risk is the data swamp: without governance, nobody knows what data exists or whether it is reliable.

Data Warehouse

A data warehouse (Snowflake, BigQuery, Redshift) enforces schema, supports SQL, and provides ACID transactions. Fast for analytical queries on aggregated Gold-layer data. Too expensive to use as primary storage for raw training data. Best for model monitoring dashboards, A/B test results, and business metrics.

Data Lakehouse (The AI-Native Pattern)

The lakehouse combines lake economics (cheap object storage) with warehouse reliability (ACID, schema enforcement, time travel). Enabled by open table formats sitting on top of Parquet in object storage:

🔺
Delta Lake
Open-source by Databricks. ACID on Parquet. Time travel up to 30 days. Largest ecosystem and community.
🪪
Apache Iceberg
Netflix-origin. Best schema evolution and hidden partitioning. Supported by Spark, Flink, Trino, Snowflake.
🌊
Apache Hudi
Uber-origin. Best for streaming upserts and CDC (change data capture) patterns on a data lake.
Time travel is critical for ML reproducibility. SELECT * FROM training_data TIMESTAMP AS OF '2026-01-15' lets you recreate the exact dataset your model trained on 6 months ago, even if the table has since changed. Without this, experiment reproducibility is impossible.
Object Storage Best Practices for AI Workloads
  • Partition by date and entity: s3://bucket/events/year=2026/month=06/day=30/ — Hive-compatible partitioning enables Spark to prune scan to only relevant partitions, cutting costs 10–100x on large tables.
  • Target file size 128MB–1GB: Too many small files (under 1MB each) kills Spark performance due to task scheduling overhead. Use coalesce() or repartition(n) before writing.
  • Lifecycle policies: Auto-archive Bronze data older than 90 days to Glacier ($0.004/GB/month). Keep Gold layer in Standard storage for fast access.
  • Separate buckets by tier: Prevent Gold-layer data from being accidentally overwritten by Bronze ingestion jobs. Use separate IAM permissions per bucket.
  • Enable versioning on Gold: S3 bucket versioning lets you restore accidentally deleted or overwritten model training datasets.
5 ETL vs ELT for ML Workloads

ETL and ELT both move data from sources to a destination. The difference is where and when transformation happens — and that determines your cost model, flexibility, and ability to reprocess historical data.

ETL (Extract → Transform → Load)
  • Transform before loading into the destination
  • Used with legacy data warehouses (limited compute)
  • Transformation done in custom Python or Spark code
  • Harder to reprocess — logic lives outside the destination
  • Better for sensitive data (PII masked before loading)
ELT (Extract → Load → Transform)
  • Load raw data then transform inside the destination
  • Used with modern cloud warehouses (BigQuery, Snowflake)
  • Transformation done in SQL/dbt inside the warehouse
  • Easy to re-run transformations on historical raw data
  • More cost-effective: cloud warehouses scale compute elastically
dbt: The Standard ELT Transformation Tool

dbt (data build tool) is the de-facto standard for the Transform step in ELT pipelines. It lets you write SQL transformations as version-controlled models with built-in testing, documentation generation, and data lineage graphs. Every serious ML team running on BigQuery or Snowflake uses dbt.

dbt — Silver model example (SQL file in models/silver/)
-- models/silver/user_features.sql
{{ config(materialized='table', schema='silver') }}

WITH raw AS (
  SELECT * FROM {{ source('bronze', 'user_events') }}
  WHERE event_date >= '2026-01-01'
),
cleaned AS (
  SELECT
    user_id,
    DATE(event_timestamp)           AS event_date,
    COUNT(*)                        AS daily_events,
    SUM(purchase_amount)            AS daily_spend,
    AVG(session_duration_seconds)   AS avg_session_sec
  FROM raw
  WHERE user_id IS NOT NULL
    AND purchase_amount >= 0
  GROUP BY 1, 2
)
SELECT * FROM cleaned

-- Run: dbt run --select silver.user_features
-- Test: dbt test --select silver.user_features
Data Contracts: Preventing Silent Model Corruption

A data contract is a formal schema agreement between a data producer (upstream team) and consumer (ML pipeline). When a data contract is broken — an upstream team renames a column, changes a data type, or drops a field — it can silently corrupt model inputs and produce wrong predictions for hours before anyone notices. Contracts are enforced as schema checks in CI/CD using tools like Soda Core, Great Expectations, or custom Pydantic validators.

Common ML-Specific ETL/ELT Patterns
  • Point-in-time joins: When building training data, join features using only information available at or before the event timestamp. Violating this causes label leakage — the most dangerous form of data contamination in ML.
  • Snapshot tables: Store daily snapshots of slowly-changing dimensions (user attributes, product catalogues) so you can reconstruct historical states for training.
  • Data versioning: Use Delta Lake or Iceberg time travel to tag each training run with the exact data version used. Essential for audit trails and experiment reproducibility.
6 Feature Engineering & Feature Stores

A feature is a measurable property of data used as input to a model. Raw data rarely goes directly into models — it must be transformed into numerical representations that capture the signal the model needs. Feature engineering is often the highest-leverage activity in applied ML, and bad features are impossible for even the best model to overcome.

Common Feature Engineering Techniques
TechniqueWhen to UseExample
Normalisation / ScalingNumerical features with different rangesStandardScaler on age; MinMax on price [0,1]
One-Hot EncodingLow-cardinality categoricals (<50 values)country → [is_US, is_UK, is_IN, ...]
Target EncodingHigh-cardinality categoricalsReplace city with average conversion rate per city
Binning / BucketisationContinuous to ordinal conversionage → [18-25, 26-35, 36-50, 51+]
Window AggregationsTime-series and user behaviourspend_last_7d, clicks_last_30d, avg_order_last_3
EmbeddingsText, images, high-cardinality IDsproduct_id → 128-dim dense vector
Log TransformRight-skewed distributions (price, revenue)log1p(purchase_amount) normalises heavy tail
Interaction FeaturesCapturing combined signalage * income, clicks / impressions
Python — Feature pipeline with scikit-learn ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer

numerical   = ["age", "account_age_days", "spend_last_30d", "num_sessions"]
categorical = ["country", "device_type", "plan_tier"]

preprocessor = ColumnTransformer([
    ("num", StandardScaler(), numerical),
    ("cat", OneHotEncoder(handle_unknown="ignore", sparse_output=False), categorical),
])

# CRITICAL: fit on training data only, then transform both sets
# This prevents data leakage from test set statistics entering training
X_train_enc = preprocessor.fit_transform(X_train)
X_test_enc  = preprocessor.transform(X_test)   # reuses train mean/std
Training-Serving Skew: The Silent Model Killer

Training-serving skew occurs when feature computation at training time differs from feature computation at inference time. It is extremely common and nearly impossible to detect without deliberate monitoring. Classic examples:

  • Training computes "purchases in last 30 days" from a historical Parquet snapshot. Production computes it from a real-time Redis aggregation with a slightly different time-window boundary (e.g., UTC vs local timezone).
  • Training imputes missing ages with the mean of the training set. Inference uses a hardcoded constant that differs.
  • Training uses log1p(amount). Production endpoint applies log(amount+1) — same math, different edge case at amount=0.
Feature Stores: Solving Training-Serving Skew by Design

A feature store is a centralised system that computes, stores, and serves features identically for both training and inference. It eliminates training-serving skew by ensuring the same code path runs in both contexts.

🍽️
Feast (Open Source)
Registry-based. Offline store on Parquet/BigQuery. Online store on Redis/DynamoDB. Free, widely adopted.
Tecton
Managed, enterprise-grade. Streaming feature pipelines. Best for real-time ML at scale. Expensive.
🔷
Databricks Feature Store
Integrated with MLflow. Automatic lineage tracking. Best choice if already running on Databricks.

Key concept: the offline store (Parquet, BigQuery) provides point-in-time-correct features for training, preventing label leakage. The online store (Redis, DynamoDB) serves those same features at millisecond latency for real-time inference. Feature definitions are written once, executed in both contexts.

7 Data Quality & Validation

Data quality issues cause more production ML failures than model bugs. A model trained on corrupted or drifted data produces wrong predictions silently — often for days — before anyone notices. Automated data validation is your first line of defence and should run at every stage of the pipeline.

The Four Dimensions of Data Quality
Completeness
Are required fields always present? Is the null rate for each column below its threshold? Are critical partitions never missing?
📐
Consistency
Is the same entity represented identically across tables? No contradictory values? Referential integrity maintained?
🕐
Timeliness
Did data arrive within the expected time window? No missing daily partitions? No stale features being served to the model?
📊
Distribution
Do feature distributions match what the model was trained on? Are there sudden spikes, drops, or anomalous values?
Validation with Great Expectations

Great Expectations is the most widely used open-source data validation framework. You define "expectations" — assertions about your data — and run them in your pipeline CI/CD. A failed expectation blocks the pipeline before corrupted data reaches your model.

Python — Great Expectations validation suite
import great_expectations as gx

context = gx.get_context()
batch = context.sources.pandas_default.read_parquet("data/silver/users.parquet")

# Define expectations (assertions about data)
batch.expect_column_to_exist("user_id")
batch.expect_column_values_to_not_be_null("user_id")
batch.expect_column_values_to_be_unique("user_id")
batch.expect_column_values_to_be_between("age", min_value=0, max_value=120)
batch.expect_column_mean_to_be_between("spend_last_30d", min_value=5, max_value=500)
batch.expect_column_values_to_be_in_set("country", ["US","UK","IN","DE","AU"])
batch.expect_column_proportion_of_unique_values_to_be_between("plan_tier", min_value=0, max_value=0.1)

# Run validation - fails pipeline if any expectation is not met
result = batch.validate()
assert result.success, f"Data quality FAILED:
{result.to_json_dict()}"
print(f"All {result.statistics['evaluated_expectations']} checks passed.")
Schema Validation with Pandera

Pandera provides DataFrame schema validation with a Python-native API. It integrates directly into pipeline functions as decorators, checking inputs and outputs automatically.

Python — Pandera schema validation as decorator
import pandera as pa
from pandera import check_types
from pandera.typing import DataFrame, Series

class UserFeatureSchema(pa.DataFrameModel):
    user_id:   Series[str]   = pa.Field(nullable=False, unique=True)
    age:       Series[int]   = pa.Field(ge=0, le=120)
    spend:     Series[float] = pa.Field(ge=0.0)
    country:   Series[str]   = pa.Field(isin=["US","UK","IN","DE","AU"])

    class Config:
        coerce = True  # auto-cast types if possible

@check_types
def build_features(raw: DataFrame) -> DataFrame[UserFeatureSchema]:
    # If output doesn't match UserFeatureSchema, pandera raises at runtime
    return raw.rename(columns={"total_spend": "spend"}).dropna()
Data Drift Detection

Data drift occurs when the statistical distribution of production data shifts from what the model was trained on. This is one of the leading causes of silent model degradation in production. Detect it with:

  • Population Stability Index (PSI): Standard for categorical drift. PSI above 0.2 indicates significant distribution shift and should trigger model retraining.
  • Kolmogorov-Smirnov (KS) test: Detects distribution shift in continuous features. A low p-value (typically <0.05) means drift is detected.
  • Jensen-Shannon Divergence: Measures similarity between two distributions. Works for both continuous and discrete features.
  • Evidently AI: Open-source library that generates data and model drift reports with a single call. Integrates with MLflow, Grafana, and Prometheus for production monitoring.
Python — Drift detection with Evidently AI
from evidently.report import Report
from evidently.metric_preset import DataDriftPreset, DataQualityPreset

report = Report(metrics=[
    DataDriftPreset(),      # checks all features for distribution shift
    DataQualityPreset(),    # checks nulls, ranges, duplicates
])
report.run(
    reference_data=training_df,    # what model was trained on
    current_data=production_df     # what's arriving in production now
)

# Check programmatically
result = report.as_dict()
drift_detected = result["metrics"][0]["result"]["dataset_drift"]
drifted_features = result["metrics"][0]["result"]["number_of_drifted_columns"]

if drift_detected:
    print(f"WARNING: {drifted_features} features drifted - trigger retraining")
8 Orchestration: Airflow, Prefect & Dagster

An ML pipeline has many interdependent steps: ingest data → validate → transform → build features → train model → evaluate → deploy. These steps have dependencies, can fail, need retries, and must run on a schedule. A data orchestrator manages all of this with dependency awareness, retries, monitoring, and alerting.

The Three Main Orchestrators Compared
ToolParadigmBest ForLearning Curve
Apache AirflowDAG-based (Python decorated tasks)Large teams, complex dependencies, existing Airflow investmentHigh (separate scheduler, worker, webserver components)
PrefectPython-native flows and tasksTeams wanting simpler Python API with cloud deployment optionMedium (very Pythonic, fast to get started)
DagsterSoftware-defined assetsML pipelines needing strong lineage and data-aware schedulingMedium (asset paradigm is different but powerful for ML)
Apache Airflow: DAG-Based Orchestration
Apache Airflow — Daily ML training pipeline DAG
from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime, timedelta

def ingest_data(**ctx):    ...  # pull from data sources
def validate_data(**ctx):  ...  # run Great Expectations checks
def build_features(**ctx): ...  # Silver -> Gold transformation
def train_model(**ctx):    ...  # fit model on Gold features
def evaluate_model(**ctx): ...  # compute metrics, push to MLflow

with DAG(
    dag_id="ml_training_pipeline",
    schedule="0 2 * * *",        # run daily at 2am UTC
    start_date=datetime(2026, 1, 1),
    catchup=False,
    default_args={
        "retries": 2,
        "retry_delay": timedelta(minutes=5),
        "on_failure_callback": alert_slack,
    },
) as dag:
    t1 = PythonOperator(task_id="ingest",    python_callable=ingest_data)
    t2 = PythonOperator(task_id="validate",  python_callable=validate_data)
    t3 = PythonOperator(task_id="features",  python_callable=build_features)
    t4 = PythonOperator(task_id="train",     python_callable=train_model)
    t5 = PythonOperator(task_id="evaluate",  python_callable=evaluate_model)

    t1 >> t2 >> t3 >> t4 >> t5  # sequential dependency chain
Dagster: Software-Defined Assets (Recommended for ML)

Dagster's asset-based model aligns better with ML workflows than task-based DAGs. You define what you produce (datasets, models, metrics) rather than what tasks run. This gives you automatic data lineage, data-aware scheduling (only re-run when upstream assets change), and built-in observability.

Dagster — ML pipeline as software-defined assets
from dagster import asset, define_asset_job, ScheduleDefinition, AssetIn

@asset
def bronze_events():
    """Raw event data ingested from S3."""
    return pd.read_json("s3://bucket/events/today.jsonl", lines=True)

@asset
def silver_features(bronze_events):
    """Cleaned and validated feature table derived from bronze events."""
    return clean_and_validate(bronze_events)

@asset
def trained_model(silver_features):
    """XGBoost model trained on silver features. Logged to MLflow."""
    model = train_xgboost(silver_features)
    mlflow.sklearn.log_model(model, "model")
    return model

@asset
def model_metrics(trained_model, silver_features):
    """Evaluation metrics for the latest trained model."""
    return evaluate(trained_model, get_test_split(silver_features))

# Lineage: model_metrics <- trained_model <- silver_features <- bronze_events
daily_job = define_asset_job("daily_training", selection="*")
daily_schedule = ScheduleDefinition(job=daily_job, cron_schedule="0 2 * * *")
💡 If starting fresh in 2026, choose Dagster for ML pipelines (best data lineage, asset-aware scheduling) or Prefect for simpler pipelines. Use Airflow only if your team already has it deployed — the migration cost from Airflow to Dagster is non-trivial but the observability gains are significant for large ML systems.
9 Streaming Data & Real-Time AI

Real-time AI features — fraud scores updated with every transaction, personalised recommendations updated after every click — require streaming data pipelines. Batch pipelines introduce latency measured in hours. Streaming pipelines introduce latency measured in milliseconds.

Apache Kafka: The Standard Event Bus

Kafka is the de-facto standard message broker for real-time ML pipelines. It decouples producers (systems generating events) from consumers (ML services processing them) via a distributed, fault-tolerant, ordered log. Key concepts:

  • Topic: A named stream of events (similar to a database table, but for streams). E.g., user-clicks, payment-events, sensor-readings.
  • Partition: Topics are split into partitions for parallelism and scalability. A topic with 12 partitions can be processed by 12 consumer instances simultaneously.
  • Offset: Each message has a monotonically increasing offset within its partition. Consumers track their position — enabling replay from any historical offset.
  • Consumer Group: A group of consumers sharing a topic's partitions. Kafka guarantees each partition is consumed by exactly one member of the group, enabling parallel processing with no duplicate work.
  • Retention: Kafka retains messages on disk (default 7 days, configurable to forever). This enables the Kappa architecture: replay all history to rebuild derived datasets.
Python — Kafka producer and consumer for feature pipeline
from confluent_kafka import Producer, Consumer
import json

# PRODUCER: application sends events to Kafka
producer = Producer({"bootstrap.servers": "kafka:9092"})
producer.produce(
    topic="user-events",
    key="user_123",
    value=json.dumps({
        "event": "purchase",
        "amount": 49.99,
        "item_id": "prod_456",
        "ts": "2026-06-30T10:00:00Z"
    })
)
producer.flush()

# CONSUMER: real-time feature computation service
consumer = Consumer({
    "bootstrap.servers": "kafka:9092",
    "group.id": "realtime-feature-pipeline",
    "auto.offset.reset": "earliest",
    "enable.auto.commit": False,   # manual commit for exactly-once
})
consumer.subscribe(["user-events"])

while True:
    msg = consumer.poll(timeout=1.0)
    if msg is None: continue
    if msg.error(): handle_error(msg.error()); continue

    event = json.loads(msg.value().decode("utf-8"))
    # Compute real-time feature and update online feature store (Redis)
    update_spend_aggregate(event["key"], event["amount"])
    consumer.commit(message=msg)   # only commit after successful processing
Stream Processing Frameworks
Apache Flink
True event-time streaming. Stateful computations with checkpointing. Best for complex stream processing with sub-second latency requirements.
🌊
Spark Structured Streaming
Micro-batch processing (100ms–1s latency). Same Spark SQL API as batch. Good if team already knows Spark.
🔨
Faust (Python)
Lightweight Python stream processing on Kafka. Great for simpler feature computations without full Flink/Spark overhead.
Lambda vs Kappa Architecture
Lambda Architecture
  • Two separate pipelines: batch (accurate) + streaming (fast)
  • Merge results at query time in serving layer
  • Complex: same transformation logic written twice
  • Pre-2018 standard approach
Kappa Architecture (Modern)
  • Single streaming pipeline handles everything
  • Reprocessing via Kafka log replay from offset 0
  • One codebase, simpler operations
  • Kafka's long retention makes historical replay feasible
Real-Time ML Feature Patterns

The most common real-time AI architecture: Kafka → Flink/consumer → Redis (online feature store) → ML model serving API. The batch pipeline (Airflow/Spark) keeps the offline feature store updated. The streaming pipeline keeps Redis up-to-date with sub-second latency. The model reads from Redis at inference time.

10 Vector Databases & Embeddings Storage

Vector databases are purpose-built storage and retrieval systems for high-dimensional embedding vectors — the numerical representations produced by LLMs, image encoders, and recommendation models. They are the storage backbone of every RAG (Retrieval Augmented Generation) system and modern semantic search application.

Why Traditional Databases Cannot Handle Vectors

Finding the 5 most similar embeddings to a query vector requires computing cosine similarity or Euclidean distance against every stored vector. For 10M vectors at 1536 dimensions (OpenAI's text-embedding-3-large), brute-force exact search takes 2–5 seconds per query. Production systems need this in under 10 milliseconds. Traditional B-tree indexes are built for equality and range queries, not approximate nearest-neighbour (ANN) search. Vector databases solve this with specialised ANN indexes.

ANN Index Algorithms
HNSW (Hierarchical Navigable Small World)
  • Graph-based multi-layer index structure
  • Fastest query speed (1–5ms for millions of vectors)
  • High memory usage: index must fit in RAM
  • Best for: production retrieval where speed is critical
  • Used by: Qdrant, Weaviate, Milvus, pgvector
IVF-PQ (Inverted File + Product Quantisation)
  • Divides vector space into clusters; quantises vectors
  • Much lower memory: quantised vectors fit on disk
  • Slower than HNSW but scales to billions of vectors
  • Best for: cost-sensitive billion-scale deployments
  • Used by: FAISS, Pinecone (internal compression)
Choosing a Vector Database
DatabaseTypeBest ForMetadata FilteringScale
PineconeManaged SaaSFastest time-to-production, zero ops✅ Strong100M+ vectors
QdrantOSS / CloudHigh performance + rich payload filtering, self-host✅ Best-in-class100M+ vectors
WeaviateOSS / CloudGraphQL API, multi-tenancy, hybrid search built-in✅ Good100M+ vectors
ChromaOSS (local)Prototyping, local RAG development, no infra needed✅ BasicMillions
pgvectorPostgreSQL extensionExisting Postgres stack, avoids new infra✅ Full SQLUnder 5M vectors
MilvusOSS distributedBillion-scale, GPU-accelerated search, Kubernetes✅ GoodBillions
Python — Index and query documents with Qdrant
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct, Filter, FieldCondition, MatchValue
from sentence_transformers import SentenceTransformer

client  = QdrantClient(url="http://localhost:6333")
encoder = SentenceTransformer("all-MiniLM-L6-v2")   # 384-dim embeddings

# Create collection with HNSW index
client.create_collection("knowledge_base", vectors_config=VectorParams(
    size=384, distance=Distance.COSINE
))

# Index documents (batch upsert for efficiency)
docs = ["LLMs use attention mechanisms", "RAG combines retrieval with generation", ...]
tags = ["llm", "rag", ...]
embeddings = encoder.encode(docs, batch_size=64, show_progress_bar=True)
client.upsert("knowledge_base", points=[
    PointStruct(
        id=i,
        vector=emb.tolist(),
        payload={"text": doc, "category": tag, "source": "internal-docs"}
    )
    for i, (doc, emb, tag) in enumerate(zip(docs, embeddings, tags))
])

# Query: semantic search with metadata filter
query_embedding = encoder.encode(["how does retrieval augmented generation work"])[0]
results = client.search(
    collection_name="knowledge_base",
    query_vector=query_embedding.tolist(),
    limit=5,
    query_filter=Filter(must=[
        FieldCondition(key="category", match=MatchValue(value="rag"))
    ])
)
for r in results:
    print(f"Score: {r.score:.3f} | {r.payload['text'][:80]}")
Hybrid Search: Combining Vector and Keyword Search

Pure vector search misses exact keyword matches (a product SKU, a person's name, a technical term). Pure keyword search (BM25) misses semantic meaning. Hybrid search combines both using Reciprocal Rank Fusion (RRF) to merge ranked results. Weaviate and Qdrant support this natively. It consistently outperforms either approach alone in production RAG systems.

🔗 Vector databases are the storage layer for RAG systems. For the full picture of how they fit into different retrieval architectures, see the Top 6 RAG Architectures guide.
📊

You've completed Layer 1 of the AI Skills Stack

Data engineering is the foundation everything else builds on. Now continue up the stack, explore the full 10-layer roadmap, or test your knowledge with an AI mock interview.

Next: AI Introduction → View Full AI Skills Stack Practice AI Interview

📯 Stay updated on AI engineering

New guides, roadmaps, and career tips — no spam, unsubscribe any time.

✓ You're subscribed!
</