AI Log Analytics with ClickHouse: Tracing Requests and Extracting JSON
AI Log Analytics with ClickHouse: Tracing Requests and Extracting JSON
This is part 2 of building cost-effective log analytics with ClickHouse Cloud. In part 1, we built CloudFront analytics for $3,300/year. Now let’s add Lambda logs with request tracing and JSON extraction.
If you haven’t read part 1, go read it first. The IAM setup and architecture concepts apply here too.
Why Lambda Logs Matter
You’re running serverless. Lambda functions power your APIs, process events, handle webhooks. But CloudWatch Logs Insights? That’s $0.005 per GB scanned. Query your logs a few times and you’ve spent more than this entire solution costs.
ClickHouse gives you:
- Request tracing across multiple log lines
- JSON structure extraction from application logs
- Performance percentiles (cold starts, duration, memory)
- Sub-second queries on millions of log lines
- Fixed costs instead of per-query charges
SMART Goals: Lambda Log Analytics
Building on the CloudFront setup:
- Specific: Parse Lambda CloudWatch logs exported to S3, extract JSON structured data from application logs, and enable request-level tracing.
- Measurable: Costs. Don’t double the costs.
- Achievable: Reuse ClickHouse infrastructure, IAM roles, and S3 bucket from CloudFront setup.
- Relevant: Enable performance monitoring and debugging for serverless applications.
- Time Bound: Setup should take about a day-ish.
Architecture
Building on the CloudFront setup from part 1, we’re adding a second S3Queue pipeline:
Data Flow:
- Lambda executes, logs go to CloudWatch Logs
- CloudWatch Logs export to S3 (hourly or daily)
- S3Queue polls
s3://bucket/lambda/**prefix - ClickHouse parses logs and extracts structured data
- Query with Grafana, CLI, or ClickHouse console
Same infrastructure:
- Same S3 bucket (different prefix)
- Same IAM roles (already configured)
- Same ClickHouse Cloud instance
Costs
We’re reusing infrastructure from the CloudFront setup. The main cost variable is how you get logs to S3.
Ingestion approaches:
| Approach | Ingestion Cost | Notes |
|---|---|---|
| CloudWatch export to S3 | $0.50/GB | Simple, mature Terraform support |
| Lambda direct to S3 | $0.25/GB | Native AWS feature (2025), Terraform support still maturing |
| Kinesis + logging library | $0.12/GB | Best for high volume, requires app changes |
Lambda’s direct S3 streaming is a first-party AWS feature released in 2025. It halves ingestion costs but Terraform support is still buggy. For very high volumes, using AWS MSK or Kinesis with a logging library that writes directly to the bus gives you more control and potentially lower costs.
This article uses CloudWatch export for simplicity and reliability.
Incremental costs (CloudWatch export approach):
- CloudWatch Logs ingestion: $0.50/GB uncompressed
- Storage: S3 at $0.023/GB/month + ClickHouse at $0.025/GB/month
- ClickHouse compute: Already running the 8GiB instance ($220/month)
- Data transfer: Zero (same region, S3 API reads)
- S3 LIST requests: ~$0.04/month (5-10 minute polling)
Assuming moderate Lambda logging volume (~250GB compressed/month, 0.15 compression ratio = ~1,667GB uncompressed):
- CloudWatch ingestion: 1,667GB × $0.50 = $833/month
- S3 storage: 250GB × $0.023 = $5.75/month
- ClickHouse storage: 250GB × $0.025 = $6.25/month
- Total incremental: ~$845/month
Combined solution costs:
- CloudFront logs: $275/month (from part 1)
- Lambda logs: +$845/month
- Total: ~$1,120/month or ~$13,440/year
CloudWatch ingestion dominates. With direct S3 streaming, you’d cut that roughly in half (~$417/month). Either way, the value is making logs queryable without per-query costs from CloudWatch Logs Insights ($0.005/GB scanned).
Understanding Lambda Log Format
Lambda logs exported from CloudWatch have this format:
TIMESTAMP MESSAGE
Example log types:
2026-01-19T20:50:10.108Z INIT_START Runtime Version: provided:al2023.v117
2026-01-19T20:50:10.216Z START RequestId: f6499439-8583-43e7-bdcb-b43eba4ed50a Version: $LATEST
2026-01-19T20:50:10.217Z 2026/01/19 20:50:10 192.168.1.1 POST /api/process Took: 1.879µs
2026-01-19T20:50:10.219Z END RequestId: f6499439-8583-43e7-bdcb-b43eba4ed50a
2026-01-19T20:50:10.219Z REPORT RequestId: f6499439-8583-43e7-bdcb-b43eba4ed50a Duration: 2.12 ms Billed Duration: 106 ms Memory Size: 512 MB Max Memory Used: 26 MB Init Duration: 103.69 ms
Key patterns to extract:
- Timestamps: ISO 8601 format (
2026-01-19T20:50:10.108Z) - Log types:
INIT_START,START,END,REPORT, or application logs - RequestId: Appears in START/END/REPORT lines (enables request tracing)
- REPORT metrics: Duration, memory usage, cold start time
- Application logs: Custom format with HTTP method, path, duration
- Embedded JSON: Some logs contain structured JSON data
Implementation
Enable CloudWatch Logs Export to S3
Configure your Lambda functions to export logs to the same S3 bucket using the lambda/ prefix:
|
|
Or use CloudWatch Logs subscription filters for real-time export, or schedule exports with EventBridge.
Reuse IAM Roles
If you followed part 1, your IAM role already has access to the S3 bucket. No changes needed. The ClickHouse instance can already read from s3://bucket/lambda/**.
Create Lambda Log Schema
The schema handles multiple complexities:
- Function name extraction: Auto-detects Lambda function name from S3 path
- Stream ID extraction: Identifies execution environment (container) from S3 path
- Log type classification: Detect INIT_START, START, END, REPORT, or custom logs
- RequestId extraction: Pull from START/END/REPORT lines for tracing
- REPORT parsing: Extract duration, memory, cold start metrics
- JSON extraction: Find and extract
{...}patterns, parse as JSON type - HTTP pattern matching: Parse custom app logs like
IP METHOD /path Took: Xµs
The complete schema: lambda-clickhouse-schema.sql
Key features:
- Multi-function support: Logs from multiple Lambda functions in one table
- Stream ID extraction: Track container reuse and concurrency patterns
- Daily partitioning for efficient querying
- JSON type for dynamic schema and simple queries
- Materialized columns for common filters (
is_cold_start,is_error,has_json) - Extracted fields for REPORT metrics
- Polymorphic JSON support (different event types, different fields)
- Efficient filtering by function name (indexed in ORDER BY)
- Request-level tracing via
request_id, container-level viastream_id - 90-day TTL
- MCP-ready for AI-powered exploration
Apply the schema:
|
|
Testing and Debugging
Verify S3 access before the queue starts consuming:
|
|
Important: The 'raw_line String' parameter is required for LineAsString format - it defines the single column structure.
|
|
If you get 0 rows, check:
- Bucket name: Verify S3 bucket exists and has files
- Path pattern: Ensure
lambda/**.gzmatches your actual structure - IAM permissions: Role must have
s3:GetObjectands3:ListBucket - Region: Bucket and ClickHouse instance should be in same region
Check queue processing status:
|
|
Querying Lambda Logs
Request Tracing
The killer feature: trace a single request across all log lines.
|
|
Results:
┌───────────────timestamp─┬─log_type─┬─message────────────────────────┬─duration_ms─┬─max_memory_used_mb─┬─init_duration_ms─┐
│ 2026-01-19 20:50:10.216 │ START │ START RequestId: f6499... │ 0 │ 0 │ 0 │
│ 2026-01-19 20:50:10.217 │ INFO │ 2603:6080... OPTIONS /api/... │ 0 │ 0 │ 0 │
│ 2026-01-19 20:50:10.219 │ END │ END RequestId: f6499... │ 0 │ 0 │ 0 │
│ 2026-01-19 20:50:10.219 │ REPORT │ REPORT RequestId: f6499... │ 2.12 │ 26 │ 103.69 │
└─────────────────────────┴──────────┴────────────────────────────────┴─────────────┴────────────────────┴──────────────────┘
See everything: when it started, what it did, how long it took, memory used, whether it was a cold start.
Multi-Function Analysis
The schema automatically extracts the Lambda function name from the S3 path. This lets you analyze multiple functions in one table.
List all functions:
|
|
Results:
┌─function_name────┬─log_lines─┬─invocations─┐
│ roi-calculator │ 45,823 │ 12,450 │
│ webhook-handler │ 23,102 │ 5,234 │
│ image-processor │ 15,891 │ 3,876 │
└──────────────────┴───────────┴─────────────┘
Compare performance across functions:
|
|
This shows which functions are slow, memory-hungry, or experiencing cold starts.
Filter queries by function:
|
|
The function_name is first in the ORDER BY clause, making these queries efficient.
Performance Monitoring
Lambda duration percentiles:
|
|
Cold Start Analysis
Track cold start frequency and performance:
|
|
Container Reuse and Concurrency Patterns
Understanding Log Streams:
CloudWatch organizes Lambda logs into streams, where each stream represents a single execution environment (container). When Lambda scales, it creates new containers with new streams. In CloudWatch, this prevents log interweaving.
What happens when we consolidate into ClickHouse?
- We preserve per-request tracing via
request_id - We add container tracking via
stream_id(extracted from S3 path) - We lose the strict stream-level isolation CloudWatch provides
But we gain powerful analysis capabilities:
Container reuse patterns:
|
|
Results show how efficiently Lambda reuses containers:
┌─function_name────┬─unique_containers─┬─total_invocations─┬─avg_invocations_per_container─┐
│ roi-calculator │ 142 │ 12,450 │ 87.68 │
│ webhook-handler │ 523 │ 5,234 │ 10.01 │
└──────────────────┴───────────────────┴───────────────────┴───────────────────────────────┘
High reuse (roi-calculator) means warm containers and low cold start rate. Low reuse (webhook-handler) suggests frequent cold starts or high concurrency.
Concurrent execution analysis:
|
|
This identifies peak concurrency periods where multiple containers run simultaneously.
Trace a specific container’s lifetime:
|
|
See every request handled by a single container, including the initial cold start.
Practical implications:
- Request tracing still works perfectly: Use
request_idto follow a single invocation - Container insights are new: CloudWatch doesn’t make this analysis easy
- Log interweaving: Only matters if querying by timestamp alone (use
request_idorstream_idfor grouping) - Same as most observability platforms: Datadog, New Relic, etc. also consolidate streams
JSON Data Extraction
This schema uses ClickHouse’s JSON type for dynamic schema inference and simpler querying. This is perfect for polymorphic data where different event types have different fields.
Extract structured data from application logs:
|
|
Notice the simple json_data.fieldName syntax instead of JSONExtractString() functions.
Results:
┌───────────────timestamp─┬─event_type──────────┬─observability_spend─┬─savings_3year─┬─savings_pct─┐
│ 2026-01-19 20:50:10.272 │ calculation_tracked │ 500000 │ 2115000 │ 50 │
└─────────────────────────┴─────────────────────┴─────────────────────┴───────────────┴─────────────┘
Discover all JSON fields:
|
|
This shows all the dynamically created columns from your JSON data. As new event types arrive with different fields, ClickHouse automatically adds columns.
Why JSON type matters:
- Simple SQL syntax (no need to remember which JSONExtract function to use)
- Schema discovery (run DESCRIBE to see all fields)
- Type inference (ClickHouse detects types automatically)
- Perfect for MCP integration (Claude can explore the schema dynamically)
- Handles polymorphic data (different event types with different fields)
Now you can analyze business metrics embedded in logs.
HTTP Endpoint Analysis
If your application logs include HTTP patterns:
|
|
Error Rate Monitoring
Track errors over time:
|
|
Visualization
Grafana Dashboards:
- Invocation count over time
- Duration percentiles (p50, p90, p99)
- Cold start rate
- Memory utilization
- Error rate
- Per-endpoint latency
ClickHouse Console: Create dashboards for at-a-glance monitoring.
Advanced: Combining CloudFront and Lambda Logs
Join CloudFront requests with Lambda executions:
|
|
This correlates backend performance with frontend experience.
MCP Integration: AI-Powered Log Analysis
The JSON type makes this setup perfect for MCP (Model Context Protocol) integration. The official ClickHouse MCP server lets Claude query your logs naturally.
Why JSON type is ideal for MCP:
- Schema discovery: Claude can run
DESCRIBE TABLEto see what fields exist - Simple SQL: Claude generates cleaner queries with
json_data.fieldsyntax - Dynamic exploration: No hardcoded field names needed
- Type safety: ClickHouse handles type inference automatically
Example MCP interaction:
User: "Show me ROI calculations from the last hour with savings over $1M"
Claude (via MCP):
SELECT
timestamp,
json_data.event,
json_data.observabilitySpend,
json_data.savings3Year,
json_data.ipAddress
FROM lambda_logs
WHERE date = today()
AND timestamp > now() - INTERVAL 1 HOUR
AND json_data.event = 'calculation_tracked'
AND json_data.savings3Year > 1000000
ORDER BY timestamp DESC;
Claude can explore your logs interactively without you writing SQL. The JSON type makes this seamless.
Setting Up MCP with Claude Code
The official ClickHouse MCP server makes your logs conversationally accessible. Here’s how to set it up:
1. Configure the MCP Server
Add the ClickHouse MCP server to your Claude Code configuration file (~/.claude.json). This example uses uv (the fast Python package manager) to run the MCP server:
|
|
Why UV? The uv package manager handles Python dependencies cleanly without polluting your global environment. It downloads and caches the mcp-clickhouse package automatically on first run.
2. Get Your ClickHouse Connection Details
From your ClickHouse Cloud console:
- Host: Your instance hostname (e.g.,
abc123.us-east-2.aws.clickhouse.cloud) - Port:
8443for HTTPS (important: this is the HTTP interface port, not the native protocol port 9440) - User: Typically
default - Password: Your ClickHouse password
- Database: The database containing your logs (e.g.,
o11y)
Important: Use port 8443 (HTTPS), not 9440 (native protocol). The MCP server uses ClickHouse’s HTTP interface, which requires the HTTPS port on ClickHouse Cloud.
3. Restart Claude Code
The MCP server loads on startup. After updating ~/.claude.json, restart Claude Code to pick up the new configuration.
Tip: Run /mcp in Claude Code to see the status of configured MCP servers. This command helps debug connection issues by showing which servers are active, any error messages, and available tools.
4. Test the Connection
Try asking Claude:
“Show me the schema for the lambda_logs table”
or
“What Lambda functions do I have logs for in the last 24 hours?”
Claude will now have tools like:
clickhouse_query- Execute SELECT queriesclickhouse_execute- Run DDL/DML statementsclickhouse_ping- Test connection
Benefits of MCP Integration:
- Natural language queries: Ask questions without writing SQL
- Schema discovery: Claude can explore what fields exist in your JSON data
- Contextual analysis: Claude can suggest queries based on your data structure
- Multi-step analysis: Ask follow-up questions and refine queries
- Cost insights: Ask “What’s costing me the most?” and get actionable answers
Example interactions:
You: "Show me high-value ROI calculations that didn't convert to leads"
Claude: [Queries lambda_logs for calculation_tracked events with high savings
that don't have matching roi_generated events from the same IP]
You: "What's the cold start rate for roi-calculator this week?"
Claude: [Calculates percentage of invocations with init_duration_ms > 0]
You: "Find the slowest Lambda invocations today"
Claude: [Queries REPORT lines ordered by duration_ms DESC]
This transforms your logs into a conversational analytics platform. The JSON type makes it seamless - Claude can discover fields dynamically and generate clean queries without hardcoded field names.
Real Example: Cold Start Analysis
Here’s an actual Claude Code session querying Lambda cold start metrics:

This question, “Show me the cold start rate,” isn’t something AWS CloudWatch answers directly. CloudWatch gives you invocation counts, duration, and errors as built-in metrics, but cold start rate requires querying the logs, finding REPORT lines with Init Duration present, and calculating the percentage yourself.
With MCP, you just ask. Claude generates the SQL, runs it against ClickHouse, and formats the results. The P90 init duration (50.48ms in this case) tells you what your users actually experience during cold starts, not just that cold starts happened.
Teaching Claude Your Schema
MCP gives Claude access to ClickHouse, but Claude doesn’t automatically know your schema or application-specific JSON structure. For better results, create a Claude skill file that documents:
- Database and table names: Which database to query (
o11y.lambda_logs) - Column definitions: What each column contains and its type
- JSON field schemas: Your application’s event types and their fields
- Common query patterns: Examples for performance metrics, cold starts, business analytics
See the complete example: lambda-analytics.skill.md
Place this Claude skill file in ~/.claude/skills/<skill-name>/SKILL.md or reference it in your project’s CLAUDE.md. Claude Code will use it to generate accurate queries for your specific data model.
Tradeoffs and Considerations
Claude Isn’t Magic:
- Claude can discover your schema but doesn’t inherently know what your data means
- It won’t understand that
json_data.observabilitySpendrepresents annual cost without context - Business logic, thresholds, and domain knowledge need to be documented
- The Claude skill file becomes a shared library for your SRE team to ensure consistent interpretation
- Without it, each person asking questions gets different (possibly wrong) assumptions
Export Latency:
- CloudWatch Logs export to S3 runs on schedule (hourly/daily)
- Not real-time, but sufficient for analytics and debugging
S3Queue Configuration:
- Unordered mode: Handles late arrivals and complex path structures
- CloudWatch can delay log delivery up to 24 hours
- Multiple functions create overlapping prefixes (
lambda/func-a/...,lambda/func-b/...) - Unordered mode tracks each file individually, won’t skip late arrivals
- Polling interval: 5-10 minutes for hourly exports
- Balances latency with S3 LIST API costs (~$0.04/month)
- Adjust based on your export frequency (more frequent exports → faster polling)
- Ordered mode pitfall: Don’t use with multi-function setup
- Processes files lexicographically, can skip late-arriving files
- Only safe for single-function, predictable export schedules
Storage Costs:
- Lambda logs are verbose (every log line is a row)
- Use the 90-day TTL to control costs
- Use Lambda’s new ability to stream directly to S3.
Query Performance:
- Daily partitioning enables fast date range queries
request_idin ORDER BY enables efficient request tracing- Use materialized columns for common filters
JSON Type Considerations:
- Requires ClickHouse 23.1+ for stable JSON type support
- Dynamic schema grows as new JSON fields appear
- Default limit of 1024 unique paths per JSON column (
max_dynamic_paths); beyond this, paths go into less efficient shared storage - Field names are case-sensitive (
observabilitySpendnotobservabilityspend) - NULL values for fields that don’t exist in specific rows (expected behavior)
- Consider adding indexes on frequently queried JSON fields if needed
Wrapping Up
Did we meet our goal of not doubling costs? Not exactly.
CloudFront analytics from part 1 cost $275/month. Adding Lambda logs brings us to ~$1,120/month. That’s 4x, not 2x. But the cost breakdown tells a different story:
- CloudWatch ingestion: $833/month (unavoidable with CloudWatch export approach)
- Incremental analytics cost: ~$12/month (S3 + ClickHouse storage)
The analytics infrastructure barely registers. CloudWatch ingestion dominates. This is why we discussed alternatives: Lambda’s direct S3 streaming ($0.25/GB) or Kinesis ($0.12/GB) can cut that $833 down to $417 or $200 respectively. The logging path you choose matters far more than the analytics backend.
What you get for that cost:
- Request-level tracing across all log lines
- Performance percentiles (duration, memory, cold starts)
- Cold start monitoring and analysis
- JSON type for dynamic schema and simple queries
- MCP integration for conversational queries
- Sub-second queries on millions of log lines
The power here is SQL and the JSON type. You can ask any question of your data. The JSON type makes it simple to query polymorphic event data. And with MCP integration, Claude can explore your logs conversationally without you writing SQL.
One observation: AI-assisted observability isn’t a magical fix. Claude can correlate latency spikes and parse REPORT lines all day. That’s the commodity layer we all share. But it won’t understand that json_data.observabilitySpend represents your customer’s annual budget, or that a calculation_tracked event without a following roi_generated event means a lost lead. That’s your business domain, your architectural decisions, your telemetry schema.
This problem isn’t new. Most teams didn’t rigorously document their runbooks and telemetry before AI either. Everyone is hoping AI will solve this, or at least paper over it. But AI isn’t a substitute for good engineering practices. It will amplify what you have, good or bad.
The Claude skill file isn’t optional documentation. It’s the foundation that makes AI tooling useful instead of just impressive demos on standardized data.
Need help building your observability stack? Contact me today!
Resources: