AI-Ready Data Lakes with Starburst + Scality RING

From warehouses to lakehouses, each evolutionary step in data architecture has solved one problem while creating another.

Data architecture has come a long way since the golden age of traditional data warehouses. Those legacy systems served their purpose — but they weren’t built for the complexity, scale, and speed demanded by today’s AI and analytics-driven world.

Traditional data warehouses — Early 2000s era:

✅ Structured, reliable, optimized for reporting

❌ Rigid, costly, and unable to manage the surge of unstructured/semi-structured data

Data lakes — Big data era:

✅ Vast, low-cost repositories for storing raw data of any type

❌ Often became “data swamps” with weak governance, poor performance, and limited real-time analytics support

Data lakehouses — Modern hybrid era (warehouses + lakes):

✅ Combine the governance and schema of warehouses with the flexibility and scale of lakes

❌ Still maturing, with challenges in consistency, performance, and enterprise-wide adoption

In part 1 of our AI blog series, we broke down the core concepts driving today’s AI boom — clarifying the terminology and cutting through the hype with A primer on the concepts of AI: ML, LLMs, DL, NLP, GenAI, and the rise of RAG. Part 2 takes the next step: showing how Scality RING + Starburst turn messy, distributed data into a solid foundation for AI.

As AI and machine learning workloads accelerate, enterprises need data environments that go beyond the tradeoffs of past architectures — bringing together scale, governance, and real-time access in a way that makes data truly AI-ready.

That’s where Scality RING, a massively scalable and immutable object storage platform, and Starburst, a modern SQL-based query engine, combine to make data lakes AI-ready with a practical path forward. Together, they form a validated architecture that turns enterprise-scale data sprawl into fast, queryable insight.

Most blogs stop at theory. This one goes further. Here, we’ll walk you step by step through integrating Starburst with Scality RING, creating schemas, and running SQL queries on real data — building a fast, flexible, and governed platform for AI/ML workloads at scale.

Ready to roll up your sleeves?

What is Starburst?

Starburst is a data lakehouse platform that provides fast, secure access to data where it lives — across clouds, on-prem storage, databases, or data lakes — without needing to move or duplicate it. The ability to access and analyze data from diverse sources empowers organizations to be more agile and responsive to changing business needs and accelerate their time to insight from data analysis.

Starburst enables high-performance queries across:

Structured data (e.g., tables, Apache Iceberg, Parquet)
Semi-structured data (e.g., JSON, CSV, YAML)
Unstructured data (tagged via metadata)

This makes it a powerful tool for unifying data access across environments. And when paired with Scality RING, it becomes an enterprise-ready platform for analytics and AI that respects scale, performance, and security.

Scality-validated design application

Starburst has been certified as a validated design by Scality. This process involves deploying Starburst and Scality RING in a lab environment and tuning the products for predictable performance and testing workloads of typical customer scenarios. Customers get the value of having integration instructions, sizing their platform to have a seamless onboarding experience as well as predictable performance.

Scality’s partner app certification program helps our customers ensure seamless integration.

Why Scality RING + Starburst?

Scality RING isn’t just an object storage platform — it’s a foundation for next-generation data analytics and AI/ML workloads. When you integrate it with Starburst, several powerful capabilities come together:

Search unstructured and semi-structured data

While SQL can’t query raw unstructured data directly, metadata tagging on RING makes it possible to organize and classify unstructured content — like documents, images, or log files — so they become discoverable and useful. Starburst can query metadata alongside structured and semi-structured data for a comprehensive view.

Structured query power at scale

Starburst brings high-performance SQL querying to structured data sitting in RING. Think Iceberg tables, ORC files, Parquet — queryable at speed and at scale, even across petabytes of data.

Fuel for AI model training

Every AI/ML workflow starts with a dataset. The better organized and governed data is, the more accurate your models will be. With Starburst + RING, you can:

Create tables and schemas on your data
Run SQL queries to extract the right training sets
Enrich or label data before feeding it into AI pipelines

And best of all, you don’t need to move the data.

Built-in governance and access controls

Security and data governance are non-negotiable, especially when working with sensitive or regulated data. Scality RING supports granular access controls and permissioning. Starburst respects and extends these controls, letting you enforce role-based access policies across your entire data lake.

In practice, that governance means you can:

Enable data access for specific teams or users
Control visibility at the dataset or metadata level
Keep your compliance team happy

How it works: A quick look

Setting up Starburst to query data on RING is straightforward.

Step 1: Connect Starburst to RING as an object storage data source.
Create a new catalog configuration file in the /etc/catalogs folder, and add the following configuration:

# etc/catalog/s3.properties
connector.name=hive
hive.metastore=file
hive.s3.aws-access-key=RING_ACCESS_KEY
hive.s3.aws-secret-key=RING_SECRET_KEY
hive.s3.endpoint=RING_S3_ENDPOINT
hive.s3.path-style-access=true
hive.s3.ssl.enabled=true
hive.s3.max-connections=100
hive.non-managed-table-writes-enabled=true
hive.s3select-pushdown.enabled=false

Step 2: Define schemas and tables on structured/semi-structured datasets.

CREATE SCHEMA s3.mydata
WITH (location = ‘s3://your-bucket/path/to/data’);

CREATE TABLE s3.mydata.mytable (
id BIGINT,
name VARCHAR,
value DOUBLE
)
WITH (
external_location = ‘s3://your-bucket/path/to/data/mytable/’,
format = ‘ORC’
);

Step 3: Query your data. Use SQL queries to explore, join, and extract insights.
a. Simple query:

SELECT * FROM s3.mydata.mytable LIMIT 10;

b. Join with another table:

SELECT
t1.name,
t2.value
FROM s3.mydata.table1 t1
JOIN s3.mydata.table2 t2
ON t1.id = t2.id;

c. Aggregation:

SELECT
date_column,
COUNT(*) as count,
SUM(value) as total
FROM s3.mydata.mytable
WHERE date_column >= DATE ‘2024-01-01’
GROUP BY date_column;

Use tagged metadata to organize unstructured data for future use. This architecture allows your AI initiatives to take off faster — without rebuilding data pipelines or duplicating data across systems.

The steps we’ve walked through above aren’t just about setting up queries — they’re about transforming your raw, distributed data into a foundation that AI can actually use.

Why this matters for AI

When your leadership says, “We want to do something around AI,” this is what they’re talking about. A platform like Starburst on Scality RING checks all the boxes:

✅ Scalable, cost-efficient data lake foundation
✅ Real-time analytics without data movement
✅ Built-in governance and security
✅ Structured + semi-structured data support
✅ Metadata tagging for unstructured data
✅ Compatible with AI/ML model training workflows

For IT practitioners, this is more than a checkbox exercise — it’s a future-proof strategy. You’re not just standing up another analytics tool; you’re building an environment where data becomes fuel for innovation, ready for AI today and adaptable for what’s ahead.

Bringing it all together

The journey from warehouses to lakes to lakehouses has always been about balance: structure vs. flexibility, cost vs. performance, governance vs. freedom.

With the validated integration of Starburst and Scality RING, enterprises don’t have to choose. You can unify sprawling data, make it queryable at scale, and do it with governance and security baked in — turning a complex data landscape into an AI-ready platform.

Next steps

If you’re ready to move from theory to practice:

Explore part 1 of our AI series for a foundational primer on AI-ready infrastructure.
Put part 2 into action by following this validated design to integrate Starburst with Scality RING.
Connect with our team to see how this fits into your AI roadmap and to start accelerating your own initiatives.

The future of AI is being built on the data foundations we lay today. With Starburst and Scality, you’re not just keeping up — you’re getting ahead.

Other AI resources:

A primer on the concepts of AI: ML, LLMs, DL, NLP, GenAI, and the rise of RAG

Stop building dumb chatbots: The RAG + Scality RING solution

Enterprise AI in action: 5 real-world use cases powered by object storage

AI can’t wait for your data — How Hammerspace and Scality keep GPUs fed

Multidimensional scale: 10 must-have data storage dimensions to power your AI workloads

The AI storage problem you didn’t see coming — and how Scality RING already solved it

From data swamps to AI-ready data lakehouses: How Scality RING + Starburst unlock enterprise insight

Traditional data warehouses — Early 2000s era:

Data lakes — Big data era:

Data lakehouses — Modern hybrid era (warehouses + lakes):

What is Starburst?

Scality-validated design application

Why Scality RING + Starburst?

Search unstructured and semi-structured data

Structured query power at scale

Fuel for AI model training

Built-in governance and access controls

How it works: A quick look

Why this matters for AI

Bringing it all together

Next steps

Other AI resources:

Rahul Padigela

Related Posts

Stop cobbling AI tools together: How Scality simplifies the AI pipeline

AI can’t wait for your data — How Hammerspace and Scality keep...

Enterprise AI in action: 5 real-world use cases powered by object storage

The AI storage problem you didn’t see coming — and how Scality...

A primer on the concepts of AI: ML, LLMs, DL, NLP, GenAI,...

Multidimensional scale: 10 must have data storage dimensions to power your AI...

About Us

Useful Links

Editors' Picks

COME MEET US

From data swamps to AI-ready data lakehouses: How Scality RING + Starburst unlock enterprise insight

Traditional data warehouses — Early 2000s era:

Data lakes — Big data era:

Data lakehouses — Modern hybrid era (warehouses + lakes):

What is Starburst?

Scality-validated design application

Why Scality RING + Starburst?

Search unstructured and semi-structured data

Structured query power at scale

Fuel for AI model training

Built-in governance and access controls

How it works: A quick look

Why this matters for AI

Bringing it all together

Next steps

Other AI resources:

AI can’t wait for your data — How Hammerspace and Scality keep GPUs fed

What is object storage, anyway?

Related Posts

About Us

Useful Links

Editors' Picks

COME MEET US