Welcome

Welcome to the Databend documentation! Databend is an open-source, elastic, and workload-aware modern cloud data warehouse designed to meet businesses' massive-scale analytics needs at low cost and with low complexity.

This welcome page guides you through the features, architecture, and other important details about Databend.

Why Databend?

Databend is always searching for and incorporating the most advanced and innovative technologies to provide you with an exceptional user experience.

Performance
Compatibility
Data Manipulation
Cloud Storage

Blazing-fast data analytics on object storage.
Leverages data-level parallelism and instruction-level parallelism technologies for optimal performance.
Supports Git-like MVCC storage for easy querying, cloning, and restoration of historical data.
No indexes to build, no manual tuning, and no need to figure out partitions or shard data.

Supports atomic operations such as SELECT, INSERT, DELETE, UPDATE, COPY, and ALTER.
Provides advanced features such as Time Travel and Multi Catalog (Apache Hive / Apache Iceberg).
Supports ingestion of semi-structured data in various formats like CSV, JSON, and Parquet.
Supports semi-structured data types such as ARRAY, MAP, and JSON.

Databend Architecture

Databend's high-level architecture is composed of a meta-service layer, a compute layer, and a storage layer.

Databend Architecture

Meta-Service Layer
Compute Layer
Storage Layer

Databend has the capability to support multiple tenants, and the meta-service layer serves these tenants and stores their respective states in a persistent key-value store. The meta-service layer plays a vital management role in Databend by:

Managing all metadata related to databases, tables, clusters, transactions, and more.
Storing user information, access control data, usage statistics, and other related information.
Performing user authentication and authorization to ensure a secure environment.

If you're interested, you can find the code related to the meta-service layer in the meta folder of the GitHub repository.

The compute layer's main responsibility is to execute computations for queries, and it can include several clusters, each with multiple nodes.

To form a cluster, several nodes can be grouped together by assigning them a shared namespace. This allows clusters to access the same database, which facilitates the handling of multiple user queries concurrently. By adding new nodes to a cluster, it is possible to scale up the computational tasks that are already running. This process, known as "work-stealing", ensures that the system can continue to operate efficiently even with additional nodes.

A node serves as a fundamental unit of the computer layer and is composed of the following components:

Planner: The planner creates a plan for executing the SQL statement provided by the user. This plan includes various types of operators from relational algebra, such as Projection, Filter, and Limit, to represent the query.

EXPLAIN SELECT avg(number) FROM numbers(100000) GROUP BY number % 3

explain                                                   |
----------------------------------------------------------+
EvalScalar                                                |
├── expressions: [avg(number) (#3)]                       |
└── AggregateFinal                                        |
    ├── group by: [(number % 3)]                          |
    ├── aggregate functions: [avg(number)]                |
    └── AggregatePartial                                  |
        ├── group by: [(number % 3)]                      |
        ├── aggregate functions: [avg(number)]            |
        └── EvalScalar                                    |
            ├── expressions: [%(numbers.number (#0), 3)]  |
            └── TableScan                                 |
                ├── table: default.system.numbers         |
                ├── read rows: 100000                     |
                ├── read bytes: 800000                    |
                ├── partitions total: 2                   |
                ├── partitions scanned: 2                 |
                └── push downs: [filters: [], limit: NONE]|

Optimizer: The optimizer is a rule-based optimizer that employs a set of predefined rules, such as "predicate pushdown" and "pruning of unused columns", to determine the most efficient way to execute a query.
Processors: A query execution pipeline is a series of steps that retrieves data for a given query. Databend builds the query execution pipeline using planner instructions and follows a Pull&Push approach, meaning it can both pull and push data as needed. Each step in the pipeline is called a "processor", such as SourceTransform or FilterTransform, and can have multiple inputs and outputs. The processors are connected to form a pipeline that executes the query. To optimize performance, the pipeline can be distributed across multiple nodes depending on the query workload.

EXPLAIN PIPELINE SELECT avg(number) FROM numbers(100000) GROUP BY number % 3

explain                                                                              |
-------------------------------------------------------------------------------------+
CompoundChunkOperator(Rename) × 1 processor                                          |
  CompoundChunkOperator(Project) × 1 processor                                       |
    CompoundChunkOperator(Map) × 1 processor                                         |
      GroupByFinalTransform × 1 processor                                            |
        Merge (GroupByPartialTransform × 2 processors) to (GroupByFinalTransform × 1)|
          GroupByPartialTransform × 2 processors                                     |
            CompoundChunkOperator(Map) × 2 processors                                |
              CompoundChunkOperator(Project->Rename) × 2 processors                  |
                NumbersSourceTransform × 2 processors                                |

If you're interested, you can find the code related to the compute layer in the query folder of the GitHub repository.

Community

The Databend community is open to data professionals, students, and anyone who has a passion for cloud data warehouses. Feel free to click on the links below to be a part of the excitement:

Slack: Chat with the community
GitHub: Feature requests, bug reports, and contributions
Twitter: Stay in the know
Weekly: Weekly updates

Roadmap

License

Databend is licensed under Apache 2.0.

Acknowledgments

Databend is inspired by ClickHouse and Snowflake, its computing model is based on apache-arrow.
The documentation website hosted by Vercel.

Why Databend?​

Databend Architecture​

Community​

Roadmap​

License​

Acknowledgments​

Why Databend?

Databend Architecture

Community

Roadmap

License

Acknowledgments