Introduction to DuckDB - In memory database for Analytics

DuckDB is an in-memory database that we can interact with using SQL language. Think of it like SQLite but for analytics, or Redis with much more power and flexibility.

DuckDB is easy to install and supports extensions, focusing on performing analytics calculations. It's portable and the perfect tool for exploring datasets from CSV, JSON or Parquet files. It can be used to work locally or from a remote server via an HTTP file system.

Installation

To install DuckDB you can download the binary or install as dependencies for your python, nodejs or c++ language.

In this tutorial we will use the binary and install it using commandline brew this only work on mac os if you use linux or windows see the installtion guide here.

brew install duckdb

After installation finish validate the installation by running the duckdb command like this.

duckdb

And it should show you repl like this:

run duckdb

Query

Let see how we can run a query to duckdb. DuckDB query language is SQL so if you have any knowledge with sql database before learning the query language should be very easy.

Create table

CREATE TABLE news (
    url text,
    title varchar,
);

CREATE TABLE links (
    url text,
    host varchar,
);

List table:

SHOW TABLES;

Insert database:

INSERT INTO news(url, title) VALUES(
    'https://jack-vanlightly.com/blog/2023/11/14/the-architecture-of-serverless-data-systems',
    'The Architecture Of Serverless Data Systems'
);

Select

There are two ways to select the data. First the normal sql syntact.

SELECT * FROM news;

SELECT url, title FROM news;

Or you can use shortcut wihtout using SELECT *

FROM news;

Delete

DELETE FROM news where url='https://jack-vanlightly.com/blog/2023/11/14/the-architecture-of-serverless-data-systems'

And that's it, look the same as the normal sql.

Bonus

You can also create table and insert the data in one query:

CREATE TABLE news AS SELECT 'https://jack-vanlightly.com/blog/2023/11/14/the-architecture-of-serverless-data-systems' AS url, 'The Architecture Of Serverless Data Systems' AS title;

External data source

What I like about DuckDB is that the ability to work with non sql data like csv, json and parquet.

Let's take a look how to work with csv data. Let's use public dataset Football Wages Dataset.

To import csv into DuckDB table you can run this sql query.

CREATE TABLE salary AS SELECT * FROM read_csv_auto('SalaryPrediction.csv');

Now we can see the highlight of the data using from salary; query.

result

If you want to see the table scema:

describe salary;
┌─────────────┬─────────────┬─────────┬─────────┬─────────┬───────┐
│ column_name │ column_type │  null   │   key   │ default │ extra │
│   varchar   │   varchar   │ varchar │ varchar │ varchar │ int32 │
├─────────────┼─────────────┼─────────┼─────────┼─────────┼───────┤
│ Wage        │ VARCHAR     │ YES     │         │         │       │
│ Age         │ BIGINT      │ YES     │         │         │       │
│ Club        │ VARCHAR     │ YES     │         │         │       │
│ League      │ VARCHAR     │ YES     │         │         │       │
│ Nation      │ VARCHAR     │ YES     │         │         │       │
│ Position    │ VARCHAR     │ YES     │         │         │       │
│ Apps        │ BIGINT      │ YES     │         │         │       │
│ Caps        │ BIGINT      │ YES     │         │         │       │
└─────────────┴─────────────┴─────────┴─────────┴─────────┴───────┘

Let's run some query to get to average salary by Club and get the top 10.

SELECT Club, AVG(CAST(REPLACE(Wage, ',', '') AS DOUBLE)) AS AverageWage
FROM salary
WHERE Wage IS NOT NULL
GROUP BY Club
ORDER BY AverageWage DESC
LIMIT 10;

Result:

┌───────────────────┬────────────────────┐
│       Club        │    AverageWage     │
│      varchar      │       double       │
├───────────────────┼────────────────────┤
│ PSG               │         11046675.0 │
│ R. Madrid         │  9128782.608695652 │
│ FC Bayern         │   8708043.47826087 │
│ Barcelona         │  6313604.166666667 │
│ Man UFC           │ 4622173.9130434785 │
│ Man City          │ 4526184.2105263155 │
│ Chelsea           │   4246629.62962963 │
│ Juventus          │        4120171.875 │
│ A. Madrid         │  4096264.705882353 │
│ Borussia Dortmund │  3907576.923076923 │
├───────────────────┴────────────────────┤
│ 10 rows                      2 columns │
└────────────────────────────────────────┘