Migrate from BigQuery to ClickHouse Cloud using CDC
This guide walks you through how you can migrate your BigQuery projects to ClickHouse Cloud with ClickPipes, a managed CDC solution available in ClickHouse Cloud. It uses the Stack Overflow dataset as a practical example.
Pre-requisites
Before proceeding with this migration guide, make sure you have the following:
- A ClickHouse Cloud account, with an empty service that you will be using for the migration
- You have completed the BigQuery to ClickHouse migration guide setup guide which walks you through how to set up the example dataset used in this guide
- You have an understanding of how primary/order by keys work in ClickHouse
- You have read Schema design
Migration approach and benefits
Change Data Capture (CDC) is the process by which tables are kept in sync between two databases. This is significantly more complex if updates and deletes are to be handled in near real-time. ClickHouse Cloud offers a managed CDC solution for easily transferring data BigQuery and ClickHouse Cloud.
Migration steps
With the Stack Overflow dataset set up in your BigQuery project, and a GCS bucket and service account key in hand, you're now ready to use ClickPipes to ingest the data. Follow the steps below to set up the BigQuery ClickPipe:
Select the data source
- Open the ClickHouse Cloud console
- Select the service you wish to use, or create a new empty service
- Select Data sources in the main navigation menu and click Create ClickPipe.
- Click the BigQuery tile.
Set up your ClickPipe connection
To set up a new ClickPipe, you must provide details on how to connect to and authenticate with your BigQuery data warehouse, and a staging GCS bucket.
- Give your ClickPipe a name, for example bigquery-stackoverflow-clickpipe
- Upload the
.jsonkey for the service account you created and downloaded in the BigQuery to ClickHouse migration guide setup guide. - Select Initial load only as the Replication method.
- Under GCS staging bucket URL, enter the gsutil URI of your bucket which you obtained earlier.
- Click Next to validate.
Configure your ClickPipe connection
You can select the number of parallel threads for initial load, snapshot number of rows per partition, and the number of tables to snapshot in parallel. For the purposes of this guide, you can leave these as default. Click Next
Configure tables
You will be asked if you want to use an existing database or create a new database. Select New database and give it a name like "stackoverflow."
For the purposes of this guide, you can deselect the "Prefix default destination table names with schema name," and leave the "Preserve NULL values from source" toggle unselected.
Expand the dropdown named stackoverflow to view the tables available to ClickPipes from your BigQuery project. Turn the Select all tables toggle switch on.
At this stage it's necessary to give thought to how to order the data, as choosing an effective primary key (sorting key) in ClickHouse is crucial for query performance and storage efficiency. Sorting keys must also be defined on table creation and can't be added after.
You must define a sorting key for the replicated tables to optimize query performance in ClickHouse. Otherwise, the sorting key will be set as tuple(), which means no primary index will be created and ClickHouse will perform full table scans for all queries on the table.
For each of the selected tables, click the Advanced settings toggle, and set the following settings per table:
| Table | Engine | Custom partitioning key for initial load | Custom sorting key |
|---|---|---|---|
badges | MergeTree | - | (UserId, Class, Date) |
comments | MergeTree | toYear(CreationDate) | (PostId, CreationDate) |
postHistory | MergeTree | toYear(CreationDate) | (PostId, CreationDate, PostHistoryTypeId) |
postLinks | MergeTree | - | (PostId, RelatedPostId) |
posts | ReplacingMergeTree | toYear(CreationDate) | (PostTypeId, toDate(CreationDate), Id) |
users | ReplacingMergeTree | toYear(CreationDate) | (Id) |
votes | MergeTree | toYear(CreationDate) | (PostId, VoteTypeId, CreationDate) |
Decisions for your custom partitioning key, and custom sorting key often come down to what your access patterns look like. To better understand how to set up your ClickHouse schema, see the article Schema design which uses the same data set.
Configure permissions
ClickPipes will create a dedicated user for writing data into a destination table. You can select a role for this internal user using a custom role or one of the predefined roles. Select Full access, and
Complete setup
Ingesting all the data will take around 40 minutes to complete, and incur a cost of around $10 in compute. We recommend using a free trial account with $300 in credits.
Click Create ClickPipe to complete the setup. You'll be redirected to the overview page, where you can the progress of the initial load and click through to see the details for your BigQuery ClickPipes.
You should see a new ClickPipe listed with type BigQuery. You can click it to open a monitoring dashboard that shows you progress on the amount of data ingested, the overall status of the pipe, and per table statuses.