Migrate from BigQuery to ClickHouse Cloud with GCS bulk-load
This guide walks you through how you can migrate your BigQuery projects to ClickHouse Cloud by bulk-loading exported BigQuery data into ClickHouse. It uses the Stack Overflow dataset as a practical example.
Pre-requisites
Before proceeding with this migration guide, make sure you have the following:
- A ClickHouse Cloud account, with an empty 3x16 GB service that you will be using for the migration
- You have completed the BigQuery to ClickHouse migration guide setup, which walks you through how to set up the example dataset used in this guide. You will need:
- A BigQuery project with the Stack Overflow dataset
- A GCS bucket
- You have an understanding of how primary/order by keys work in ClickHouse
- You have read Schema design
Migration approach
BigQuery supports exporting data to Google's object store (GCS). We will use this as a data source for ingestion into ClickHouse by:
- Exporting tables to GCS.
- Importing the data into ClickHouse Cloud using the
gcsors3Clusterfunctions.
The approach is illustrated below:
This approach has several advantages:
- BigQuery export functionality supports a filter for exporting a subset of data.
- BigQuery supports exporting to Parquet, Avro, JSON, and CSV formats and several compression types - all supported by ClickHouse.
- GCS supports object life cycle management, allowing data that has been exported and imported into ClickHouse to be deleted after a specified period.
- Google allows up to 50 TB per day to be exported to GCS for free. Users only pay for GCS storage.
- Exports produce multiple files automatically, limiting each to a maximum of 1 GB of table data. This is beneficial to ClickHouse since it allows imports to be parallelized.
Before trying the following examples, we recommend users review the permissions required for export and locality recommendations to maximize export and import performance.
Migration steps
Export BigQuery data
- Go to the Google Cloud Console.
- Click on the project picker at the top of the page next to the Google Cloud logo and select the BigQuery project you made previously
- Select the Run a query in BigQuery button to open the studio.
- Now select the SQL query button.
- Run the following DDL commands, substituting
<bucket-uri>for the gsutil URI of your own GCS bucket e.g.gs://bigquery-to-clickhouse
Create schemas on ClickHouse Cloud
- Open the ClickHouse Cloud console
- Select the service you wish to use, or create a new empty service
- Select SQL console
- Select + next to the home icon to create a new query
- Run the following DDL statements to create the equivalent ClickHouse database and tables:
The schemas above are provided to you in an already optimized state. In practice, you will go through some iteration to find optimal schemas for your data.
We recommend focusing on migrating the primary table first. This may not necessarily be the largest table but rather the one on which you expect to run the most analytical queries. This approach allows you to familiarize yourself with the main ClickHouse concepts. This table may require remodeling as additional tables are added to fully exploit ClickHouse features and obtain optimal performance. This modeling process is explored further in Schema design and the recommended next steps guides of that article.
As an illustrative example, for this dataset, you would begin with the posts table, which is the table likely to receive the most analytical queries.
The BigQuery schema for posts is shown below:
Applying the process described in Schema design and going further by specifying column level compression codecs results in the following optimized schema:
Insert data from GCS
Insert the data for each table using the s3Cluster function:
The s3Cluster function above is used when your service contains multiple replicas, which allows processing files from both Amazon S3 and Google Cloud Storage in parallel with multiple replicas in a specified Cloud service.
You can also use the gcs function (an alias for the s3 table function) if your service has only a single replica.
The ACCESS_ID and SECRET used in the above query is your HMAC key associated with your GCS bucket.