This is interesting! Do you plan to add support for distinguishing between blocking and non-blocking/async migrations? This seems to be a common challenge when using Clickhouse and needing to change the ordering of existing tables.
Context: When adding Clickhouse to Langfuse [1] to store all observability data, the migration DX was a big challenge. Some migrations need to copy/reorder data between tables which can take a long time and thus need be performed in the background/async. Many teams run Langfuse self-hosted and reducing downtime of an upgrade is important, thus, we added some custom logic to run these “background migrations” for now [2]. For regular migrations we use golang-migrate, works decently although DX isn’t as good as eg Prisma for Postgres.
Interesting, we didn't run into the need for blocking and async migrations for our use case, because we do incremental updates to our clickhouse schema. What do you mean by needing to change the ordering of existing tables? I'll check out the background migrations link tomorrow!
caust1c 23 days ago [-]
In my experience, my pain hasn't been from managing the migration files themselves.
My pain comes from needing to modify the schema in such a way it requires a data migration, reprocessing data to some degree, and managing the migration of that data in a sane way.
You can't just run a simple `INSERT INTO table SELECT * FROM old_table`, or anything like that because if the data is large, it takes forever and a failure in the middle could be fairly difficult to recover from.
So what I do is I split the migration into time-based chunks, because nearly every table has a time component that is immutable after writing, but I really want a migration tool that can figure out what that column is, what those chunks are, and incrementally apply a data migration in batches so that if one batch fails I can go in there to investigate and know exactly how much progress on the data migration has been made.
vortex_ape 23 days ago [-]
This makes a ton of sense! We've faced the same exact problem because whenever we need to add a new materialized table or column, we need to backfill data into it.
For materialized columns, it's a "easy" (not really because you still need to monitor the backfill) because we can run something like `ALTER TABLE events UPDATE materialized_column = materialized_column WHERE 1`. Depending on the materialization that can bring the load up on clickhouse because it still creates a lot of background mutations that we need to monitor, because they can fail due to all sorts of reasons (memory limit or disk space errors), in which case we need to jump in and fix things by hand.
For materialized tables, it's a bit harder because we need to write custom scripts to load data in day-wise (or month, depending on the data size) chunks like you mentioned, because an `INSERT INTO table SELECT * FROM another_table` will for sure run into memory limit errors depending on the data size.
I would love to think more about this to see if it's something that would make sense to handle within Houseplant.
Would love to chat more here or there if you're keen!
akdor1154 23 days ago [-]
On one hand i see SQL-in-yaml and instantly nope outta here.
On the other, I've definitely had trouble with getting an existing migrations tool to work with Clickhouse.. currently using golang-migrate but it's not painless. I think Atlas might have the right idea but I'm hesitant to lock my company into commercial tooling. Sqlmesh may also kind of do what's needed, but its focus is more on transforming tables of data, managing your ddl happens as a happy accident along the way.
vortex_ape 23 days ago [-]
We could look into supporting another way to define migrations that is not yaml. What would you have in mind and why is yaml bad?
gregw2 20 days ago [-]
Why mot use comment-annotated .sql like liquibase or flyway database migration tools do? How does yaml help?
As someone who did migrations wrapped in XML back in the day, those parsing wrappers around sql (xml, yaml, json) add complexity for the user and create opaque error messages at times due to the wrapper format. (quoting, escaping, indenting, allowed characters, etc)
I get that as a certain fraction of your userbase may like yaml and I am not saying you shouldn't support it. And I get that the yaml parsing library can accelerate certain dev steps for you. But the impedence mismatch between yaml and sql and error handling will make your product usability worse.
You are already building a domain specific language of sorts. You will build a better product if you keep your "language" as close to the problem domain (sql-centric migrations) as possible. Yaml just adds complexity.
My two cents anyway.
sgarland 23 days ago [-]
YAML is bad, first and foremost, because the language is hideously awful. Anything that can interpret Norway’s country code as a bool, or certain numbers as sexagesimal, has not been well-thought out.
As to what else, I continually fail to see what’s wrong with SQL. It’s incredibly easy to read, in plain English, and understand what is being done. ALTER TABLE foo ADD COLUMN bar INTEGER NOT NULL DEFAULT 0… hmm, wonder what that’s doing?
For people who say you get too many .sql files doing it this way: a. Get your data models right the first time b. Build a tool to parse and combine statements, if you must.
vortex_ape 23 days ago [-]
I agree with your point about SQL's readability, however, I think there's value in having structured metadata around migrations, like dependencies between migrations, rollback instructions, and environment specific variations, beyond just the raw SQL.
This could be done with SQL comments, but having it in a structured format makes it more reliable to parse and validate programmatically. I do see why YAML's quirks could become a problem in a tool that's meant to help you make sure your database is in order, we didn't run into issues like country codes or numbers being interpreted as sexagesimal (yet).
Perhaps a middle ground would be to keep the actual migrations in .sql files for readability, while using a separate metadata file (in JSON or TOML) for the orchestration details. What do you think?
sgarland 22 days ago [-]
Re: dependencies, SQL is declarative. While there might be benefit in keeping the last modification as its own entity, I think just dumping the output of SHOW CREATE TABLE or the equivalent is the best solution. It clearly shows the current state of the table. This of course requires that the migration has already occurred, or has been faked for display purposes.
I guess my stance is that if you need some kind of config file to create the migration, then fine, but as a DBRE, it’s much easier for me to reason about the current state by seeing the declarative output than by mentally summing migration files. As a bonus, this lets you see quite easily if you’ve done something silly like creating an additional index on a column that has a UNIQUE constraint (though you’d still have to know why that’s unnecessary, I suppose).
vortex_ape 22 days ago [-]
Yep that's the reason we generate a schema.sql at the end so that you know the current state of the database after the migrations have been applied (you don't need to mentally sum those migrations). Coupled with git, you get to see the diff of the schema.sql to see what changed which is super useful, like the same additional index example you mentioned.
christiangenco 23 days ago [-]
I'm new to ClickHouse. Why is this necessary? Isn't there a native way to do ClickHouse migrations?
benpacker 23 days ago [-]
You can connect and run SQL, but there’s no built in support for versioning, testing, and rolling back DDL.
vortex_ape 23 days ago [-]
You can run DDL commands directly on ClickHouse, but after you have a lot of these commands in your codebase, it becomes a pain to manage them. We had these commands defined in a lot of ruby files named <timestamp>_create_table.rb using which we mimicked the way rails and active record manage the app database migrations. Rails and active record also give you a schema.rb which shows the latest state of the app database at all times. After having no schema.rb for a long time, and it becoming a pain to manage the standalone migration files in ruby, we decided to build this to be able to manage migrations for ClickHouse, our analytics database. We love using it and the schema.sql file (with the latest db state) it generates at the end.
owenthejumper 23 days ago [-]
Migrations is a confusing word in the context. It's a modern word for 'configuration management for database schemas'.
Context: When adding Clickhouse to Langfuse [1] to store all observability data, the migration DX was a big challenge. Some migrations need to copy/reorder data between tables which can take a long time and thus need be performed in the background/async. Many teams run Langfuse self-hosted and reducing downtime of an upgrade is important, thus, we added some custom logic to run these “background migrations” for now [2]. For regular migrations we use golang-migrate, works decently although DX isn’t as good as eg Prisma for Postgres.
[1] OSS LLM Observability, https://github.com/langfuse/langfuse
[2] https://langfuse.com/self-hosting/background-migrations
My pain comes from needing to modify the schema in such a way it requires a data migration, reprocessing data to some degree, and managing the migration of that data in a sane way.
You can't just run a simple `INSERT INTO table SELECT * FROM old_table`, or anything like that because if the data is large, it takes forever and a failure in the middle could be fairly difficult to recover from.
So what I do is I split the migration into time-based chunks, because nearly every table has a time component that is immutable after writing, but I really want a migration tool that can figure out what that column is, what those chunks are, and incrementally apply a data migration in batches so that if one batch fails I can go in there to investigate and know exactly how much progress on the data migration has been made.
For materialized columns, it's a "easy" (not really because you still need to monitor the backfill) because we can run something like `ALTER TABLE events UPDATE materialized_column = materialized_column WHERE 1`. Depending on the materialization that can bring the load up on clickhouse because it still creates a lot of background mutations that we need to monitor, because they can fail due to all sorts of reasons (memory limit or disk space errors), in which case we need to jump in and fix things by hand.
For materialized tables, it's a bit harder because we need to write custom scripts to load data in day-wise (or month, depending on the data size) chunks like you mentioned, because an `INSERT INTO table SELECT * FROM another_table` will for sure run into memory limit errors depending on the data size.
I would love to think more about this to see if it's something that would make sense to handle within Houseplant.
Would love to chat more here or there if you're keen!
On the other, I've definitely had trouble with getting an existing migrations tool to work with Clickhouse.. currently using golang-migrate but it's not painless. I think Atlas might have the right idea but I'm hesitant to lock my company into commercial tooling. Sqlmesh may also kind of do what's needed, but its focus is more on transforming tables of data, managing your ddl happens as a happy accident along the way.
As someone who did migrations wrapped in XML back in the day, those parsing wrappers around sql (xml, yaml, json) add complexity for the user and create opaque error messages at times due to the wrapper format. (quoting, escaping, indenting, allowed characters, etc)
I get that as a certain fraction of your userbase may like yaml and I am not saying you shouldn't support it. And I get that the yaml parsing library can accelerate certain dev steps for you. But the impedence mismatch between yaml and sql and error handling will make your product usability worse.
You are already building a domain specific language of sorts. You will build a better product if you keep your "language" as close to the problem domain (sql-centric migrations) as possible. Yaml just adds complexity.
My two cents anyway.
As to what else, I continually fail to see what’s wrong with SQL. It’s incredibly easy to read, in plain English, and understand what is being done. ALTER TABLE foo ADD COLUMN bar INTEGER NOT NULL DEFAULT 0… hmm, wonder what that’s doing?
For people who say you get too many .sql files doing it this way: a. Get your data models right the first time b. Build a tool to parse and combine statements, if you must.
This could be done with SQL comments, but having it in a structured format makes it more reliable to parse and validate programmatically. I do see why YAML's quirks could become a problem in a tool that's meant to help you make sure your database is in order, we didn't run into issues like country codes or numbers being interpreted as sexagesimal (yet).
Perhaps a middle ground would be to keep the actual migrations in .sql files for readability, while using a separate metadata file (in JSON or TOML) for the orchestration details. What do you think?
I guess my stance is that if you need some kind of config file to create the migration, then fine, but as a DBRE, it’s much easier for me to reason about the current state by seeing the declarative output than by mentally summing migration files. As a bonus, this lets you see quite easily if you’ve done something silly like creating an additional index on a column that has a UNIQUE constraint (though you’d still have to know why that’s unnecessary, I suppose).