>They care less about impressive benchmarks or clever algorithms, and more about whether they can operate and use a database efficiently to query, update, analyze, and persist their data with minimal headache.
Hugely important, and I would add "backup-and-restore" to that list. At risk of sounding conceited, ease of use is a primary goal of rqlite[1] -- because in the real world databases must be operated[2]. I never add a feature if it's going to measurably decrease how easy it is to operate the database.
Or, I should say, I don't add the feature until I can figure out how it can made be easy and intuitive to use. That's assuming the feature is even coherent with the existing feature set of the database.
Of course, it's easy for me to do this. I am not developing the database for commercial reasons, so can just say "no" to an idea if I want. That said, I've found that many ideas which didn't seem interesting to me when an end-user first proposed them become compelling once I think more about the operational pain (and it's almost always operational) they are experiencing.
Automatic backups to S3[1] was such a feature. I was sceptical -- "just run a script, call the backup API, and upload yourself" was my attitude. But the built-in support has become popular.
The distance between a large effort and a moderate effort is not very long; from the user's perspective, both things are in the realm of the Hassle.
The chasm between a small required effort and zero effort is vast, from the user's perspective.
Any product person will tell you that. Hitting the right zero-effort target is what separates a runaway success from a tepid reaction.
hawk_ 147 days ago [-]
> The chasm between a small required effort and zero effort is vast, from the user's perspective.
Wow! I think I have intuitively/subconsciously worked this into some products I have developed but never had it articulated this way. This is very useful. Any books you'd recommend on this topic?
nine_k 147 days ago [-]
Sorry, can't recommend a good book. My opinions come from interacting with actual product people, and from watching "undercooked", "overpriced", "poorly made", "easy to replicate" things achieve enviable success and wide adoption very quickly: PHP, Docker, Heroku, Dropbox, etc. They all completely removed friction in some narrow but important areas, and this was more important than all their very real shortcomings.
spinningslate 148 days ago [-]
yes, definitely. I find it helpful to think of usability and friction as an inverse square law relation [0]. Small increases in friction (x-axis) cause dramatic drop-off in usability (y-axis) to begin with, then correspondingly less so. Specific user tolerance will vary, but adoption broadly follows a similar path - exponential drop off.
I've never seen any data to back this up in a quantitative sense (though interested if anyone has?). Nevertheless, I've still found it useful as a qualitative rule of thumb in a positive sense: shaving off small edges of friction can have non-linear return in adoption and satisfaction.
I like the electromagnetic analogy. I could say along these lines that lowering friction is like upgrading from aluminium wires to copper wires, and removing friction is like upgrading to a superconductor. The change becomes qualitative.
makmanalp 148 days ago [-]
Add schema migrations and bulk loads to this. So many systems crap out doing things like schema migrations at scale. Query latencies degrade over time due to internal structures keeping track of things, stuff runs out of buffer / log space in memory or on disk, you have a traffic spike but you can't pause or throttle a 40 hour long ALTER, things that should never lock do lock for indeterminate times during cutover, stuff craps out after cutover due to surprising behavior but there is no rollback, or even worse I've seen things just flat out crash with some random assert fail or segfault.
It's a world of pain, and there are so much scar tissue of third party tooling doing crazy stuff dealing with this problem among large companies that really should be the DB vendor's problem.
sgarland 148 days ago [-]
If you spend time reading docs – I mean really reading them, you’ll probably find out why something locked. For example, MySQL 8.? can extend metadata locks to cover foreign key relationships. This seemingly innocuous line in docs can cause the entire DB to grind to a halt under load.
I’m not saying this is good, only that RDBMS management is very much still in the land of arcane incantations, and anyone saying otherwise is selling something. It’s also why I have a job, so from my perspective it isn’t all bad.
makmanalp 147 days ago [-]
Hi friend. I, like you, am exactly that person who reads those docs very carefully and has to know about the seemingly innocuous one liners as well as the one liners that are not there but should be, that eventually make it in after the next point release. Or the bug report that's marked "S3 (non-critical)" that sneakily awaits for days in production until it whacks something very much critical. Or the bug that celebrated its 14th birthday. What I understand about how some of these implementations work makes me very sad sometimes. And I'm telling you that the state of things is silly.
The thing is - and I imagine you know this - that when you run tens of thousands of database instances it's not a matter of if, it's a matter of when. I can't chase after thousands of engineers to make sure they're not hitting some obscure pitfall nor can I demand that they know the ins and outs of the 8 different types of locks in FooDB. And "read the docs lol" definitely doesn't scale. So we build automated mass-guardrails and linters as best we can and use various shenanigans to deal with the real world.
In the context of schema migrations, there's tools like pt-osc, gh-ost and pgroll that are excellent for what they are and I'm grateful to those who created them but these are still in effect huge hacks that are bolted on top of the database. They're teeming with footguns. I can tell you about a time where badly conceived timeout / error handling code in one of these tools in their golang database driver resulted in most of a database dropped instantly with no fast rollback, for example. That was fun.
If a third party bolt on tool can implement zero(ish) downtime migrations and cutovers, at least much much much better than the regular DDL, why the heck is the database not doing whatever that thing is doing? It's tech debt, history, crud, lack of caring, etc. Schema migrations is just the tip of the iceberg too, don't get me started on replication, durability settings, resource usage vs capacities ...
My point is that academics and vendors should be taking operational problems seriously and owning them, and solving them properly at the database layer.
And just like you I also make living off of the fact that the real world operational characteristics of truly busy databases are very different than what gets benchmarked and tested. I'm just saying that some of this stuff even I don't want to have to deal with. There's better problems to be working on.
pjs_ 148 days ago [-]
A huge fraction (not 100%, but maybe 80%) of my frustration in trying to get technical people to use a database is that they have such a hard time understanding JOINs.
People endlessly want hacks, workarounds, and un-normalized data structures, for a single reason - they don't want to have to think about JOIN. It's not actually for performance reasons, it's not actually for any reason other than it's easier to imagine a big table with lots of columns.
I'm actually sympathetic to that reticence but what I am not sympathetic about is this: why, in 2024, can't the computer figure out multi-table joins for me?
Unless your schema is really fucked up, there should only be one or two actually sensible ways to join across multiple tables. Like say I have apples in boxes in houses. I have an apple table, a box table, and a house table. apples have a box_id, boxes have a house_id. Now I want to find all the apples in a house. Do two joins, cool. But literally everyone seems to write this by hand, and it means that crud apps end up with thousands of nearly identical queries which are mostly just boringly spelling out the chain of joins that needs to be applied.
When I started using SQLAlchemy, I naively assumed that a trivial functionality of such a sophisticated ORM would be to implement `apple.join(house)`, automatically figuring out that `box` is the necessary intermediate step. SQLAlchemy even has all the additional relationship information to figure out the details of how those joins need to work. But after weeks of reading the documentation I realized that this is not a supported feature.
In the end I wrote a join tool myself, which can automatically find paths between distant tables in the schema, but it seems ludicrous to have to homebrew something like that.
I'm not a trained software engineer but it seems like this must be a very generic problem -- is there a name for the problem? Are there accepted solutions or no-go-theorems on what is possible? I have searched the internet a lot and mostly just find people saying "oh we just type out all combinatorially-many possible queries"... apologies in advance if I am very ignorant here
arkh 147 days ago [-]
> Unless your schema is really fucked up
So most enterprise schema who have outlived multiple applications. Usually due to time constraint, lack of database administrators and complex business needs. Excel is the king of databases for a reason.
big_whack 148 days ago [-]
It's not really a problem of there being combinatorially many ways to join table A to table B, but rather that unless the join is fully-specified those ways will mostly produce different results. Your tool would need to sniff out these ambiguous cases and either fail or prompt the user to specify what they mean. In either case the user isn't saved from understanding joins.
halfcat 148 days ago [-]
> ”but rather that unless the join is fully-specified those ways will mostly produce different results.”
As a SQL non-expert, I think this is why we are averse to joins, because it’s easy to end up with more rows in the result than you intended, and it’s not always clear how to verify you haven’t ended up in this scenario.
sa46 147 days ago [-]
I wish there were a way to specify the expected join cardinality syntactically for 1:1 joins.
SELECT * FROM order JOIN_ONE customer USING (customer_id)
Don't have a suitable database to test on but I'm pretty sure standard SQL allows you to join on a subselect that limits its result to one.
singron 147 days ago [-]
You probably want to use LATERAL JOIN, which will compute the subquery for each left hand row. Otherwise it's computed just once.
cess11 147 days ago [-]
With SQL there are commonly several options available, and tradeoffs might not be obvious until it's tested against a real data set. Sometimes even an EXPLAIN that looks good still has some drawback in production, though in my experience it's very rare.
I like that though, I have a preference for REPL style development and fleshing things out interactively. Enough so that I build Java like that, hacking things out directly in tests.
big_whack 147 days ago [-]
Sorry, but someone who is averse to joins is not a non-expert in SQL, they are a total novice. The answer is like any other programming language. You simply must learn the language fundamentals in order to use it.
halfcat 147 days ago [-]
Not sorry, I’ll stick with SQL non-expert as I’ve only worked with databases for a few decades and sometimes run into people who know more.
Working with a database you built or can control is kind of a simplistic example.
In my experience this most often arises when it’s someone else’s database or API you’re interacting with and do not control.
An upstream ERP system doesn’t have unique keys, or it changes the name of a column, or an accounting person adds a custom field, or the accounting system does claim to have unique identifiers but gets upgraded and changes the unique row identifiers that were never supposed to change, or a user deletes a record and recreates the same record with the same name, which now has a different ID so the data in your reporting database no longer has the correct foreign keys, and some of the data has to be cross-referenced from the ERP system with a CRM that only has a flaky API and the only way to get the data is by pulling a CSV report file from an email, which doesn’t have the same field names to reliably correlate the data with the ERP, and worse the CRM makes these user-editable so one of your 200 sales people decides to use their own naming scheme or makes a typo and we have 10 different ways of spelling “New York”, “new york”, “NY”, “newyork2”, “now york”, and yeah…
Turns out you can sometimes end up with extra rows despite your best efforts and that SQL isn’t always the best tool for joining data, and no I’m not interested in helping you troubleshoot your 7-page SQL query that’s stacked on top of multiple layers of nested SQL views that’s giving you too many rows. You might even say I’m averse.
consteval 147 days ago [-]
Yes when you have duplicated data and data inconsistencies/integrity issues you might get duplicate data and data inconsistencies/integrity issues in your output.
This is a problem of form, not method. JOINs are a fantastic, well-defined method to aggregate data. If the form of your data is messed up, then naturally the result may be too.
> no I’m not interested in helping you troubleshoot your 7-page SQL query that’s stacked on top of multiple layers of nested SQL views that’s giving you too many rows
People say this type of thing but SQL is an incredibly high-level language.
Yes debugging a big SQL query can suck. Debugging the equivalent of it is almost always much worse. I'd rather debug a 30-line query than a 400 line perl script that does the same thing. Because that's actually the alternative.
I have manually aggregated data, messy data, in awk, perl, python... it is much, much worse.
halfcat 146 days ago [-]
> ”I have manually aggregated data, messy data, in awk, perl, python... it is much, much worse.”
Yes, but with Python/etc you can at least do the same logic in a plain loop, which is much slower but serves as a more reliable reference of what the data is supposed to be, which can be used to validate the functionality of SQL output.
Is there an equivalent in SQL of this “slow and dumb” approach for validating? Like, I’m not sure if a lateral join is essentially doing the same thing under the hood.
rawgabbit 146 days ago [-]
Most databases have the concept of temporary tables that will automatically disappear when your session ends. For troubleshooting, I would breakdown each step and save it in a temp table. Validate it. Then use it as the input for the next step. Etc.
rawgabbit 147 days ago [-]
I have run into this scenario a few times where the multi-hour processes produces an explosion of rows that no one cared to troubleshoot further. They only wanted to de-duplicate the final result.
In practice, I ended up creating a skeleton table that not only had the appropriate constraints for de-duplication. But I would also create mostly empty rows (empty except for the required column key ID fields) with the exact rows they were expecting. And then I would perform an UPDATE <skeleton> ... FROM <complicated join>. It is a hack but if there was no desire to rewrite a process that was written years ago by teams of consultants, I can only do what I can do.
yunolearn 147 days ago [-]
Nobody is entitled to any of this being easy. If you don't like working with badly-designed databases, why not simply work with people who know how databases work? In the meantime, I have bad news: nobody is going to do the hard work for you.
yunolearn 147 days ago [-]
To verify, you have to think through the problem and arrive at a solution, like all of programming.
sgarland 148 days ago [-]
> People endlessly want hacks, workarounds, and un-normalized data structures, for a single reason - they don't want to have to think about JOIN. It's not actually for performance reasons, it's not actually for any reason other than it's easier to imagine a big table with lots of columns.
That's because, inexplicably, devs by and large don't know SQL, and don't want to learn it. It's an absurdly simple language that should take a competent person a day to get a baseline level of knowledge, and perhaps a week to be extremely comfortable with.
As an aside, something you can do is create views (non-materialized) for whatever queries are desired. The counter-arguments to this are that is slows development velocity, but then, so does devs who don't know how to do joins.
rtpg 148 days ago [-]
it's "absurdly simple", but then you get presented with a bunch of weird things that look like abstraction ceilings like "oh you can't refer to the select clause alias you made in the filter because despite that showing up first lexicographically the ordering is different" and "oh you don't have to refer to the table name except when you do because of ambiguity issues".
I think there's a beautiful space for some SQL-like language that just operates a bit more like a general-purpose language in a more regular fashion. Bonus points for ones where you don't query tables but point at indexes or table scans and the like (resolving the "programmer writes query that is super non-performant because they assume an index is present when it's not").
I think it's still super straightforward to sit down and learn it, but it's really unfortunate that we spend a bunch of time in school learning data structures and then SQL tries really hard to hide all that, making it pretty opaque despite people intuitively understanding B-Trees or indexes.
iTokio 147 days ago [-]
SQL separates query definition from implementation because there is a planning phase between them that can be sometimes quite complex.
To choose the best path to retrieve data, you have to know what are the possible paths (using the underlying data structures, indexes but also different algorithms to filter, join…), but you should also know some data metrics to evaluate if some shortcuts are worth it (a seq scan can be the best choice with a small table..).
And the thing that will trip most humans, is that you need to reevaluate the plan if the underlying assumptions change (data distribution has become something that you never expected).
Note that the planner is also NOT always right, it heavily relies on heuristics and data metrics that can be skewed or not up to date.
Some databases allow the use of hints to choose an index or a specific path.
rtpg 147 days ago [-]
My honest experience in a skilled team has been that people more or less start off thinking "OK, what indices do we need to make this performant", work off of that, and then in the end try to have queries that hit those ones.
I understand the value of full declarative planning with heuristics, but sometimes the query writers do in fact have a better understanding of the data that will go in.
And beyond that, having consistent plans is actually better in some ideologies! Instead of a query suddenly changing tactics in a data- and time-dependent way, having consistent behavior at the planning phase means that you can apply general engineering maintenance tactics. Keep an eye on query perf, improve things that need to be improved... there are still the possibility of hitting absolutely nasty issues, but the fact that every SQL debugging session starts with "well we gotta run EXPLAIN first after the fact" is actually kind of odd!
mike_hearn 147 days ago [-]
There's an interesting experiment in this direction here:
It maps Java objects to a scalable transactional K/V store of your choice, and handles things like indexing, schema migrations and the rest for you. You express query plans by hand using the Java collections and streams framework.
flyingsilverfin 147 days ago [-]
I wanted to jump in here and say that what we're working on at typedb.com, in our 3.0 version (coming soon in alpha!), is that we're taking our earlier database query language and making it much more Programming-like: functions, errors containing stack traces, more sophisticated type inference, queries as streams/pipelines... I think it's super exciting and has a huge horizon for where it could go by meshing more ideas from PL design :)
Incidentally I think it also addresses what a lot of the comments here are talking about: not learning JOINs, indexing, build-in relation cardinality constraints, etc, but that's a separate point!
sgarland 147 days ago [-]
> TypeDB models are described by types, defined in a schema as templates for data instances, analogous to classes. Each user-defined type extends one of three root types: entity, relation, and attribute, or a previously user-defined type.
This sounds like an EAV table, which is generally a bad idea.
Re: types, Postgres allows you to define whatever kind of type you’d like. Also re: inheritance, again, Postgres tables can inherit a parent. Not just FK linking, but literally schema inheritance. It’s rare that you’d want to do this, but you can.
In general, my view is that the supposed impedance mismatch is a good thing, and if anything, it forces the dev to think about less complicated ways to model their data. The relational model has stuck around because it’s incredibly good, not because nothing better has come around.
EDIT: this came across as quite harsh, and I’m sorry for the tone. Making a new product is hard. I’m just very jaded about anything trying to replace SQL, because I love it, it’s near-universal, and it hasn’t (nor is likely to) gone anywhere for quite some time.
golergka 147 days ago [-]
> I think it's still super straightforward to sit down and learn it, but it's really unfortunate that we spend a bunch of time in school learning data structures and then SQL tries really hard to hide all that, making it pretty opaque despite people intuitively understanding B-Trees or indexes.
That's one of the best things about SQL, it's declarative nature. I describe the end result, data I want to receive — not the instructions on how it should be done. There's no control flow, there's no program state, which means that my mental model of it is so much simpler.
lucianbr 147 days ago [-]
Yes but for performance you need to know how it is done, which defeats the declarative point.
I've read countless articles on how to rearrange the "declaration of what you want" in order to get the database to do it in a fast way.
paulmd 147 days ago [-]
> I've read countless articles on how to rearrange the "declaration of what you want" in order to get the database to do it in a fast way.
While this is doubtlessly true, in many cases the “rearranging” also involves a subtle change in what you are asking the database to do, in ways which allow the database to actually do less work.
SELECT 1 WHERE EXISTS vs WHERE ID IN (SELECT ID FROM mytable WHERE …) is a great example. The former is a much simpler request despite functionally doing the same thing in the common use-cases.
sgarland 147 days ago [-]
Yes, although both MySQL and Postgres will automatically rewrite the latter to the former, in most circumstances. I still find it good practice to explicitly write the semijoin, to demonstrate the intent.
MySQL, but not Postgres, will also convert the inverse to an antijoin.
As an aside, this is also often a great example to demonstrate why you shouldn’t use SELECT *, since you can’t write a semijoin if you need the predicate column in the result.
lucianbr 146 days ago [-]
I don't see how that is supporting "SQL is declarative and therefore simple". You've kept the declarative part, but it's not simple anymore. You need to know multiple declarative ways to ask for the same thing, and the consequences of each. Might as well just tell the DB how you want to do it.
sgarland 147 days ago [-]
Sometimes you do, yes. Often times, though, the issue is that the statistics for the table are wrong, or the vacuum (for Postgres) hasn’t been able to finish. Both of these are administrative problems which can be dealt with by reading docs and applying the knowledge.
I think of RDBMS like C: they’re massively capable and performant, but only if you know what you’re doing. They’re also very eager to catch everything on fire if you don’t.
sgarland 147 days ago [-]
> "oh you don't have to refer to the table name except when you do because of ambiguity issues"
Maybe it's easier if you think of it like helpful syntactic sugar?
> I think there's a beautiful space for some SQL-like language that just operates a bit more like a general-purpose language in a more regular fashion. Bonus points for ones where you don't query tables but point at indexes or table scans and the like
That sounds like imperative programming, which is fine for most things, but [generally] not RDBMS (or IaC, but that's a completely separate topic). You can't possibly know the state of a given table or index – the cardinality of columns, the relative grouping of tuples to one another, etc. While you can hint at index usage (natively with MySQL, via extension with Postgres), that's as close as the planner will let you get, because it knows these things better than you do.
> resolving the "programmer writes query that is super non-performant because they assume an index is present when it's not"
More frequently, I see "programmer writes query that is super non-performant because they haven't read the docs, and don't know the requirements for the planner to use the index." A few examples:
* Given a table with columns foo, bar, baz, with an index on (foo, bar), a query with a predicate on `bar` alone is [generally] non-sargeable. Postgres can do this, but it's rare, and unlikely to perform as well as you'd want anyway.
* Indices on columns are unlikely to be used for aggregations like GROUP BY, except in very specific circumstances for MySQL [0] (I'm not sure what limitations Postgres has on this).
* Not knowing that a leading wildcard on a predicate (e.g. `WHERE user_name LIKE '%ara'`) will, except under two circumstances [1], skip using an index.
> despite people intuitively understanding B-Trees or indexes.
You say that, but the sheer number of devs I've talked to who are unaware that UUIDv4 is an abysmally bad choice WRT performance for indices – primary or secondary – says otherwise.
[1]: Postgres can create trigram indices, which can search with these, at the expense of the index being quite large. Both MySQL and Postgres can make use of the REVERSE() function to create a reverse index on the column, which can then be used in a query with the username also reversed.
rtpg 147 days ago [-]
In the universe in which I "know" what index to use (an assumption that can be contested!), you saying "well silly you, the planner is too stupid to figure this out" is not a great defense of the system!
My serious belief is that all the SQL variants are generally great, but I just want this to be incremented with some lower-level language that I can be more explicit with, from time to time. If only because sometimes there are operational needs.
The fact that the best we get with this is planner _hints_ is still to this day surprising to me. Hints! I am in control of the machine, why shouldn't it just listen to me! (and to stop the "but random data analyst could break thing", this is why we have invented permission systems)
sgarland 147 days ago [-]
> you saying "well silly you, the planner is too stupid to figure this out" is not a great defense of the system!
I don’t think that’s what I said. At least, it wasn’t what I meant.
The planner needs accurate stats on the tables to produce an optimal output. So would you, as a human, if you were writing a query to be executed precisely as written.
It also needs to know information about the system it’s operating in – the amount of RAM it can expect to use being the most important one, but also things like the relative performance of a random disk seek vs. sequential.
The planner also has a huge leg up on most people, in that it is aware of all of the kinds of joins and lookups it can do. Off the top of my head, Postgres can scan in the following ways: sequential scan, parallel sequential scan, index scan, parallel index scan, index-only scan, and bitmap scan. For joins, it can use: nested loop, merge, hash. Do you know when all of those are ideal? What about when you have a query joining multiple tables, keeping in mind that the intermediate contents of a table that has been joined once may look nothing like what it started as?
Izkata 147 days ago [-]
> Off the top of my head, Postgres can scan in the following ways: sequential scan, parallel sequential scan, index scan, parallel index scan, index-only scan, and bitmap scan.
I don't think that last one is a thing. There is both "bitmap heap scan" and "bitmap index scan" though.
sgarland 146 days ago [-]
It’s the generalized parent of both of those (since you will not see one without the other). You’re technically correct, but I’m also not the only one [0] who uses this term.
> * Given a table with columns foo, bar, baz, with an index on (foo, bar), a query with a predicate on `bar` alone is [generally] non-sargeable. Postgres can do this, but it's rare, and unlikely to perform as well as you'd want anyway.
The one I see many times more often than this is having one index on (foo) and one index on (bar), then querying on both foo and bar, and assuming the database uses both indexes for best performance. It can be done, even by ancient mysql, but it's most likely going to be a lot slower than expected so they're heavily biased against doing it and will most likely just choose one of the two indexes.
pif 147 days ago [-]
> devs by and large don't know SQL, and don't want to learn it
One huge problem is that learning SQL will take you nowhere, unless your schema consists of a few tables with a handful of rows at most.
There is no standard SQL, and there is no serious application development without having to delve into the sanctum sanctorum of your database vendor.
sgarland 147 days ago [-]
I have worked in massive DBs (tables with billions of rows) in both Postgres and MySQL. I occasionally have to look at docs to figure out if what I want to do is supported by the particular version, but that’s about it.
There is a standard SQL, and vendors are free to add to it. Generally speaking, though, you can get quite far with the standard.
> there is no serious application development without having to delve into the sanctum sanctorum of your database vendor.
OK? I could say the same thing about needing to intimately know the differences in C++ versions. Yes, you should know your tools well.
rawgabbit 148 days ago [-]
Most databases have the concept of foreign keys. You declare how the tables relate to each other. You can then write a script that queries this metadata to write the join for you. I did this sort of thing over twenty years ago.
Spivak 147 days ago [-]
Most ORMs have this feature as well and sorry to say you will
hit edge cases where the join is ambiguous and have to manually specify it pretty fast.
KronisLV 147 days ago [-]
> where the join is ambiguous
Join table that maps to an entity in the middle.
You can even have multiple columns that have foreign keys against various tables, like some_table_id, other_table_id, another_table_id with only the needed ones being filled out.
It's not like you have to particularly care about the fact that most of those columns will be empty in practice, as opposed to making your database hard to query or throwing constraints aside altogether.
Terr_ 147 days ago [-]
> Join table that maps to an entity in the middle.
I'm not not sure what you mean. Are you saying that instead of FK/PK relations like:
user.residence_country_id = country.id
plus user.citizenship_country_id = country.id
Even then, the reverse "I have a country gimme a user" request is ambiguous.
Izkata 147 days ago [-]
GP misunderstood what an "ambiguous join" is. Rather than a join that returns multiple results, they're more like when you have relationships like this [0]:
user.residence1 -> country
user.residence2 -> country
user.join(country)
You have to specify somewhere which foreign key to follow, it can't be figured out automatically. GP's suggestion doesn't work because there's no way to automatically determine which one to use.
I'm only really familiar with Django, and it handles this type of thing by making you specify the "residence1" name instead of the table name "country".
[0] To refer back to the top of this comment chain, that user is complaining they can't do something like also have a "country->continent" relationship and do "user.join(continent)" and have the system figure out the two joins needed.
rawgabbit 145 days ago [-]
Putting on my dogmatic 3NF purist hat. Instead of denormalized
Oh definitely, I just couldn't think of a better example off the top of my head. But also not really relevant: You could also have another table and the relationships be A -> B -> D and A -> C -> D, with the desired join between A and D.
Remember this whole thread is about the system figuring out the JOIN criteria automatically.
pjs_ 147 days ago [-]
Yep, that's exactly what I did - using the ORM schema definition rather than the schema directly, but same difference
agent281 147 days ago [-]
Some databases have the concept of a natural join. It joins two tables on common columns. Not quite what I would want. I would prefer something like a key join that uses foreign key relationships. I don't know if any database has that though. (If anybody does, please let me know.)
Natural join is not jargon that I had heard before but unfortunately it does not refer to what I am talking about - I'm talking about something where the computer automatically figures out how to join across more than two tables.
yunolearn 147 days ago [-]
It's not just databases, it's everything: git, HTTP, even programming language features. The industry collectively decided you don't have to actually _know_ anything about anything anymore. And people wonder why software is so broken...
cess11 147 days ago [-]
That join you can have your ORM solve for you with some annotations or whatever implying eager loading and so on.
But that's only the trivial case, often you want something more complicated, where the order of clauses or what keywords you pick and how you want the results sorted will affect performance in your specific schema on your specific database engine.
RDBMS management and querying is a rather deep and experience demanding subject once you step outside the trivial cases. You could put more of it in application code, but it will be a lot of chatter and probably worse outcomes in terms of performance than having people on the team that are really good with databases.
paulmd 147 days ago [-]
> Unless your schema is really fucked up, there should only be one or two actually sensible ways to join across multiple tables. Like say I have apples in boxes in houses. I have an apple table, a box table, and a house table. apples have a box_id, boxes have a house_id. Now I want to find all the apples in a house. Do two joins, cool. But literally everyone seems to write this by hand, and it means that crud apps end up with thousands of nearly identical queries which are mostly just boringly spelling out the chain of joins that needs to be applied.
You are primarily describing natural join here. It won’t read your mind and join your tables automatically, but it automatically uses shared columns as join keys and removes the duplicate keys from the result.
The problem with any “auto” solution is going to be things like modified dates, which may exist in multiple tables but aren’t shared.
Even more magic is natural semi join and natural anti join.
Taikonerd 147 days ago [-]
> crud apps end up with thousands of nearly identical queries which are mostly just boringly spelling out the chain of joins that needs to be applied.
I think EdgeDB [0] is on the right track here. It's a database built on Postgres, but with a different query language. So instead of manually fiddling with JOIN statements, traversing linked tables just looks like dot notation:
select BlogPost.author.email; # get all author emails
Django is a python web framework over 15 years old that's also very similar to what GP wants. For example their apple->box->house query would be something like:
Apple.objects.get(box__house__owner = 'pjs_')
Still have to specify "box", but the actual foreign key relationships are defined on the models so you don't do the full JOIN statements anywhere.
crazygringo 148 days ago [-]
> In the end I wrote a join tool myself, which can automatically find paths between distant tables in the schema, but it seems ludicrous to have to homebrew something like that.
I mean, I think the issue is just that there are lots of possible paths so it can't be automated.
If you want to join employees to buildings, is it where their desk is assigned today, or where their desk was assigned two years ago, or where their team is based even though they work remotely, or where they last badged in?
Sure you can build a tool to find all potential joins based on foreign keys, but then how do you know which is correct unless you understand what the tables mean? And then if you understand what the tables mean, writing the join out yourself is trivial.
> Unless your schema is really fucked up, there should only be one or two actually sensible ways to join across multiple tables.
In my experience, having just one or two ways is for simple/toy projects. Lots of joins in no way means a schema is "fucked up". It probably just means it's correctly modeling actual relationships, correctly normalized.
pjs_ 147 days ago [-]
Yes, this is accurate. You often end up with many solutions and human judgement is sometimes required in the end to pick the right strategy. However, what I have found in practice is that heuristics and hinting can rapidly cut through that complexity. E.g. "always pick the shortest path between tables, and never use this set of tables as intermediate nodes on any path" rules out a ton of options, and usually will leave you with one or two, and usually those are the natural choices.
In this way you can use polynomially-many constraints or rules to avoid exponentially-many weird or exceptional routes through the schema. I am optimistic that you can build a system where by default, the autojoin solution is the natural one maybe 90% of the time. There will certainly be exceptions where you have to express the join conditions explicitly. But I think you can dramatically reduce the amount of code required.
I would also hazard the suggestion that this might produce productive backpressure on the system. If the autojoiner is struggling to find a good route through the schema, it's possible that the schema is not properly normalized or otherwise messed up.
Terr_ 147 days ago [-]
Real-world example: Someone wants Job-Application by Country, but that could mean via Applicant's Residence, the Applicant's Nationality, the Requisition Primary location, or one of the Requisition's Satellite Offices, etc.
... And god help you if someday Requisitions need to have Revisions too.
srcreigh 147 days ago [-]
You should learn about some basic database internals. In particular the data structure for indexes, the difference between primary and secondary indexes, join algorithms, page size being 8KiB, etc.
That knowledge gets you to a place to understand that most joins are very slow due to each DB page containing only 1% useful information for the query algorithm.
It will help you see reasons why ppl do things like put an array column in, or use wide tables, etc.
pjs_ 147 days ago [-]
We use the tool that I wrote on a database with tens of millions of rows. Because the keys are properly indexed, we rarely have performance issues associated with joins. Joins use the index, the Postgres planner is really good, and we can join across heaps of tables with great performance.
We have database performance issues for other reasons (loading of redundant information, too many queries) but not really because of joins.
mrgoldenbrown 146 days ago [-]
Here's one example of a common complication: Let's say your apple example is for an apple trading app. Each house is related to 2 boxes. One is available_for_trade_box_id and another is save_for_eating_box_id. How does your auto join tool know which to use?
Spivak 148 days ago [-]
I think there's a huge difference between how you design:
1. A database that's meant to be understood by programmers, make queries by the application efficient in space and time, and provide strong referential integrity.
2. A database that's meant to be played with by humans, where duplication is fine, referential integrity is a nice-to-have, every column is okay to be nullable, tables should contain complete objects as understood by the user that map 1-1ish to the real world, that eliminate as many opaque ids as feasible, and foreign keys might exist but aren't enforced in any direction.
The latter database is far more ergonomic and you won't run up against a user frustratingly bashing their keyboard because the database just refuses to do what they want. The stakes in #2 style databases are extremely low-- let the user destroy their copy of the database it's fine, we can reload from the last save.
The nice thing is that it seems very possible to go from #1 -> #2 mechanically, and hand that off to the users who want to play with it.
jiggawatts 148 days ago [-]
Some technical design elements can enable this style of non-programmer usage that most databases engines do not currently support.
First, treat each column separately in the physical engine (column store). Users ought to never need to worry about column count limits, row byte limits, or sparsity. Similarly, schema operators such as adding, removing, or reordering columns ought to always be instant atomic changes.
Imagine how much faster it would be possible to explore the schema design space if changes were instant and didn’t require “migration scripts” or data copies from an old table to a new one. It would make the database feel more like a spreadsheet!
Next, there ought to be Git-style forking and merging so that creating a test environment should also be a lightweight operation just like creating a local branch in a Git repo. Merging changes can be either schema only or schema+data -- the latter to support slowly-changing “master data” editing workflows. Currently, few if any database support the equivalent of a pull request with reviews to merge data. Hence the excessively complex access controls that could all be replaced with a single "peer review" operation.
I’ve seen a fancy category-theoretic approach where forks of the database can receive live data updates from production. This would allow UAT and similar environments to be evergreen with zero infrastructure code such as nightly ETL sync jobs.
Many operational tasks can be eliminated by tying live environments of both the code and data to Git branches of a single (mono) repo. Now the code and data schema can’t go out of sync! Literally impossible. No need for scaffolding, or ORMs, or any of those layers of overcomplicted abstractions! The schema is just “there”, in the repo, always representing reality.
Sprinkle on a Microsoft Access style form designer but with HTML5 and guest user support and you have a billion dollar product.
refset 148 days ago [-]
The 4GL dream hinges on figuring out how to transition from #2 -> #1 seamlessly. The spectrum is wider and more complex than "Excel -> Postgres" but even solving that journey would be a good start. It could save us all from a whole a bunch of ~needless engineering work.
> it seems very possible to go from #1 -> #2 mechanically, and hand that off to the users who want to play with it
NocoDB seems like a reasonable attempt of taking this approach for Postgres.
feoren 145 days ago [-]
> The 4GL dream hinges on figuring out how to transition from #2 -> #1 seamlessly
I guarantee you there is no algorithmic way to do this. Going from "big flat file" to "nicely organized 3+NF schema" adds a lot of underlying domain knowledge and information to the system. You're asking for "enhance!" on the database schema. It will always require expert understanding.
josephg 148 days ago [-]
Something thats always bugged me about relational database modelling is how you have to use table relationships for everything. Humans have a special category for ownership (eg Order owns DeliveryAddress), which works differently from other kinds of relationships. Eg Order references Products.
This problem is heightened by the fact that a SQL table typically can't store composite types. Like, you can't make a column of lists-of-strings. Every modelling problem is solved by using more tables. And any nontrivial application ends up with a table explosion. You have to study your database very closely, then write complex, slow, join-heavy queries, just to get the information about simple stuff - like an order.
Solving every problem with tables might seem clever from a technical perspective, but its just not how humans think. There's a reason document databases like mongodb are popular. Even if they are worse technically, they're much easier to use and reason about.
There's no reason SQL databases couldn't support nested data. Postgres sort of does already via JSON fields. But it feels like you have to fight the database to do it.
stult 148 days ago [-]
Your argument is a variant of the object–relational impedance mismatch problem[1]. It's easier for us to reason about objects (or functions and types) than it is to think in terms of SQL tables, so it is much easier to develop complex logic or domain models in general purpose programming languages. Yet data storage and retrieval is much, much more efficient with relational databases, and it is generally a good practice to logically separate the data storage layer from the rest of the system anyway. But implementing that layer to translate from the relational model to the object model frequently involves tons of finicky, manually crafted SQL statements embedded in general purpose code where the linting/type checking/static analysis tooling often isn't great for the embedded SQL. The only other alternatives are using an ORM that automagically handles most if not all of the relational-object mapping or a NoSQL solution like Mongo that avoids the mismatch altogether. Both those alternatives typically involve sacrificing performance and scalability, however, and the limitations and quirks of each frequently contribute to bugs (e.g., schema drift in document DBs, or implementation details of the ORM that make it hard to map nested relationships).
I don't think it's just about storage and retrievel being efficient. Sometimes it _is_ easier to think in terms of tables.
I have a bunch of posts, written by various authors, on various sites. Store that in some OOP way, with ownership going in some direction (For example, sites own posts). Now look up the most recent posts for authors. Look up the most recent posts per site. Look up the most prolific authors. Do the authors work cross-site? Maybe they don't but people use the same name across sites.
There are plenty of times I have put things into a database not for perf reasons but simply because it's easier to do queries that way than to write a bunch of bespoke logic at each variant of certain queries, simply due to ownership arrows being all over the place
minkles 148 days ago [-]
I think this is a poor understanding and laziness. SQL is type and schema first and people hate that because it makes things hard and complicated up front. Table explosions are rare if you know what you are doing. Many people don't any more.
As for join heavy, complexity, this is not necessarily a problem in reality. It's incredibly easy to scale this out to huge systems (relatively) cheaply.
Believe me as they scale up, they look way less hard and less complicated than arbitrary and poorly enforced schemas in document databases. I could write an essay on how to fuck up MongoDB (or any document store) because I spent nearly 2 years unfucking one.
josephg 147 days ago [-]
I’d love to read that essay. But to me sql looks like crap json - it’s like json except if (for purely historical reasons) we didn’t have objects or lists except at the top level of a json blob.
Claiming it’s a moral failing on my part for wanting nice, intuitive data structures is a bit rich. Go on, next why don’t you tell me that json is bad and we should use xml. Or c is lazy and we should use assembler. Without more detail, that’s all I hear - moral justifications for a dated design.
And the design is dated. As far as I can tell, the reason for the current design is because early sql databases had a fixed byte size for fields. So in that world, it would be impossible to embed a list in a cell. But that limitation hasn’t existed in modern sql databases for decades now.
Why can’t we have nice things? I don’t see anything in sql that precludes embedded lists and objects in table cells. Postgres even - sort of - supports it already via json fields.
cess11 147 days ago [-]
If JSON is enough for your use case, go ahead, use it. The big RDBM systems support it rather well now, so you can bring it straight into the SQL world too.
XML is for the non-trivial case, where having schemas and tooling around them for validation and code generation becomes useful. XML also has a default programming language for querying and transformations, allowing you to implement adapters from CSV, JSON or whatever into XML, and then between XML schemas.
XML can be used in a JSON-like way if that's what you want: XPath. When programming against XML it's usually not what you want though, in practice it tends to be nicer to use the schema to set up an object hierarchy native to the programming language and unmarshal into that. But there is at least one XPath oriented database, eXist, and it's pretty ergonomic if you find yourself with a lot of XML and isn't sure what it contains or what to use it for. I find it easier to work with than the NoSQL JSON varieties I come across professionally, and when I have Questions For The Customer it craps out PDF:s I can send them, which doesn't seem that easy to do with Mongo or Firestore.
As for SQL databases, they come in a lot of shapes and colours. I'm not so sure there is a JSON-first database equivalent of DuckDB, for example. Maybe I'm wrong, it's not a specialty of mine.
minkles 147 days ago [-]
It's nothing even remotely like JSON. It doesn't encode the same information or semantics at all.
Your entire argument is "old stuff is bad and I like JSON".
And yes we should use XML. At least it has competent, verifiable, consistent schemas and contracts. At the very least, you can encode xs:decimal in it unlike JSON which is "oh it's some shit that looks like a number and we might stick in a double at the other end because we don't know our butt from our elbow".
I ask you with JSON, how do you encode "add a single item to the list attached to a customer"
I am so so so tired of dealing with this level of thinking on a daily basis.
josephg 147 days ago [-]
I must admit, I'm having more of a go at SQL than I really feel. SQL nailed transactional semantics, indexes and querying.
My argument isn't that "old stuff is bad and I like JSON". Its that when I'm modelling data in my program, sometimes I have objects which reference each other (via key or shared pointer). And sometimes I have objects which contain another object. Eg, a struct with a vec inside. Real code is packed full of this stuff.
SQL lets me directly model relationships between objects. And it lets me model records with scalar fields. But it does not let me model records with variable size fields. This restriction seems totally arbitrary. There is no technical reason for it that I can think of. And it makes it needlessly complex to map data from my programming language to my SQL database.
I've named this problem 3 times in this thread now. For all your moaning about us kids being lazy, you still haven't given a single technical reason for this deficiency in SQL.
I largely agree with you about JSON. I don't actually want JSON fields in my database. I want something strongly typed, with well defined semantics and a good query system. SQL has a great track record of providing just that. And it has a perfectly serviceable type system already. So why on earth is it missing this important, obvious feature?
I just want my database to be able to express embedded lists and records, like I do in almost every programming language I've ever used, every messaging format I've used (protobuf, msgpack, json, xml, etc). And in many other databases (couchdb, mongo, foundationdb, etc).
> I ask you with JSON, how do you encode "add a single item to the list attached to a customer"
I can think of dozens of ways to express that. Postgres already supports that operation for embedded JSON fields using this ugly syntax:
UPDATE <table> SET <field-name> = <field-name> || '{"a": 1}'::jsonb
... Do you need more examples? Basically every database natively supports this except for "raw" SQL databases.
Edit: Apparently (according to other comments in this thread) some SQL databases are already moving in this direction anyway. How wonderful!
consteval 147 days ago [-]
> But it does not let me model records with variable size fields. This restriction seems totally arbitrary
Yes it does. Customers have orders which are an array. You have two tables then CUSTOMER and ORDER and you JOIN them. Why not just put the orders inside of CUSTOMER? Because now you can't query it, because you don't know how many columns will come out and you can't have disparate columns across rows.
So maybe you dump it all in one column, but obviously that has problem in terms of forming relations.
Sure, it's a different way of thinking. But its faster, its MUCH safer, the invariants are actually properly specified.
Sure you can use mongodb and that will work for wild-westing your way through software. But I wouldn't dare touch a mongo instance without going through the application, because all the constraints are willy-nilly implicitly applied in the application. But I directly view and edit SQL databases daily.
josephg 147 days ago [-]
> You have two tables then CUSTOMER and ORDER and you JOIN them. Why not just put the orders inside of CUSTOMER? Because now you can't query it, because you don't know how many columns will come out and you can't have disparate columns across rows.
Why wouldn't you be able to query it? You can query JSON fields that contain lists. Why not SQL fields with strongly typed lists?
> Sure, it's a different way of thinking. But its faster, its MUCH safer, the invariants are actually properly specified.
Why would it be safer?
An embedded list has pretty clear and obvious semantics. And as ekimekim said in another comment, postgres already has partial support. Apparently this works today:
CREATE TABLE example (
height number_with_unit, -- our composite type, eg. (6, 'ft') or (180, 'cm')
known_aliases TEXT[], -- list of string
active_times TSRANGE, -- time range, ie. (start, end) timestamp pair
);
An embedded list also sounds much faster to me - because you don't have to JOIN. Embedding a list promises to the database "I'll always fetch this content in the context of the containing record". Instead of (fetch row) -> (fetch referencing key) -> (fetch rows in child table), the database can simply fetch the associated field directly.
> But I wouldn't dare touch a mongo instance without going through the application, because all the constraints are willy-nilly implicitly applied in the application.
Yes, I hate mongodb as much as you do. I want explicit types and explicit invariants. But right now mongodb has useful features that are missing / unloved in SQL. How embarrassing. SQL databases should just add support for this approach to data modelling. Its nice to see that postgres is trying exactly that.
In programming, I don't have to choose between javascript and assembly. I have nice languages like rust with good type systems and good performance. We can have nice things.
housecarpenter 146 days ago [-]
As I understand it, the main advantage of having separate CUSTOMER and ORDER tables rather than just have a field on CUSTOMER which is a list of order IDs is that the latter structure makes it easy for someone querying the database to retrieve a list of order IDs for a given customer ID (just look up the value of the list-of-order-IDs field), but difficult to retrieve the customer ID for a given order ID (one would have to go through each customer and search through the list-of-order-IDs field for a match). The former structure, on the other hand, makes both tasks trivial to do with a JOIN.
josephg 146 days ago [-]
What you're describing is a straightforward indexing problem. There's nothing stopping the database building an index which can look up customer IDs for a given order ID. You could even, if the database allows it, build a virtual ORDER table projection that you could query directly.
Swizec 148 days ago [-]
> Like, you can't make a column of lists-of-strings.
Most relational databases have supported lists-of-strings (arrays) for about 10 years now. All of them supported lists-of-strings even before that, if you were okay doing a bit of app-level work on top of your SQL.
Relational databases these days support full JSON. And they're faster at it than many json-first databases :)
This is changing. Postgres (as you noted) supports JSON, and it also has excellent native support for arrays, range types, and composite types. For example:
CREATE TYPE number_with_unit (
value DOUBLE PRECISION,
unit TEXT
);
CREATE TABLE example (
height number_with_unit, -- our composite type, eg. (6, 'ft') or (180, 'cm')
known_aliases TEXT[], -- list of string
active_times TSRANGE, -- time range, ie. (start, end) timestamp pair
);
SELECT height.unit,
known_aliases[1] AS primary_alias,
upper(active_times) - lower(active_times) AS time_active
FROM example;
globular-toast 147 days ago [-]
You need to be aware of the tradeoffs, though. For example, adding an alias in your example could easily result in a lost update: if two users add an alias at the same time the second write will overwrite the first. You can go up an isolation level or check it in your application (do an `UPDATE ... WHERE` then check it went through), of course, but with relational tables you get this for free without any worry or performance impact.
codr7 148 days ago [-]
Easier to reason about in that they don't force you to clarify your thoughts, which will come back to bite your head off. And there's a pretty significant long term price to be paid. I would rather write Cobol than deal with MongoDB.
Spivak 147 days ago [-]
This is all nice and good but I've never once seen someone correctly clarify their thoughts ahead of time that doesn't end up being the wrong model as the code evolves.
The only schemas that survive long term are the ones that create such a tangled ball of foreign keys that migrations are infeasible. Great advice for anyone looking to leave a legacy.
codr7 147 days ago [-]
At least they exist, and force you to refactor the code to match reality.
Document databases tend to take on a life of their own; and since they encourage shortcuts, the slope is typically down.
yas_hmaheshwari 147 days ago [-]
Nicely put!
I see new grads talking about MERN stack (with M for Mongo), and always have an urge to correct them that replace M with Mysql
AdieuToLogic 147 days ago [-]
> Something thats always bugged me about relational database modelling is how you have to use table relationships for everything.
This is pretty much the definition of what a RDBMS is. The tables provide the relations via designated columns. Relation navigation is therefore possible in either direction given an established relationship and very often possible starting from any table in the pathological case.
If this conceptual model is not appropriate, it is best not to use it to begin with.
> This problem is heightened by the fact that a SQL table typically can't store composite types.
That is not what RDBMS' are designed to do, even though many vendors now support some form of composite data types (like JSON). Just like a moped is not designed to move a family across a continent.
Pick the persistent storage mechanism which best fits the problem at hand:
- If the problem calls for a document structure, use a document database.
- If the problem calls for a directed graph structure, use a graph database.
- If the problem calls for a relational structure, use a RDBMS.
- If the problem calls for some combination of the above, consider using multiple storage technologies.
None are perfect, but starting with the most appropriate persistent store can eliminate the weaknesses others would introduce in the same solution space.
jeeyoungk 148 days ago [-]
+1
For example, BigQuery has natural support for arrays and nested data, and it's quite nice / essential for good data modeling. For example, "tags" can be stored as `Array<Struct<Key, Value>>`, and this can be used to implement things like, "search with fields with particular tags".
This reduces the cognitive burden of remembering which tables join with which, especially if we know that a relationship is solely relevant in one context. I.e. Tags can only be joined to the main table, and no other joins are sensical.
zie 147 days ago [-]
I feel like you have never bothered to fully learn SQL and databases, as if you had, you wouldn't have said most of what you have said.
Certainly some problems are not friendly to SQL databases, but many problems are, and lots of problems are being added into SQL databases too. I would not be surprised in the least if LLM's start getting stored in SQL databases eventually, as you probably can reduce their storage to tables and SQL databases are amazing at table storage.
I agree it's not really how many humans think(though that can be said about many computer things), but that doesn't mean you can't learn how to do it.
globular-toast 147 days ago [-]
Like some siblings mention, this is the object-relational impedance mismatch. A decent ORM like SQLAlchemy goes some way to closing this gap. Less good ORMs like Django don't help much at all, though.
When you talk about "just" using nested data in your database you seem to be forgetting a whole load of problems that relational databases solve like concurrency issues, write performance, data integrity and consistency etc. The book Designing Data-Intensive Applications by Kleppmann is an essential read.
boxed 147 days ago [-]
Heh. I was just going to comment on how Djangos ORM is great because it cleanly and transparently solves the number one problem of the paper: the horrible joins.
SQLAlchemy however falls flat on its face in this regard.
globular-toast 147 days ago [-]
Django makes querying easier by generating the code for the joins, but you're still joining tables.
The point about SQLAlchemy was specifically about the issue mentioned by the GP, namely having to make more tables for composites. SQLAlchemy does a better job of mapping db tables back to regular objects. With Django you get the `alias_set` stuff but it never lets you forget that this is really a database table. Django uses the active record pattern while SQLAlchemy is a data mapper.
boxed 147 days ago [-]
JSON fields are horrible though, as they aren't validated so you now have backwards compatibility problems in your SCHEMA. PER ROW. Like in Mongo.
sgarland 148 days ago [-]
Non-scalar data violates 1NF [0]. You _can_ have composite types, arrays, JSON, etc. as others have said, but you are likely giving up relational integrity in doing so, which is the raison d'être of RDBMS. The reason they're so strict is to protect you from yourself (and others). When you get a foreign key violation, that should be cause for alarm, not an annoyance to figure out how to overcome.
> Solving every problem with tables might seem clever from a technical perspective, but its just not how humans think.
Not without practice, no. You can in fact learn to do data modeling.
I’ll also add to my previous comment, that it’s fine to violate normal form if you are aware that you’re doing so, and have a good reason. There are some instances where an array happens to make excellent sense from a performance standpoint, for example, and storing it in RDBMS doesn’t magically make everything explode.
I do disagree with people’s tendency to shift towards treating RDBMS as a way to store schema in JSON, though. Not only is it massively unperformant, and indexing it is a pain, but you can quite easily get yourself into situations where there are multiple versions of your schema-within-a-schema, and that can be a nightmare.
josephg 147 days ago [-]
I don't think I was clear above. I want embedded objects. I wish there was a more "database-native" way to do it than using JSON, and all the ugliness that JSON brings. (Like dynamic typing, schema-within-a-schema, more complex indexing & querying, etc.)
Sounds like something the databases should be able to provide natively, without needing to use JSON at all.
(And apparently support for this sort of stuff is slowly improving! Thanks to all the sibling comments for details!)
jamesblonde 148 days ago [-]
Prof Murat is calling for examples of how LLMs can help make DBs more usable.
I note that Julius.ai is making data analytics easier - just upload some data and ask for some charts and it does an ok job.
The problem of going from natural language to SQL is still a far from solved problem. The main benchmark for this is Bird-Bench and accuracy is only in the high 60s - https://bird-bench.github.io/
I have been working on making tables queryable using function calling, where you a fixed number of canned queries based on the primary key and an event-time column. PyData talk on this - https://www.youtube.com/watch?v=dRcjTe5qgwM
jalcazar 148 days ago [-]
Gemini generating SQL queries from natural language could be an example of AI making DBs more usable. There is more people speaking natural language than SQL
Bird-bench has gemini on 69.03% on the test set.
That is a long way from something you can build on.
randomdata 148 days ago [-]
> The problem of going from natural language to SQL is still a far from solved problem.
Is there any improvement going from natural language to something other than SQL? Certainly SQL can be cut out of the picture if it is what confuses these systems.
zerodensity 148 days ago [-]
What would that "something other" be? If the goal is to talk to a SQL database the output from the LLM would benefit from being SQL.
randomdata 148 days ago [-]
> What would that "something other" be?
Depends. Where is the problem? Is it the quirkiness of SQL? Perhaps something like QUEL or Datalog would yield better results? Is it declarative programming that it struggles with? GPT, for example, seems to be optimized for Python in particular. Perhaps an imperative approach is easier for the LLM to understand? It doesn't even have to be a language suitable for humans. Perhaps it would fare better with something like SQLite's byte code language?
> If the goal is to talk to a SQL database
While being able to talk to an existing SQL database would no doubt simplify the problem in a lot of cases, which is of merit, I doubt that is the actual goal. The user doesn't care about the technology, as they say. Getting the expected results out of the database is undoubtedly the actual goal.
SQL as a target is all well and good if it works reliably, but the claim was that it doesn't. If some other target performs better, there is no need to cling to SQL. It is merely an implementation detail.
sgbeal 148 days ago [-]
> there is no need to cling to SQL. It is merely an implementation detail.
It is, in fact, also the interface. To use your example of SQLite bytecode: once your tool generates it, there is no way to feed that into SQLite. The bytecode is an implementation detail, with SQL being the public interface.
randomdata 148 days ago [-]
But, to stick with your example, you can then modify SQLite to accept byte code input – or straight up write your own database engine that uses said byte code. We already know how to solve that kind of problem. This is, comparatively speaking, child's play.
It is recognized that SQL as a target would theoretically provide a less labour intensive path for reasons of integrating into what already exists, but that only holds if natural language to SQL gets solved, and is not enough harder to solve than an alternative target.
A reasonable stretch goal, but if another target gets you there first, it would be foolhardy to cling to SQL. Replacing the database interface is a much simpler problem to solve.
big_whack 148 days ago [-]
I think the problem is the quirkiness on the English side, not the SQL side. You could translate datalog to SQL or vice versa, but understanding intention from arbitrary english is much harder. And often query results must be 100% accurate and reliable.
randomdata 148 days ago [-]
> I think the problem is the quirkiness on the English side
While likely, the question asked if there was any improvement shown with other targets to validate that assumption. There is no benefit in thinking.
> And often query results must be 100% accurate and reliable.
It seems that is impossible. Even the human programmers struggle to reliably convert natural language to SQL according to the aforementioned test study. They are slightly better than the known alternatives, but far from perfect. But if another target can get closer to human-level performance, that is significant.
yuliyp 148 days ago [-]
When I find someone claiming a suspicious data analysis result I can ask them for the SQL and investigate it to see if there's a bug in it (or further investigate where the data being queried comes from). If the abstraction layer between LLM prompt and data back is removed, I'm left with (just like other LLM answers) some words but no way to know if they're correct.
randomdata 148 days ago [-]
1. How would the abstraction be removed? Language generation is what LLMs do; a language abstraction is what you are getting out, no matter what. There is no magic involved.
2. The language has to represent a valid computer program. That is as true of SQL as any other target. You can know that it is correct by reading it.
big_whack 148 days ago [-]
Once you have SQL, you have datalog. Once you have datalog, you have SQL. The problem isn't the target, it is getting sufficiently rigorous and structured output from the LLM to target anything.
randomdata 148 days ago [-]
So you already claimed, but, still, curiously we have no answer to the question. If you don't know, why not just say so?
That said, if you have ever used these tools to generate code, you will know that they are much better at some languages than others. In the general case, the target really is the problem sometimes. Does that carry into this particular narrow case? I don't know. What do the comparison results show?
7thpower 148 days ago [-]
This was basically the only reasonable way I found to create a consistent user experience. I think of them as natural language BI, where you have canned reports to answer common categories of questions.
flowerlad 148 days ago [-]
> Users cannot interact with the database directly
It is not super hard to find UI designed for end users, these days.
If you know the basics such as what a relational database is, then here’s is a good UI: https://visualdb.com
minkles 148 days ago [-]
People have forgotten we had this back in 1997 with MS Access.
I built a whole ERP system with it, single handedly including the hardware, software, networking and the workstation deployments without really breaking a sweat. We have gone off the rails somewhere.
flowerlad 148 days ago [-]
Right but Access is outdated. Microsoft tried to make a web version of Access, but gave up. Now they recommend Power Apps but it’s support for databases is pretty weak.
minkles 147 days ago [-]
It still works and it doesn't matter. Only the problem solved matters.
yunolearn 147 days ago [-]
It's old so it's bad?
flowerlad 147 days ago [-]
For starters it is not web-based, does not work on Mac, does not support LLM query generation, etc.
hi41 148 days ago [-]
How do you create the correct indexes? If we create index for all the conditions in a where clause wouldn’t the number of indexes become too many? I don’t fully grasp how to best create indexes. Does anyone have good reference and examples?
beart 147 days ago [-]
Creating good indexes is a nuanced problem to solve. It is highly dependent on the primary use case of the data. For instance, if you have a table representing varieties of beer, you are likely to present an interface for searching by brand, or type (ipa, pilsner). Those would be good targets for indexes. However, you probably won't often search by color, so it won't be worth indexing that column.
Typically I would add any indexes initially that I know will be heavy hitters, then refine as needed based on actual use. Most databases provide tools to help identify missing indexes for common queries.
valiant55 147 days ago [-]
I think a good baseline is just include indexes for foreign key relationships and then adjust from there.
Also, if anyone reading this is using SQL Sever, do not use the recommended missing indexes it provides. They have a whole host of issues, including the ordering of columns and straight up recommend completely unnecessary indexes.
hi41 144 days ago [-]
Yes, we are using MS SQL. Will keep your advice in my mind.
magicalhippo 147 days ago [-]
Run the database profiler in production, logging queries which queries take a long time (ie >200 ms say). Queries that come up repeatedly should be looked at.
Use the query analyzer to figure out where it's doing table scans or similar, and you'll probably quickly see which index you need to create.
Keep in mind the database server tries to use indexes to minimize the amount of rows it needs to scan. So you just need enough to drastically cut down the number of rows to consider. If you have five columns in the where clause, think of which one or which combination is the most specific. This also means if you have a query with a where clause "colX = 1", yet 97% of the rows have colX equal to 1, adding and index on colX won't help. This means it can be customer-specific.
Also keep in mind that some database servers don't index null values, thus "where colX is null" means a table scan regardless. In such cases an alternative could be to add a computed column which would have a definite value, say "coalesce(colX, -1)" if colX should inly have positive values. You can then index on that, and tweak your queries to filter on the computed column.
Also keep in mind that if you have "order by", then the database server might ignore your index if it won't help in presenting the rows in the order dictated by the "order by" clause. We have some tables where we have multiple permutations of the same columns indexed due to this.
It's a skill and it's very database server dependent, just like learning to know the optimizer in your programming language of choice. Takes a fair bit of trial and error using the query analyzer.
mike_hearn 147 days ago [-]
Some databases can do it autonomously. Oracle is an example. It can notice if you keep doing inefficient queries and add the needed indexes by itself.
consteval 147 days ago [-]
Depends entirely on your application's data access patterns. What columns is your application actually using to do retrieval and insert?
golergka 147 days ago [-]
Mostly where EXPLAIN ANALYZE points to.
delifue 148 days ago [-]
> “Find all flights from Detroit to Beijing” ... which must be joined twice with flight info to express our query
Actually this can be done by nested query, without joining. I prefer nested query than joins. Because join is "global" and nested query is "local" (global means interfering with the whole SQL statement). The local one is easier to understand.
boxed 147 days ago [-]
For me the join problem doesn't exist anymore, because I use the excellent Django ORM for work and hobby projects.
This is a SELECT with 5 joins:
Foo.objects.filter(a__b__c__d__e=3)
Types are respected, valid relations exist and invalid relations do not. Clean.
Izkata 147 days ago [-]
I believe that's 4 joins, "e" should be a column on "d"
boxed 143 days ago [-]
Yea, my bad.
leandrod 146 days ago [-]
Those who do not learn the relational model are bound to flounder. For most use cases, there is no reducible complexity in the relational model.
kak3a 148 days ago [-]
Author talked about usability problem with database, it's oxymoron usability problem in this site where all the text are in gray which makes reading very painful. Had to use "Darken Text" Chrome ext so it is not killing my eye!
trollied 148 days ago [-]
I think the core of this is the age old problem that people would rather invent crappy technologies that end up being a pain instead of taking some time to learn a standard, SQL.
SQL is not going to go away. It's relatively easy to learn.
nine_k 148 days ago [-]
In other words, people would rather make many tiny, seemingly easy steps than a few really impactful but seemingly arduous steps.
Teach SQL a spoonful at a time then, I suppose. Do not start from having the students to read Codd's original papers.
OTOH SQL is not the best language: it's both too wordy and too terse, it's not composable in many important cases, it does not map nicely to set-theoretic operations. But a transpiler to SQL that would offer a vastly better experience is yet to take the world by storm.
morkalork 148 days ago [-]
It's shocking how many developers are afraid of SQL, it hurts my soul.
minkles 148 days ago [-]
20+ years ago no one was afraid of it. Our front end web folk were doing SQL quite happily. Now they can't even consume an API without problems.
I think the demand for staff has lowered standards considerably.
yunolearn 147 days ago [-]
20 years ago (or more like 30-40), software actually did something for the user and wasn't just a ZIRP hack to make billions courtesy of Google and friends. Once the money became the entire point, once line go up, you just need to be able to fog a mirror long enough to get acquired. The number of supposedly senior developers who send me screenshots of stack traces they haven't even read makes me want to day drink heavily. I'm thinking of starting a bakery.
minkles 147 days ago [-]
Yeah nailed it there. Totally agree.
I was going to start a grilled cheese van!
codr7 148 days ago [-]
I honestly don't get it, it's not rocket science, and compared to freakin MongoDB it's a Sunday walk in the park.
yas_hmaheshwari 147 days ago [-]
Totally agree!
I was thinking of this quote about regex, apt in this situation:
"A programmer had a problem with Sql. So he thought of using an ORM. Now he has two problems"
Choose your battles wisely
pif 147 days ago [-]
> taking some time to learn a standard, SQL.
One huge problem is that learning SQL will take you nowhere, unless your schema consists of a few tables with a handful of rows at most.
There is no standard SQL, and there is no serious application development without having to delve into the sanctum sanctorum of your database vendor.
yunolearn 147 days ago [-]
This is, quite frankly, bullshit. SQL fundamentals transcend a given RDBMS. And when they don't, boo hoo, learn the syntax of your particular database flavor. We need not be so precious.
148 days ago [-]
SoftTalker 148 days ago [-]
> You youngins may not remember, but pre-2005 we had to call a travel agent to book our flights and get paper tickets in return. This sucked, we don't have any transparency in to the process, we couldn't explore options (price, convenience, dates, airports) and customize our trip. Having access to flight booking via web was really a great improvement for user experience.
I'm going to disagree. I could call my travel agent and say "I am going to Chicago on <date> I need to be there by <time> call me back with a couple of nonstop options. And I'll need a room at the Hilton and Towers for two nights."
vs. today I can spend my own time navigating travel sites, avoiding their dark patterns, wondering if I'm really getting the best prices, making sure I understand the terms (is this refundable? are any checked bags included?) etc and then do the same for the hotel booking.
If you work in a business with a travel department or assistants who can manage your travel you have an idea, compared to the hassle of doing it all yourself.
nine_k 148 days ago [-]
As usual:
* Relegate it to an agent: have to trust the agent, have to pay the agent, saves you time.
* DIY: have to spend time, have to have some expertise, saves you money.
A middle ground existed back in the day, too: you could visit a physical office of an airline and buy a physical ticket there, at the cost of some time and the narrow choice, but it saved you some money compared to an agent, and the clerk could provide a limited assistance.
ComputerGuru 148 days ago [-]
> wondering if I'm really getting the best prices
Curious why you assume you were getting the best price before?
nine_k 148 days ago [-]
But "before" you did not have an option and thus the need to wonder!
"One who has a watch always knows what time is it. One who has two watches is never certain about anything."
SoftTalker 148 days ago [-]
Yes, the paradox of having too many choices. Sounds like a good thing, but it causes anxiety.
nine_k 148 days ago [-]
Making a choice means making decisions. Making decisions can definitely be taxing, especially when yourself or your social circle insists that decisions were "good", "smart", or otherwise met a standard of excellence.
Suddenly you end up in a situation where you need to prove yourself, else the smartass cousin or your own resentful self will constantly remind you how you played the fool when you could have made a much better choice, obvious in hindsight.
An agent (human or machine) that chooses the tickets for you because you can't be bothered with such mundane trivialities may be quite desirable for some.
Hugely important, and I would add "backup-and-restore" to that list. At risk of sounding conceited, ease of use is a primary goal of rqlite[1] -- because in the real world databases must be operated[2]. I never add a feature if it's going to measurably decrease how easy it is to operate the database.
[1] https://www.rqlite.io
[2] https://docs.google.com/presentation/d/1Q8lQgCaODlecHa2hS-Oe...
Disclaimer: I'm the creator of rqlite.
Of course, it's easy for me to do this. I am not developing the database for commercial reasons, so can just say "no" to an idea if I want. That said, I've found that many ideas which didn't seem interesting to me when an end-user first proposed them become compelling once I think more about the operational pain (and it's almost always operational) they are experiencing.
Automatic backups to S3[1] was such a feature. I was sceptical -- "just run a script, call the backup API, and upload yourself" was my attitude. But the built-in support has become popular.
[1] https://www.philipotoole.com/adding-automatic-s3-backups-to-...
The chasm between a small required effort and zero effort is vast, from the user's perspective.
Any product person will tell you that. Hitting the right zero-effort target is what separates a runaway success from a tepid reaction.
Wow! I think I have intuitively/subconsciously worked this into some products I have developed but never had it articulated this way. This is very useful. Any books you'd recommend on this topic?
I've never seen any data to back this up in a quantitative sense (though interested if anyone has?). Nevertheless, I've still found it useful as a qualitative rule of thumb in a positive sense: shaving off small edges of friction can have non-linear return in adoption and satisfaction.
[0]: https://en.wikipedia.org/wiki/Inverse-square_law
It's a world of pain, and there are so much scar tissue of third party tooling doing crazy stuff dealing with this problem among large companies that really should be the DB vendor's problem.
I’m not saying this is good, only that RDBMS management is very much still in the land of arcane incantations, and anyone saying otherwise is selling something. It’s also why I have a job, so from my perspective it isn’t all bad.
The thing is - and I imagine you know this - that when you run tens of thousands of database instances it's not a matter of if, it's a matter of when. I can't chase after thousands of engineers to make sure they're not hitting some obscure pitfall nor can I demand that they know the ins and outs of the 8 different types of locks in FooDB. And "read the docs lol" definitely doesn't scale. So we build automated mass-guardrails and linters as best we can and use various shenanigans to deal with the real world.
In the context of schema migrations, there's tools like pt-osc, gh-ost and pgroll that are excellent for what they are and I'm grateful to those who created them but these are still in effect huge hacks that are bolted on top of the database. They're teeming with footguns. I can tell you about a time where badly conceived timeout / error handling code in one of these tools in their golang database driver resulted in most of a database dropped instantly with no fast rollback, for example. That was fun.
If a third party bolt on tool can implement zero(ish) downtime migrations and cutovers, at least much much much better than the regular DDL, why the heck is the database not doing whatever that thing is doing? It's tech debt, history, crud, lack of caring, etc. Schema migrations is just the tip of the iceberg too, don't get me started on replication, durability settings, resource usage vs capacities ...
My point is that academics and vendors should be taking operational problems seriously and owning them, and solving them properly at the database layer.
And just like you I also make living off of the fact that the real world operational characteristics of truly busy databases are very different than what gets benchmarked and tested. I'm just saying that some of this stuff even I don't want to have to deal with. There's better problems to be working on.
People endlessly want hacks, workarounds, and un-normalized data structures, for a single reason - they don't want to have to think about JOIN. It's not actually for performance reasons, it's not actually for any reason other than it's easier to imagine a big table with lots of columns.
I'm actually sympathetic to that reticence but what I am not sympathetic about is this: why, in 2024, can't the computer figure out multi-table joins for me?
Unless your schema is really fucked up, there should only be one or two actually sensible ways to join across multiple tables. Like say I have apples in boxes in houses. I have an apple table, a box table, and a house table. apples have a box_id, boxes have a house_id. Now I want to find all the apples in a house. Do two joins, cool. But literally everyone seems to write this by hand, and it means that crud apps end up with thousands of nearly identical queries which are mostly just boringly spelling out the chain of joins that needs to be applied.
When I started using SQLAlchemy, I naively assumed that a trivial functionality of such a sophisticated ORM would be to implement `apple.join(house)`, automatically figuring out that `box` is the necessary intermediate step. SQLAlchemy even has all the additional relationship information to figure out the details of how those joins need to work. But after weeks of reading the documentation I realized that this is not a supported feature.
In the end I wrote a join tool myself, which can automatically find paths between distant tables in the schema, but it seems ludicrous to have to homebrew something like that.
I'm not a trained software engineer but it seems like this must be a very generic problem -- is there a name for the problem? Are there accepted solutions or no-go-theorems on what is possible? I have searched the internet a lot and mostly just find people saying "oh we just type out all combinatorially-many possible queries"... apologies in advance if I am very ignorant here
So most enterprise schema who have outlived multiple applications. Usually due to time constraint, lack of database administrators and complex business needs. Excel is the king of databases for a reason.
As a SQL non-expert, I think this is why we are averse to joins, because it’s easy to end up with more rows in the result than you intended, and it’s not always clear how to verify you haven’t ended up in this scenario.
I like that though, I have a preference for REPL style development and fleshing things out interactively. Enough so that I build Java like that, hacking things out directly in tests.
Working with a database you built or can control is kind of a simplistic example.
In my experience this most often arises when it’s someone else’s database or API you’re interacting with and do not control.
An upstream ERP system doesn’t have unique keys, or it changes the name of a column, or an accounting person adds a custom field, or the accounting system does claim to have unique identifiers but gets upgraded and changes the unique row identifiers that were never supposed to change, or a user deletes a record and recreates the same record with the same name, which now has a different ID so the data in your reporting database no longer has the correct foreign keys, and some of the data has to be cross-referenced from the ERP system with a CRM that only has a flaky API and the only way to get the data is by pulling a CSV report file from an email, which doesn’t have the same field names to reliably correlate the data with the ERP, and worse the CRM makes these user-editable so one of your 200 sales people decides to use their own naming scheme or makes a typo and we have 10 different ways of spelling “New York”, “new york”, “NY”, “newyork2”, “now york”, and yeah…
Turns out you can sometimes end up with extra rows despite your best efforts and that SQL isn’t always the best tool for joining data, and no I’m not interested in helping you troubleshoot your 7-page SQL query that’s stacked on top of multiple layers of nested SQL views that’s giving you too many rows. You might even say I’m averse.
This is a problem of form, not method. JOINs are a fantastic, well-defined method to aggregate data. If the form of your data is messed up, then naturally the result may be too.
> no I’m not interested in helping you troubleshoot your 7-page SQL query that’s stacked on top of multiple layers of nested SQL views that’s giving you too many rows
People say this type of thing but SQL is an incredibly high-level language.
Yes debugging a big SQL query can suck. Debugging the equivalent of it is almost always much worse. I'd rather debug a 30-line query than a 400 line perl script that does the same thing. Because that's actually the alternative.
I have manually aggregated data, messy data, in awk, perl, python... it is much, much worse.
Yes, but with Python/etc you can at least do the same logic in a plain loop, which is much slower but serves as a more reliable reference of what the data is supposed to be, which can be used to validate the functionality of SQL output.
Is there an equivalent in SQL of this “slow and dumb” approach for validating? Like, I’m not sure if a lateral join is essentially doing the same thing under the hood.
In practice, I ended up creating a skeleton table that not only had the appropriate constraints for de-duplication. But I would also create mostly empty rows (empty except for the required column key ID fields) with the exact rows they were expecting. And then I would perform an UPDATE <skeleton> ... FROM <complicated join>. It is a hack but if there was no desire to rewrite a process that was written years ago by teams of consultants, I can only do what I can do.
That's because, inexplicably, devs by and large don't know SQL, and don't want to learn it. It's an absurdly simple language that should take a competent person a day to get a baseline level of knowledge, and perhaps a week to be extremely comfortable with.
As an aside, something you can do is create views (non-materialized) for whatever queries are desired. The counter-arguments to this are that is slows development velocity, but then, so does devs who don't know how to do joins.
I think there's a beautiful space for some SQL-like language that just operates a bit more like a general-purpose language in a more regular fashion. Bonus points for ones where you don't query tables but point at indexes or table scans and the like (resolving the "programmer writes query that is super non-performant because they assume an index is present when it's not").
I think it's still super straightforward to sit down and learn it, but it's really unfortunate that we spend a bunch of time in school learning data structures and then SQL tries really hard to hide all that, making it pretty opaque despite people intuitively understanding B-Trees or indexes.
To choose the best path to retrieve data, you have to know what are the possible paths (using the underlying data structures, indexes but also different algorithms to filter, join…), but you should also know some data metrics to evaluate if some shortcuts are worth it (a seq scan can be the best choice with a small table..).
And the thing that will trip most humans, is that you need to reevaluate the plan if the underlying assumptions change (data distribution has become something that you never expected).
Note that the planner is also NOT always right, it heavily relies on heuristics and data metrics that can be skewed or not up to date. Some databases allow the use of hints to choose an index or a specific path.
I understand the value of full declarative planning with heuristics, but sometimes the query writers do in fact have a better understanding of the data that will go in.
And beyond that, having consistent plans is actually better in some ideologies! Instead of a query suddenly changing tactics in a data- and time-dependent way, having consistent behavior at the planning phase means that you can apply general engineering maintenance tactics. Keep an eye on query perf, improve things that need to be improved... there are still the possibility of hitting absolutely nasty issues, but the fact that every SQL debugging session starts with "well we gotta run EXPLAIN first after the fact" is actually kind of odd!
https://github.com/permazen/permazen
It maps Java objects to a scalable transactional K/V store of your choice, and handles things like indexing, schema migrations and the rest for you. You express query plans by hand using the Java collections and streams framework.
Incidentally I think it also addresses what a lot of the comments here are talking about: not learning JOINs, indexing, build-in relation cardinality constraints, etc, but that's a separate point!
This sounds like an EAV table, which is generally a bad idea.
Re: types, Postgres allows you to define whatever kind of type you’d like. Also re: inheritance, again, Postgres tables can inherit a parent. Not just FK linking, but literally schema inheritance. It’s rare that you’d want to do this, but you can.
In general, my view is that the supposed impedance mismatch is a good thing, and if anything, it forces the dev to think about less complicated ways to model their data. The relational model has stuck around because it’s incredibly good, not because nothing better has come around.
EDIT: this came across as quite harsh, and I’m sorry for the tone. Making a new product is hard. I’m just very jaded about anything trying to replace SQL, because I love it, it’s near-universal, and it hasn’t (nor is likely to) gone anywhere for quite some time.
That's one of the best things about SQL, it's declarative nature. I describe the end result, data I want to receive — not the instructions on how it should be done. There's no control flow, there's no program state, which means that my mental model of it is so much simpler.
I've read countless articles on how to rearrange the "declaration of what you want" in order to get the database to do it in a fast way.
While this is doubtlessly true, in many cases the “rearranging” also involves a subtle change in what you are asking the database to do, in ways which allow the database to actually do less work.
SELECT 1 WHERE EXISTS vs WHERE ID IN (SELECT ID FROM mytable WHERE …) is a great example. The former is a much simpler request despite functionally doing the same thing in the common use-cases.
MySQL, but not Postgres, will also convert the inverse to an antijoin.
As an aside, this is also often a great example to demonstrate why you shouldn’t use SELECT *, since you can’t write a semijoin if you need the predicate column in the result.
I think of RDBMS like C: they’re massively capable and performant, but only if you know what you’re doing. They’re also very eager to catch everything on fire if you don’t.
Maybe it's easier if you think of it like helpful syntactic sugar?
> I think there's a beautiful space for some SQL-like language that just operates a bit more like a general-purpose language in a more regular fashion. Bonus points for ones where you don't query tables but point at indexes or table scans and the like
That sounds like imperative programming, which is fine for most things, but [generally] not RDBMS (or IaC, but that's a completely separate topic). You can't possibly know the state of a given table or index – the cardinality of columns, the relative grouping of tuples to one another, etc. While you can hint at index usage (natively with MySQL, via extension with Postgres), that's as close as the planner will let you get, because it knows these things better than you do.
> resolving the "programmer writes query that is super non-performant because they assume an index is present when it's not"
More frequently, I see "programmer writes query that is super non-performant because they haven't read the docs, and don't know the requirements for the planner to use the index." A few examples:
* Given a table with columns foo, bar, baz, with an index on (foo, bar), a query with a predicate on `bar` alone is [generally] non-sargeable. Postgres can do this, but it's rare, and unlikely to perform as well as you'd want anyway.
* Indices on columns are unlikely to be used for aggregations like GROUP BY, except in very specific circumstances for MySQL [0] (I'm not sure what limitations Postgres has on this).
* Not knowing that a leading wildcard on a predicate (e.g. `WHERE user_name LIKE '%ara'`) will, except under two circumstances [1], skip using an index.
> despite people intuitively understanding B-Trees or indexes.
You say that, but the sheer number of devs I've talked to who are unaware that UUIDv4 is an abysmally bad choice WRT performance for indices – primary or secondary – says otherwise.
[0]: https://dev.mysql.com/doc/refman/8.4/en/group-by-optimizatio...
[1]: Postgres can create trigram indices, which can search with these, at the expense of the index being quite large. Both MySQL and Postgres can make use of the REVERSE() function to create a reverse index on the column, which can then be used in a query with the username also reversed.
My serious belief is that all the SQL variants are generally great, but I just want this to be incremented with some lower-level language that I can be more explicit with, from time to time. If only because sometimes there are operational needs.
The fact that the best we get with this is planner _hints_ is still to this day surprising to me. Hints! I am in control of the machine, why shouldn't it just listen to me! (and to stop the "but random data analyst could break thing", this is why we have invented permission systems)
I don’t think that’s what I said. At least, it wasn’t what I meant.
The planner needs accurate stats on the tables to produce an optimal output. So would you, as a human, if you were writing a query to be executed precisely as written.
It also needs to know information about the system it’s operating in – the amount of RAM it can expect to use being the most important one, but also things like the relative performance of a random disk seek vs. sequential.
The planner also has a huge leg up on most people, in that it is aware of all of the kinds of joins and lookups it can do. Off the top of my head, Postgres can scan in the following ways: sequential scan, parallel sequential scan, index scan, parallel index scan, index-only scan, and bitmap scan. For joins, it can use: nested loop, merge, hash. Do you know when all of those are ideal? What about when you have a query joining multiple tables, keeping in mind that the intermediate contents of a table that has been joined once may look nothing like what it started as?
I don't think that last one is a thing. There is both "bitmap heap scan" and "bitmap index scan" though.
[0]: https://www.cybertec-postgresql.com/en/postgresql-indexing-i...
The one I see many times more often than this is having one index on (foo) and one index on (bar), then querying on both foo and bar, and assuming the database uses both indexes for best performance. It can be done, even by ancient mysql, but it's most likely going to be a lot slower than expected so they're heavily biased against doing it and will most likely just choose one of the two indexes.
One huge problem is that learning SQL will take you nowhere, unless your schema consists of a few tables with a handful of rows at most.
There is no standard SQL, and there is no serious application development without having to delve into the sanctum sanctorum of your database vendor.
There is a standard SQL, and vendors are free to add to it. Generally speaking, though, you can get quite far with the standard.
> there is no serious application development without having to delve into the sanctum sanctorum of your database vendor.
OK? I could say the same thing about needing to intimately know the differences in C++ versions. Yes, you should know your tools well.
Join table that maps to an entity in the middle.
You can even have multiple columns that have foreign keys against various tables, like some_table_id, other_table_id, another_table_id with only the needed ones being filled out.
And in practice, this will be way more manageable than the dynamic mess of the OTLT pattern (table_name and table_id): https://www.red-gate.com/simple-talk/blogs/when-the-fever-is...
It's not like you have to particularly care about the fact that most of those columns will be empty in practice, as opposed to making your database hard to query or throwing constraints aside altogether.
I'm not not sure what you mean. Are you saying that instead of FK/PK relations like:
You would reify every edge like: Even then, the reverse "I have a country gimme a user" request is ambiguous.I'm only really familiar with Django, and it handles this type of thing by making you specify the "residence1" name instead of the table name "country".
[0] To refer back to the top of this comment chain, that user is complaining they can't do something like also have a "country->continent" relationship and do "user.join(continent)" and have the system figure out the two joins needed.
Remember this whole thread is about the system figuring out the JOIN criteria automatically.
Oracle: https://docs.oracle.com/javadb/10.8.3.0/ref/rrefsqljnaturalj... MySQL: https://dev.mysql.com/doc/refman/8.4/en/join.html
But that's only the trivial case, often you want something more complicated, where the order of clauses or what keywords you pick and how you want the results sorted will affect performance in your specific schema on your specific database engine.
RDBMS management and querying is a rather deep and experience demanding subject once you step outside the trivial cases. You could put more of it in application code, but it will be a lot of chatter and probably worse outcomes in terms of performance than having people on the team that are really good with databases.
You are primarily describing natural join here. It won’t read your mind and join your tables automatically, but it automatically uses shared columns as join keys and removes the duplicate keys from the result.
The problem with any “auto” solution is going to be things like modified dates, which may exist in multiple tables but aren’t shared.
Even more magic is natural semi join and natural anti join.
I think EdgeDB [0] is on the right track here. It's a database built on Postgres, but with a different query language. So instead of manually fiddling with JOIN statements, traversing linked tables just looks like dot notation:
[0] https://www.edgedb.com/I mean, I think the issue is just that there are lots of possible paths so it can't be automated.
If you want to join employees to buildings, is it where their desk is assigned today, or where their desk was assigned two years ago, or where their team is based even though they work remotely, or where they last badged in?
Sure you can build a tool to find all potential joins based on foreign keys, but then how do you know which is correct unless you understand what the tables mean? And then if you understand what the tables mean, writing the join out yourself is trivial.
> Unless your schema is really fucked up, there should only be one or two actually sensible ways to join across multiple tables.
In my experience, having just one or two ways is for simple/toy projects. Lots of joins in no way means a schema is "fucked up". It probably just means it's correctly modeling actual relationships, correctly normalized.
In this way you can use polynomially-many constraints or rules to avoid exponentially-many weird or exceptional routes through the schema. I am optimistic that you can build a system where by default, the autojoin solution is the natural one maybe 90% of the time. There will certainly be exceptions where you have to express the join conditions explicitly. But I think you can dramatically reduce the amount of code required.
I would also hazard the suggestion that this might produce productive backpressure on the system. If the autojoiner is struggling to find a good route through the schema, it's possible that the schema is not properly normalized or otherwise messed up.
... And god help you if someday Requisitions need to have Revisions too.
That knowledge gets you to a place to understand that most joins are very slow due to each DB page containing only 1% useful information for the query algorithm.
It will help you see reasons why ppl do things like put an array column in, or use wide tables, etc.
We have database performance issues for other reasons (loading of redundant information, too many queries) but not really because of joins.
1. A database that's meant to be understood by programmers, make queries by the application efficient in space and time, and provide strong referential integrity.
2. A database that's meant to be played with by humans, where duplication is fine, referential integrity is a nice-to-have, every column is okay to be nullable, tables should contain complete objects as understood by the user that map 1-1ish to the real world, that eliminate as many opaque ids as feasible, and foreign keys might exist but aren't enforced in any direction.
The latter database is far more ergonomic and you won't run up against a user frustratingly bashing their keyboard because the database just refuses to do what they want. The stakes in #2 style databases are extremely low-- let the user destroy their copy of the database it's fine, we can reload from the last save.
The nice thing is that it seems very possible to go from #1 -> #2 mechanically, and hand that off to the users who want to play with it.
First, treat each column separately in the physical engine (column store). Users ought to never need to worry about column count limits, row byte limits, or sparsity. Similarly, schema operators such as adding, removing, or reordering columns ought to always be instant atomic changes.
Imagine how much faster it would be possible to explore the schema design space if changes were instant and didn’t require “migration scripts” or data copies from an old table to a new one. It would make the database feel more like a spreadsheet!
Next, there ought to be Git-style forking and merging so that creating a test environment should also be a lightweight operation just like creating a local branch in a Git repo. Merging changes can be either schema only or schema+data -- the latter to support slowly-changing “master data” editing workflows. Currently, few if any database support the equivalent of a pull request with reviews to merge data. Hence the excessively complex access controls that could all be replaced with a single "peer review" operation.
I’ve seen a fancy category-theoretic approach where forks of the database can receive live data updates from production. This would allow UAT and similar environments to be evergreen with zero infrastructure code such as nightly ETL sync jobs.
Many operational tasks can be eliminated by tying live environments of both the code and data to Git branches of a single (mono) repo. Now the code and data schema can’t go out of sync! Literally impossible. No need for scaffolding, or ORMs, or any of those layers of overcomplicted abstractions! The schema is just “there”, in the repo, always representing reality.
Sprinkle on a Microsoft Access style form designer but with HTML5 and guest user support and you have a billion dollar product.
> it seems very possible to go from #1 -> #2 mechanically, and hand that off to the users who want to play with it
NocoDB seems like a reasonable attempt of taking this approach for Postgres.
I guarantee you there is no algorithmic way to do this. Going from "big flat file" to "nicely organized 3+NF schema" adds a lot of underlying domain knowledge and information to the system. You're asking for "enhance!" on the database schema. It will always require expert understanding.
This problem is heightened by the fact that a SQL table typically can't store composite types. Like, you can't make a column of lists-of-strings. Every modelling problem is solved by using more tables. And any nontrivial application ends up with a table explosion. You have to study your database very closely, then write complex, slow, join-heavy queries, just to get the information about simple stuff - like an order.
Solving every problem with tables might seem clever from a technical perspective, but its just not how humans think. There's a reason document databases like mongodb are popular. Even if they are worse technically, they're much easier to use and reason about.
There's no reason SQL databases couldn't support nested data. Postgres sort of does already via JSON fields. But it feels like you have to fight the database to do it.
[1] https://en.wikipedia.org/wiki/Object%E2%80%93relational_impe...
I have a bunch of posts, written by various authors, on various sites. Store that in some OOP way, with ownership going in some direction (For example, sites own posts). Now look up the most recent posts for authors. Look up the most recent posts per site. Look up the most prolific authors. Do the authors work cross-site? Maybe they don't but people use the same name across sites.
There are plenty of times I have put things into a database not for perf reasons but simply because it's easier to do queries that way than to write a bunch of bespoke logic at each variant of certain queries, simply due to ownership arrows being all over the place
As for join heavy, complexity, this is not necessarily a problem in reality. It's incredibly easy to scale this out to huge systems (relatively) cheaply.
Believe me as they scale up, they look way less hard and less complicated than arbitrary and poorly enforced schemas in document databases. I could write an essay on how to fuck up MongoDB (or any document store) because I spent nearly 2 years unfucking one.
Claiming it’s a moral failing on my part for wanting nice, intuitive data structures is a bit rich. Go on, next why don’t you tell me that json is bad and we should use xml. Or c is lazy and we should use assembler. Without more detail, that’s all I hear - moral justifications for a dated design.
And the design is dated. As far as I can tell, the reason for the current design is because early sql databases had a fixed byte size for fields. So in that world, it would be impossible to embed a list in a cell. But that limitation hasn’t existed in modern sql databases for decades now.
Why can’t we have nice things? I don’t see anything in sql that precludes embedded lists and objects in table cells. Postgres even - sort of - supports it already via json fields.
XML is for the non-trivial case, where having schemas and tooling around them for validation and code generation becomes useful. XML also has a default programming language for querying and transformations, allowing you to implement adapters from CSV, JSON or whatever into XML, and then between XML schemas.
XML can be used in a JSON-like way if that's what you want: XPath. When programming against XML it's usually not what you want though, in practice it tends to be nicer to use the schema to set up an object hierarchy native to the programming language and unmarshal into that. But there is at least one XPath oriented database, eXist, and it's pretty ergonomic if you find yourself with a lot of XML and isn't sure what it contains or what to use it for. I find it easier to work with than the NoSQL JSON varieties I come across professionally, and when I have Questions For The Customer it craps out PDF:s I can send them, which doesn't seem that easy to do with Mongo or Firestore.
As for SQL databases, they come in a lot of shapes and colours. I'm not so sure there is a JSON-first database equivalent of DuckDB, for example. Maybe I'm wrong, it's not a specialty of mine.
Your entire argument is "old stuff is bad and I like JSON".
And yes we should use XML. At least it has competent, verifiable, consistent schemas and contracts. At the very least, you can encode xs:decimal in it unlike JSON which is "oh it's some shit that looks like a number and we might stick in a double at the other end because we don't know our butt from our elbow".
I ask you with JSON, how do you encode "add a single item to the list attached to a customer"
I am so so so tired of dealing with this level of thinking on a daily basis.
My argument isn't that "old stuff is bad and I like JSON". Its that when I'm modelling data in my program, sometimes I have objects which reference each other (via key or shared pointer). And sometimes I have objects which contain another object. Eg, a struct with a vec inside. Real code is packed full of this stuff.
SQL lets me directly model relationships between objects. And it lets me model records with scalar fields. But it does not let me model records with variable size fields. This restriction seems totally arbitrary. There is no technical reason for it that I can think of. And it makes it needlessly complex to map data from my programming language to my SQL database.
I've named this problem 3 times in this thread now. For all your moaning about us kids being lazy, you still haven't given a single technical reason for this deficiency in SQL.
I largely agree with you about JSON. I don't actually want JSON fields in my database. I want something strongly typed, with well defined semantics and a good query system. SQL has a great track record of providing just that. And it has a perfectly serviceable type system already. So why on earth is it missing this important, obvious feature?
I just want my database to be able to express embedded lists and records, like I do in almost every programming language I've ever used, every messaging format I've used (protobuf, msgpack, json, xml, etc). And in many other databases (couchdb, mongo, foundationdb, etc).
> I ask you with JSON, how do you encode "add a single item to the list attached to a customer"
I can think of dozens of ways to express that. Postgres already supports that operation for embedded JSON fields using this ugly syntax:
Mongodb expresses it like this: Rethinkdb: ... Do you need more examples? Basically every database natively supports this except for "raw" SQL databases.Edit: Apparently (according to other comments in this thread) some SQL databases are already moving in this direction anyway. How wonderful!
Yes it does. Customers have orders which are an array. You have two tables then CUSTOMER and ORDER and you JOIN them. Why not just put the orders inside of CUSTOMER? Because now you can't query it, because you don't know how many columns will come out and you can't have disparate columns across rows.
So maybe you dump it all in one column, but obviously that has problem in terms of forming relations.
Sure, it's a different way of thinking. But its faster, its MUCH safer, the invariants are actually properly specified.
Sure you can use mongodb and that will work for wild-westing your way through software. But I wouldn't dare touch a mongo instance without going through the application, because all the constraints are willy-nilly implicitly applied in the application. But I directly view and edit SQL databases daily.
Why wouldn't you be able to query it? You can query JSON fields that contain lists. Why not SQL fields with strongly typed lists?
> Sure, it's a different way of thinking. But its faster, its MUCH safer, the invariants are actually properly specified.
Why would it be safer?
An embedded list has pretty clear and obvious semantics. And as ekimekim said in another comment, postgres already has partial support. Apparently this works today:
An embedded list also sounds much faster to me - because you don't have to JOIN. Embedding a list promises to the database "I'll always fetch this content in the context of the containing record". Instead of (fetch row) -> (fetch referencing key) -> (fetch rows in child table), the database can simply fetch the associated field directly.> But I wouldn't dare touch a mongo instance without going through the application, because all the constraints are willy-nilly implicitly applied in the application.
Yes, I hate mongodb as much as you do. I want explicit types and explicit invariants. But right now mongodb has useful features that are missing / unloved in SQL. How embarrassing. SQL databases should just add support for this approach to data modelling. Its nice to see that postgres is trying exactly that.
In programming, I don't have to choose between javascript and assembly. I have nice languages like rust with good type systems and good performance. We can have nice things.
Most relational databases have supported lists-of-strings (arrays) for about 10 years now. All of them supported lists-of-strings even before that, if you were okay doing a bit of app-level work on top of your SQL.
Relational databases these days support full JSON. And they're faster at it than many json-first databases :)
edit: Actually I'm wrong. SQL added support for arrays in 1999, a full 25 years ago. https://en.wikipedia.org/wiki/SQL:1999 https://stackoverflow.com/questions/30109111/at-what-version...
The only schemas that survive long term are the ones that create such a tangled ball of foreign keys that migrations are infeasible. Great advice for anyone looking to leave a legacy.
Document databases tend to take on a life of their own; and since they encourage shortcuts, the slope is typically down.
I see new grads talking about MERN stack (with M for Mongo), and always have an urge to correct them that replace M with Mysql
This is pretty much the definition of what a RDBMS is. The tables provide the relations via designated columns. Relation navigation is therefore possible in either direction given an established relationship and very often possible starting from any table in the pathological case.
If this conceptual model is not appropriate, it is best not to use it to begin with.
> This problem is heightened by the fact that a SQL table typically can't store composite types.
That is not what RDBMS' are designed to do, even though many vendors now support some form of composite data types (like JSON). Just like a moped is not designed to move a family across a continent.
Pick the persistent storage mechanism which best fits the problem at hand:
- If the problem calls for a document structure, use a document database.
- If the problem calls for a directed graph structure, use a graph database.
- If the problem calls for a relational structure, use a RDBMS.
- If the problem calls for some combination of the above, consider using multiple storage technologies.
None are perfect, but starting with the most appropriate persistent store can eliminate the weaknesses others would introduce in the same solution space.
For example, BigQuery has natural support for arrays and nested data, and it's quite nice / essential for good data modeling. For example, "tags" can be stored as `Array<Struct<Key, Value>>`, and this can be used to implement things like, "search with fields with particular tags".
This reduces the cognitive burden of remembering which tables join with which, especially if we know that a relationship is solely relevant in one context. I.e. Tags can only be joined to the main table, and no other joins are sensical.
Certainly some problems are not friendly to SQL databases, but many problems are, and lots of problems are being added into SQL databases too. I would not be surprised in the least if LLM's start getting stored in SQL databases eventually, as you probably can reduce their storage to tables and SQL databases are amazing at table storage.
I agree it's not really how many humans think(though that can be said about many computer things), but that doesn't mean you can't learn how to do it.
When you talk about "just" using nested data in your database you seem to be forgetting a whole load of problems that relational databases solve like concurrency issues, write performance, data integrity and consistency etc. The book Designing Data-Intensive Applications by Kleppmann is an essential read.
SQLAlchemy however falls flat on its face in this regard.
The point about SQLAlchemy was specifically about the issue mentioned by the GP, namely having to make more tables for composites. SQLAlchemy does a better job of mapping db tables back to regular objects. With Django you get the `alias_set` stuff but it never lets you forget that this is really a database table. Django uses the active record pattern while SQLAlchemy is a data mapper.
> Solving every problem with tables might seem clever from a technical perspective, but its just not how humans think.
Not without practice, no. You can in fact learn to do data modeling.
[0]: https://en.wikipedia.org/wiki/First_normal_form
Good discussion at https://dba.stackexchange.com/questions/2342/what-is-atomic-...
I’ll also add to my previous comment, that it’s fine to violate normal form if you are aware that you’re doing so, and have a good reason. There are some instances where an array happens to make excellent sense from a performance standpoint, for example, and storing it in RDBMS doesn’t magically make everything explode.
I do disagree with people’s tendency to shift towards treating RDBMS as a way to store schema in JSON, though. Not only is it massively unperformant, and indexing it is a pain, but you can quite easily get yourself into situations where there are multiple versions of your schema-within-a-schema, and that can be a nightmare.
Sounds like something the databases should be able to provide natively, without needing to use JSON at all.
(And apparently support for this sort of stuff is slowly improving! Thanks to all the sibling comments for details!)
I note that Julius.ai is making data analytics easier - just upload some data and ask for some charts and it does an ok job.
The problem of going from natural language to SQL is still a far from solved problem. The main benchmark for this is Bird-Bench and accuracy is only in the high 60s - https://bird-bench.github.io/
I have been working on making tables queryable using function calling, where you a fixed number of canned queries based on the primary key and an event-time column. PyData talk on this - https://www.youtube.com/watch?v=dRcjTe5qgwM
https://cloud.google.com/bigquery/docs/write-sql-gemini#prom...
Is there any improvement going from natural language to something other than SQL? Certainly SQL can be cut out of the picture if it is what confuses these systems.
Depends. Where is the problem? Is it the quirkiness of SQL? Perhaps something like QUEL or Datalog would yield better results? Is it declarative programming that it struggles with? GPT, for example, seems to be optimized for Python in particular. Perhaps an imperative approach is easier for the LLM to understand? It doesn't even have to be a language suitable for humans. Perhaps it would fare better with something like SQLite's byte code language?
> If the goal is to talk to a SQL database
While being able to talk to an existing SQL database would no doubt simplify the problem in a lot of cases, which is of merit, I doubt that is the actual goal. The user doesn't care about the technology, as they say. Getting the expected results out of the database is undoubtedly the actual goal.
SQL as a target is all well and good if it works reliably, but the claim was that it doesn't. If some other target performs better, there is no need to cling to SQL. It is merely an implementation detail.
It is, in fact, also the interface. To use your example of SQLite bytecode: once your tool generates it, there is no way to feed that into SQLite. The bytecode is an implementation detail, with SQL being the public interface.
It is recognized that SQL as a target would theoretically provide a less labour intensive path for reasons of integrating into what already exists, but that only holds if natural language to SQL gets solved, and is not enough harder to solve than an alternative target.
A reasonable stretch goal, but if another target gets you there first, it would be foolhardy to cling to SQL. Replacing the database interface is a much simpler problem to solve.
While likely, the question asked if there was any improvement shown with other targets to validate that assumption. There is no benefit in thinking.
> And often query results must be 100% accurate and reliable.
It seems that is impossible. Even the human programmers struggle to reliably convert natural language to SQL according to the aforementioned test study. They are slightly better than the known alternatives, but far from perfect. But if another target can get closer to human-level performance, that is significant.
2. The language has to represent a valid computer program. That is as true of SQL as any other target. You can know that it is correct by reading it.
That said, if you have ever used these tools to generate code, you will know that they are much better at some languages than others. In the general case, the target really is the problem sometimes. Does that carry into this particular narrow case? I don't know. What do the comparison results show?
It is not super hard to find UI designed for end users, these days.
If you know the basics such as what a relational database is, then here’s is a good UI: https://visualdb.com
I built a whole ERP system with it, single handedly including the hardware, software, networking and the workstation deployments without really breaking a sweat. We have gone off the rails somewhere.
Typically I would add any indexes initially that I know will be heavy hitters, then refine as needed based on actual use. Most databases provide tools to help identify missing indexes for common queries.
Also, if anyone reading this is using SQL Sever, do not use the recommended missing indexes it provides. They have a whole host of issues, including the ordering of columns and straight up recommend completely unnecessary indexes.
Use the query analyzer to figure out where it's doing table scans or similar, and you'll probably quickly see which index you need to create.
Keep in mind the database server tries to use indexes to minimize the amount of rows it needs to scan. So you just need enough to drastically cut down the number of rows to consider. If you have five columns in the where clause, think of which one or which combination is the most specific. This also means if you have a query with a where clause "colX = 1", yet 97% of the rows have colX equal to 1, adding and index on colX won't help. This means it can be customer-specific.
Also keep in mind that some database servers don't index null values, thus "where colX is null" means a table scan regardless. In such cases an alternative could be to add a computed column which would have a definite value, say "coalesce(colX, -1)" if colX should inly have positive values. You can then index on that, and tweak your queries to filter on the computed column.
Also keep in mind that if you have "order by", then the database server might ignore your index if it won't help in presenting the rows in the order dictated by the "order by" clause. We have some tables where we have multiple permutations of the same columns indexed due to this.
It's a skill and it's very database server dependent, just like learning to know the optimizer in your programming language of choice. Takes a fair bit of trial and error using the query analyzer.
Actually this can be done by nested query, without joining. I prefer nested query than joins. Because join is "global" and nested query is "local" (global means interfering with the whole SQL statement). The local one is easier to understand.
This is a SELECT with 5 joins:
Types are respected, valid relations exist and invalid relations do not. Clean.SQL is not going to go away. It's relatively easy to learn.
Teach SQL a spoonful at a time then, I suppose. Do not start from having the students to read Codd's original papers.
OTOH SQL is not the best language: it's both too wordy and too terse, it's not composable in many important cases, it does not map nicely to set-theoretic operations. But a transpiler to SQL that would offer a vastly better experience is yet to take the world by storm.
I think the demand for staff has lowered standards considerably.
I was going to start a grilled cheese van!
I was thinking of this quote about regex, apt in this situation:
"A programmer had a problem with Sql. So he thought of using an ORM. Now he has two problems"
Choose your battles wisely
One huge problem is that learning SQL will take you nowhere, unless your schema consists of a few tables with a handful of rows at most.
There is no standard SQL, and there is no serious application development without having to delve into the sanctum sanctorum of your database vendor.
I'm going to disagree. I could call my travel agent and say "I am going to Chicago on <date> I need to be there by <time> call me back with a couple of nonstop options. And I'll need a room at the Hilton and Towers for two nights."
vs. today I can spend my own time navigating travel sites, avoiding their dark patterns, wondering if I'm really getting the best prices, making sure I understand the terms (is this refundable? are any checked bags included?) etc and then do the same for the hotel booking.
If you work in a business with a travel department or assistants who can manage your travel you have an idea, compared to the hassle of doing it all yourself.
* Relegate it to an agent: have to trust the agent, have to pay the agent, saves you time.
* DIY: have to spend time, have to have some expertise, saves you money.
A middle ground existed back in the day, too: you could visit a physical office of an airline and buy a physical ticket there, at the cost of some time and the narrow choice, but it saved you some money compared to an agent, and the clerk could provide a limited assistance.
Curious why you assume you were getting the best price before?
"One who has a watch always knows what time is it. One who has two watches is never certain about anything."
Suddenly you end up in a situation where you need to prove yourself, else the smartass cousin or your own resentful self will constantly remind you how you played the fool when you could have made a much better choice, obvious in hindsight.
An agent (human or machine) that chooses the tickets for you because you can't be bothered with such mundane trivialities may be quite desirable for some.