This hits home. I am helping someone analyze medical research data. When I helped before a few years ago we spent a few weeks trying to clean the data, figure out how to run the basic analysis (linear regression, etc), only to arrive at "some" results that were never repeatable because we learned as we built.
I am doing it again now. I used Claude to import the data from CSV into a database, then asked it to help me normalize it, which output a txt file with a lot of interesting facts about the data. Next step I asked to write a "fix data" script that will fix all the issues I told it about.
Finally, I said "give me univariate analysis, output the results into CSV / PNG and then write a separate script to display everything in a jupyter notebook".
Weeks of work into about 2 hours...
mritchie712 33 days ago [-]
we've built a business[0] around this workflow, but in cases where the source data isn't as simple as a CSV. Think Stripe, Hubspot, Salesforce, etc. where you'd normally need to write a ton of API calls or buy something like Fivetran. The flow for Definite is:
1. Add your sources (Postgres, S3, CRM, Quickbooks, Google Sheets, etc.)
2. We deploy standard, pre-baked data models (e.g. how do you calculate ARR using Stripe data)
3. AI answers questions using the standard models and starts updating the model with SQL for anything that's not already answered.
We spin up a datalake to store all the data (similar to this one[1]) for our customers, so it's very cost effective.
Only if the output from Claude is correct. If not...
voidhorse 33 days ago [-]
This. I get why people have started using LLMs for this and I think it's great in theory, but the black box nature and possibility of hallucination makes it a non starter for me. Having the LLM generate scripts which you can then validate for correctness seems more plausible.
I also worry that this approach will lead to a sort of further reification of data science. While things have already trended this way, data science is not about applying a few routine formulas to a data set. Done properly, it is far more exploratory and all about building an understanding of the unique properties and significance of a particular data set. I worry the use of these tools will greatly reduce the exploratory phase and lead to analyses that simply confirm biases or typical conclusions rather than yielding new insight.
huijzer 33 days ago [-]
The output is not black box. I always see myself as responsible for the output. The models give hints.
daveguy 33 days ago [-]
Definitely the right way to approach this. You already need to know what you're doing (for validation and error checking), but if you do it can be faster. As long as P != NP the validation is faster than coming up with the solution. My only concern is how far away from a "good" solution is the quick LLM + check vs expert solution. It may be worth using human expertise in 2 weeks than validated LLM solution in 2 hours. (And i'd question good validation of traditionally 2 week work in 2 hours.)
There's going to be a lot of moving fast and breaking things coming. Hopefully less breaking than moving.
raducu 33 days ago [-]
> Only if the output from Claude is correct. If not...
Had a task at work to clear unused metrics.
Exported a whole dashboard, thought about regexes to extract metrics out of xml (bad, I know) asked chat gpt to produce the one-liners to produce the data.
Got 22 used metrics.
Next day I just gave chat gpt the whole file and asked it to spit all the used metrics.
46 used metics.
Asked Claude, Deepseek and Gemini the same question. Only Gemini messed it up by missing some, duplicating some.
Re-checked the one-liners chat-gpt produced.
Turns out it/I messes up when I told it to generate a list of unique metrics from a file containing just the metric names one per line. What I wanted was a script/one-liner that would print all the metric names just once (de-duplicate) and chat-gpt ad-literam produced a script that only prints metrics that show up exactly once in the whole file.
In the end, just asking LLMs to simply extract the names from the grafana dashboard worked better, parsing out expressions, only producing unique metrics names and all that, but there was no way to know for sure, just that given that 3/4 of the LLMs produced the same output meant it was most likely corect.
I fixed the programatic approach and got thr same result, but it was a very wiered feeling asking the LLMs to just give me the result of what for me was a whole process of many steps.
HumanOstrich 33 days ago [-]
Are you sure you didn't also have a bunch of typos in your prompts? ;)
raducu 33 days ago [-]
Unlike humans, LLMs seem to deal surprisingly well with typos.
Freed from the "the other human must not be up to my exquisite eloquency " and given that it's a machine that I'm talking to (20 years of "the compiler is never wrong") -- I've learned more about my communication inadequacies through talking with LLMs in the past 2 years than 40 years of talking to humans.
owenthejumper 33 days ago [-]
But I am not giving Claude a csv and saying 'clean it up'. I am asking it to write me a python script to clean it up. That way I can validate the script myself.
lyu07282 33 days ago [-]
Think about it logically: Are you really sure you can validate the script yourself? If it takes you weeks to do what Claude does in some hours, it seems misplaced confidence in your capabilities.
sdenton4 33 days ago [-]
There is, in fact, a large body of work studying classes of problems which are hard to solve but easy to verify. So I'm not sure why this kind of usage is a surprise to so many people.
abstractbeliefs 33 days ago [-]
I'm not sure that source code verification is such a problem. It feels like it's definitely easier to write code to solve a problem than to verify some code written by someone else is correct and fault free.
throwup238 33 days ago [-]
All processes and by extension code tolerate some level of error, even our most reliable systems. Whether LLM produced output is within that tolerance is up to each practitioner to test and verify.
I think AI has revealed that there is a lot of low hanging fruit that is very tolerant of errors across many disciplines that isn’t met by our current supply of software engineers. In my own day to day that’s a lot of low impact bash scripts that automate personal things while at work it’s sales and lead gen where it’s not a big deal if a salesperson cold calls someone who couldn’t use our product (other than the temporary embarrassment it causes both parties).
yawnxyz 33 days ago [-]
It's a lot easier to check the code / check the output of the code / spot verify than it is to do the work itself... if I'd write my own code, I'd still have to verify (bc I trust my own coding ability even less than Claude lol)
I’ve come to this same conclusion. Been able to code up something that would’ve taken me a week to do back in the day with Claude in 2 hours. I’ve given canvas csvs and seen it run analysis on them in minutes that would’ve take me day to do when I used to run R scripts and throw them into slides. This probably just the beginning too…
squigz 33 days ago [-]
What happens when that 'weeks of work' is just shifted into the future, as you find out the LLM made things up and you have to figure out what went wrong?
fifilura 33 days ago [-]
Humans make mistakes too.
I find this "LLMs can be wrong" argument a bit tiresome, and also a bit lazy.
I feel like we have been here before. With wikipedia. With stack overflow. Or with the whole debate about c/assembler vs garbage collected languages.
Yoric 33 days ago [-]
> Humans make mistakes too.
Well, yes, but fortunately, we build computers to automate things using simple algorithms to remove the risk of such mistakes.
Except when we use LLMs, in which case we increase the risk of mistakes.
> I feel like we have been here before. With wikipedia. With stack overflow. Or with the whole debate about c/assembler vs garbage collected languages.
Well, Wikipedia is a great tool, but it is permanently weaponized.
C/Assembler vs. garbage-collected languages was about decreasing the risk (at the cost of increasing the resource requirement), so, unless I misunderstand what you write, it kinda feels like you're arguing against your side?
squigz 33 days ago [-]
Funny you mention Wikipedia, since in most professional settings (particularly research roles) you can't just cite Wikipedia. Maybe in highschool that's okay, but when there are actual stakes on the table, putting some effort into your research beyond reading the Wikipedia article is probably necessary.
williamcotton 33 days ago [-]
For my ETL pipelines I have not had this issue.
arscan 33 days ago [-]
“I am doing it again now” is the operative phrase here I think. I’ve found LLMs are quite good helping me build things much better and faster in this case. Maybe not so much for stuff I haven’t done before and don’t really quite know what I’m trying to accomplish or what a good solution looks like.
fermisea 33 days ago [-]
Can I ask you to beta test my product? I'm building something like this and I want to focus on medical data (from omics to RCTs)
Cheer2171 33 days ago [-]
I really don't mean this in a rude way, but if it took you a few weeks to do that on your own, you are really bad at googling for tutorials and walkthroughs. You could have watched a one hour bootcamp video and learned how to do it yourself.
What you are saying Claude helped you do is like 15 lines of python. A few weeks? 120 hours of effort?
mritchie712 33 days ago [-]
the task above is not 15 lines of python with a real world dataset.
the tutorials you reference? yes, 15 lines of python when you're starting with the titanic.csv. But a real world dataset normally takes hours or days of cleaning before it's ready to run any statistical analysis on.
Cheer2171 33 days ago [-]
Data cleaning is hard. That is not what OP said they had Claude do. They just said Claude normalized it. Normalizing data does not take days unless you are learning to do both statistics and programming for the first time ever
33 days ago [-]
erikgahner 34 days ago [-]
Most of these examples/walkthroughs look like they have been generated by LLMs. They might be useful for teaching purposes, but I believe they significantly underestimate the amount of domain knowledge that often goes into data extraction, data cleaning and data wrangling.
tsumnia 34 days ago [-]
I'm not against that approach (though I am a teacher so guilty as charged).
Toy examples help teach a concept and it helps when the example is relevant to the learner's interest. However at some point, we can't design real world application examples because so much additional mess has to get thrown in there. For example, a blog for learning web development isn't really useful to many but helps outline the basics of URL parameters, GET/POST requests, database management, etc.
It is on the learner to then take those skills and use them elsewhere. Or like it would do when I was learning, ignore the blog and make your own thing but roughly following the example.
galgia 34 days ago [-]
+ I assumed that most people will ctrl+a -> ctrl+c -> ChatGPT -> ctrl+v
tsumnia 33 days ago [-]
I will admit over reliance on AI is a major issue that we're coming to terms with right now. However to invoke playing devil's advocate, a person over relying on stimulants can also be a bad thing.
In moderation, AI can be fine and help. If you're assuming AI gets to do all the work while you sit around sipping mai tais and eating bonbons, you're going to have a rough time - which is exactly what we're starting to see with students that have been Copilot and GPTing through their classes. They're finally hitting the more complex stuff that needs creative thinking and problem solving skills that just aren't trained yet.
dkarl 33 days ago [-]
An LLM would need a lot of integrations to send the emails, Slack messages, and meeting invites to find out all the required domain knowledge. They're basically a full-fledged employee who could take on a management role at that point.
galgia 34 days ago [-]
You are right! This is here to be used when your resources do not allow you to build full-blow solutions. Yes, I used LLMs to help create examples from my existing code, but they are based on things I have put in production when the client's resources were limited and wanted to move from point 0 to test out the potential of LLMs on their data.
lmeyerov 33 days ago [-]
Afaict this skips the evals and alignment side of LLMs. We find result quality is where most of our AI time goes when helping teams across industries and building our own LLM tools. Calling an API and getting bad results is the easy part, while ensuring good results is the part we get paid for.
If you look at tools like dspy, even if you disagree with their solutions, much of their effort is on helping get good vs bad results. In practice, I find different LLM use cases to have different correctness approaches, but it's not that many. I'd encourage anyone trying to teach here to always include how to get good results for every method presented, otherwise it is teaching bad & incomplete methods.
plaidfuji 34 days ago [-]
This is where things are headed. All that ridiculous busywork that goes into ETL and modeling pipelines… it’s going to turn into “here’s a pile of data that’s useful for answering a question, here’s a prompt that describes how to structure it and what question I want answered, and here’s my oauth token to get it done.” So much data cleaning and prep code will be scrapped over the next few years…
benrutter 34 days ago [-]
I'm definitely biased because my day job is writing ETL pipelines and supporting software, and my current side project is a data contracts library for helping the above[0]. Still I'm not sure I see this happening.
80% of the focus of an ETL pipeline is in ensuring edge cases are handled appropriately (i.e. not producing models from potentially erroneous data, dead letter queing unknown fields etc).
I think an LLM would be great for "take this json and make it a pandas dataframe", but a lot less great for interact with this billing API to produce auditable payment tables.
For areas that are reliability focused, LLMs still need a lot more improvments to be useful.
> I think an LLM would be great for "take this json and make it a pandas dataframe", but a lot less great for interact with this billing API to produce auditable payment tables.
Yeah, it's great....so long as you don't care that it randomly screws up the conversion 10% of the time.
My first thought, when I saw the post title, was that this is the 2025 equivalent to people using MapReduce for a 1MB dataset. LLMs certainly have good applications in data pipelines, but cleaning structured data isn't it.
galgia 34 days ago [-]
Yes, LLMs are not always the best option, they are an option. Sometimes requirements of the project are such that they are also the best option.
Inappropriate tools are always an option? I can cut a cake with a jackhammer, but....
Anyway, like I said, there are certainly good applications of LLMs, and this is probably one? I wouldn't describe "do market research on prices" as a traditional "data pipeline", but that's just me, I guess.
daxfohl 33 days ago [-]
I think you'd tell the LLM to design the pipeline, not be the pipeline. That way you can see exactly what it's done and tweak as needed. Plus should be way more cost effective.
icedchai 33 days ago [-]
Hah. I remember being forced to use MapReduce for a tiny dataset, back in the early 2010's. Hadoop was all the rage.
miningape 34 days ago [-]
"lemme just fire up a dbt workflow to analyse this CSV file"
tesch1 33 days ago [-]
You may have meant that sarcastically, but i just did that for 2 csv files that i needed to do a bunch of cleanups and joins to analyze. With llm help the whole adventure was easy.
miningape 33 days ago [-]
What I really like to do for this is loading it into SQLite, there are built in macros for reading/writing CSV files. And they're queryable with SQL which makes for a great jumping point to do some basic cleaning, joining and analysis.
This also I'd argue makes the job easier with LLMs since you can ask it to write a SQL query which you can validate / reason about rather than relying on it for transforming the data itself (which I've seen a lot under this post)
CalRobert 33 days ago [-]
Honestly I spend ten times as much effort figuring out people's sloppy notebooks or pandas stuff than when they just use DBT and SQL. And 90% of the time SQL is all they needed.
kipukun 33 days ago [-]
For your wimsey library, using “pipe” to validate the contracts would seem to me to drastically slow down the Polars query because the UDF pushes the query out of Rust into Python. I think a cool direction would be to have a “compiler” which takes in a contract and spits out native queries for a variety of dataframe libraries (pandas/polars/pyspark). It becomes harder to define how to error with a test contract but that can be the secret sauce.
benrutter 33 days ago [-]
Actually you're almost 100% describing how Wimsey works! It's using native df code rather than a UDF of some kind. Under the hood it uses Narwhal's which converts polars style expressions into native pandas/polars/spark/dask code with super minimal overheads.
If you're using a lazy dataframe (via polars, spark etc) Wimsey will force collection, so that can have speed implications. Reason being that I can't find a cross-language way yet of embedding assertions for fail later down the line.
galgia 34 days ago [-]
I belive that LLMs will become better and better in the near future and pipelines will replace classic approaches with LLM-enriched pipelines will drastically simplify the ETL flows.
isaacremuant 33 days ago [-]
Not that I don't love LLMs and play with them and their potential but if we don't get proper mechanism that ensure quality and consistency then it's not really a substitute for what we have.
It's very easy to produce something that seemingly works but you can't attest to its quality. The problem is producing something resilient, that is easy to adapt and describes the domain of what you want to do.
If all these things are so great, them why do I still need to do so many things to integrate a bigtech cloud agent with popular tool? Why is it so costly or limited?
You can't simply wish for a problem not to happen. Someone owns the troubleshooting and the modification and they need to understand the system they're trying to modify.
Replacing scrapers with LLM is an easy and obvious thing, specially when you don't care about quality to a high degree. Other systems such as financial ones don't have that luxury.
benrutter 33 days ago [-]
You may be right! I guess we'll find out soon.
One thing I'd be wary of is what "LLM-enriched pipelines" look like. If it's "write a sentence and get a pipeline" then I think that does massively simplify the ammount of work, but there's another reality where people use LLMs to get more features out of existing data, rather than doing the same transformations we do now. Under that one, ETL pipelines would end up taking more time, and being more complex.
Yoric 34 days ago [-]
But at what cost?
We're in an energy/environmental crisis, and we're replacing simple pipelines with (unreliable) gas factories?
danielbln 33 days ago [-]
Cost per token has cratered a thousand percent over the last two years, and that's not just lighting VC on fire, efficiency gains are made left and right.
Yoric 33 days ago [-]
How much do we need to progress before it becomes comparable in terms of energy to the (often already rather energy-inefficient) data pipelines we've been using so far?
Recall that while the cost per token may decrease, CoT multiplies the number of tokens by several orders of magnitude.
galgia 33 days ago [-]
LLMs are not the most efficient way to solve the problem, but they can solve it.
Yoric 32 days ago [-]
They can do it, they're just slower, less reliable and orders of magnitude more energy-expensive.
But yes, they're potentially easier to setup.
drunkpotato 33 days ago [-]
This is a head-scratcher of a take. Have you actually done any in-depth work on data pipelines and analytics tooling? If so, what precisely do you see LLMs making easier?
I tried using enterprise chat gpt to write a query to load some json data into a data warehouse. I was impressed with how good a job it did, but it still required several rounds of refinement and hand-holding and the end result was almost, but not quite, correct. So I'm not coming at this from the perspective of hating LLMs a priori, but I am unimpressed with the hype and over-selling of its capabilities. In the end, it was no faster than writing the query myself, but it wasn't slower either, so I can see it being somewhat helpful in limited conditions.
Unless the technology makes another quantum leap improvement at the same time the price drops like a stone, I don't see LLMs coming anywhere close to your claim.
That said, I expect to see a huge amount of snake oil and enterprise dollars wastefully burned on executive pipe dreams of "here's a pile of data now magic me a better business!" in the next few years of LLM over-hyped nonsense. There's always a quick buck to make in duping clueless execs drooling over replacing pesky, annoying, "over-paid" tech people.
robwwilliams 33 days ago [-]
Let me give you a complementary perspective. Same problems all of you have but I work in a small lab team of PhD biologist who generate huge omics data set and even larger lightsheet microscopy and MRI datasets but don’t know how to do a VLOOKUP in Excel. And who do not know the exotic acronyms: LIMS, QA, QC, or SQL. Yes, really.
What do we typically do in academic biomedical research in this situation?
The lead PI looks around the lab and finds a grad student or postdoc who knows how to turn on a computer and if very lucky also has had 6 months of experience noodling around with R or Python. This grad or postdoc is then charged with running some statistical analyses without any training whatsoever in data science. What is an outlier anyway, what do you mean by “normalize”, what is metadata exactly?
You get my drift: It is newbies in data science and programming (often 40-and 50-year-olds) leading novices (20- and 30-year-olds) to the slaughter. Might contribute to some lack of replicability ;-)
And it has been this way in the majority of academic labs since I started using CPM on an Apple 2 in 1980 at UC Davis in an electrophysiology lab in Psychology, to the first Macs I set up at Yale in a developmental neurobiology lab in 1984, and up to the point at which I set up my own lab in neurogenetics at the University of Tennessee with a pair of Mac IIs in 1989 and $150,000 in set-up funds, just enough for me to hire one very inexperience technician to help me do everything.
So in this context I hope all of you can appreciate that ANY help in bringing some real data science into mom-and-pop laboratories would be a huge huge boon.
And please god, let it be FOSS.
drunkpotato 33 days ago [-]
I feel you, and LLMs are no doubt a boon in tooling to help in this kind of scenario. I'm not poo-pooing LLMs in general; they are very cool! I wish they were allowed to just be very cool while we incorporate them into our tooling and workflows, rather than over-hyped.
icedchai 33 days ago [-]
You have more faith in LLMs than I do. The reality is it will probably get you 70 to 80% there, then you'll spend a ton of time debugging / fixing your pipelines, only to realize it would've been simpler, faster, and more reliable to not involve an LLM in the first place.
drunkpotato 33 days ago [-]
I believe that we'll learn how to incorporate LLMs to improve parts of data pipelines, particularly those that involve extracting unstructured or semistructured data into structured data, especially if it can provide a reliability score or confidence level with the extract. I'm much more skeptical of claims beyond that.
I also think there are unanswered questions about reliability, cost (dollar and energy), and AI business models; I don't think OpenAI can burn $2+ to make a dollar forever.
owenthejumper 33 days ago [-]
Unless you can provide some "citation", I don't think you are right. I do this every day now and it gets me 99 % there with very little debugging.
icedchai 33 days ago [-]
As always, "it depends." How simple are your pipelines? Single CSV? Sensible column names that are totally unambiguous? Consistent, clean data? Then LLMs are probably fine...
miningape 34 days ago [-]
This is completely wrong, if anything an increase in the usage of LLMs to generate small pipelines will lead to increased demand for professional pipelines to be built. Because if any small thing breaks the dashboards/features break which is immediately noticeable. I think you'll see a big increase in the number of models a data scientist can create, but making those python notebooks production ready can't be done by an LLM. That's to say as analysts create more potential use cases, there will be more demand to get those implemented.
There's so much that goes into ensuring the reliability, scalability and monitoring of production ready data pipelines. Not to mention the integration work for each use case. An LLM will give you short term wins at the cost of long term reliability - which is exactly why we already have DE teams to support DA and DS roles.
vharuck 34 days ago [-]
>This is completely wrong, if anything an increase in the usage of LLMs to generate small pipelines will lead to increased demand for professional pipelines to be built. Because if any small thing breaks the dashboards/features break which is immediately noticeable. I think you'll see a big increase in the number of models a data scientist can create, but making those python notebooks production ready can't be done by an LLM. That's to say as analysts create more potential use cases, there will be more demand to get those implemented.
I agree. There is a lot of data people want that isn't made because of labor costs. Not just in quantity, but difficulty. If you can only afford to hire one analyst, and the analyst's time is only spent on cleaning data and generating basic sums, then that's all you'll get. But if the analyst can save a lot of time with LLMs, they'll have time to handle more complicated statistics using those counts like forecasts or other models.
benjiro 33 days ago [-]
> If you can only afford to hire one analyst, and the analyst's time is only spent on cleaning data and generating basic sums, then that's all you'll get. But if the analyst can save a lot of time with LLMs, they'll have time to handle more complicated statistics using those counts like forecasts or other models.
That applies to so many other jobs.
My productivity as a single IT developer, making a rather large and complex system mostly skyrocketed when LLM's became actually useful (around GPT4 era).
Work where i may have spend hours dealing with a bug, being maybe 10 minutes because my brain was looking over some obvious issue that a LLM instantly spotted (or gave suggestions that focused me upon the issue).
Implementing features that may have taken days, reduces to a few hours.
Time taken to learn things massive reduces because you can ask for specific examples. Where a lot of open source project are poorly documented or missing examples or just badly structured. Just ask the LLM and it puts you in the right direction.
Now, ... this is all from the perspective of a 25+ year experienced dev. The issue i fear for more, is people who are starting out, writing code but not understanding why or how things work. I remember people before LLM's coming in for Senior jobs, that did not even have basic SQL understanding, because they non-stop used ORM's. But they forgot that some (or a lot) of this knowledge was not transferable to different companies that used SQL or other ORM's that may work different.
I suspect that we are going to see a generation of employees that are so used to LLMs doing the work but not understanding how or why specific functions or data structures are needed. And then get stuck in hours of LLM loop questioning because they can not point the LLM to the actual issue!
At time i think, i wish this was available 20 years ago. But then question that statement very fast. Was i going to be the same dev today, if i relied non-stop on LLMs and not gritted by teeth on issues to develop this specific skillset?
I see more productivity from Senior devs etc, more code turnout from juniors (or code monkies), but a gap where the skills are a issue. And lets not forget the potential issue of LLM poisoning with years of data that feeds back on itself.
galgia 34 days ago [-]
I see it as a gray area - long term there will be a need for both and you will have just one tool to choose from when presented with time-budget-quality constraints.
miningape 34 days ago [-]
Yeah I can also see it very much depending on the demands - I'm definitely not saying every pipeline has to be the most reliable, scalable piece of software ever written.
If a small script works for you and your use case / constraints there's nothing I can say against it, but when you do grow past a certain point you'll need pipelines built in a proper way. This is where I see the increased demand since the scrappy pipelines are already proving their value.
galgia 33 days ago [-]
Exactly, scale after you need to.
ekianjo 34 days ago [-]
This would require massively more compute than regular pipelines...
plaidfuji 34 days ago [-]
(1) that delta will decrease quickly, and (2) corporations will gladly pay for compute over headcount to maintain fragile data pipelines
timr 34 days ago [-]
> (1) that delta will decrease quickly
Is your data pipeline o(n^3) in the number of tokens? If not, then no, it won't.
ekianjo 34 days ago [-]
The price will go down, but LLMs reaching 100% accuracy and reliability is another story. We are nowhere close right now.
34 days ago [-]
galgia 34 days ago [-]
If your problem is compute, you are already optimizing. This is here for all the steps before you start thinking latency-compute. Not all use cases are made equal.
mistrial9 33 days ago [-]
no, not so simple.. the simplicity of this idea is like a gravitational pull for your human mental model mind. Meanwhile, LLMs are like a non-reproducible cotton-candy machine. Quality will be an elusive light at the end of the tunnel, not a result, for non-trivial systems IMHO. Simple systems? sure, but economics will assign low-skill humans to the task, and other problems emerge.
What is the intoxication that assumes the engineering disciplines are now suddenly auto-automatable ?
Keyframe 33 days ago [-]
not data pipelines, not yet at least since usually those require high degree of accuracy (depending on the company, of course). Where I see it (already) move in is data exploration, which effectively are data pipelines before data pipelines are being developed.
galgia 33 days ago [-]
Good point! LLMs are best when you are starting from point 0.
galgia 34 days ago [-]
Exactly!
fire_lake 33 days ago [-]
Big song and dance to call the OpenAI rest endpoint.
hrpnk 33 days ago [-]
What's missing in these examples are evals and any advice on creating a verification set that can be used to assert that the system continues to work as designed. Models and their prompting patterns change; one cannot just pretend that a 1-time automation will continue to work indefinitely when the environment is constantly changing.
refactor_master 34 days ago [-]
This ETL is nice, but ours is 100k LOC, and spans multiple departments and employments, and I haven’t yet been able to make an LLM write a convincing test that wasn’t already solved by strict typing.
I’m not trying to move the goal post here, but LLMs haven’t replaced a single headcount. In fact, it’s only been helping our business so far.
javierluraschi 32 days ago [-]
For those interested, you can use LLMs to process CSVs in Hal9 and also generate streamlit apps, in addition, the code is open source so if you want to help us improve our RAG or add new tools, you are more than welcomed.
I am doing it again now. I used Claude to import the data from CSV into a database, then asked it to help me normalize it, which output a txt file with a lot of interesting facts about the data. Next step I asked to write a "fix data" script that will fix all the issues I told it about.
Finally, I said "give me univariate analysis, output the results into CSV / PNG and then write a separate script to display everything in a jupyter notebook".
Weeks of work into about 2 hours...
1. Add your sources (Postgres, S3, CRM, Quickbooks, Google Sheets, etc.)
2. We deploy standard, pre-baked data models (e.g. how do you calculate ARR using Stripe data)
3. AI answers questions using the standard models and starts updating the model with SQL for anything that's not already answered.
We spin up a datalake to store all the data (similar to this one[1]) for our customers, so it's very cost effective.
0 - https://www.definite.app/
1 - https://www.definite.app/blog/cloud-iceberg-duckdb-aws
Only if the output from Claude is correct. If not...
I also worry that this approach will lead to a sort of further reification of data science. While things have already trended this way, data science is not about applying a few routine formulas to a data set. Done properly, it is far more exploratory and all about building an understanding of the unique properties and significance of a particular data set. I worry the use of these tools will greatly reduce the exploratory phase and lead to analyses that simply confirm biases or typical conclusions rather than yielding new insight.
There's going to be a lot of moving fast and breaking things coming. Hopefully less breaking than moving.
Had a task at work to clear unused metrics.
Exported a whole dashboard, thought about regexes to extract metrics out of xml (bad, I know) asked chat gpt to produce the one-liners to produce the data.
Got 22 used metrics.
Next day I just gave chat gpt the whole file and asked it to spit all the used metrics.
46 used metics.
Asked Claude, Deepseek and Gemini the same question. Only Gemini messed it up by missing some, duplicating some.
Re-checked the one-liners chat-gpt produced. Turns out it/I messes up when I told it to generate a list of unique metrics from a file containing just the metric names one per line. What I wanted was a script/one-liner that would print all the metric names just once (de-duplicate) and chat-gpt ad-literam produced a script that only prints metrics that show up exactly once in the whole file.
In the end, just asking LLMs to simply extract the names from the grafana dashboard worked better, parsing out expressions, only producing unique metrics names and all that, but there was no way to know for sure, just that given that 3/4 of the LLMs produced the same output meant it was most likely corect.
I fixed the programatic approach and got thr same result, but it was a very wiered feeling asking the LLMs to just give me the result of what for me was a whole process of many steps.
Freed from the "the other human must not be up to my exquisite eloquency " and given that it's a machine that I'm talking to (20 years of "the compiler is never wrong") -- I've learned more about my communication inadequacies through talking with LLMs in the past 2 years than 40 years of talking to humans.
I think AI has revealed that there is a lot of low hanging fruit that is very tolerant of errors across many disciplines that isn’t met by our current supply of software engineers. In my own day to day that’s a lot of low impact bash scripts that automate personal things while at work it’s sales and lead gen where it’s not a big deal if a salesperson cold calls someone who couldn’t use our product (other than the temporary embarrassment it causes both parties).
I find this "LLMs can be wrong" argument a bit tiresome, and also a bit lazy.
I feel like we have been here before. With wikipedia. With stack overflow. Or with the whole debate about c/assembler vs garbage collected languages.
Well, yes, but fortunately, we build computers to automate things using simple algorithms to remove the risk of such mistakes.
Except when we use LLMs, in which case we increase the risk of mistakes.
> I feel like we have been here before. With wikipedia. With stack overflow. Or with the whole debate about c/assembler vs garbage collected languages.
Well, Wikipedia is a great tool, but it is permanently weaponized.
C/Assembler vs. garbage-collected languages was about decreasing the risk (at the cost of increasing the resource requirement), so, unless I misunderstand what you write, it kinda feels like you're arguing against your side?
What you are saying Claude helped you do is like 15 lines of python. A few weeks? 120 hours of effort?
the tutorials you reference? yes, 15 lines of python when you're starting with the titanic.csv. But a real world dataset normally takes hours or days of cleaning before it's ready to run any statistical analysis on.
Toy examples help teach a concept and it helps when the example is relevant to the learner's interest. However at some point, we can't design real world application examples because so much additional mess has to get thrown in there. For example, a blog for learning web development isn't really useful to many but helps outline the basics of URL parameters, GET/POST requests, database management, etc.
It is on the learner to then take those skills and use them elsewhere. Or like it would do when I was learning, ignore the blog and make your own thing but roughly following the example.
In moderation, AI can be fine and help. If you're assuming AI gets to do all the work while you sit around sipping mai tais and eating bonbons, you're going to have a rough time - which is exactly what we're starting to see with students that have been Copilot and GPTing through their classes. They're finally hitting the more complex stuff that needs creative thinking and problem solving skills that just aren't trained yet.
If you look at tools like dspy, even if you disagree with their solutions, much of their effort is on helping get good vs bad results. In practice, I find different LLM use cases to have different correctness approaches, but it's not that many. I'd encourage anyone trying to teach here to always include how to get good results for every method presented, otherwise it is teaching bad & incomplete methods.
80% of the focus of an ETL pipeline is in ensuring edge cases are handled appropriately (i.e. not producing models from potentially erroneous data, dead letter queing unknown fields etc).
I think an LLM would be great for "take this json and make it a pandas dataframe", but a lot less great for interact with this billing API to produce auditable payment tables.
For areas that are reliability focused, LLMs still need a lot more improvments to be useful.
[0] https://github.com/benrutter/wimsey
Yeah, it's great....so long as you don't care that it randomly screws up the conversion 10% of the time.
My first thought, when I saw the post title, was that this is the 2025 equivalent to people using MapReduce for a 1MB dataset. LLMs certainly have good applications in data pipelines, but cleaning structured data isn't it.
There is one browser that uses price matching example that is impossible to do without a full-blown data science team right now: https://github.com/Pravko-Solutions/FlashLearn/tree/main/exa...
Anyway, like I said, there are certainly good applications of LLMs, and this is probably one? I wouldn't describe "do market research on prices" as a traditional "data pipeline", but that's just me, I guess.
This also I'd argue makes the job easier with LLMs since you can ask it to write a SQL query which you can validate / reason about rather than relying on it for transforming the data itself (which I've seen a lot under this post)
If you're using a lazy dataframe (via polars, spark etc) Wimsey will force collection, so that can have speed implications. Reason being that I can't find a cross-language way yet of embedding assertions for fail later down the line.
It's very easy to produce something that seemingly works but you can't attest to its quality. The problem is producing something resilient, that is easy to adapt and describes the domain of what you want to do.
If all these things are so great, them why do I still need to do so many things to integrate a bigtech cloud agent with popular tool? Why is it so costly or limited?
UX matters, validation matters, reliability matters, cost matters.
You can't simply wish for a problem not to happen. Someone owns the troubleshooting and the modification and they need to understand the system they're trying to modify.
Replacing scrapers with LLM is an easy and obvious thing, specially when you don't care about quality to a high degree. Other systems such as financial ones don't have that luxury.
One thing I'd be wary of is what "LLM-enriched pipelines" look like. If it's "write a sentence and get a pipeline" then I think that does massively simplify the ammount of work, but there's another reality where people use LLMs to get more features out of existing data, rather than doing the same transformations we do now. Under that one, ETL pipelines would end up taking more time, and being more complex.
We're in an energy/environmental crisis, and we're replacing simple pipelines with (unreliable) gas factories?
Recall that while the cost per token may decrease, CoT multiplies the number of tokens by several orders of magnitude.
But yes, they're potentially easier to setup.
I tried using enterprise chat gpt to write a query to load some json data into a data warehouse. I was impressed with how good a job it did, but it still required several rounds of refinement and hand-holding and the end result was almost, but not quite, correct. So I'm not coming at this from the perspective of hating LLMs a priori, but I am unimpressed with the hype and over-selling of its capabilities. In the end, it was no faster than writing the query myself, but it wasn't slower either, so I can see it being somewhat helpful in limited conditions.
Unless the technology makes another quantum leap improvement at the same time the price drops like a stone, I don't see LLMs coming anywhere close to your claim.
That said, I expect to see a huge amount of snake oil and enterprise dollars wastefully burned on executive pipe dreams of "here's a pile of data now magic me a better business!" in the next few years of LLM over-hyped nonsense. There's always a quick buck to make in duping clueless execs drooling over replacing pesky, annoying, "over-paid" tech people.
What do we typically do in academic biomedical research in this situation?
The lead PI looks around the lab and finds a grad student or postdoc who knows how to turn on a computer and if very lucky also has had 6 months of experience noodling around with R or Python. This grad or postdoc is then charged with running some statistical analyses without any training whatsoever in data science. What is an outlier anyway, what do you mean by “normalize”, what is metadata exactly?
You get my drift: It is newbies in data science and programming (often 40-and 50-year-olds) leading novices (20- and 30-year-olds) to the slaughter. Might contribute to some lack of replicability ;-)
And it has been this way in the majority of academic labs since I started using CPM on an Apple 2 in 1980 at UC Davis in an electrophysiology lab in Psychology, to the first Macs I set up at Yale in a developmental neurobiology lab in 1984, and up to the point at which I set up my own lab in neurogenetics at the University of Tennessee with a pair of Mac IIs in 1989 and $150,000 in set-up funds, just enough for me to hire one very inexperience technician to help me do everything.
So in this context I hope all of you can appreciate that ANY help in bringing some real data science into mom-and-pop laboratories would be a huge huge boon.
And please god, let it be FOSS.
I also think there are unanswered questions about reliability, cost (dollar and energy), and AI business models; I don't think OpenAI can burn $2+ to make a dollar forever.
There's so much that goes into ensuring the reliability, scalability and monitoring of production ready data pipelines. Not to mention the integration work for each use case. An LLM will give you short term wins at the cost of long term reliability - which is exactly why we already have DE teams to support DA and DS roles.
I agree. There is a lot of data people want that isn't made because of labor costs. Not just in quantity, but difficulty. If you can only afford to hire one analyst, and the analyst's time is only spent on cleaning data and generating basic sums, then that's all you'll get. But if the analyst can save a lot of time with LLMs, they'll have time to handle more complicated statistics using those counts like forecasts or other models.
That applies to so many other jobs.
My productivity as a single IT developer, making a rather large and complex system mostly skyrocketed when LLM's became actually useful (around GPT4 era).
Work where i may have spend hours dealing with a bug, being maybe 10 minutes because my brain was looking over some obvious issue that a LLM instantly spotted (or gave suggestions that focused me upon the issue).
Implementing features that may have taken days, reduces to a few hours.
Time taken to learn things massive reduces because you can ask for specific examples. Where a lot of open source project are poorly documented or missing examples or just badly structured. Just ask the LLM and it puts you in the right direction.
Now, ... this is all from the perspective of a 25+ year experienced dev. The issue i fear for more, is people who are starting out, writing code but not understanding why or how things work. I remember people before LLM's coming in for Senior jobs, that did not even have basic SQL understanding, because they non-stop used ORM's. But they forgot that some (or a lot) of this knowledge was not transferable to different companies that used SQL or other ORM's that may work different.
I suspect that we are going to see a generation of employees that are so used to LLMs doing the work but not understanding how or why specific functions or data structures are needed. And then get stuck in hours of LLM loop questioning because they can not point the LLM to the actual issue!
At time i think, i wish this was available 20 years ago. But then question that statement very fast. Was i going to be the same dev today, if i relied non-stop on LLMs and not gritted by teeth on issues to develop this specific skillset?
I see more productivity from Senior devs etc, more code turnout from juniors (or code monkies), but a gap where the skills are a issue. And lets not forget the potential issue of LLM poisoning with years of data that feeds back on itself.
If a small script works for you and your use case / constraints there's nothing I can say against it, but when you do grow past a certain point you'll need pipelines built in a proper way. This is where I see the increased demand since the scrappy pipelines are already proving their value.
Is your data pipeline o(n^3) in the number of tokens? If not, then no, it won't.
What is the intoxication that assumes the engineering disciplines are now suddenly auto-automatable ?
I’m not trying to move the goal post here, but LLMs haven’t replaced a single headcount. In fact, it’s only been helping our business so far.
- https://hal9.ai
- https://github.com/hal9ai/hal9