> By providing the beta version of FireDucks free of charge and enabling data scientists to actually use it, NEC will work to improve its functionality while verifying its effectiveness, with the aim of commercializing it within FY2024.
In other words, it's free only to trap you.
tombert 14 days ago [-]
Thanks for the warning.
I nearly made the mistake of merging Akka into a codebase recently; fortunately I double-checked the license and noticed it was the bullshit BUSL and it would have potentially cost my employer tens of thousands of dollars a year [1]. I ended up switching everything to Vert.x, but I really hate how normalized these ostensibly open source projects are sneaking scary expensive licenses into things now.
[1] Yes I'm aware of Pekko now, and my stuff probably would have worked with Pekko, but I didn't really want to deal with something that by design is 3 years out of date.
cogman10 14 days ago [-]
IMO, you made a good decision ditching akka. We have an akka app before the BUSL and it is a PITA to maintain.
Vert.x and other frameworks are far better and easier for most devs to grok.
switchbak 14 days ago [-]
> We have an akka app before the BUSL and it is a PITA to maintain
I would imaging the non-Scala use case to be less than ideal.
In Scala land, Pekko - the open source fork of Akka is the way to go if you need compatibility. Personally, I'd avoid new versions of Akka like the plague, and just use more modern alternatives to Pekko/Akka anyway.
I'm not sure what Lightbend's target market is? Maybe they think they have enough critical mass to merit the price tag for companies like Sony/Netflix/Lyft, etc. But they've burnt their bridge right into the water with everyone else, so I see them fading into irrelevance over the next few years.
tombert 14 days ago [-]
I actually do have some decision-making power in regards to what tech I use for my job [1] at a mid-size (by tech standards) company, and my initial plan was to use Akka for the thing I was working on, since it more or less fit into the actor model perfectly.
I'm sure that Lightbend feels that their support contract is the bee's knees and worth whatever they charge for it, but it's a complete non-starter for me, and so I look elsewhere.
Vert.x actor-ish model is a bit different, but it's not the that different, and considering that Vert.x tends to perform extremely well in benchmarks, it doesn't really feel like I'm losing a lot by using it instead of Akka, particularly since I'm not using Akka Streams.
[1] Normal disclaimer: I don't hide my employment history, and it's not hard to find, but I politely ask that you do not post it here.
wmfiv 14 days ago [-]
I've found actors (Akka specifically) to be a great model when you have concurrent access to fine grained shared state. It provides such a simple mental model of how to serialize that access. I'm not a fan as a general programming model or even as a general purpose concurrent programming model.
tombert 14 days ago [-]
Vert.x has the "Verticle" abstraction, which more or less corresponds to something like an Actor. It's close enough to where I don't feel like I'm missing much by using it instead of Akka.
Weryj 14 days ago [-]
What are your criticisms of actors as a general purpose concurrent programming model?
tombert 14 days ago [-]
Yeah, Vert.x actually ended up being pretty great. I feel like it gives me most of the cool features of Akka that I actually care about, but it allows you to gradually move into it; it can be a full-on framework, but it can also just be a decent library to handle concurrency.
Plus the license isn't stupid.
poulpy123 13 days ago [-]
>it was the bullshit BUSL
I didn't know the licence and had a look, but I don't see what is bullshit with it. It's not a classical open source licence, but pretty close and much better than closed source
> and it would have potentially cost my employer tens of thousands of dollars a year
If your employer is not providing its software open source, there is nothing shocking to have to pay for the software used
tombert 13 days ago [-]
> I didn't know the licence and had a look, but I don't see what is bullshit with it.
I just think it's a proprietary license that is trying to LARP as an OSS license. It sneaks in language that makes it so it's unclear how much it will actually cost you to use it. It makes me terrified to import anything touching it because I don't want to risk accidentally costing my employer millions of dollars.
I don't really see how it's "pretty close" to an OSS license. Part of an OSS license is that I can use the code for whatever I want, which is decidedly not the case with BUSL. I do appreciate that stuff eventually becomes Apache, so I guess that's better than nothing, but I'd rather just avoid the stuff entirely, or only use the Apache licensed stuff.
I also don't really like the idea that I could contribute to Akka, have my contributions being monetized by Lightbend, but I'm not even allowed to use my own contributions without paying them a fee. I know that CLAs aren't exactly new in the OSS world, but at least if I were to make a contribution to Ubuntu, I'm still allowed to run Ubuntu server for free, with my contributions included.
I guess the license just kind of feels "Bait and Switch" to me. It tries to get you to think that it's OSS and then smacks you with a "JK IT'S PROPRIETARY".
> If your employer is not providing its software open source, there is nothing shocking to have to pay for the software used
Sure, except in the case of Akka there's enough competition in the Java library world that I don't think that it's worth it. Vert.x is comparable, and the license is less likely to accidentally cost me lots of money.
I mostly think that Akka's licensing is way too expensive too, again especially when you consider that there's a good chunk of concurrency libraries in Java-land that have more business-friendly licenses.
TylerJewell 13 days ago [-]
I am the CEO of Akka, formerly Lightbend.
We did a long podcast and a couple blogs that offered transparency to the rationale on why we moved from Apache to BSL, which still downgrades to Apache after 36 months. See Emily Omier for the specifics.
It came down to survival. The company faced a bankruptcy event as customers were using the software without contributions and after exhausting alternatives needed to change the license model to create a more sustainable approach.
The consequence of this choice was that there was less adoption from OSS and ISVs who need a flexible licensing model for embedding and redistribution. It also encouraged the Pekko fork which is a branch that is 2.5 years old. And that branch helped older projects and OSS distributions to maintain their position without financial consequences.
It is not cheap to maintain Akka, and after 15 years we have turned a profit, albeit barely. We are growing, finally, and have a prosperous future and most of our spend goes into development. It did allow us to create Akka 3, which is a simpler model for devs within enterprises mixed with a consumption based model that should be significantly cheaper than the traditional libraries, and cheaper than the cost to adopt most any other framework. We can debate the merits of different business models but we couldn't have maintained the 50 CVE fixes and create a modern version of Akka if we hadn't taken this step.
We need a better strategy on how to appeal to the OSS community once more. To appeal to startups and academics, we have free commercial licenses and subscriptions, which nearly 200 accounts have signed up in the last 18 months.
bdangubic 13 days ago [-]
What would you say is the main difference between your and other products that do not use BSL?
Surely it is also not cheap to maintain Spring Framework either, no?
TylerJewell 13 days ago [-]
Well, Vert.x and Spring are maintained by RedHat and Broadcom. Both of those companies measure their profit and loss tied to their broader orchestration and platform sales (Kubernetes). They fund app dev frameworks only to the degree they can drive profitable adoption of their other commercial offerings. Broadcom, in particular, after the VMW acquisition has trimmed their staffing in areas that do not directly impact the Tanzu bottom line. Not all Vert.x and Spring customers need or desire that coupling, and so that poses an interesting dynamic that is different from us.
We are a pure play app dev platform and that gets to the heart of why the business model is different. I'd argue that we are very motivated to make sure that customers are successful with app dev as that is our bottom line where our rivals are financially incentives by infrastructure sales, not app dev outcomes.
tombert 12 days ago [-]
Wow, thank you very much for your reply, especially for how polite it was when I was decidedly impolite. I sympathize with how hard it is to make money in the software world, and I know absolutely nothing about business so of course take whatever I say with a boulder of salt.
That said, and I realize that this is crass but it's also honest: Akka's profitability isn't my problem. When I am looking to import a library for my job, I try my best to weigh pros and cons of each (as we all do), and when I see a BUSL that's an immediate red flag; if Akka were the only cool concurrency library in the JVM world then I'd just put up with it, but when there are viable alternatives like Vert.x it's extremely hard to go to my employer and ask them to spend $5000/month + $0.15/Akka-hour [1], especially since we run thousands of individual JVMs, and running a comparable thing in Vert.x cost us nothing (albeit with having to do tech support ourselves). Whether or not it's "fair" that Vert.x is a pet project from Red Hat or VMWare and therefore doesn't have to worry about financing is sort of orthogonal to whether or not I choose it or Akka.
This isn't meant to shit on Akka, it's very cool software, I'm just frustrated by the BUSL because it gives the illusion of an OSS license, the initial marketing around it looked like an OSS license, and I wasted about 15 hours writing some Akka code only to realize that I had to throw it away because there was no chance I was going to get my employer to approve a PR with BUSL-licensed libraries that would have cost us hundreds of thousands of dollars a year.
Again, apologies that this is rude, and if Akka/Lightbend/Typesafe is making a profit then of course all the best to you, but this is just my rationale.
Re-reading this, I apologize for how hostile I come off. You're not trying to sell me, you're just giving justification, which is fine even if I'm not a huge fan of the license.
ladyanita22 14 days ago [-]
Important to upvote this. If there's room for improvement for Polars (which I'm sure there is), go and support the project. But don't fall for a commercial trap when there are competent open source tools available.
binoct 14 days ago [-]
No shade to the juggernaut of the open source software movement and everything it has/will enabled, but why the hate for a project that required people’s time and knowledge to create something useful to a segment of users and then expect to charge for using it in the future? Commercial trap seems to imply this is some sort of evil machination but it seems like they are being quite upfront with that language.
floatrock 14 days ago [-]
It's not hate for the project, it's hate for the deceptive rollout.
Basically it's a debate about how many dark patterns can you squeeze next to that "upfront language" before "marketing" slides into "bait-n-switch."
papichulo2023 14 days ago [-]
Not sure if evil or not, but it is unprofessional to use a tool that you dont know how much it will cost for your company in the future.
maleldil 14 days ago [-]
While I agree, it's worth noting that this project is a drop-in replacement (they claim that, at least), but Polars has a very different API. I much prefer Polars's API, but it's still a non-trivial cost to switch to it, which is why many people would instead explore Pandas alternatives instead.
BostonEnginerd 14 days ago [-]
I thought I saw on the documentation that it was released under the modified BSD license. I guess they could take future versions closed source, but the current version should be available for folks to use and further develop.
OutOfHere 14 days ago [-]
It's just the binary that's BSD, not the source code. The source code is unavailable.
ritchie46 13 days ago [-]
I don't trust their benchmarks. I ran their benchmarks source locally on my machine TPCH scale 10. Polars was orders of magnitudes faster and didn't SIGABORT at query 10 (I wasn't OOM).
(.venv) [fireducks] ritchie46 /home/ritchie46/Downloads/deleteme/polars-tpch[SIGINT] $ SCALE_FACTOR=10.0 make run-polars
.venv/bin/python -m queries.polars
{"scale_factor":10.0,"paths":{"answers":"data/answers","tables":"data/tables","timings":"output/run","timings_filename":"timings.csv","plots":"output/plot"},"plot":{"show":false,"n_queries":7,"y_limit":null},"run":{"io_type":"parquet","log_timings":false,"show_results":false,"check_results":false,"polars_show_plan":false,"polars_eager":false,"polars_streaming":false,"modin_memory":8000000000,"spark_driver_memory":"2g","spark_executor_memory":"1g","spark_log_level":"ERROR","include_io":true},"dataset_base_dir":"data/tables/scale-10.0"}
Code block 'Run polars query 1' took: 1.47103 s
Code block 'Run polars query 2' took: 0.09870 s
Code block 'Run polars query 3' took: 0.53556 s
Code block 'Run polars query 4' took: 0.38394 s
Code block 'Run polars query 5' took: 0.69058 s
Code block 'Run polars query 6' took: 0.25951 s
Code block 'Run polars query 7' took: 0.79158 s
Code block 'Run polars query 8' took: 0.82241 s
Code block 'Run polars query 9' took: 1.67873 s
Code block 'Run polars query 10' took: 0.74836 s
Code block 'Run polars query 11' took: 0.18197 s
Code block 'Run polars query 12' took: 0.63084 s
Code block 'Run polars query 13' took: 1.26718 s
Code block 'Run polars query 14' took: 0.94258 s
Code block 'Run polars query 15' took: 0.97508 s
Code block 'Run polars query 16' took: 0.25226 s
Code block 'Run polars query 17' took: 2.21445 s
Code block 'Run polars query 18' took: 3.67558 s
Code block 'Run polars query 19' took: 1.77616 s
Code block 'Run polars query 20' took: 1.96116 s
Code block 'Run polars query 21' took: 6.76098 s
Code block 'Run polars query 22' took: 0.32596 s
Code block 'Overall execution of ALL polars queries' took: 34.74840 s
(.venv) [fireducks] ritchie46 /home/ritchie46/Downloads/deleteme/polars-tpch$ SCALE_FACTOR=10.0 make run-fireducks
.venv/bin/python -m queries.fireducks
{"scale_factor":10.0,"paths":{"answers":"data/answers","tables":"data/tables","timings":"output/run","timings_filename":"timings.csv","plots":"output/plot"},"plot":{"show":false,"n_queries":7,"y_limit":null},"run":{"io_type":"parquet","log_timings":false,"show_results":false,"check_results":false,"polars_show_plan":false,"polars_eager":false,"polars_streaming":false,"modin_memory":8000000000,"spark_driver_memory":"2g","spark_executor_memory":"1g","spark_log_level":"ERROR","include_io":true},"dataset_base_dir":"data/tables/scale-10.0"}
Code block 'Run fireducks query 1' took: 5.35801 s
Code block 'Run fireducks query 2' took: 8.51291 s
Code block 'Run fireducks query 3' took: 7.04319 s
Code block 'Run fireducks query 4' took: 19.60374 s
Code block 'Run fireducks query 5' took: 28.53868 s
Code block 'Run fireducks query 6' took: 4.86551 s
Code block 'Run fireducks query 7' took: 28.03717 s
Code block 'Run fireducks query 8' took: 52.17197 s
Code block 'Run fireducks query 9' took: 58.59863 s
terminate called after throwing an instance of 'std::length_error'
what(): vector::_M_default_append
Code block 'Overall execution of ALL fireducks queries' took: 249.06256 s
Traceback (most recent call last):
File "/home/ritchie46/miniconda3/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/ritchie46/miniconda3/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/home/ritchie46/Downloads/deleteme/polars-tpch/queries/fireducks/__main__.py", line 39, in <module>
execute_all("fireducks")
File "/home/ritchie46/Downloads/deleteme/polars-tpch/queries/fireducks/__main__.py", line 22, in execute_all
run(
File "/home/ritchie46/miniconda3/lib/python3.10/subprocess.py", line 526, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['/home/ritchie46/Downloads/deleteme/polars-tpch/.venv/bin/python', '-m', 'fireducks.imhook', 'queries/fireducks/q10.py']' died with <Signals.SIGABRT: 6>.
make: \*\* [Makefile:52: run-fireducks] Error 1
(.venv) [fireducks] ritchie46 /home/ritchie46/Downloads/deleteme/polars-tpch[2] $
mushufasa 14 days ago [-]
If it's good, then why not just fork it when (if) the license changes? It is 3-clause BSD.
In fact, what's stopping the pandas library from incorporating fireducks code into the mainline branch? pandas itself is BSD.
nicce 14 days ago [-]
There is no code. The binary blob is licensed.
rich_sasha 14 days ago [-]
It's a bit sad for me. I find the biggest issue for me with pandas is the API, not the speed.
So many foot guns, poorly thought through functions, 10s of keyword arguments instead of good abstractions, 1d and 2d structures being totally different objects (and no higher-order structures). I'd take 50% of the speed for a better API.
I looked at Polars, which looks neat, but seems made for a different purpose (data pipelines rather than building models semi-interactively).
To be clear, this library might be great, it's just a shame for me that there seems no effort to make a Pandas-like thing with better API. Maybe time to roll up my sleeves...
stared 14 days ago [-]
Yes, every time I write df[df.sth = val], a tiny part of me dies.
For a comparison, dplyr offers a lot of elegant functionality, and the functional approach in Pandas often feels like an afterthought. If R is cleaner than Python, it tells a lot (as a side note: the same story for ggplot2 and matplotlib).
Another surprise for friends coming from non-Python backgrounds is the lack of column-level type enforcement. You write df.loc[:, "col1"] and hope it works, with all checks happening at runtime. It would be amazing if Pandas integrated something like Pydantic out of the box.
I still remember when Pandas first came out—it was fantastic to have a tool that replaced hand-rolled data structures using NumPy arrays and column metadata. But that was quite a while ago, and the ecosystem has evolved rapidly since then, including Python’s gradual shift toward type checking.
oreilles 14 days ago [-]
> Yes, every time I write df[df.sth = val], a tiny part of me dies.
That's because it's a bad way to use Pandas, even though it is the most popular and often times recommended way. But the thing is, you can just write "safe" immutable Pandas code with method chaining and lambda expressions, resulting in very Polars-like code. For example:
Plus nowadays with the latest Pandas versions supporting Arrow datatypes, Polars performance improvements over Pandas are considerably less impressive.
Column-level name checking would be awesome, but unfortunately no python library supports that, and it will likely never be possible unless some big changes are made in the Python type hint system.
wodenokoto 14 days ago [-]
I’m not really sure why you think
.loc[lambda d: d["y"] > 0.5]
Is stylistically superior to
[df.y > 0.5]
I agree it comes in handy quite often, but that still doesn’t make it great to write compared to what sql or dplyr offers in terms of choosing columns to filter on (`where y > 0.5`, for sql and `filter(y > 0.5)`, for dplyr)
oreilles 14 days ago [-]
It is superior because you don't need to assign your dataframe to a variable ('df'), then update that variable or create a new one everytime you need to do that operation. Which means it is both safer (you're guaranteed to filter on the current version of the dataframe) and more concise.
For the rest of your comment: it's the best you can do in python. Sure you could write SQL, but then you're mixing text queries with python data manipulation and I would dread that. And SQL-only scripting is really out of question.
chaps 14 days ago [-]
Eh, SQL and python can still work together very well where SQL takes the place of pandas. Doing things in waves/batch helps.
Big problem with pandas is that you still have to load the dataframe into memory to work with it. My data's too big for that and postgres makes that problem go away almost entirely.
__mharrison__ 14 days ago [-]
It's superior because it is safer. Not because the API (or requirement for using Lambda) looks better. The lambda allows the operation to work on the current state of the dataframe in the chained operation rather than the original dataframe. Alternatively, you could use .query("y > 0.5"). This also works on the current state of the dataframe.
(I'm the first to complain about the many warts in Pandas. Have written multiple books about it. This is annoying, but it is much better than [df.y > 0.5].)
14 days ago [-]
OutOfHere 14 days ago [-]
Using `lambda` without care is dangerous because it risks being not vectorized at all. It risks being super slow, operating one row at a time. Is `d` a single row or the entire series or the entire dataframe?
rogue7 14 days ago [-]
In this case `d` is the entire dataframe. It's just a way of "piping" the object without having to rename it.
You are probably thinking about `df.apply(lambda row: ..., axis=1)` which operates on each row at a time and is indeed very slow since it's not vectorized. Here this is different and vectorized.
almostkorean 14 days ago [-]
Appreciate the explanation, this is something I should know by now but don't
OutOfHere 14 days ago [-]
That's excellent.
rogue7 14 days ago [-]
Agreed 100%. I am using this method-chaining style all the time and it works like a charm.
moomin 14 days ago [-]
I mean, yes there’s arrow data types, but it’s got a long way to go before it’s got full parity with the numpy version.
doctorpangloss 14 days ago [-]
All I want is for the IDE and Python to correctly infer types and column names for all of these array objects. 99% of the pain for me is navigating around SQL return values and CSVs as pieces of text instead of code.
bdjsiqoocwk 13 days ago [-]
Nonsense, if you understand why df[df.sh ==val] you'll see it's great. If you don't, you can also do df.query("sh == val").
stared 12 days ago [-]
If you type df[df2.sh == val] you will understand why it is not great.
bdjsiqoocwk 12 days ago [-]
That might or might not make sense depending on what df1 and df2 contain.
But what are you saying, that typing wrong things you might get wrong results? Yes, coding is like that.
What's your point, make a point.
movpasd 14 days ago [-]
I started using Polars for the "rapid iteration" usecase you describe, in notebooks and such, and haven't looked back — there are a few ergonomic wrinkles that I mostly attribute to the newness of the library, but I found that polars forces me to structure my thought process and ask myself "what am I actually trying to do here?".
I find I basically never write myself into a corner with initially expedient but ultimately awkward data structures like I often did with pandas, the expression API makes the semantics a lot clearer, and I don't have to "guess" the API nearly as much.
So even for this usecase, I would recommend trying out polars for anyone reading this and seeing how it feels after the initial learning phase is over.
ljosifov 14 days ago [-]
+1 Seconding this. My limited experience with pandas had a non-trivial number of moments "?? Is it really like this? Nah - I'm mistaken for sure, this can not be, no one would do something insane like that". And yet and yet... Fwiw since I've found that numpy is a must (ofc), but pandas is mostly optional. So I stick to numpy for my writing, and keep pandas read only. (just execute someone else's)
Have you tried polars? It’s a much more regular syntax. The regular syntax fits well with the lazy execution. It’s very composable for programmatically building queries. And then it’s super fast
bionhoward 14 days ago [-]
I found the biggest benefit of polars is ironically the loss of the thing I thought I would miss most, the index; with pandas there are columns, indices, and multi-indices, whereas with polars, everything is a column, it’s all the same so you can delete a lot of conditionals.
However, I still find myself using pandas for the timestamps, timedeltas, and date offsets, and even still, I need a whole extra column just to hold time zones, since polars maps everything to UTC storage zone, you lose the origin / local TZ which screws up heterogeneous time zone datasets. (And I learned you really need to enforce careful manual thoughtful consideration of time zone replacement vs offsetting at the API level)
Had to write a ton of code to deal with this, I wish polars had explicit separation of local vs storage zones on the Datetime data type
paddy_m 14 days ago [-]
I think pandas was so ambitious syntax wise and concept wise. But it got be a bit of a jumble. The index idea in particular is so cool, particular multi-indexes, watching people who really understand it do multi index operations is very cool.
IMO Polars sets a different goal of what's the most pandas like thing that we can build that is fast (and leaves open the possibility for more optimization), and clean.
Polars feels like you are obviously manipulating an advanced query engine. Pandas feels like manipulating this squishy datastructure that should be super useful and friendly, but sometimes it does something dumb and slow
amelius 14 days ago [-]
Yes. Pandas turns 10x developers into .1x developers.
berkes 14 days ago [-]
It does to me. Well, a 1x developer into a .01x dev in my case.
My conclusion was that pandas is not for developers. But for one-offs by managers, data-scientists, scientists, and so on. And maybe for "hackers" who cludge together stuff 'till it works and then hopefully never touch it.
Which made me realize such thoughts can come over as smug, patronizing or belittling. But they do show how software can be optimized for different use-cases.
The danger then lies into not recognizing these use-cases when you pull in smth like pandas. "Maybe using panda's to map and reduce the CSVs that our users upload to insert batches isn't a good idea at all".
This is often worsened by the tools/platforms/lib devs or communities not advertising these sweet spots and limitations. Not in the case of Pandas though: that's really clear about this not being a lib or framework for devs, but a tool(kit) to do data analysis with. Kudo's for that.
analog31 14 days ago [-]
I'm one of those people myself, and have whittled my Pandas use down to displaying pretty tables in Jupyter. Everything else I do in straight Numpy.
theLiminator 14 days ago [-]
Imo numpy is not better than pandas for the things you'd use pandas for, though polars is far superior.
fastasucan 14 days ago [-]
>My conclusion was that pandas is not for developers. But for one-offs by managers, data-scientists, scientists, and so on. And maybe for "hackers" who cludge together stuff 'till it works and then hopefully never touch it.
It doesn't work for me so it can't work for anyone?
berkes 12 days ago [-]
No. "It doesn't work for me. Why is that?" "well, turns out Panda's has a clear and well-defined use-case. So using it outside that use-case will bring problems, pain, friction, etc. Using it for something its not intended for, is why it doesn't work for me"
sega_sai 14 days ago [-]
Great point that I completely share. I tend to avoid pandas at all costs except for very simple things as I have bitten by many issues related to indexing. For anything complicated I tend to switch to duckdb instead.
bravura 14 days ago [-]
Can you explain your use-case and why DuckDB is better?
Considering switching from pandas and want to understand what is my best bet. I am just processing feature vectors that are too large for memory, and need an initial simple JOIN to aggregate them.
rapatel0 14 days ago [-]
Look into [Ibis](https://ibis-project.org/). It's a dataframe library built on duckdb. It supports lazy execution, greater than memory datastructures, remote s3 data and is insanely fast. Also works with basically any backend (postgres, mysql, parquet/csv files, etc) though there are some implementation gaps in places.
I previously had a pandas+sklearn transformation stack that would take up to 8 hours. Converted it to ibis and it executes in about 4 minutes now and doesn't fill up RAM.
It's not a perfect apples to apples pandas replacement but really a nice layer on top of sql. after learning it, I'm almost as fast as I was on pandas with expressions.
techwizrd 14 days ago [-]
I made the switch to Ibis a few months ago and have been really enjoying it. It works with all the plotting libraries including seaborn and plotnine. And it makes switching from testing on a CSV to running on a SQL/Spark a one-line change. It's just really handy for analysis (similar to the tidyverse).
sega_sai 14 days ago [-]
I am not necessarily saying duckdb is better. I personally just found it easier, clearer to write a sql query for any complicated set of joins/group by processing than to try to do that in pandas.
martinsmit 14 days ago [-]
Check out redframes[1] which provides a dplyr-like syntax and is fully interoperable with pandas.
Building on top of Pandas feels like you're only escaping part of the problems. In addition to the API, the datatypes in Pandas are a mess, with multiple confusing (and none of them good) options for e.g. dates/datetimes. Does redframes do anything there?
h14h 14 days ago [-]
If you wanna try a different API, take a look at Elixir Explorer:
It runs on top of Polars so you get those speed gains, but uses the Elixir programming language. This gives the benefit of a simple finctional syntax w/ pipelines & whatnot.
It also benefits from the excellent Livebook (a Jupyter alternative specific to Elixir) ecosystem, which provides all kinds of benefits.
faizshah 14 days ago [-]
Pandas is a commonly known DSL at this point so lots of data scientists know pandas like the back of their hand and thats why a lot of pandas but for X libraries have become popular.
I agree that pandas does not have the best designed api in comparison to say dplyr but it also has a lot of functionality like pivot, melt, unstack that are often not implemented by other libraries. It’s also existed for more than a decade at this point so there’s a plethora of resources and stackoverflow questions.
On top of that, these days I just use ChatGPT to generate some of my pandas tasks. ChatGPT and other coding assistants know pandas really well so it’s super easy.
But I think if you get to know Pandas after a while you just learn all the weird quirks but gain huge benefits from all the things it can do and all the other libraries you can use with it.
rich_sasha 14 days ago [-]
I've been living in the shadow of pandas for about a decade now, and the only thing I learned is to avoid using it.
I 100% agree that pandas addresses all the pain points of data analysis in the wild, and this is precisely why it is so popular. My point is, it doesn't address them well. It seems like a conglomerate of special cases, written for a specific problem it's author was facing, with little concern for consistency, generality or other use cases that might arise.
In my usage, any time saved by its (very useful) methods tends to be lost on fixing subtle bugs introduced by strange pandas behaviours.
In my use cases, I reindex the data using pandas and get it to numpy arrays as soon as I can, and work with those, with a small library of utilities I wrote over the years. I'd gladly use a "sane pandas" instead.
specproc 14 days ago [-]
Aye, but we've learned it, we've got code bases written in it, many of us are much more data kids than "real devs".
I get it doesn't follow best practices, but it does do what it needs to. Speed has been an issue, and it's exciting seeing that problem being solved.
Interesting to see so many people recently saying "polars looks great, but no way I'll rewrite". This library seems to give a lot of people, myself included, exactly what we want. I look forward to trying it.
omnicognate 14 days ago [-]
What about the polars API doesn't work well for your use case?
short_sells_poo 14 days ago [-]
Polars is missing a crucial feature for replacing pandas in Finance: first class timeseries handling. Pandas allows me to easily do algebra on timeseries. I can easily resample data with the resample(...) method, I can reason about the index frequency, I can do algebra between timeseries, etc.
You can do the same with Polars, but you have to start messing about with datetimes and convert the simple problem "I want to calculate a monthly sum anchored on the last business day of the month" to SQL-like operations.
Pandas grew a large and obtuse API because it provides specialized functions for 99% of the tasks one needs to do on timeseries. If I want to calculate an exponential weighted covariance between two time series, I can trivially do this with pandas: series1.ewm(...).cov(series2). I welcome people to try and do this with Polars. It'll be a horrible and barely readable contraption.
YC is mostly populated by technologists, and technologists are often completely ignorant about what makes pandas useful and popular. It was built by quants/scientists, for doing (interactive) research. In this respect it is similar to R, which is not a language well liked by technologists, but it is (surprise) deeply loved by many scientists.
n8henrie 14 days ago [-]
I don't know what exponential weighted covariance is, but I've had pretty good luck converting time series-based analyses from pandas to polars (for patient presentations to my emergency department -- patients per hour, per day, per shift, etc.). Resample has a direct (and easier IMO) replacement in polars, and there is group_by_dynamic.
I've had trouble determining whether one timestamp falls between two others across tens of thousands of rows (with the polars team suggesting I use a massive cross product and filter -- which worked but excludes the memory requirement), whereas in pandas I was able to sort the timestamps and thereby only need to compare against the preceding / following few based on the index of the last match.
The other issue I've had with resampling is with polars automatically dropping time periods with zero events, giving me a null instead of zero for the count of events in certain time periods (which then gets dropped from aggregations). This has caught me a few times.
But other than that I've had good luck.
short_sells_poo 14 days ago [-]
I'm curious how is polars group_by_dynamic easier than resample in pandas. In pandas if I want to resample to a monthly frequency anchored to the last business day of the month, I'd write:
> my_df.resample("BME").apply(...)
Done. I don't think it gets any easier than this. Every time I tried something similar with polars, I got bogged down in calendar treatment hell and large and obscure SQL like contraptions.
Edit: original tone was unintentionally combative - apologies.
n8henrie 13 days ago [-]
Totally fair. And thank you for the rewording (sincerely). I haven't used polars for anything business or finance related, so this is likely one of many blind spots for me.
Reviewing my work, only needed an hourly aggregation, which was similarly easy in polars and pandas (I misspoke about being easier) -- what I found easier was grouping by time data that wasn't amenable to `resample`.
In polars I had no problems using a regular group_by with a pl.col.dt object, whereas in pandas I remember struggling to do so, even though it seemed straightforward.
Sorry, I wish I could remember more details; this was probably 5 years ago that I was writing the pandas code and just converted it to polars about a year ago, so it's possible that I just got better at python in the meantime (though I was writing much more python back then). And of course a rewrite is likely to feel easier the second time.
The other confounding issue is that the eager pandas code crashed with OOM regularly and took several minutes to run, whereas polars handles it very well (which I'm sure to some degree is it optimizing things that I could have done manually), but this made iterating on this codebase feel much less onerous.
"""
Calculate monthly sums anchored to the last business day of each month
Parameters:
df: DataFrame with dates and values
date_column: name of date column
value_column: name of value column to sum
Returns:
DataFrame with sums anchored to last business day
"""
# Ensure date column is datetime
df[date_column] = pd.to_datetime(df[date_column])
# Group by end of business month and sum
monthly_sum = df.groupby(pd.Grouper(
key=date_column,
freq='BME' # Business Month End frequency
))[value_column].sum().reset_index()
return monthly_sum
It's actually much simpler than that. Assuming the index of the dataframe DF is composed of timestamps (which is normal for timeseries):
df.resample("BME").sum()
Done. One line of code and it is quite obvious what it is doing - with perhaps the small exception of BME, but if you want max readability you could do:
df.resample(pd.offsets.BusinessMonthEnd()).sum()
This is why people use pandas.
short_sells_poo 14 days ago [-]
Answered the child comment but let me copy paste here too. It's literally one (short) line:
> df.resample("BME").sum()
Assuming `df` is a dataframe (ie table) indexed by a timestamp index, which is usual for timeseries analysis.
"BME" stands for BusinessMonthEnd, which you can type out if you want the code to be easier to read by someone not familiar with pandas.
ies7 14 days ago [-]
This one liner example is one of the reason why some people use pandas and some people despise it.
It so easy for my analyst team because of daily uses but my developers probavly will never thought/know BME and decided to implement the code again.
tomrod 14 days ago [-]
A bit from memory as in transit, but something like df.groupby(df[date_col]+pd.offsets.MonthEnd(0))[agg_col].sum()
bobbylarrybobby 13 days ago [-]
Is LazyFrame.group_by_dynamic not basically the same thing?
otsaloma 14 days ago [-]
Agreed, never had a problem with the speed of anything NumPy or Arrow based.
Planning to switch to NumPy 2.0 strings soon. Other than that I feel all the basic operations are fine and solid.
Note for anyone else rolling up their sleeves: You can get quite far with pure Python when building on top of NumPy (or maybe Arrow). The only thing I found needing more performance was group-by-aggregate, where Numba seems to work OK, although a bit difficult as a dependency.
epistasis 14 days ago [-]
Have you examined siuba at all? It promises to be more similar to the R tidyverse, which IMHO has a much better API. And I personally prefer dplyr/tidyverse to Polars for exploratory analysis.
I have not yet used siuba, but would be interested in others' opinions. The activation energy to learn a new set of tools is so large that I rarely have the time to fully examine this space...
otsaloma 14 days ago [-]
I think the choice of using functions instead of classes + methods doesn't really fit well into Python. Either you need to do a huge amount of imports or use the awful `from siuba import *`. This feels like shoehorning the dplyr syntax into Python when method chaining would be more natural and would still retain the idea.
Also, having (already a while ago) looked at the implementation of the magic `_` object, it seemed like an awful hack that will serve only a part of use cases. Maybe someone can correct me if I'm wrong, but I get the impression you can do e.g. `summarize(x=_.x.mean())` but not `summarize(x=median(_.x))`. I'm guessing you don't get autocompletion in your editor or useful error messages and it can then get painful using this kind of a magic.
Bootvis 14 days ago [-]
The lack of non standard evaluation still forces you to write `_.` so this might be a better Pandas but not a better tidyverse.
A pity their compares don’t have tidyverse or R’s data.table. I think R would look simpler but now it remains unclear.
kussenverboten 14 days ago [-]
Agree with this. My favorite syntax is the elegance of data.table API in R. This should be possible in Python too someday.
te_chris 14 days ago [-]
Pandas best feature for me is the df format being readable by duckdb. The filtering api is a nightmare
fluorinerocket 14 days ago [-]
Thank you I don't know why people think it's so amazing. I end up sometimes just extracting the numpy arrays from the data frame and doing things like I know how to, because the Panda way is so difficult
i fell on dark days when they changed the multiindex reference level=N, which worked perfectly and was so logical and could be input alongside the axis, was swapped out in favor of a separate call for groupby
wodenokoto 14 days ago [-]
In that case I’d recommend dplyr in R. It also integrates with a better plotting library, GGPlot, which not only gives you better API than matplotlib but also prettier plots (unless you really get to work at your matplot code)
adolph 14 days ago [-]
So many foot guns, poorly thought through functions, 10s of keyword arguments instead of good abstractions
Yeah, Pandas has that early PHP feel to it, probably out of being a successful first mover.
nathan_compton 14 days ago [-]
Yeah. Pandas is the worst.
Polars is better in some ways but so verbose!
Kalanos 14 days ago [-]
The pandas API makes a lot more sense if you are familiar with numpy.
Writing pandas code is a bit redundant. So what?
Who is to say that fireducks won't make their own API?
omnicognate 14 days ago [-]
> Then came along Polars (written in Rust, btw!) which shook the ground of Python ecosystem due to its speed and efficiency
Polars rocked my world by having a sane API, not by being fast. I can see the value in this approach if, like the author, you have a large amount of pandas code you don't want to rewrite, but personally I'm extremely glad to be leaving the pandas API behind.
ralegh 14 days ago [-]
I personally found the polars API much clunkier, especially for rapid prototyping. I use it only for cemented processes where I could do with speed up/memory reduction.
Is there anything specific you prefer moving from the pandas API to polars?
benrutter 14 days ago [-]
Not OP but the ability to natively implement complex groupby logic is a huge plus for me at least.
Say you want to take an aggergation like "the mean of all values over the 75th percentile" algonside a few other aggregations. In pandas, this means you're gonna be in for a bunch of hoops and messing around with stuff because you can't express it via the api. Polars' api lets you express this directly without having to implement any kind of workaround.
> FireDucks is not a open source library at this moment.
You can get it installed freely using pip and use under BSD-3 license and of course can look into the python part of the source code.
I don't understand what it means. It looks like a contradiction. Does it have a BSD-3 licence or not?
abcalphabet 14 days ago [-]
From the above link:
> While the wheel packages are available at https://pypi.org/project/fireducks/#files, and while they do contain Python files, most of the magic happens inside a (BSD-3-licensed) shared object library, for which source code is not provided.
_flux 14 days ago [-]
They provide BSD-3-licensed Python files but the interesting bit happens in the shared object library, which is only provided in binary form (but is also BSD-3-licensed it seems, so you can distribute it freely).
joshuaissac 14 days ago [-]
Since it is under the BSD 3 licence, users would also be permitted to decompile and modify the shared object under the licence terms.
jlokier 14 days ago [-]
Nice insight!
sampo 14 days ago [-]
BSD license gives you the permission to use and to redistribute. In this case you may use and redistribute the binaries.
Edit: To use, redistribute, and modify, and distribute modified versions.
japhyr 14 days ago [-]
"Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met..."
Such a crazy distortion of the meaning of the license.
Imagine being like "the project is GPL - just the compiled machine code".
PittleyDunkin 14 days ago [-]
This is pretty common for binary blobs for where the source code has been lost.
Y_Y 14 days ago [-]
Wouldn't it be nice if GitHub was just for source code and you couldn't just slap up a README that's an add for some proprietary shitware with a vague promise of source some day in the glorious future?
diggan 14 days ago [-]
> Wouldn't it be nice if GitHub was just for source code
GitHub always been a platform for "We love to host FOSS but we won't be 100% FOSS ourselves", so makes sense they allow that kind of usage for others too.
I think what you want, is something like Codeberg instead, which is explicitly for FOSS and 100% FOSS themselves.
rad_gruchalski 14 days ago [-]
You'd slap that in a comment then?
thecopy 14 days ago [-]
>proprietary shitware
Is this shitware? It seems to be very high quality code
yupyupyups 14 days ago [-]
I think the anger comes from the fact that we expect Github repositories to host the actual source code and not be a dead-end with a single README.md file.
ori_b 14 days ago [-]
How can you tell?
sbarre 14 days ago [-]
I mean, based on the claims and the benchmarks, it seems to provide massive speedups to a very popular tool.
How would you define "quality" in this context?
echoangle 14 days ago [-]
High quality code isn't just code that performs well when executed, but also is readable, understandable and maintainable. You can't judge code quality by looking at the compiled result, just because it works well.
sbarre 14 days ago [-]
That's certainly one opinion about it.
One could also say that quality is related to the functional output.
echoangle 14 days ago [-]
> One could also say that quality is related to the functional output.
Right, I said nothing that contradicts that ("High quality code isn't just code that performs well when executed, but also ..."). High quality functional output is a necessary requirement, but it isn't sufficient to determine if code is high quality.
sbarre 14 days ago [-]
Sure, I guess it depends on what matters to you or to your evaluation criteria.
My point was that it's all subjective in the end.
echoangle 14 days ago [-]
It's not really subjective if you're at all reasonable about it.
Imagine writing a very good program, running it through an obfuscator, and throwing away the original code. Is the obfuscated code "high quality code" now, because the output of the compilation still works as before?
sbarre 10 days ago [-]
Again it depends what you mean by "high quality code".
Do you mean how well it was written, or do you mean how well it performs? Or do both matter? Equally, or one more/less than the other?
It probably depends on whether you're the developer taking over the codebase, or the customer running the code in production..
Take video games.. A lot of it is messy spaghetti C++ code, not modular or well structured, full of hacks and manual optimizations, to give the best possible performance on available hardware.
It might be impossible to parse or maintain, but it does the job about as well as possible, which is really all that matters to the end user. I would call that high quality code.
So again, subjective...
ori_b 14 days ago [-]
Written so that it's easy to maintain, well tested, correct in its handling of edge cases, easy to debug, and easy to iterate on.
imranq 14 days ago [-]
This presentation does a good job distilling why FireDucks is so fast:
* rewriting base pandas functions like dropna in c++
* in-built compiler to remove unused code
Pretty impressive especially given you import fireducks.pandas as pd instead of import pandas as pd, and you are good to go
However I think if you are using a pandas function that wasn't rewritten, you might not see the speedups
faizshah 14 days ago [-]
It’s not clear to me why this would be faster than polars, duckdb, vaex or clickhouse. They seem to be taking the same approach of multithreading, optimizing the plan, using arrow, optimizing the core functions like group by.
maleldil 14 days ago [-]
None of those drop-in replacements for Pandas. The main draw is "faster without changing your code".
faizshah 14 days ago [-]
I’m asking more about what techniques did they use to get the performance improvements in the slides.
They are showing a 20-30% improvement over Polars, Clickhouse and Duckdb. But those 3 tools are SOTA in this area and generally rank near eachother in every benchmark.
So 20-30% improvement over that cluster makes me interested to know what techniques they are using to achieve that over their peers.
mettamage 14 days ago [-]
Maybe it isn’t? Maybe they just want a fast pandas api?
geysersam 14 days ago [-]
According to their benchmarks they are faster. Not by a lot, but still significantly.
ayhanfuat 14 days ago [-]
In its essence it is a commercial product which has a free trial.
> Future Plans
By providing the beta version of FireDucks free of charge and enabling data scientists to actually use it, NEC will work to improve its functionality while verifying its effectiveness, with the aim of commercializing it within FY2024.
graemep 14 days ago [-]
Its BSD licensed. They do not way what the plans are but most likely a proprietary version with added support or features.
ori_b 14 days ago [-]
It's a BSD licensed binary blob. There's no code provided.
graemep 14 days ago [-]
Wow! That is so weird.
Its freeware under an open source license. Really misleading.
It looks like something you should stay away from unless you need it REALLY badly. Its a proprietary product with unknown pricing and no indication of what their plans are.
Does the fact that the binary is BSD licensed allow reverse-engineering?
captn3m0 14 days ago [-]
> Redistribution and use in source and binary forms, with or without modification, are permitted
Reversing and re-compiling should count as modification?
14 days ago [-]
ayhanfuat 14 days ago [-]
They say the source code for the part “where the magic happens” is not available so I am not sure what BSD implies there.
HelloNurse 14 days ago [-]
It serves as a ninja's smoke bomb until the "BSD" binary blob is suddenly obsoleted by a proprietary binary blob.
safgasCVS 14 days ago [-]
I’m sad that R’s tidy syntax is not copied more widely in the python world. Dplyr is incredibly intuitive most don’t ever bother reading the instructions you can look at a handful of examples and you’ve got the gist of it. Polars despite its speed is still verbose and inconsistent while pandas is seemingly a collection of random spells.
dr_kiszonka 13 days ago [-]
I don't know why, but it seems pipes are hard to do in Python without extra parentheses, etc.
__mharrison__ 14 days ago [-]
Lots of Pandas hate in this thread. However, for folks with lots of lines of Pandas in production, Fireducks can be a lifesaver.
I've had the chance to play with it on some of my code it queries than ran in 8+ minutes come down to 20 seconds.
Re-writing in Polars involves more code changes.
However, with Pandas 2.2+ and arrow, you can use .pipe to move data to Polars, run the slow computation there, and then zero copy back to Pandas. Like so...
Looks very cool, BUT: it's closed source? That's an immediate deal breaker for me as a quant. I'm happy to pay for my tools, but not being able to look and modify the source code of a crucial library like this makes it a non-starter.
flakiness 14 days ago [-]
> FireDucks is released on pypi.org under the 3-Clause BSD License (the Modified BSD License).
Where can I find the code? I don't see it on GitHub.
> contact@fireducks.jp.nec.com
So it's from NEC (a major Japanese computer company), presumably a research artifact?
Setting aside complaints about the Pandas API, it's frustrating that we might see the community of a popular "standard" tool fragment into two or even three ecosystems (for libraries with slightly incompatible APIs) -- seemingly all with the value proposition of "making it faster". Based on the machine learning experience over the last decade, this kind of churn in tooling is somewhat exhausting.
I wonder how much of this is fundamental to the common approach of writing libraries in Python with the processing-heavy parts delegated to C/C++ -- that the expressive parts cannot be fast and the fast parts cannot be expressive. Also, whether Rust (for polars, and other newer generation of libraries) changes this tradeoff substantially enough.
tgtweak 14 days ago [-]
I think it's a natural path of software life that compatibility often stands in the way of improving the API.
This really does seem like a rare thing that everything speeds up without breaking compatability. If you want a fast revised API for your new project (or to rework your existing one) then you have a solution for that with Polars. If you just want your existing code/workloads to work faster, you have a solution for that now.
It's OK to have a slow, compatible, static codebase to build things on then optimize as-needed.
Trying to "fix" the api would break a ton of existing code, including existing plugins. Orphaning those projects and codebases would be the wrong move, those things take a decade to flesh out.
This really doesn't seem like the worst outcome, and doesn't seem to be creating a huge fragmented mess.
SiempreViernes 14 days ago [-]
> Based on the machine learning experience over the last decade, this kind of churn in tooling is somewhat exhausting.
Don't come to old web-devs with those complains, every single one of them had to write at least one open source javascript library just to create their linkedin account!
viraptor 14 days ago [-]
> 100% compatibility with existing Pandas code: check.
Is it actually? Do people see that level of compatibility in practice?
Lots of people have mentioned Polars' sane API as the main reason to favor it, but the other crucial reason for us is that it's based on Apache Arrow. That allows us to use it where it's the best tool and then switch to whatever else we need when it isn't.
DonHopkins 14 days ago [-]
FireDucks FAQ:
Q: Why do ducks have big flat feet?
A: So they can stomp out forest fires.
Q: Why do elephants have big flat feet?
A: So they can stomp out flaming ducks.
adrian17 14 days ago [-]
Any explanation what makes it faster than pandas and polars would be nice (at least something more concrete than "leverage the C engine").
My easy guess is that compared to pandas, it's multi-threaded by default, which makes for an easy perf win. But even then, 130-200x feels extreme for a simple sum/mean benchmark. I see they are also doing lazy evaluation and some MLIR/LLVM based JIT work, which is probably enough to get an edge over polars; though its wins over DuckDB _and_ Clickhouse are also surprising out of nowhere.
Also, I thought one of the reasons for Polars's API was that Pandas API is way harder to retrofit lazy evaluation to, so I'm curious how they did that.
breakds 14 days ago [-]
I understand `pandas` is widely used in finance and quantitative trading, but it does not seem to be the best fit especially when you want your research code to be quickly ported to production.
We found `numpy` and `jax` to be a good trade-off between "too high level to optimize" and "too low level to understand". Therefore in our hedge fund we just build data structures and helper functions on top of them. The downside of the above combination is on sparse data, for which we call wrapped c++/rust code in python.
uptownfunk 14 days ago [-]
If they could just make a dplyr for py it would be so awesome. But sadly I don’t think the python language semantics will support such a tool. It all comes down to managing the namespace I guess
Great work, but I will hold my adoption until c++ source is available.
rcarmo 14 days ago [-]
The killer app for Polars in my day-to-day work is its direct Parquet export. It's become indispensable for cleaning up stuff that goes into Spark or similar engines.
__mharrison__ 14 days ago [-]
Many of the complaints about Pandas here (and around the internet) are about the weird API. However, if you follow a few best practices, you never run into the issue folks are complaining about.
I wrote a nice article about chaining for Ponder. (Sadly, it looks like the Snowflake acquisition has removed that. My book, Effective Pandas 2, goes deep into my best practices.)
otsaloma 14 days ago [-]
I don't quite agree, but if this was true, what would you tell a junior colleague in a code review? You can't use this function/argument/convention/etc you found in the official API documentation because...I don't like it? I think any team-maintained Pandas codebase will unavoidably drift into the inconsistent and bad. If you're always working alone, then it can of course be a bit better.
__mharrison__ 14 days ago [-]
I have strong opinions about Pandas. I've used it since it came out and have coalesced on patterns that make it easy to use.
(Disclaimer: I'm a corporate trainer and feed my family teaching folks how to work with their data using Pandas.)
When I teach about "readable" code, I caveat that it should be "readable for a specific audience". I hold that if you are a professional, that audience is other professionals. You should write code for professionals and not for newbies. Newbies should be trained up to write professional code. YMMV, but that is my bias based on experience seeing this work at some of the biggest companies in the world.
Would be nice to know what are internals of FireDucks
Kalanos 14 days ago [-]
Regarding compatibility, fireducks appears to be using the same column dtypes:
```
>>> df['year'].dtype == np.dtype('int32')
True
```
cmcconomy 14 days ago [-]
Every time I see a new better pandas, I check to see if it has geopandas compatibility
thecleaner 14 days ago [-]
Sure but single node performance. This makes it not very useful IMO since quite a few data science folks work with Hadoop clusters or Snowflake clusters or DataBricks where data is distributed and querying is handled by Spark executors.
chaxor 14 days ago [-]
The comparison is to pandas, so single node performance is understood in the scope.
This is for people running small tasks that may only take a couple days on a single node with a 32 core CPU or something, not tasks that take 3 months using thousands of cores.
My understanding for the latter is that pyspark is a decent option, while ballista is the better option for which to look forward. Perhaps using bastion-rs as a backend can be useful for an upcoming system as well. Databricks et al are cloud trash IMO, as is anything that isn't meant to be run on a local single node system and a local HPC cluster with zero code change and a single line of config change.
While for most of my jobs I ended up being able to evade the use of HPC by simply being smarter and discovering better algorithms to process information, I recall like pyspark decently, but preferring the simplicity of ballista over pyspark due to the simpler installation of Rust over managing Java and JVM junk.
The constant problems caused by anything using JVM backend and the environment config with it was terrible to add to a new system every time I ran a new program.
In this regard, ballista is a enormous improvement. Anything that is a one-line install via pip on any new system, runs local-first without any cloud or telemetry, and requires no change in code to run on a laptop vs HPC is the only option worth even beginning to look into and use.
Kalanos 14 days ago [-]
Hadoop hasn't been relevant for a long time, which is telling.
Unless I had thousands of files to work with, I would be loathe to use cluster computing. There's so much overhead, cost, waiting for nodes to spin up, and cloud architecture nonsense.
My "single node" computer is a refurbished tower server with 256GB RAM and 50 threads.
Most of these distributed computing solutions arose before data processing tools started taking multi-threading seriously.
markhahn 12 days ago [-]
understood: big facilities get shared; sharing requires arbitration and queueing.
an interesting angle on 50 threads and 256G: your data is probably pretty cool (cache-friendly). if your threads are merely HT, that's only 25 real cores, and might be only a single socket. implying probably <100 GB/s memory bandwidth. so a best-case touch-all-memory operation would take several seconds. for non-sequential patterns, effective rates would be much lower, and keep cores even less busy.
so cache-friendliness is really the determining feature in this context. I wonder how much these packages are oriented towards cache tuning. it affects basic strategy, such as how filtering is implemented in an expression graph...
benrutter 14 days ago [-]
Anyone here tried using FireDucks?
The promise of a 100x speedup with 0 changes to your codebase is pretty huge, but even a few correctness / incompatibility issues would probably make it a no-go for a bunch of potential users.
i_love_limes 14 days ago [-]
I have never heard of FireDucks! I'm curious if anyone else here has used it. Polars is nice, but it's not totally compatible. It would be interesting how much faster it is for more complex calculations
softwaredoug 14 days ago [-]
The biggest advantage of pandas is its extensibility. If you care about that, it’s (relatively) easy to add your own extension array type.
I haven’t seen that in other system like Polars, but maybe I’m wrong.
caycep 14 days ago [-]
Just because I haven't jumped into the data ecosystem for a while - is Polars basically the same as Pandas but accelerated? Is Wes still involved in either?
PhasmaFelis 14 days ago [-]
"FireDucks: Pandas but Faster" sounds like it's about something much more interesting than a Python library. I'd like to read that article.
14 days ago [-]
dkga 14 days ago [-]
Reading all pandas vs polars reminded me of the tidyverse vs data.table discussion some 10 years ago.
hinkley 14 days ago [-]
TIL that NEC still exists. Now there’s a name I have not heard in a long, long time.
insane_dreamer 14 days ago [-]
surprised not to see any mention of numpy (our go-to) here
edit: I know pandas uses numpy under the hood, but "raw" numpy is typically faster (and more flexible), so curious as to why it's not mentioned
Gepsens 14 days ago [-]
It'll be polars and datafusion for me thanks
E_Bfx 14 days ago [-]
Very impressive, the Python ecosystem is slowly getting very good.
BiteCode_dev 14 days ago [-]
Spent the last 20 years hearing that.
At some point I think it's more honest to say "the python ecosystem keeps getting more awesome".
Kalanos 14 days ago [-]
Continues to be the best by far
gigatexal 14 days ago [-]
On average only 1.5x faster than polars. That’s kinda crazy.
geysersam 14 days ago [-]
Why is that crazy? (I think the crazy thing is that they are faster at all. Taking an existing api and making it fast is harder than creating the api from scratch with performance in mind)
gigatexal 13 days ago [-]
100x faster than pandas yet 1.5x faster than polars. Polars is stupid fast. It’s got a far better api too — much more ergonomic.
> By providing the beta version of FireDucks free of charge and enabling data scientists to actually use it, NEC will work to improve its functionality while verifying its effectiveness, with the aim of commercializing it within FY2024.
In other words, it's free only to trap you.
I nearly made the mistake of merging Akka into a codebase recently; fortunately I double-checked the license and noticed it was the bullshit BUSL and it would have potentially cost my employer tens of thousands of dollars a year [1]. I ended up switching everything to Vert.x, but I really hate how normalized these ostensibly open source projects are sneaking scary expensive licenses into things now.
[1] Yes I'm aware of Pekko now, and my stuff probably would have worked with Pekko, but I didn't really want to deal with something that by design is 3 years out of date.
Vert.x and other frameworks are far better and easier for most devs to grok.
I would imaging the non-Scala use case to be less than ideal.
In Scala land, Pekko - the open source fork of Akka is the way to go if you need compatibility. Personally, I'd avoid new versions of Akka like the plague, and just use more modern alternatives to Pekko/Akka anyway.
I'm not sure what Lightbend's target market is? Maybe they think they have enough critical mass to merit the price tag for companies like Sony/Netflix/Lyft, etc. But they've burnt their bridge right into the water with everyone else, so I see them fading into irrelevance over the next few years.
I'm sure that Lightbend feels that their support contract is the bee's knees and worth whatever they charge for it, but it's a complete non-starter for me, and so I look elsewhere.
Vert.x actor-ish model is a bit different, but it's not the that different, and considering that Vert.x tends to perform extremely well in benchmarks, it doesn't really feel like I'm losing a lot by using it instead of Akka, particularly since I'm not using Akka Streams.
[1] Normal disclaimer: I don't hide my employment history, and it's not hard to find, but I politely ask that you do not post it here.
Plus the license isn't stupid.
I didn't know the licence and had a look, but I don't see what is bullshit with it. It's not a classical open source licence, but pretty close and much better than closed source
> and it would have potentially cost my employer tens of thousands of dollars a year
If your employer is not providing its software open source, there is nothing shocking to have to pay for the software used
I just think it's a proprietary license that is trying to LARP as an OSS license. It sneaks in language that makes it so it's unclear how much it will actually cost you to use it. It makes me terrified to import anything touching it because I don't want to risk accidentally costing my employer millions of dollars.
I don't really see how it's "pretty close" to an OSS license. Part of an OSS license is that I can use the code for whatever I want, which is decidedly not the case with BUSL. I do appreciate that stuff eventually becomes Apache, so I guess that's better than nothing, but I'd rather just avoid the stuff entirely, or only use the Apache licensed stuff.
I also don't really like the idea that I could contribute to Akka, have my contributions being monetized by Lightbend, but I'm not even allowed to use my own contributions without paying them a fee. I know that CLAs aren't exactly new in the OSS world, but at least if I were to make a contribution to Ubuntu, I'm still allowed to run Ubuntu server for free, with my contributions included.
I guess the license just kind of feels "Bait and Switch" to me. It tries to get you to think that it's OSS and then smacks you with a "JK IT'S PROPRIETARY".
> If your employer is not providing its software open source, there is nothing shocking to have to pay for the software used
Sure, except in the case of Akka there's enough competition in the Java library world that I don't think that it's worth it. Vert.x is comparable, and the license is less likely to accidentally cost me lots of money.
I mostly think that Akka's licensing is way too expensive too, again especially when you consider that there's a good chunk of concurrency libraries in Java-land that have more business-friendly licenses.
We did a long podcast and a couple blogs that offered transparency to the rationale on why we moved from Apache to BSL, which still downgrades to Apache after 36 months. See Emily Omier for the specifics.
It came down to survival. The company faced a bankruptcy event as customers were using the software without contributions and after exhausting alternatives needed to change the license model to create a more sustainable approach.
The consequence of this choice was that there was less adoption from OSS and ISVs who need a flexible licensing model for embedding and redistribution. It also encouraged the Pekko fork which is a branch that is 2.5 years old. And that branch helped older projects and OSS distributions to maintain their position without financial consequences.
It is not cheap to maintain Akka, and after 15 years we have turned a profit, albeit barely. We are growing, finally, and have a prosperous future and most of our spend goes into development. It did allow us to create Akka 3, which is a simpler model for devs within enterprises mixed with a consumption based model that should be significantly cheaper than the traditional libraries, and cheaper than the cost to adopt most any other framework. We can debate the merits of different business models but we couldn't have maintained the 50 CVE fixes and create a modern version of Akka if we hadn't taken this step.
We need a better strategy on how to appeal to the OSS community once more. To appeal to startups and academics, we have free commercial licenses and subscriptions, which nearly 200 accounts have signed up in the last 18 months.
Surely it is also not cheap to maintain Spring Framework either, no?
We are a pure play app dev platform and that gets to the heart of why the business model is different. I'd argue that we are very motivated to make sure that customers are successful with app dev as that is our bottom line where our rivals are financially incentives by infrastructure sales, not app dev outcomes.
That said, and I realize that this is crass but it's also honest: Akka's profitability isn't my problem. When I am looking to import a library for my job, I try my best to weigh pros and cons of each (as we all do), and when I see a BUSL that's an immediate red flag; if Akka were the only cool concurrency library in the JVM world then I'd just put up with it, but when there are viable alternatives like Vert.x it's extremely hard to go to my employer and ask them to spend $5000/month + $0.15/Akka-hour [1], especially since we run thousands of individual JVMs, and running a comparable thing in Vert.x cost us nothing (albeit with having to do tech support ourselves). Whether or not it's "fair" that Vert.x is a pet project from Red Hat or VMWare and therefore doesn't have to worry about financing is sort of orthogonal to whether or not I choose it or Akka.
This isn't meant to shit on Akka, it's very cool software, I'm just frustrated by the BUSL because it gives the illusion of an OSS license, the initial marketing around it looked like an OSS license, and I wasted about 15 hours writing some Akka code only to realize that I had to throw it away because there was no chance I was going to get my employer to approve a PR with BUSL-licensed libraries that would have cost us hundreds of thousands of dollars a year.
Again, apologies that this is rude, and if Akka/Lightbend/Typesafe is making a profit then of course all the best to you, but this is just my rationale.
[1] https://akka.io/pricing
ETA:
Re-reading this, I apologize for how hostile I come off. You're not trying to sell me, you're just giving justification, which is fine even if I'm not a huge fan of the license.
Basically it's a debate about how many dark patterns can you squeeze next to that "upfront language" before "marketing" slides into "bait-n-switch."
In fact, what's stopping the pandas library from incorporating fireducks code into the mainline branch? pandas itself is BSD.
So many foot guns, poorly thought through functions, 10s of keyword arguments instead of good abstractions, 1d and 2d structures being totally different objects (and no higher-order structures). I'd take 50% of the speed for a better API.
I looked at Polars, which looks neat, but seems made for a different purpose (data pipelines rather than building models semi-interactively).
To be clear, this library might be great, it's just a shame for me that there seems no effort to make a Pandas-like thing with better API. Maybe time to roll up my sleeves...
For a comparison, dplyr offers a lot of elegant functionality, and the functional approach in Pandas often feels like an afterthought. If R is cleaner than Python, it tells a lot (as a side note: the same story for ggplot2 and matplotlib).
Another surprise for friends coming from non-Python backgrounds is the lack of column-level type enforcement. You write df.loc[:, "col1"] and hope it works, with all checks happening at runtime. It would be amazing if Pandas integrated something like Pydantic out of the box.
I still remember when Pandas first came out—it was fantastic to have a tool that replaced hand-rolled data structures using NumPy arrays and column metadata. But that was quite a while ago, and the ecosystem has evolved rapidly since then, including Python’s gradual shift toward type checking.
That's because it's a bad way to use Pandas, even though it is the most popular and often times recommended way. But the thing is, you can just write "safe" immutable Pandas code with method chaining and lambda expressions, resulting in very Polars-like code. For example:
Plus nowadays with the latest Pandas versions supporting Arrow datatypes, Polars performance improvements over Pandas are considerably less impressive.Column-level name checking would be awesome, but unfortunately no python library supports that, and it will likely never be possible unless some big changes are made in the Python type hint system.
For the rest of your comment: it's the best you can do in python. Sure you could write SQL, but then you're mixing text queries with python data manipulation and I would dread that. And SQL-only scripting is really out of question.
Big problem with pandas is that you still have to load the dataframe into memory to work with it. My data's too big for that and postgres makes that problem go away almost entirely.
(I'm the first to complain about the many warts in Pandas. Have written multiple books about it. This is annoying, but it is much better than [df.y > 0.5].)
You are probably thinking about `df.apply(lambda row: ..., axis=1)` which operates on each row at a time and is indeed very slow since it's not vectorized. Here this is different and vectorized.
But what are you saying, that typing wrong things you might get wrong results? Yes, coding is like that.
What's your point, make a point.
I find I basically never write myself into a corner with initially expedient but ultimately awkward data structures like I often did with pandas, the expression API makes the semantics a lot clearer, and I don't have to "guess" the API nearly as much.
So even for this usecase, I would recommend trying out polars for anyone reading this and seeing how it feels after the initial learning phase is over.
However, I still find myself using pandas for the timestamps, timedeltas, and date offsets, and even still, I need a whole extra column just to hold time zones, since polars maps everything to UTC storage zone, you lose the origin / local TZ which screws up heterogeneous time zone datasets. (And I learned you really need to enforce careful manual thoughtful consideration of time zone replacement vs offsetting at the API level)
Had to write a ton of code to deal with this, I wish polars had explicit separation of local vs storage zones on the Datetime data type
IMO Polars sets a different goal of what's the most pandas like thing that we can build that is fast (and leaves open the possibility for more optimization), and clean.
Polars feels like you are obviously manipulating an advanced query engine. Pandas feels like manipulating this squishy datastructure that should be super useful and friendly, but sometimes it does something dumb and slow
My conclusion was that pandas is not for developers. But for one-offs by managers, data-scientists, scientists, and so on. And maybe for "hackers" who cludge together stuff 'till it works and then hopefully never touch it.
Which made me realize such thoughts can come over as smug, patronizing or belittling. But they do show how software can be optimized for different use-cases.
The danger then lies into not recognizing these use-cases when you pull in smth like pandas. "Maybe using panda's to map and reduce the CSVs that our users upload to insert batches isn't a good idea at all".
This is often worsened by the tools/platforms/lib devs or communities not advertising these sweet spots and limitations. Not in the case of Pandas though: that's really clear about this not being a lib or framework for devs, but a tool(kit) to do data analysis with. Kudo's for that.
It doesn't work for me so it can't work for anyone?
Considering switching from pandas and want to understand what is my best bet. I am just processing feature vectors that are too large for memory, and need an initial simple JOIN to aggregate them.
I previously had a pandas+sklearn transformation stack that would take up to 8 hours. Converted it to ibis and it executes in about 4 minutes now and doesn't fill up RAM.
It's not a perfect apples to apples pandas replacement but really a nice layer on top of sql. after learning it, I'm almost as fast as I was on pandas with expressions.
[1]: https://github.com/maxhumber/redframes
https://hexdocs.pm/explorer/exploring_explorer.html
It runs on top of Polars so you get those speed gains, but uses the Elixir programming language. This gives the benefit of a simple finctional syntax w/ pipelines & whatnot.
It also benefits from the excellent Livebook (a Jupyter alternative specific to Elixir) ecosystem, which provides all kinds of benefits.
I agree that pandas does not have the best designed api in comparison to say dplyr but it also has a lot of functionality like pivot, melt, unstack that are often not implemented by other libraries. It’s also existed for more than a decade at this point so there’s a plethora of resources and stackoverflow questions.
On top of that, these days I just use ChatGPT to generate some of my pandas tasks. ChatGPT and other coding assistants know pandas really well so it’s super easy.
But I think if you get to know Pandas after a while you just learn all the weird quirks but gain huge benefits from all the things it can do and all the other libraries you can use with it.
I 100% agree that pandas addresses all the pain points of data analysis in the wild, and this is precisely why it is so popular. My point is, it doesn't address them well. It seems like a conglomerate of special cases, written for a specific problem it's author was facing, with little concern for consistency, generality or other use cases that might arise.
In my usage, any time saved by its (very useful) methods tends to be lost on fixing subtle bugs introduced by strange pandas behaviours.
In my use cases, I reindex the data using pandas and get it to numpy arrays as soon as I can, and work with those, with a small library of utilities I wrote over the years. I'd gladly use a "sane pandas" instead.
I get it doesn't follow best practices, but it does do what it needs to. Speed has been an issue, and it's exciting seeing that problem being solved.
Interesting to see so many people recently saying "polars looks great, but no way I'll rewrite". This library seems to give a lot of people, myself included, exactly what we want. I look forward to trying it.
You can do the same with Polars, but you have to start messing about with datetimes and convert the simple problem "I want to calculate a monthly sum anchored on the last business day of the month" to SQL-like operations.
Pandas grew a large and obtuse API because it provides specialized functions for 99% of the tasks one needs to do on timeseries. If I want to calculate an exponential weighted covariance between two time series, I can trivially do this with pandas: series1.ewm(...).cov(series2). I welcome people to try and do this with Polars. It'll be a horrible and barely readable contraption.
YC is mostly populated by technologists, and technologists are often completely ignorant about what makes pandas useful and popular. It was built by quants/scientists, for doing (interactive) research. In this respect it is similar to R, which is not a language well liked by technologists, but it is (surprise) deeply loved by many scientists.
I've had trouble determining whether one timestamp falls between two others across tens of thousands of rows (with the polars team suggesting I use a massive cross product and filter -- which worked but excludes the memory requirement), whereas in pandas I was able to sort the timestamps and thereby only need to compare against the preceding / following few based on the index of the last match.
The other issue I've had with resampling is with polars automatically dropping time periods with zero events, giving me a null instead of zero for the count of events in certain time periods (which then gets dropped from aggregations). This has caught me a few times.
But other than that I've had good luck.
> my_df.resample("BME").apply(...)
Done. I don't think it gets any easier than this. Every time I tried something similar with polars, I got bogged down in calendar treatment hell and large and obscure SQL like contraptions.
Edit: original tone was unintentionally combative - apologies.
Reviewing my work, only needed an hourly aggregation, which was similarly easy in polars and pandas (I misspoke about being easier) -- what I found easier was grouping by time data that wasn't amenable to `resample`.
In polars I had no problems using a regular group_by with a pl.col.dt object, whereas in pandas I remember struggling to do so, even though it seemed straightforward.
Sorry, I wish I could remember more details; this was probably 5 years ago that I was writing the pandas code and just converted it to polars about a year ago, so it's possible that I just got better at python in the meantime (though I was writing much more python back then). And of course a rewrite is likely to feel easier the second time.
The other confounding issue is that the eager pandas code crashed with OOM regularly and took several minutes to run, whereas polars handles it very well (which I'm sure to some degree is it optimizing things that I could have done manually), but this made iterating on this codebase feel much less onerous.
`.join_where()`[1] was also added recently.
[1]: https://docs.pola.rs/api/python/stable/reference/dataframe/a...
But I'm guessing it's something like this:
import pandas as pd
def calculate_monthly_business_sum(df, date_column, value_column):
# Example usage:df = pd.DataFrame({ 'date': ['2024-01-01', '2024-01-31', '2024-02-29'], 'amount': [100, 200, 300] })
result = calculate_monthly_business_sum(df, 'date', 'amount')
print(result)
Which you can run here => https://python-fiddle.com/examples/pandas?checkpoint=1732114...
df.resample("BME").sum()
Done. One line of code and it is quite obvious what it is doing - with perhaps the small exception of BME, but if you want max readability you could do:
df.resample(pd.offsets.BusinessMonthEnd()).sum()
This is why people use pandas.
> df.resample("BME").sum()
Assuming `df` is a dataframe (ie table) indexed by a timestamp index, which is usual for timeseries analysis.
"BME" stands for BusinessMonthEnd, which you can type out if you want the code to be easier to read by someone not familiar with pandas.
It so easy for my analyst team because of daily uses but my developers probavly will never thought/know BME and decided to implement the code again.
Here's my alternative: https://github.com/otsaloma/dataiter https://dataiter.readthedocs.io/en/latest/_static/comparison...
Planning to switch to NumPy 2.0 strings soon. Other than that I feel all the basic operations are fine and solid.
Note for anyone else rolling up their sleeves: You can get quite far with pure Python when building on top of NumPy (or maybe Arrow). The only thing I found needing more performance was group-by-aggregate, where Numba seems to work OK, although a bit difficult as a dependency.
https://siuba.org
I have not yet used siuba, but would be interested in others' opinions. The activation energy to learn a new set of tools is so large that I rarely have the time to fully examine this space...
Also, having (already a while ago) looked at the implementation of the magic `_` object, it seemed like an awful hack that will serve only a part of use cases. Maybe someone can correct me if I'm wrong, but I get the impression you can do e.g. `summarize(x=_.x.mean())` but not `summarize(x=median(_.x))`. I'm guessing you don't get autocompletion in your editor or useful error messages and it can then get painful using this kind of a magic.
A pity their compares don’t have tidyverse or R’s data.table. I think R would look simpler but now it remains unclear.
Yeah, Pandas has that early PHP feel to it, probably out of being a successful first mover.
Writing pandas code is a bit redundant. So what?
Who is to say that fireducks won't make their own API?
Polars rocked my world by having a sane API, not by being fast. I can see the value in this approach if, like the author, you have a large amount of pandas code you don't want to rewrite, but personally I'm extremely glad to be leaving the pandas API behind.
Is there anything specific you prefer moving from the pandas API to polars?
Say you want to take an aggergation like "the mean of all values over the 75th percentile" algonside a few other aggregations. In pandas, this means you're gonna be in for a bunch of hoops and messing around with stuff because you can't express it via the api. Polars' api lets you express this directly without having to implement any kind of workaround.
Nice article on it here: https://labs.quansight.org/blog/dataframe-group-by
I don't understand what it means. It looks like a contradiction. Does it have a BSD-3 licence or not?
> While the wheel packages are available at https://pypi.org/project/fireducks/#files, and while they do contain Python files, most of the magic happens inside a (BSD-3-licensed) shared object library, for which source code is not provided.
Edit: To use, redistribute, and modify, and distribute modified versions.
https://opensource.org/license/bsd-3-clause
Imagine being like "the project is GPL - just the compiled machine code".
GitHub always been a platform for "We love to host FOSS but we won't be 100% FOSS ourselves", so makes sense they allow that kind of usage for others too.
I think what you want, is something like Codeberg instead, which is explicitly for FOSS and 100% FOSS themselves.
Is this shitware? It seems to be very high quality code
How would you define "quality" in this context?
One could also say that quality is related to the functional output.
Right, I said nothing that contradicts that ("High quality code isn't just code that performs well when executed, but also ..."). High quality functional output is a necessary requirement, but it isn't sufficient to determine if code is high quality.
My point was that it's all subjective in the end.
Imagine writing a very good program, running it through an obfuscator, and throwing away the original code. Is the obfuscated code "high quality code" now, because the output of the compilation still works as before?
Do you mean how well it was written, or do you mean how well it performs? Or do both matter? Equally, or one more/less than the other?
It probably depends on whether you're the developer taking over the codebase, or the customer running the code in production..
Take video games.. A lot of it is messy spaghetti C++ code, not modular or well structured, full of hacks and manual optimizations, to give the best possible performance on available hardware.
It might be impossible to parse or maintain, but it does the job about as well as possible, which is really all that matters to the end user. I would call that high quality code.
So again, subjective...
https://fireducks-dev.github.io/files/20241003_PyConZA.pdf
The main reasons are
* multithreading
* rewriting base pandas functions like dropna in c++
* in-built compiler to remove unused code
Pretty impressive especially given you import fireducks.pandas as pd instead of import pandas as pd, and you are good to go
However I think if you are using a pandas function that wasn't rewritten, you might not see the speedups
They are showing a 20-30% improvement over Polars, Clickhouse and Duckdb. But those 3 tools are SOTA in this area and generally rank near eachother in every benchmark.
So 20-30% improvement over that cluster makes me interested to know what techniques they are using to achieve that over their peers.
> Future Plans By providing the beta version of FireDucks free of charge and enabling data scientists to actually use it, NEC will work to improve its functionality while verifying its effectiveness, with the aim of commercializing it within FY2024.
Its freeware under an open source license. Really misleading.
It looks like something you should stay away from unless you need it REALLY badly. Its a proprietary product with unknown pricing and no indication of what their plans are.
Does the fact that the binary is BSD licensed allow reverse-engineering?
Reversing and re-compiling should count as modification?
I've had the chance to play with it on some of my code it queries than ran in 8+ minutes come down to 20 seconds.
Re-writing in Polars involves more code changes.
However, with Pandas 2.2+ and arrow, you can use .pipe to move data to Polars, run the slow computation there, and then zero copy back to Pandas. Like so...
to:Where can I find the code? I don't see it on GitHub.
> contact@fireducks.jp.nec.com
So it's from NEC (a major Japanese computer company), presumably a research artifact?
> https://fireducks-dev.github.io/docs/about-us/ Looks like so.
I wonder how much of this is fundamental to the common approach of writing libraries in Python with the processing-heavy parts delegated to C/C++ -- that the expressive parts cannot be fast and the fast parts cannot be expressive. Also, whether Rust (for polars, and other newer generation of libraries) changes this tradeoff substantially enough.
This really does seem like a rare thing that everything speeds up without breaking compatability. If you want a fast revised API for your new project (or to rework your existing one) then you have a solution for that with Polars. If you just want your existing code/workloads to work faster, you have a solution for that now.
It's OK to have a slow, compatible, static codebase to build things on then optimize as-needed.
Trying to "fix" the api would break a ton of existing code, including existing plugins. Orphaning those projects and codebases would be the wrong move, those things take a decade to flesh out.
This really doesn't seem like the worst outcome, and doesn't seem to be creating a huge fragmented mess.
Don't come to old web-devs with those complains, every single one of them had to write at least one open source javascript library just to create their linkedin account!
Is it actually? Do people see that level of compatibility in practice?
It should be pretty close, though.
Q: Why do ducks have big flat feet?
A: So they can stomp out forest fires.
Q: Why do elephants have big flat feet?
A: So they can stomp out flaming ducks.
My easy guess is that compared to pandas, it's multi-threaded by default, which makes for an easy perf win. But even then, 130-200x feels extreme for a simple sum/mean benchmark. I see they are also doing lazy evaluation and some MLIR/LLVM based JIT work, which is probably enough to get an edge over polars; though its wins over DuckDB _and_ Clickhouse are also surprising out of nowhere.
Also, I thought one of the reasons for Polars's API was that Pandas API is way harder to retrofit lazy evaluation to, so I'm curious how they did that.
We found `numpy` and `jax` to be a good trade-off between "too high level to optimize" and "too low level to understand". Therefore in our hedge fund we just build data structures and helper functions on top of them. The downside of the above combination is on sparse data, for which we call wrapped c++/rust code in python.
I wrote a nice article about chaining for Ponder. (Sadly, it looks like the Snowflake acquisition has removed that. My book, Effective Pandas 2, goes deep into my best practices.)
(Disclaimer: I'm a corporate trainer and feed my family teaching folks how to work with their data using Pandas.)
When I teach about "readable" code, I caveat that it should be "readable for a specific audience". I hold that if you are a professional, that audience is other professionals. You should write code for professionals and not for newbies. Newbies should be trained up to write professional code. YMMV, but that is my bias based on experience seeing this work at some of the biggest companies in the world.
EDIT: I've found some benchmarks https://fireducks-dev.github.io/docs/benchmarks/
Would be nice to know what are internals of FireDucks
```
>>> df['year'].dtype == np.dtype('int32')
True
```
While for most of my jobs I ended up being able to evade the use of HPC by simply being smarter and discovering better algorithms to process information, I recall like pyspark decently, but preferring the simplicity of ballista over pyspark due to the simpler installation of Rust over managing Java and JVM junk. The constant problems caused by anything using JVM backend and the environment config with it was terrible to add to a new system every time I ran a new program.
In this regard, ballista is a enormous improvement. Anything that is a one-line install via pip on any new system, runs local-first without any cloud or telemetry, and requires no change in code to run on a laptop vs HPC is the only option worth even beginning to look into and use.
Unless I had thousands of files to work with, I would be loathe to use cluster computing. There's so much overhead, cost, waiting for nodes to spin up, and cloud architecture nonsense.
My "single node" computer is a refurbished tower server with 256GB RAM and 50 threads.
Most of these distributed computing solutions arose before data processing tools started taking multi-threading seriously.
an interesting angle on 50 threads and 256G: your data is probably pretty cool (cache-friendly). if your threads are merely HT, that's only 25 real cores, and might be only a single socket. implying probably <100 GB/s memory bandwidth. so a best-case touch-all-memory operation would take several seconds. for non-sequential patterns, effective rates would be much lower, and keep cores even less busy.
so cache-friendliness is really the determining feature in this context. I wonder how much these packages are oriented towards cache tuning. it affects basic strategy, such as how filtering is implemented in an expression graph...
The promise of a 100x speedup with 0 changes to your codebase is pretty huge, but even a few correctness / incompatibility issues would probably make it a no-go for a bunch of potential users.
I haven’t seen that in other system like Polars, but maybe I’m wrong.
edit: I know pandas uses numpy under the hood, but "raw" numpy is typically faster (and more flexible), so curious as to why it's not mentioned
At some point I think it's more honest to say "the python ecosystem keeps getting more awesome".