Cool, I’m interested to see where you go with this.
I found the operator precedence unnatural, and it looks like a lot of other folks in this thread did too. I would naturally assume `cat:dog` would be equivalent to `(cat):(dog)` rather than `ca(t:d)og`.
bane 36 days ago [-]
This is a very interesting idea for a few reasons.
> I would naturally assume `cat:dog` would be equivalent to `(cat):(dog)` rather than `ca(t:d)og`
It was confusing to me too until I remembered that we all kind of use regexes sort of wrong. They're "really" supposed to be considered as generators and not matchers. So IIR cat|dog as a "regular expression" (not a regex) is supposed to formaly expand to
{catog,cadog}
For matching, this set of strings can then be substring matched against some larger text.
The problem is that almost no regex matching engine actually does this, and so now they'll do all kinds of strange things either to meet our expectations, or for efficiency or something.
If you go and try a bunch of different regex tools you'll get variations that either service (cat)|(dog) or (cat)|(dog)|(ca[td]og) or something else.
So from a more formal conceptualization I think cat:dog should produce ca(t:d)og not (cat):(dog). But our experience with "regex" tools has subverted that formalization and now everybody just puts parens around expressions they want to alternate.
My real minor issue with this proposal, as interesting and well thought out as it is, is that it feels like it's just trying to get back at regular expressions as generators, which they actually are and it's coming from a place on the other side of a few decades of how we've been abusing them as regexes for user expectations. In other words, the problem is the tooling, not the syntax.
source: I've worked adjacent to this space in the past and if you've never thought of regexes as string set generators you can toy with the idea here
but again, how these generator tools works is also very specific. The ones I used to work with had a variety of ways to specify constraints on closures and such to restrict the generators.
Timwi 36 days ago [-]
There’s no reason to say that “ca(t:d)og” is a “more correct” parsing than “(cat):(dog)”. You did hit the nail on the head insofar as that you realized that we as programmers have built strong habits and make assumptions on the basis of those habits. But you didn’t take it to the logical conclusion and didn’t realize that having a text-based syntax to represent regexes is also such a habit/assumption.
In pure theoretical computer science, regular expressions exist as an abstract concept independent from syntax or parsers. They are an “algebra”, which means they are composed of elements connected with operators, but they are not inherently tied to a syntax. In the most fundamental formulation of regular expressions (the one in the Chomsky hierarchy), the only operators are alteration (which modern syntaxes express as “|”), the Kleene star (“*”) and — notably — concatenation, which modern syntaxes simply omit, in a way comparable to how modern mathematics notation omits the multiplication operator when you write “2x”.
In the same way that maths needs rules to define whether “2x²” means “(2x)²” or “2(x²)”, regex syntax needs such rules too. This is called operator precedence. I’m sure you’ve heard that before, but you just might not have realized that the regular expression “ab” has an operator in it because it is typically not written.
Now I’m not going to argue that the operator precedence in maths notation is haphazard or without reason — but it is arbirary. It was arbitrarily chosen to be the most useful to mathematicians using the notation. And it turns out that giving exponentiation higher precedence than (invisible) multiplication (meaning: “2x²” means “2(x²)” rather than “(2x)²”) is more useful.
So coming back to the original example, whether “cat:dog” means “ca(t:d)og” or “(cat):(dog)” is simply a matter of defining the precedence of the “:” operator relative to the concatenation operator. You can argue (and I would agree with you) that one is more useful than the other, and therefore preferable (in the same way that “(cat)|(dog)” is more useful than “ca(t|d)og”), but neither of them is more fundamentally correct or primal or, as you put it, “supposed to formally expand to”.
c0nstantine 36 days ago [-]
I agree with the point that precedence is arbitrary. The current version looks like this:
1 Escaped characters
2 []
3 ()
4 * + ? {m,n}
5 :
6 . (implicit concatenation)
7 |
I have some reasons to put it that way. I want : to be somewhat 'atomic'. If you think about '*' or '+' they can be lower in the table as well. Anyway, I will try to put : lower in the next version and see how it goes.
36 days ago [-]
c0nstantine 37 days ago [-]
Thank you for the feedback. Yes, the precedence is a question for me. Maybe I will change this.
If I shift it behind concatenation there could be another problem. E.g. with non-associative : should be illegal. And I am not sure how to treat this:
cat:dog:mouse
In the current version I inject the epsilon (empty string). It looks natural E.g. to remove every second letter I could run '..:' which is technically '.(.:eps)':
echo 'abcde' | ./trre '..:'
result: 'ace'
actually ':' association could have a meaning as a composition of regular relations; but I found it too complicated for now.
kccqzy 36 days ago [-]
I do not understand the rules by which you inject the epsilon and I think this is a source of confusion for many people. I had thought that an epsilon could be injected anywhere REGEX can appear (effectively allowing epsilon as a REGEX) but of course that just leads to infinite number of parses. Manually injecting epsilon is a highly hacky thing to do; better to consider that when you design the grammar.
I would not worry about "cat:dog:mouse" because intuitively it is clearly correct and it means replacing cat with mouse. With parentheses it could be written as "((cat:dog):mouse)".
c0nstantine 35 days ago [-]
Epsilon injection appears whenever right or left side of ':' has no operand. E.g.
(:a)
(a:)
a:|b
a|b:
etc
I will try to change the precedence and see how it works.
Btw what do you think about explicit operators '>' '<' where '<' works as usual regex matcher, and '>' as a generator. For example to change 'cat' to 'dog' there could be something like '<cat>dog' where '<cat' part is a parser and '>dog' is a generator. Thanks.
kccqzy 34 days ago [-]
I think your epsilon injection rule is trying to achieve this kind of production:
I think this would work better, but ':a:' is still ambiguous: it has two parse trees.
twiss 37 days ago [-]
Yeah. Similarly, for the range transformations, instead of `[a:A-z:Z]`, I would suggest `[a-z:A-Z]`; and instead of `[a:b-y:zz:a]`, something like `[a-y:b-z;z:a]`, perhaps.
kazinator 37 days ago [-]
I would suggest simply [a-z]:[A-Z], inspired by tr.
Then there is no syntactic special case. This is just EXPR:EXPR; the special case is that both EXPR are character class syntax, and so the tr-like range mapping applies.
c0nstantine 36 days ago [-]
[a-z] is equivalent to 'a|b|...|z' in the normal regex language.
So if we do [a-z]:[A-Z] it should be expanded to:
(a|b|...|z):(A|B|...|Z)
which is pretty legal in trre but has different meaning of mapping any a-z to ALL the A-Z (generating A-Z on each occurrence of lowercase letter).
kazinator 36 days ago [-]
[a-z] is a semantically equivalent regex to a|b|..|z, but the two are not equivalent syntactic forms.
Distinct syntactic forms can be given distinct semantics, as long as there is rhyme and reason.
Moreover, the right side of the colon is not the normal regex language, it only borrows its syntax. So there, we may be justified in denying that the usual equivalence holds between character class syntax and a disjunction of the symbols denoted by the class.
c0nstantine 36 days ago [-]
The right side is a normal regex language syntactically. Semantically it is a generator instead of a parser (consumer).
But I got your point. Maybe there could be some ways to do it in consistent way. Just straight tr-like syntax won't work, e.g I really want it something like this to be valid:
[a-b]:(x|y) (pairs a:x, b:x, a:y, b:y)
and I prefer not handle these in some ad-hoc way.
kazinator 36 days ago [-]
I also go your point. The right side is a regular expression because it denotes a regular set.
languagehacker 36 days ago [-]
For folks interested in finite-state transducers and other kinds of tooling available, check out XFST (Xerox Finite-State Transducer), which has been used in computational linguistics applications for a good 20 years now.
I remember a Finnish researcher from PARC coming to one of my classes at UT to show how you can use FSTs for handling Finnish morphology, which is, on its face, quite a feat.
http://hfst.github.io/ is the modern open source version of XFST; it subsumes foma and openfst, pretty sure it does all of what trre does and more.
woodson 36 days ago [-]
There’s also k2 (https://github.com/k2-fsa/k2) which implements a lot of FSA and FST algorithms in CUDA, with PyTorch bindings.
mcyc 36 days ago [-]
People may also be interested in Pynini [1], a python wrapper (+ a lot of additional ease-of-use functionality) of OpenFst [2] (a really great library for transducers).
There are some good tutorials in the form of homework assignments (from like Johns Hopkins and some others) that go through Pynini use cases.
If you are looking for an alternative to standard regex, especially if you have trouble with group logic and are looking for something maintainable, you might like the Rosie Pattern Language.
sweet, I did my "Diplom" CS thesis around 1997 on finite state transducers. was mich less trivial than I'd thought. the ask was to implement composition and DFAs where possible. also for composed transducers. "algebra of finite state transducers". use case was morphology. the topic was heavily underestimated and I had to finish half way through. so: chapeau :-)
anyhow
wrt syntax, are you sure you want ':' to bind stronger than concatenation 'ab' ?
username223 37 days ago [-]
I used OpenFST for bioinformatics back in the early 2000s. They were fun to play with, but never ended up being useful for the things I was working on. It's cool to see that the project is still going 20 years later: https://www.openfst.org/twiki/bin/view/FST/WebHome
woodson 35 days ago [-]
It was used quite a bit in WFST-based speech recognition decoding, before so-called end-to-end approaches (CTC, RNN-T, AED, etc) took over.
Etheryte 37 days ago [-]
I have to say, incredibly bold of you to essentially hinge your graduation on whether you can regex hard enough or not.
c0nstantine 37 days ago [-]
yeah. Transducers are very old topic. For some reason they were not connected to a specific language like regex.
> wrt syntax, are you sure you want ':' to bind stronger than concatenation 'ab' ?
That's something I am still not sure about. I took a hundred examples and it looked more natural this way (: lower then .). But I can change it with the change of one digit in the code, literally. That's why I'm posting here. I need some real feedback.
froh 37 days ago [-]
hm maybe juxtapose a number of examples in one precedence and then the other? and share them to gather feedback?
also,
* colon as member of a character class (and dash and how they interact)
* posix character classes (and the colon in the syntax)
* word boundaries are really useful in practice
* think of ünicøde early and look at what russ cox did there
boundaries, what do you decide to exclude? for example back references and grouping have fun with DFAs (again see russ cox and re2)
composition is fantastic, and a shortcut to exponential land because grammars (as in Chomsky hierarchy) can easily be expressed with transducers, yay.
boundaries will also clarify use cases and allow to state: "if you want to do xyz please use re2 and a bit of code instead"
and one "technicality" that hit me once with re2 was a static buffer for resumable iteration. I'd loved to reallocate the buffer and drop it's beginning and extend it at the end but alas, the state of re2 has hard pointers into the buffer, not relative ones. I think this was when re2 hit the end of the buffer mid-expression during incremental parse. so you can't reallocate and instead you have to drop the hard earned state and resume.
anyhow, it's been quite a while so I'm no longer in the thicket of this :-D
what's your driver? curiosity? or some itch?
but I really enjoy seeing your project :-)
c0nstantine 36 days ago [-]
Thank you for the detailed comment.
So Unicode is something on a top of my TODO list. Boundaries is a very interesting topic. Maybe I'll extend the doc to include the details.
> what's your driver? curiosity? or some itch?
It's an itch :). 8 years ago I explored automata theory and found finite transducers to be a handy and natural way to transform texts. And as regex corresponds to FSA (finite state acceptors) I wanted to create a language that corresponds to FST (finite state transducers). There is a lesser-known algorithm for FST determinization and I want to applied it to make the transformation fast and efficient. It turned out to be not that simple as I expected initially.
layer8 37 days ago [-]
This doesn’t seem sufficient as soon as you want to perform some kind of structural substitution, for example doing the equivalent of s/"([^"]*)"/'$1'/. If it could do that and also somehow be able to replace any of the [^"] that match ['] by \', that would seem more useful.
More generally speaking, since regular expressions effectively define a parse tree on their matches, being able to perform more general transformations of those trees would be useful.
johnnymellor 36 days ago [-]
If I understand correctly the following ttre expression does what you're asking for:
":'(':(\\')|[^"'])*":'
tomashubelbauer 36 days ago [-]
So this is how it feels to read a regex when you don't know how to regex.
layer8 36 days ago [-]
Good point, I didn't think of that solution.
c0nstantine 36 days ago [-]
Thank you for doing my work! :)
c0nstantine 36 days ago [-]
Hi,
If I understand it correctly you want to change something inside the "..." block and change the quotas to single '.
So I substitute the text inside "" by symbol - using this expression ".+?:-" and simultaneously changing the surrounding quota.
Question mark means non-greedy mode.
crazygringo 37 days ago [-]
> Regular expressions is a great tool for searching patterns in text. But I always found it unnatural for text editing.
The entire purpose of this project seems to hinge on this assertion, but there isn't a single example.
I don't understand what makes regex unnatural for editing? What is meant by editing? Why do people struggle with groups?
There are lots of examples of the syntax for this project, but why is it better than regular regex?
If there were a few examples showing "here's the vanilla regex version, here's my version, here's why my version makes this easier" I might be able to understand this project.
Etheryte 37 days ago [-]
I would say regex is usually pretty write once edit never. Building a prototype that tries to look beyond that is a great way to see what a better future might look like in this regard.
psychoslave 37 days ago [-]
Sure, though you can do commented regexp in Perl and Ruby and probably others, using x flag if my memory serve well, though in practice for maintenability you'll probably better avoid this kind of "clever" tricks.
Just skimming through the readme, I'm also in doubt that the paradigm shift would really be any better than folk regexp.
That doesn't make the existence of the project any less cooler, of course. Congratulation to the author, and don't let such a comment narrow down your motivation, as it's really not the point. If you enjoy do it, great. If you learn something in the process, awesome. If it can lead to something that I can't envision, how marvelous. But it doesn't, that was still a great epic and you can be proud.
c0nstantine 36 days ago [-]
Oh, I've learnt a lot. Initially wanted to complete the whole project in two weeks and it took a few months. The hardest part was the DFT determinization algorithm design.
Thanks for your feedback!
c0nstantine 37 days ago [-]
Fair point. The most explicit example if you need to change something in context. For example if we need to change 'y' to 'Y' only if it occurs between x and y you would do something like this in python.
pattern = r'(x)y(z)'
replacement = r'\1Y\2'
result = re.sub(pattern, replacement, text)
I would like to replace it with 'xy:Yz' pattern:
result = re.trre('xy:Yz', text)
If you need your x, z to be more complicated patterns or even regex themselves it can be more handy using this approach.
crazygringo 37 days ago [-]
Thanks!
I guess I'm still struggling to see how it's simpler overall.
Most of the examples on your page don't involve groups at all, e.g.:
That already seems a lot more complicated than just:
re.sub('cat', 'dog', 'catcatcat')
I don't need to use groups that often in regex replacements, and when I do I'm already trying to do something complicated, and it's not clear to me why the colon syntax is easier to write, easier to understand, or if it's as flexible.
Not trying to criticize the project, just genuinely trying to understand the specific strengths and limitations of the proposed syntax. E.g. what if I want to turn xyz into zYx?
c0nstantine 37 days ago [-]
>> E.g. what if I want to turn xyz into zYx?
echo 'zyx' | ./trre 'xy:Yz|zy:Yx'
It is still a regular language. I do not introduce references.
You are right in sense the `sed` is far superior editor. But here I see some advantages:
- the current implementation is super small; it is direct translation to an automaton
- the complex patterns may be compiled in a more efficient way using deterministic transducer. I can't defend this claim now but I have some evidences
- there are some tricks you can do using 'generative' part of it, e.g. and you even can find levenshtein distance of 1 between two strings just by generating substitutions/insertions/deletions and implement a simple spell checker.
Overall, I think you have a good point. Maybe it is just marginal improvement (if any). It was more comfortable to write in this style instead of group usage. I used it for some time and found it handy (especially as extended `tr`).
rendaw 37 days ago [-]
I think it's an interesting project, but is this a functional replacement for regex substitutions? It seems like you can only replace text with something else at the same location. Separating the replacement from the matching can be unweildy, but you can reorder, repeat, and insert text between things too.
IIUC
zokier 37 days ago [-]
Well, I think it is fair to say that regexes alone do not provide any facilities for editing. You have groups, but you have to use some other language (like sed) to compose those groups.
everdimension 37 days ago [-]
Come on, it's about replacements. They're easier to express (meaning literally easier to type out) with the author's syntax
Great project
37 days ago [-]
bawolff 37 days ago [-]
[flagged]
wfn 37 days ago [-]
What a pleasure your C code is :) very nice (currently reading it)
My only quick comment is - the link to `theory.pdf` in README is broken (your pdf is in docs/ dir, so just need to change the url to include docs/).
c0nstantine 37 days ago [-]
Thank you! For the feedback and pointing to the typo. Fixed. Actually my C is very rusty and I am bit uncomfortable about this.
zoogeny 37 days ago [-]
> Avoid using * or + in the right part, as it can cause infinite loops
Why not just disallow this? I understand it would make the grammar more difficult to specify - but I don't see any good reason to keep it.
c0nstantine 37 days ago [-]
Fair point. I agree. Now it is better to disable it.
The rationale was to implement a fun operation called transducer composition. It is possible to do simple operation on strings and compose trre's like filters. But I haven't finished it yet. So again, a fair point.
shawnz 37 days ago [-]
Another question about this issue: in the case of `:a*` for example, why doesn't it just pick empty string as the replacement text and immediately exit? Why would it be an infinite loop if `-g` isn't specified?
And then if `-g` were specified, you could use this to create infinite generators, which could be a useful construct in some cases -- like `yes` but more powerful.
EDIT: Another interesting use case, if I am understanding correctly: if this worked, then you could use `:(<regex>)` to have it output an example of a string that matches any given regex. `-g :(<regex>)` produces a generator of every string in the language matched by that regex. `-g :(<regex>) | head -n 100` would give you 100 examples.
c0nstantine 36 days ago [-]
The '-g' flag is obsolete. Somehow it got into my new docs. The right way is to use '-ma' flags where '-m' is for matching the whole string and '-a' stands for all the outputs.
You got the idea correctly. E.g. to generate all strings of length 5 over alphabet 10 (and truncate to 10000) you can do:
echo '' | ./trre -am ':[a-c]{5}' | head -n 1000
The docs are fixed now. Thanks for pointing this out.
The infinite generators is something nice to have, I agree. Just didn't wrap my hand around how to do this in 'ergonomically' correct way.
groby_b 37 days ago [-]
It's a cool exploration, but I'm missing examples of why it's actually better. (Which, TBF, might just be an indicator I spent too much time with regexps :)
I.e - why is trre "(cat):(dog)" an improvement over s/cat/dog? What's the improvement of "(x:)or" over s/xor/or? And so on. Pretty much all the examples match (at least in my head) to relatively easy regexps.
I think the core advantage, if there is one, would be in group logic, so maybe the examples should lean into that - even before explaining the basics. I'd explain why it's actually a better choice before explaining the full syntax, or hello-world use cases.
For the caesar cipher example, it screams for a "apply this in reverse" - it's a pretty common request for a lot of text replacements, but it's super clear with this one. (Because programmer brain immediately screams "why express logic twice")
I don't know if it's useful (yet), but I think it's great you're trying to explore alternatives to a long-established status quo. (Caveat: That usually means there's a good chance it won't land. But it's still great to see an exploration)
andrewla 37 days ago [-]
I feel like this is very underspecified, The very first example:
$ echo 'cat' | trre 'c:da:ot:g'
dog
Feels strange. What is happening here; the grammar says
What is the parse tree here? Why is "c" not being replaced with "da"? Or why isn't c being removed and "da" being replaced by "ot"?
I do like the idea of having a search/replace semantic that is more intuitive than grouping operators; back in MS-DOS days you could do "ren .log .txt" and this would work which feels bananas to my modern bash-minded way of thinking, but it's very obvious looking at this what it is supposed to do.
danielparks 37 days ago [-]
This is a matter of operator precedence and tokenization. Tokens are single characters in this language, and there is an invisible operator between them.
If the operator were explicit (let’s call it ~), the example would look like this:
$ echo 'cat' | trre 'c:d~a:o~t:g'
dog
With unnecessary parentheses:
$ echo 'cat' | trre '(c:d)~(a:o)~(t:g)'
dog
37 days ago [-]
c0nstantine 36 days ago [-]
That's true. Thank you for elaborating.
There is a hidden operator of concatenation as for usual regular expressions. In the code I denote it as lower dot '.' (as in the old Thompson's implementation).
c0nstantine 36 days ago [-]
The grammar is underspecified. The full grammar is more complex. I guess I need just remove the current version from docs. Now it is confusing indeed.
> Why is "c" not being replaced with "da"?
It is all about precedence. According to the discussion I think I've chosen a wrong one and it raises confusion. Current version of precedence table is this:
So the ':' is stronger then '.' (implicit concatenation).
kccqzy 37 days ago [-]
Yes it is underspecified. The deletion example shows that an empty string is possibly a REGEX. So you can essentially treat any position as containing as many empty string regexes as you want. So there are indeed infinite number of parses.
If we instead require regex to be non-empty (breaking the deletion examples), then the ambiguity becomes that of concatenation: whether it's '(((c:d)(a:o))(t:g))' or '((c:d)((a:o)(d:g)))'. Assuming associativity, this would not matter.
Imustaskforhelp 37 days ago [-]
From what it feels as to how it works, it seems that c:d and a: (nothing) and ot:g
but yes now that I read it , it also makes confusion , theoretically your point makes valid , I also believe that c should be replaced by da after I read the repo , I am not sure ...
37 days ago [-]
i3oi3 37 days ago [-]
Are the examples all actual outputs of the program? It's entirely possible that my understanding of the grammar is off, but it looks like these examples are wrong:
$ echo 'cat dog' | trre 'c:bat|d:hog'
bat hog
$ echo '' | trre ':a*' # <- do NOT do this
dog
$ echo '' | trre ':(repeat-10-times){10}'
dog
c0nstantine 37 days ago [-]
The second line actually is an output. I've modified the README. The last example is a typo. Fixed. Thanks!
meonkeys 37 days ago [-]
And the first one? Wouldn't the output be
batat hogog
c0nstantine 37 days ago [-]
Can't reproduce.
I have the following:
> echo 'cat dog' | ./trre 'c:bat|d:hog'
bat hog
robertlutece 36 days ago [-]
it reads like '(c:b)at|(d:h)og' and not 'c:(bat)|d:(hog)'
skykooler 36 days ago [-]
Not sure the range operator is fully specified, this works and seems like it shouldn't:
In fact, "[a:A-z:x]" seems to do the same thing as "[a:A-z:Z]" for all x.
larodi 37 days ago [-]
In place replacing the text while parsing is very powerful approach. Fingers crosses this flies.
This, combined with probabilistic approach can be even more interesting. Probabilistic approaches regexes exist since 70s if memory serves right.
c0nstantine 33 days ago [-]
Thank you for your feedback. There is a bunch of deterministic methods to infer regex from samples (positive and negative). There are ml-based as well. But it is a different story.
kemiller 36 days ago [-]
This feels like it would be interesting in the Helix editor. They don't have a good solution for traditional regexp search-and-replace, but this would fill that niche in a more elegant way.
c0nstantine 34 days ago [-]
Hi. Missed your message initially. Helix is a great project. Let me know if/how I can help. The trre is a bit raw. But hope I can polish it within a month or two.
kemiller 33 days ago [-]
Oh, I'm not directly associated with it, and I don't use it mostly because that specific feature is missing. And they're in rust, so I'm not sure how well c bindings work. Still, really cool project.
synthc 36 days ago [-]
Interesting! I did an internship where I tried to use transducers for fast information extraction.
In theory, you can use FST's for fast approximate parsing.
I didn't really work out, but I had lots of fun implementing a libary to compose FST's and explore cool algorithms to compose them. Not much business value was delivered, but I learned a lot.
sabellito 37 days ago [-]
I love the idea, wanna se where it goes. Gives me the same vibe as when jq came out all those years ago.
c0nstantine 36 days ago [-]
Thank you! Still a lot of work to do. I really like the jq style.
jll29 36 days ago [-]
Check out Xerox XFST and its open source clone FOMA for finite state transducers as described by extended regular expresssions, both of which describe the language of regular relations:
The Xerox FST book
https://press.uchicago.edu/ucp/books/book/distributed/F/bo36...
(Xerox XRCE's finite-state-tools comprising the compilers lexc, xfst and twolc, the languages they compile and the formal language finite state transducers describe including linguistic applications)
Elsewhere, the readme says that ":" is "non-associative", and I had a look at the language grammar but haven't figured out how to parse a sequence of ":".
mordechai9000 37 days ago [-]
I would probably use this in a text editor if it was available. I don't struggle with group logic, but I often forget which tools use \ to reference capture groups, and which use $. (If it's Microsoft or Microsoft-adjacent, it's probably $.)
c0nstantine 34 days ago [-]
Let me know if you need any help. Not it is still raw but I hope I'll polish it soon.
rjh29 37 days ago [-]
and of course Perl supports both!
YeGoblynQueenne 36 days ago [-]
Cool project. Here's my use of Finite State Transducers (det and nondet) for navigation agents:
This doesn't seem to have anything to do with regular expressions.
YeGoblynQueenne 33 days ago [-]
Not regular expressions but regular automata. The project uses Finite State Transducers to build autonomous agents.
CyberDildonics 33 days ago [-]
This thread is about regular expressions.
Finite State Transducers to build autonomous agents
That's extremely general and is terminology from the days of tape drives that describes techniques that are considered trivial now, like parsing strings into individual words.
Oh, apologies, I didn't realise you were asking for more specific information. You can find that in the link I posted above. There is a copy of a paper in the repo, here's a direct link:
The paper was accepted for publication but it's still in review and anyway I prefer to point people to the repo with the code so they can reproduce the experiments if they want.
CyberDildonics 30 days ago [-]
Did you mean to reply somewhere else? This doesn't seem like a coherent reply and still doesn't have anything about regular expressions.
YeGoblynQueenne 29 days ago [-]
No I meant to reply to you. Weren't you asking for more details about my work? It's about Finite State Transducers, not regular expressions.
31 days ago [-]
kazinator 37 days ago [-]
This is nifty --- and small.
You could port this to like V7 Unix from 1979 or earlier; why didn't they get this idea? :)
Tools like sed build a transducer around the whole automaton: s/this/that/g.
c0nstantine 36 days ago [-]
I guess folks generally more interested in searching for the pattern then modifying it.
> Tools like sed build a transducer around the whole automaton: s/this/that/g.
That sounds reasonable. Could you provide any links on sed internals? Thanks.
aqueueaqueue 36 days ago [-]
Another alternative is regex and use replace all. Then you cam replace the regex cat with dog (or use a group for wrapping with space)
This is a common modus operandi for me in VS Code.
the_arun 37 days ago [-]
> $ echo 'xor' | '(x:)or' 'xor'
> cat
I got lost here.
c0nstantine 37 days ago [-]
Typo. Fixed. Thanks. Too many cats in examples.
trashburger 37 days ago [-]
Also too many dogs it seems, as the infinite and finite loop examples also produce "dog".
layer8 37 days ago [-]
I would give the colon operator lower precedence than concatenation and repetition.
simlevesque 37 days ago [-]
I ran the installation lines and got this error:
make && sh test.sh
cc -std=c99 -Wall -Wextra -Wpedantic -o2 trre_nft.c -o trre
cc -std=c99 -Wall -Wextra -Wpedantic -o2 trre_dft.c -o trre_dft
test.sh: 14: Syntax error: "(" unexpected
For the second part it is a bug in the README. Thank you for pointing this out! I had to be more careful before the publication. Fixed. Try '-ma' flags instead.
$echo '' | trre -ma ':(0|1){,3}?'
simlevesque 36 days ago [-]
I'm using Linux.
c0nstantine 34 days ago [-]
Did it solve the problem? I guess the issue is the process substitution construction of bash "<()". Not all shells support this.
37 days ago [-]
ngruhn 36 days ago [-]
you should call it `trex` :D
agumonkey 36 days ago [-]
very inspiring idea, makes me wanna start projects I had in mind related to that. thanks and good luck
metadat 37 days ago [-]
When I saw this headline, I got excited about the prospect of a new innovation in the application of Regular Expressions. After reading, I was scratching my head because trre doesn't provide any new capabilities and is essentially just yet another flavor of regex. Additional complexity without a significant upside.
Trre seems like an arbitrarily different version of `sed -e /a/b/ ..`. This method of search and replace is essentially ported everywhere else, from awk to vim to IntelliJ, and has always gotten the job done adequately and feels as natural as anything else I've learned.
Am I missing something?
p.s I just realized I've been regex'ing heavily for 21 (!) years now, time flies.
c0nstantine 37 days ago [-]
Hey, I didn't claim it is something groundbreaking. The idea is quite old, indeed. You don not need AI or LLMs here.
The sed is superior, actually. I do not cover all the functions sed provides. I think of it more like 'tr' + regexp. But it has different underlying engine and might be faster and more expressive for some use cases (e.g. tokenization, morphology).
metadat 37 days ago [-]
Thanks for the clarification, totally on me to set expectations inappropriately based on only a headline. Take care.
Thank you for the link. I think I came across it some years ago. They implement weighted transducers. Nice tool for things like morphology from the era before the LLMs. I've implemented something similar 8 years ago: https://github.com/c0stya/fslib
37 days ago [-]
teknopaul 37 days ago [-]
My two penneth: I find the replacement part the easiest and escping all the characters which mean something in regexp to be the most annoying part.
Adding a new char to be escaped seems like another annoyance.
I try to avoid tools that make the hard bit harder, and the easy bit easier
I found the operator precedence unnatural, and it looks like a lot of other folks in this thread did too. I would naturally assume `cat:dog` would be equivalent to `(cat):(dog)` rather than `ca(t:d)og`.
> I would naturally assume `cat:dog` would be equivalent to `(cat):(dog)` rather than `ca(t:d)og`
It was confusing to me too until I remembered that we all kind of use regexes sort of wrong. They're "really" supposed to be considered as generators and not matchers. So IIR cat|dog as a "regular expression" (not a regex) is supposed to formaly expand to
{catog,cadog}
For matching, this set of strings can then be substring matched against some larger text.
The problem is that almost no regex matching engine actually does this, and so now they'll do all kinds of strange things either to meet our expectations, or for efficiency or something.
If you go and try a bunch of different regex tools you'll get variations that either service (cat)|(dog) or (cat)|(dog)|(ca[td]og) or something else.
So from a more formal conceptualization I think cat:dog should produce ca(t:d)og not (cat):(dog). But our experience with "regex" tools has subverted that formalization and now everybody just puts parens around expressions they want to alternate.
My real minor issue with this proposal, as interesting and well thought out as it is, is that it feels like it's just trying to get back at regular expressions as generators, which they actually are and it's coming from a place on the other side of a few decades of how we've been abusing them as regexes for user expectations. In other words, the problem is the tooling, not the syntax.
source: I've worked adjacent to this space in the past and if you've never thought of regexes as string set generators you can toy with the idea here
https://onlinestringtools.com/generate-string-from-regex
but again, how these generator tools works is also very specific. The ones I used to work with had a variety of ways to specify constraints on closures and such to restrict the generators.
In pure theoretical computer science, regular expressions exist as an abstract concept independent from syntax or parsers. They are an “algebra”, which means they are composed of elements connected with operators, but they are not inherently tied to a syntax. In the most fundamental formulation of regular expressions (the one in the Chomsky hierarchy), the only operators are alteration (which modern syntaxes express as “|”), the Kleene star (“*”) and — notably — concatenation, which modern syntaxes simply omit, in a way comparable to how modern mathematics notation omits the multiplication operator when you write “2x”.
In the same way that maths needs rules to define whether “2x²” means “(2x)²” or “2(x²)”, regex syntax needs such rules too. This is called operator precedence. I’m sure you’ve heard that before, but you just might not have realized that the regular expression “ab” has an operator in it because it is typically not written.
Now I’m not going to argue that the operator precedence in maths notation is haphazard or without reason — but it is arbirary. It was arbitrarily chosen to be the most useful to mathematicians using the notation. And it turns out that giving exponentiation higher precedence than (invisible) multiplication (meaning: “2x²” means “2(x²)” rather than “(2x)²”) is more useful.
So coming back to the original example, whether “cat:dog” means “ca(t:d)og” or “(cat):(dog)” is simply a matter of defining the precedence of the “:” operator relative to the concatenation operator. You can argue (and I would agree with you) that one is more useful than the other, and therefore preferable (in the same way that “(cat)|(dog)” is more useful than “ca(t|d)og”), but neither of them is more fundamentally correct or primal or, as you put it, “supposed to formally expand to”.
1 Escaped characters
2 []
3 ()
4 * + ? {m,n}
5 :
6 . (implicit concatenation)
7 |
I have some reasons to put it that way. I want : to be somewhat 'atomic'. If you think about '*' or '+' they can be lower in the table as well. Anyway, I will try to put : lower in the next version and see how it goes.
If I shift it behind concatenation there could be another problem. E.g. with non-associative : should be illegal. And I am not sure how to treat this:
cat:dog:mouse
In the current version I inject the epsilon (empty string). It looks natural E.g. to remove every second letter I could run '..:' which is technically '.(.:eps)':
echo 'abcde' | ./trre '..:'
result: 'ace'
actually ':' association could have a meaning as a composition of regular relations; but I found it too complicated for now.
I would not worry about "cat:dog:mouse" because intuitively it is clearly correct and it means replacing cat with mouse. With parentheses it could be written as "((cat:dog):mouse)".
(:a)
(a:)
a:|b
a|b:
etc
I will try to change the precedence and see how it works. Btw what do you think about explicit operators '>' '<' where '<' works as usual regex matcher, and '>' as a generator. For example to change 'cat' to 'dog' there could be something like '<cat>dog' where '<cat' part is a parser and '>dog' is a generator. Thanks.
Then there is no syntactic special case. This is just EXPR:EXPR; the special case is that both EXPR are character class syntax, and so the tr-like range mapping applies.
So if we do [a-z]:[A-Z] it should be expanded to:
(a|b|...|z):(A|B|...|Z)
which is pretty legal in trre but has different meaning of mapping any a-z to ALL the A-Z (generating A-Z on each occurrence of lowercase letter).
Distinct syntactic forms can be given distinct semantics, as long as there is rhyme and reason.
Moreover, the right side of the colon is not the normal regex language, it only borrows its syntax. So there, we may be justified in denying that the usual equivalence holds between character class syntax and a disjunction of the symbols denoted by the class.
But I got your point. Maybe there could be some ways to do it in consistent way. Just straight tr-like syntax won't work, e.g I really want it something like this to be valid:
[a-b]:(x|y) (pairs a:x, b:x, a:y, b:y)
and I prefer not handle these in some ad-hoc way.
I remember a Finnish researcher from PARC coming to one of my classes at UT to show how you can use FSTs for handling Finnish morphology, which is, on its face, quite a feat.
There are some good tutorials in the form of homework assignments (from like Johns Hopkins and some others) that go through Pynini use cases.
[1] https://www.openfst.org/twiki/bin/view/GRM/Pynini
[2] https://www.openfst.org/
https://gitlab.com/rosie-pattern-language/rosie/-/blob/maste...
https://rosie-lang.org/about/
anyhow
wrt syntax, are you sure you want ':' to bind stronger than concatenation 'ab' ?
> wrt syntax, are you sure you want ':' to bind stronger than concatenation 'ab' ?
That's something I am still not sure about. I took a hundred examples and it looked more natural this way (: lower then .). But I can change it with the change of one digit in the code, literally. That's why I'm posting here. I need some real feedback.
also,
* colon as member of a character class (and dash and how they interact)
* posix character classes (and the colon in the syntax)
* word boundaries are really useful in practice
* think of ünicøde early and look at what russ cox did there
boundaries, what do you decide to exclude? for example back references and grouping have fun with DFAs (again see russ cox and re2)
composition is fantastic, and a shortcut to exponential land because grammars (as in Chomsky hierarchy) can easily be expressed with transducers, yay.
boundaries will also clarify use cases and allow to state: "if you want to do xyz please use re2 and a bit of code instead"
and one "technicality" that hit me once with re2 was a static buffer for resumable iteration. I'd loved to reallocate the buffer and drop it's beginning and extend it at the end but alas, the state of re2 has hard pointers into the buffer, not relative ones. I think this was when re2 hit the end of the buffer mid-expression during incremental parse. so you can't reallocate and instead you have to drop the hard earned state and resume.
anyhow, it's been quite a while so I'm no longer in the thicket of this :-D
what's your driver? curiosity? or some itch?
but I really enjoy seeing your project :-)
So Unicode is something on a top of my TODO list. Boundaries is a very interesting topic. Maybe I'll extend the doc to include the details.
> what's your driver? curiosity? or some itch?
It's an itch :). 8 years ago I explored automata theory and found finite transducers to be a handy and natural way to transform texts. And as regex corresponds to FSA (finite state acceptors) I wanted to create a language that corresponds to FST (finite state transducers). There is a lesser-known algorithm for FST determinization and I want to applied it to make the transformation fast and efficient. It turned out to be not that simple as I expected initially.
More generally speaking, since regular expressions effectively define a parse tree on their matches, being able to perform more general transformations of those trees would be useful.
If I understand it correctly you want to change something inside the "..." block and change the quotas to single '.
It can be done by this expression:
echo '"hello world" "hello again!"' | ./trre "\":'.+?:-\":'"
'-' '-'
So I substitute the text inside "" by symbol - using this expression ".+?:-" and simultaneously changing the surrounding quota.
Question mark means non-greedy mode.
The entire purpose of this project seems to hinge on this assertion, but there isn't a single example.
I don't understand what makes regex unnatural for editing? What is meant by editing? Why do people struggle with groups?
There are lots of examples of the syntax for this project, but why is it better than regular regex?
If there were a few examples showing "here's the vanilla regex version, here's my version, here's why my version makes this easier" I might be able to understand this project.
Just skimming through the readme, I'm also in doubt that the paradigm shift would really be any better than folk regexp.
That doesn't make the existence of the project any less cooler, of course. Congratulation to the author, and don't let such a comment narrow down your motivation, as it's really not the point. If you enjoy do it, great. If you learn something in the process, awesome. If it can lead to something that I can't envision, how marvelous. But it doesn't, that was still a great epic and you can be proud.
Thanks for your feedback!
pattern = r'(x)y(z)'
replacement = r'\1Y\2'
result = re.sub(pattern, replacement, text)
I would like to replace it with 'xy:Yz' pattern:
result = re.trre('xy:Yz', text)
If you need your x, z to be more complicated patterns or even regex themselves it can be more handy using this approach.
I guess I'm still struggling to see how it's simpler overall.
Most of the examples on your page don't involve groups at all, e.g.:
That already seems a lot more complicated than just: I don't need to use groups that often in regex replacements, and when I do I'm already trying to do something complicated, and it's not clear to me why the colon syntax is easier to write, easier to understand, or if it's as flexible.Not trying to criticize the project, just genuinely trying to understand the specific strengths and limitations of the proposed syntax. E.g. what if I want to turn xyz into zYx?
echo 'zyx' | ./trre 'xy:Yz|zy:Yx'
It is still a regular language. I do not introduce references.
You are right in sense the `sed` is far superior editor. But here I see some advantages: - the current implementation is super small; it is direct translation to an automaton - the complex patterns may be compiled in a more efficient way using deterministic transducer. I can't defend this claim now but I have some evidences - there are some tricks you can do using 'generative' part of it, e.g. and you even can find levenshtein distance of 1 between two strings just by generating substitutions/insertions/deletions and implement a simple spell checker.
Overall, I think you have a good point. Maybe it is just marginal improvement (if any). It was more comfortable to write in this style instead of group usage. I used it for some time and found it handy (especially as extended `tr`).
IIUC
Great project
My only quick comment is - the link to `theory.pdf` in README is broken (your pdf is in docs/ dir, so just need to change the url to include docs/).
Why not just disallow this? I understand it would make the grammar more difficult to specify - but I don't see any good reason to keep it.
The rationale was to implement a fun operation called transducer composition. It is possible to do simple operation on strings and compose trre's like filters. But I haven't finished it yet. So again, a fair point.
And then if `-g` were specified, you could use this to create infinite generators, which could be a useful construct in some cases -- like `yes` but more powerful.
EDIT: Another interesting use case, if I am understanding correctly: if this worked, then you could use `:(<regex>)` to have it output an example of a string that matches any given regex. `-g :(<regex>)` produces a generator of every string in the language matched by that regex. `-g :(<regex>) | head -n 100` would give you 100 examples.
You got the idea correctly. E.g. to generate all strings of length 5 over alphabet 10 (and truncate to 10000) you can do:
echo '' | ./trre -am ':[a-c]{5}' | head -n 1000
The docs are fixed now. Thanks for pointing this out.
The infinite generators is something nice to have, I agree. Just didn't wrap my hand around how to do this in 'ergonomically' correct way.
I.e - why is trre "(cat):(dog)" an improvement over s/cat/dog? What's the improvement of "(x:)or" over s/xor/or? And so on. Pretty much all the examples match (at least in my head) to relatively easy regexps.
I think the core advantage, if there is one, would be in group logic, so maybe the examples should lean into that - even before explaining the basics. I'd explain why it's actually a better choice before explaining the full syntax, or hello-world use cases.
For the caesar cipher example, it screams for a "apply this in reverse" - it's a pretty common request for a lot of text replacements, but it's super clear with this one. (Because programmer brain immediately screams "why express logic twice")
I don't know if it's useful (yet), but I think it's great you're trying to explore alternatives to a long-established status quo. (Caveat: That usually means there's a good chance it won't land. But it's still great to see an exploration)
I do like the idea of having a search/replace semantic that is more intuitive than grouping operators; back in MS-DOS days you could do "ren .log .txt" and this would work which feels bananas to my modern bash-minded way of thinking, but it's very obvious looking at this what it is supposed to do.
If the operator were explicit (let’s call it ~), the example would look like this:
With unnecessary parentheses:There is a hidden operator of concatenation as for usual regular expressions. In the code I denote it as lower dot '.' (as in the old Thompson's implementation).
> Why is "c" not being replaced with "da"?
It is all about precedence. According to the discussion I think I've chosen a wrong one and it raises confusion. Current version of precedence table is this:
| 1 | Escaped characters | \<special character> | | 2 | Bracket expression | [] | | 3 | Grouping | () | | 4 | Single-character-ERE duplication | * + ? {m,n} | | 5 | Transduction | : | | 6 | Concatenation | . (implicit) | | 8 | Alternation | | |
So the ':' is stronger then '.' (implicit concatenation).
If we instead require regex to be non-empty (breaking the deletion examples), then the ambiguity becomes that of concatenation: whether it's '(((c:d)(a:o))(t:g))' or '((c:d)((a:o)(d:g)))'. Assuming associativity, this would not matter.
but yes now that I read it , it also makes confusion , theoretically your point makes valid , I also believe that c should be replaced by da after I read the repo , I am not sure ...
$ echo 'cat dog' | trre 'c:bat|d:hog' bat hog
$ echo '' | trre ':a*' # <- do NOT do this dog
$ echo '' | trre ':(repeat-10-times){10}' dog
batat hogog
I have the following:
> echo 'cat dog' | ./trre 'c:bat|d:hog'
bat hog
This, combined with probabilistic approach can be even more interesting. Probabilistic approaches regexes exist since 70s if memory serves right.
The Xerox FST book https://press.uchicago.edu/ucp/books/book/distributed/F/bo36... (Xerox XRCE's finite-state-tools comprising the compilers lexc, xfst and twolc, the languages they compile and the formal language finite state transducers describe including linguistic applications)
The XFST book https://press.uchicago.edu/ucp/books/book/distributed/F/bo36...
FOMA: the open source clone https://fomafst.github.io/
FOMA: the paper (Holden, M. (2009) Proc. EACL) https://aclanthology.org/E09-2008.pdf
FOMA: the open source clone https://fomafst.github.io/
FOMA: the paper (Holden, M. (2009) Proc. EACL) https://aclanthology.org/E09-2008.pdf
Elsewhere, the readme says that ":" is "non-associative", and I had a look at the language grammar but haven't figured out how to parse a sequence of ":".
https://github.com/stassa/ijclr_2024_experiments
Finite State Transducers to build autonomous agents
That's extremely general and is terminology from the days of tape drives that describes techniques that are considered trivial now, like parsing strings into individual words.
https://en.wikipedia.org/wiki/Finite-state_transducer
https://github.com/stassa/ijclr_2024_experiments/tree/master...
The paper was accepted for publication but it's still in review and anyway I prefer to point people to the repo with the code so they can reproduce the experiments if they want.
You could port this to like V7 Unix from 1979 or earlier; why didn't they get this idea? :)
Tools like sed build a transducer around the whole automaton: s/this/that/g.
> Tools like sed build a transducer around the whole automaton: s/this/that/g.
That sounds reasonable. Could you provide any links on sed internals? Thanks.
This is a common modus operandi for me in VS Code.
> cat
I got lost here.
Then I ran one of the generator examples:
And I got this error:$ make && bash test.sh
with 'bash' instead.
For the second part it is a bug in the README. Thank you for pointing this out! I had to be more careful before the publication. Fixed. Try '-ma' flags instead.
$echo '' | trre -ma ':(0|1){,3}?'
Trre seems like an arbitrarily different version of `sed -e /a/b/ ..`. This method of search and replace is essentially ported everywhere else, from awk to vim to IntelliJ, and has always gotten the job done adequately and feels as natural as anything else I've learned.
Am I missing something?
p.s I just realized I've been regex'ing heavily for 21 (!) years now, time flies.
The sed is superior, actually. I do not cover all the functions sed provides. I think of it more like 'tr' + regexp. But it has different underlying engine and might be faster and more expressive for some use cases (e.g. tokenization, morphology).
Adding a new char to be escaped seems like another annoyance.
I try to avoid tools that make the hard bit harder, and the easy bit easier