The goal is to achieve the same quality and style that a skilled Rust developer would produce
This would be quite a feat. I imagine a lot of really unintelligible Rust will be generated from this, if for no other reason than the Rust’s complexity and semantics compared to C++, so to me attention to output style will be particularly interesting.
I also wonder about how the problem space compares for C vs C++, since C is such a small language
I think automatic translation from C++ is nearly impossible due to Rust lacking most of C++’s features – no syntax-based templates, no data inheritance, no move constructors, and much simpler and less configurable standard library. The impedance mismatch is pretty bad for non-trivial C++, and needs large rearchitecting to fit Rust.
However, C programs converted to idiomatic Rust generally can keep a similar architecture, and become simpler.
Drop and ? take care of cleanup and error handling boilerplate. Slices, Vec, iterators replace pointer arithmetic and raw malloc wrangling. Libstd handles strings and containers (you can delete a lot of DIY code). There’s easier portability, simpler build system and dependency management (fewer #ifdefs & Windows vs Unix schism).
The difficulty in coming from C is that ownership in C is entirely outside the type system. C++ has things like unique and shared pointers that can map to Rust constructs, but C has per-codebase conventions on who is responsible for deallocating objects, when aliasing is permitted, what locks must be held, and so on. The FreeBSD kernel has a set of conventions for describing which locks must be held to access each field in a program, but they’re in comments and so are not mechanically checked and may be wrong. A lot of other codebases don’t even have regular conventions for these things.
Converting any of that to idiomatic Rust is hard. In a lot of cases, it will require either large unsafe blocks to try to capture the properties that the C programmer hopes are there and expose safe interfaces, or it will require completely redesigning data structures.
I’d consider mechanical translation from Java to idiomatic Haskell to be an easier problem and that’s not one I’d have much hope of anyone managing.
Remember, this is DARPA. They fund research that is known to be very hard and which has a very high probability of failure. In one of the DARPA programs that funded some of the CHERI work, every other performer dropped out in the first round because they didn’t reach their objectives. In the first program that funded CHERI, no one dropped out but only one other performer has come close to commercial adoption, and then only in some niche DoD workloads.
And that’s fine. If people knew it would work, industry would fund the development.
I have semi-manually converted lodepng, pngquant, and dssim from C, and created a comprehensive wrapper for lcms2. They all happened to have mostly Rust-compatible memory management, and a library handle that neatly translated to self.
Separation of mutexes from data is similar to separation of length from data pointer. That’s the bit I expect LLMs will be needed for to “read” the docs and comments.
I wonder how will it work in practice. In almost all codebases, the ownership patterns will be obvious from the code. The may be bugs and exceptions of course, but with enough fuzzy logic and data flow analysis you should be able to say “I’m really sure they this function allocates its result space and the caller is responsible for freeing”. We can create those hypotheses while translating end test them in practice. Worst case they’re wrong and the resulting app will not compile.
Ah, I mistook the C++ mention at the top of the notice to mean TRACTOR targeted both C and C++. Yeah I wonder how possible it would even be for C++ depending on the features. I haven’t written C++ since school and even then it was very basic.
Automated translation research has been ongoing for decades. I would anticipate one of the old companies there with a well established engine to pick up this proposal and deliver some good work.
At first I was as skeptical of this as the other comments were, but then I read a much more in-depth story on The Register and saw they’re working to specifically train LLMs to do the conversions as well as to avoid the now-infamous hallucination problem. I think this project might have some real merit to it, I can see an LLM being trained to detect specific patterns in C and Rust and convert between the two quite easily, so long as you have a good domain expert in both fields to actually provide quality sample data.
Seems cool. I’ve wondered about doing the “C -> Rust” direct translation, which has lots of unsafe, but then doing a “unsafe Rust -> safe rust + assertions” translation. After that I suppose you could generate testcases, and then an LLM could make it pretty and as long as the testcases pass you’d have your job done.
Let’s assume I’m the maintainer of some moderately sized and moderately prevalent C code base, and someone, as part of distribution of my C library uses this to translate my C code to rust as a compilation step while building the package or whatever the artefact is (because safe). If there are problems / issues with the generated rust code at usage time / runtime will the responsibility of debugging these issues be pushed down on me as the author of the original C code?
Let’s say I’m a one-man show and it’s some open source lib that sits in apt, rpm or whatever and is expressed as a dep by other packages.I’ve not got resources or time (or backing) to port to rust and relearn my entire codebase in another language let alone know where the warts could be. That feels kinda scary.
I like the idea and I do like rust, but it’s kind of scary that if this was generally made available to people, that you could suddenly one day find yourself forcibly becoming a rust developer by the ecosystem you distribute into, or changing C code for optimal rust generation instead of optimal use / performance.
Even if someone forks your stuff and transpiles to rust, suddenly the project you spent all this time on is gone elsewhere? And if someone transpiles your code, can they relicense it as it’s a different implementation in another language? Like could some commercial operation convert your open source code, relicense and distribute it under a more commercially oriented license?
Let’s assume I’m the maintainer of some moderately sized and moderately prevalent C code base […] If there are problems / issues with the generated rust code at usage time / runtime will the responsibility of debugging these issues be pushed down on me as the author of the original C code?
If the bug is present in the C version, then you might be expected to help debug / fix it. If the bug is only present in the Rust version, then no reasonable person would try to put it on your plate.
And regardless of where the bug is, the author of open-source software has no responsibility to debug or maintain it.
Let’s say I’m a one-man show and it’s some open source lib that sits in apt, rpm or whatever and is expressed as a dep by other packages.I’ve not got resources or time (or backing) to port to rust and relearn my entire codebase in another language let alone know where the warts could be. That feels kinda scary.
A similar situation already exists with regard to vendor patches. Many Linux distributions will modify upstream sources as part of their packaging process, whether minor changes (backporting stable bugfixes) or major (making the OpenSSH random number generator non-random). The normal policy is that only unmodified upstream sources get support from upstream.
Consider also the Linux kernel, which is almost never used in its pure upstream form. The LKML bug-reporting process requires bugs to be reproduced with a “vanilla” kernel, to exclude the possibility that it’s caused by a patch.
I like the idea and I do like rust, but it’s kind of scary that if this was generally made available to people, that you could suddenly one day find yourself forcibly becoming a rust developer by the ecosystem you distribute into, or changing C code for optimal rust generation instead of optimal use / performance.
I’m not exactly sure what scenario you’re imagining here, but it doesn’t exist. The existence of transpilers doesn’t imply any changes to the upstream development process.
Even if someone forks your stuff and transpiles to rust, suddenly the project you spent all this time on is gone elsewhere?
Also not sure what you mean by this, since your project hasn’t been changed or affected in any way. Forks of open-source software are common, but unless there’s a strong motivation (i.e. irreconcilable technical differences, personal animosity) they usually don’t compete directly with the original project.
And if someone transpiles your code, can they relicense it as it’s a different implementation in another language? Like could some commercial operation convert your open source code, relicense and distribute it under a more commercially oriented license?
No, that wouldn’t be permitted under copyright law.
I don’t know Rust, but allow me to play devils’s advocate here for a moment.
If you have a tool to translate C to Rust, and the resulting Rust code compiles, does that mean the C code is safe? Could you then the keep the source in C, use this tool as a linter, retain the compilation speed and portability, and get Rust-level memory safety?
More seriously I wonder if such a translation tool would produce Rust code that doesn’t compile, or fail the translation step for unsafe code. If the latter, what’s to stop people adopting those techniques in their C compilers to get memory safe C?
No, the idea is that the ‘translation’ also somehow alters the program to be safe. So the Rust translation behaves differently to the C original. And the assumption is that you can do this in a way that only changes the ‘bad’ behavior without changing the ‘good behavior’ - that is, that the new program will essentially be exactly the same but without ‘security problems’ somehow. Which would be very cool, if it is indeed possible.
More likely, you get a tool that gets you some of the way there, either through partial conversion of the codebase and then you need programmers to get you the rest of the way, or through identifying sections that can be automatically split out into components and oxidised leaving a smaller amount of C behind; but that is still probably much cheaper than a big bang rewrite.
CHERI had similar things, where you did generally still need some amount of rewriting to get a real C codebase working under CHERI, but it would be a very small percentage of the codebase that needed to be rewritten.
CHERI had similar things, where you did generally still need some amount of rewriting to get a real C codebase working under CHERI, but it would be a very small percentage of the codebase that needed to be rewritten.
Depending on the codebase. Often things just work (for example, a lot of the KDE apps, once the libraries were ported, required no code changes).
We designed CHERI specifically with the C abstract machine in mind though. CHERI isn’t some weird architecture that came from nowhere, it’s a weird architecture that was designed to be a compilation target for C-like languages. For example, CHERI C can cast a pointer to a uintptr_t, do some arithmetic on it, cast it back, and then dereference it. As long as the result is in bounds, it’s fine.
That same operation, though memory safe, may violate type safety, lifetime safety, and a bunch of other properties that Rust enforces.
CHERI effectively gives you a 1-bit dynamic type system. Every pointer-sized (and aligned) memory address contains either a pointer (to an unspecified type, but with dynamically known and enforced bounds) or data. That’s enough to compile C in such a way that you can protect code in other languages from bugs in C libraries (which was my goal), but it’s a much weaker set of guarantees than safe Rust gives.
The other answers are saying “no”, but it’s more like “maybe”.
You could imagine a C codebase that had annotations and macros that plug into the transpiler, such that the output is valid safe Rust. In that case, compiling the Rust would act as a proof-of-correctness for the memory safety of the original C code. There are existing static analysis tools for C that do something similar, though without the intermediate Rust representation (they just directly check the annotations). The resulting C code is verbose and extremely non-idiomatic, but it won’t crash, which might be a good tradeoff if you’re writing safety-critical code.
However, in general, C codebases are not written with Rust-style memory safety in mind. If you try to take idiomatic C code and transpile it directly to Rust, the whole thing will need to be unsafe. This is the approach used by c2rust, and it takes some serious effort to clean up the transpiler output.
An interesting approach is to separate the transpilation from the safety-analysis. You use something like c2rust to generate Rust code with lots of unsafe, then a separate tool (with custom-written rules) that will do things like convert (int32_t *a) to (a: &i32), (a: Option<&i32>), (a: Option<&mut i32>), etc, depending on project-specific context.
There’s also automatic runtime checking in rust that doesn’t occur in C. For example if thing is a heap allocated array/pointer/vector thing[4] may read out of bounds in C, but cause a panic at runtime in Rust due to bounds checks.
“Safe” doesnt mean “runs without error if it compiles”, it means “fail instead of doing anything unsafe”. Ideally fail before compiling but failing at runtime is acceptable.
This would be quite a feat. I imagine a lot of really unintelligible Rust will be generated from this, if for no other reason than the Rust’s complexity and semantics compared to C++, so to me attention to output style will be particularly interesting.
I also wonder about how the problem space compares for C vs C++, since C is such a small language
I think automatic translation from C++ is nearly impossible due to Rust lacking most of C++’s features – no syntax-based templates, no data inheritance, no move constructors, and much simpler and less configurable standard library. The impedance mismatch is pretty bad for non-trivial C++, and needs large rearchitecting to fit Rust.
However, C programs converted to idiomatic Rust generally can keep a similar architecture, and become simpler.
Dropand?take care of cleanup and error handling boilerplate. Slices, Vec, iterators replace pointer arithmetic and raw malloc wrangling. Libstd handles strings and containers (you can delete a lot of DIY code). There’s easier portability, simpler build system and dependency management (fewer#ifdefs & Windows vs Unix schism).The difficulty in coming from C is that ownership in C is entirely outside the type system. C++ has things like unique and shared pointers that can map to Rust constructs, but C has per-codebase conventions on who is responsible for deallocating objects, when aliasing is permitted, what locks must be held, and so on. The FreeBSD kernel has a set of conventions for describing which locks must be held to access each field in a program, but they’re in comments and so are not mechanically checked and may be wrong. A lot of other codebases don’t even have regular conventions for these things.
Converting any of that to idiomatic Rust is hard. In a lot of cases, it will require either large unsafe blocks to try to capture the properties that the C programmer hopes are there and expose safe interfaces, or it will require completely redesigning data structures.
I’d consider mechanical translation from Java to idiomatic Haskell to be an easier problem and that’s not one I’d have much hope of anyone managing.
Remember, this is DARPA. They fund research that is known to be very hard and which has a very high probability of failure. In one of the DARPA programs that funded some of the CHERI work, every other performer dropped out in the first round because they didn’t reach their objectives. In the first program that funded CHERI, no one dropped out but only one other performer has come close to commercial adoption, and then only in some niche DoD workloads.
And that’s fine. If people knew it would work, industry would fund the development.
I have semi-manually converted lodepng, pngquant, and dssim from C, and created a comprehensive wrapper for lcms2. They all happened to have mostly Rust-compatible memory management, and a library handle that neatly translated to
self.Separation of mutexes from data is similar to separation of length from data pointer. That’s the bit I expect LLMs will be needed for to “read” the docs and comments.
I wonder how will it work in practice. In almost all codebases, the ownership patterns will be obvious from the code. The may be bugs and exceptions of course, but with enough fuzzy logic and data flow analysis you should be able to say “I’m really sure they this function allocates its result space and the caller is responsible for freeing”. We can create those hypotheses while translating end test them in practice. Worst case they’re wrong and the resulting app will not compile.
Ah, I mistook the C++ mention at the top of the notice to mean TRACTOR targeted both C and C++. Yeah I wonder how possible it would even be for C++ depending on the features. I haven’t written C++ since school and even then it was very basic.
Automated translation research has been ongoing for decades. I would anticipate one of the old companies there with a well established engine to pick up this proposal and deliver some good work.
At first I was as skeptical of this as the other comments were, but then I read a much more in-depth story on The Register and saw they’re working to specifically train LLMs to do the conversions as well as to avoid the now-infamous hallucination problem. I think this project might have some real merit to it, I can see an LLM being trained to detect specific patterns in C and Rust and convert between the two quite easily, so long as you have a good domain expert in both fields to actually provide quality sample data.
Seems cool. I’ve wondered about doing the “C -> Rust” direct translation, which has lots of unsafe, but then doing a “unsafe Rust -> safe rust + assertions” translation. After that I suppose you could generate testcases, and then an LLM could make it pretty and as long as the testcases pass you’d have your job done.
I vaguely remember seeing someone blog about doing exactly that to a C program. They replaced it one
.ofile at a time, testing as they went.Let’s assume I’m the maintainer of some moderately sized and moderately prevalent C code base, and someone, as part of distribution of my C library uses this to translate my C code to rust as a compilation step while building the package or whatever the artefact is (because safe). If there are problems / issues with the generated rust code at usage time / runtime will the responsibility of debugging these issues be pushed down on me as the author of the original C code?
Let’s say I’m a one-man show and it’s some open source lib that sits in apt, rpm or whatever and is expressed as a dep by other packages.I’ve not got resources or time (or backing) to port to rust and relearn my entire codebase in another language let alone know where the warts could be. That feels kinda scary.
I like the idea and I do like rust, but it’s kind of scary that if this was generally made available to people, that you could suddenly one day find yourself forcibly becoming a rust developer by the ecosystem you distribute into, or changing C code for optimal rust generation instead of optimal use / performance.
Even if someone forks your stuff and transpiles to rust, suddenly the project you spent all this time on is gone elsewhere? And if someone transpiles your code, can they relicense it as it’s a different implementation in another language? Like could some commercial operation convert your open source code, relicense and distribute it under a more commercially oriented license?
If the bug is present in the C version, then you might be expected to help debug / fix it. If the bug is only present in the Rust version, then no reasonable person would try to put it on your plate.
And regardless of where the bug is, the author of open-source software has no responsibility to debug or maintain it.
A similar situation already exists with regard to vendor patches. Many Linux distributions will modify upstream sources as part of their packaging process, whether minor changes (backporting stable bugfixes) or major (making the OpenSSH random number generator non-random). The normal policy is that only unmodified upstream sources get support from upstream.
Consider also the Linux kernel, which is almost never used in its pure upstream form. The LKML bug-reporting process requires bugs to be reproduced with a “vanilla” kernel, to exclude the possibility that it’s caused by a patch.
I’m not exactly sure what scenario you’re imagining here, but it doesn’t exist. The existence of transpilers doesn’t imply any changes to the upstream development process.
Also not sure what you mean by this, since your project hasn’t been changed or affected in any way. Forks of open-source software are common, but unless there’s a strong motivation (i.e. irreconcilable technical differences, personal animosity) they usually don’t compete directly with the original project.
No, that wouldn’t be permitted under copyright law.
I don’t know Rust, but allow me to play devils’s advocate here for a moment.
If you have a tool to translate C to Rust, and the resulting Rust code compiles, does that mean the C code is safe? Could you then the keep the source in C, use this tool as a linter, retain the compilation speed and portability, and get Rust-level memory safety?
More seriously I wonder if such a translation tool would produce Rust code that doesn’t compile, or fail the translation step for unsafe code. If the latter, what’s to stop people adopting those techniques in their C compilers to get memory safe C?
No, the idea is that the ‘translation’ also somehow alters the program to be safe. So the Rust translation behaves differently to the C original. And the assumption is that you can do this in a way that only changes the ‘bad’ behavior without changing the ‘good behavior’ - that is, that the new program will essentially be exactly the same but without ‘security problems’ somehow. Which would be very cool, if it is indeed possible.
More likely, you get a tool that gets you some of the way there, either through partial conversion of the codebase and then you need programmers to get you the rest of the way, or through identifying sections that can be automatically split out into components and oxidised leaving a smaller amount of C behind; but that is still probably much cheaper than a big bang rewrite.
CHERI had similar things, where you did generally still need some amount of rewriting to get a real C codebase working under CHERI, but it would be a very small percentage of the codebase that needed to be rewritten.
Depending on the codebase. Often things just work (for example, a lot of the KDE apps, once the libraries were ported, required no code changes).
We designed CHERI specifically with the C abstract machine in mind though. CHERI isn’t some weird architecture that came from nowhere, it’s a weird architecture that was designed to be a compilation target for C-like languages. For example, CHERI C can cast a pointer to a uintptr_t, do some arithmetic on it, cast it back, and then dereference it. As long as the result is in bounds, it’s fine.
That same operation, though memory safe, may violate type safety, lifetime safety, and a bunch of other properties that Rust enforces.
CHERI effectively gives you a 1-bit dynamic type system. Every pointer-sized (and aligned) memory address contains either a pointer (to an unspecified type, but with dynamically known and enforced bounds) or data. That’s enough to compile C in such a way that you can protect code in other languages from bugs in C libraries (which was my goal), but it’s a much weaker set of guarantees than safe Rust gives.
The other answers are saying “no”, but it’s more like “maybe”.
You could imagine a C codebase that had annotations and macros that plug into the transpiler, such that the output is valid safe Rust. In that case, compiling the Rust would act as a proof-of-correctness for the memory safety of the original C code. There are existing static analysis tools for C that do something similar, though without the intermediate Rust representation (they just directly check the annotations). The resulting C code is verbose and extremely non-idiomatic, but it won’t crash, which might be a good tradeoff if you’re writing safety-critical code.
However, in general, C codebases are not written with Rust-style memory safety in mind. If you try to take idiomatic C code and transpile it directly to Rust, the whole thing will need to be
unsafe. This is the approach used by c2rust, and it takes some serious effort to clean up the transpiler output.An interesting approach is to separate the transpilation from the safety-analysis. You use something like c2rust to generate Rust code with lots of
unsafe, then a separate tool (with custom-written rules) that will do things like convert(int32_t *a)to(a: &i32),(a: Option<&i32>),(a: Option<&mut i32>), etc, depending on project-specific context.Not necessarily, because the tool could be generating unsafe rust, which lacks the memory-safety guarantees of safe rust.
There’s also automatic runtime checking in rust that doesn’t occur in C. For example if
thingis a heap allocated array/pointer/vectorthing[4]may read out of bounds in C, but cause a panic at runtime in Rust due to bounds checks.“Safe” doesnt mean “runs without error if it compiles”, it means “fail instead of doing anything unsafe”. Ideally fail before compiling but failing at runtime is acceptable.