[v1,2/2] arm64: Enable BTI for main executable as well as the interpreter

Message ID 20210521144621.9306-3-broonie@kernel.org
State Superseded
Headers show
Series
  • arm64: Enable BTI for the executable as well as the interpreter
Related show

Commit Message

Siddhesh Poyarekar via Libc-alpha May 21, 2021, 2:46 p.m.
Currently for dynamically linked ELF executables we only enable BTI for
the interpreter, expecting the interpreter to do this for the main
executable. This is a bit inconsistent since we do map main executable and
is causing issues with systemd's MemoryDenyWriteExecute feature which is
implemented using a seccomp filter which prevents setting PROT_EXEC on
already mapped memory and lacks the context to be able to detect that
memory is already mapped with PROT_EXEC.

Resolve this by checking the BTI property for the main executable and
enabling BTI if it is present when doing the initial mapping. This does
mean that we may get more code with BTI enabled if running on a system
without BTI support in the dynamic linker, this is expected to be a safe
configuration and testing seems to confirm that. It also reduces the
flexibility userspace has to disable BTI but it is expected that for cases
where there are problems which require BTI to be disabled it is more likely
that it will need to be disabled on a system level.

Signed-off-by: Mark Brown <broonie@kernel.org>

---
 arch/arm64/include/asm/elf.h | 14 ++++++++++----
 arch/arm64/kernel/process.c  | 18 ++++++------------
 2 files changed, 16 insertions(+), 16 deletions(-)

-- 
2.20.1

Comments

Siddhesh Poyarekar via Libc-alpha June 3, 2021, 3:40 p.m. | #1
On Fri, May 21, 2021 at 03:46:21PM +0100, Mark Brown wrote:
> Currently for dynamically linked ELF executables we only enable BTI for

> the interpreter, expecting the interpreter to do this for the main

> executable. This is a bit inconsistent since we do map main executable and

> is causing issues with systemd's MemoryDenyWriteExecute feature which is

> implemented using a seccomp filter which prevents setting PROT_EXEC on

> already mapped memory and lacks the context to be able to detect that

> memory is already mapped with PROT_EXEC.


It's hard to know whether this is an extensibility fail in the
semantics of mprotect() (and so we were wrong to add PROT_BTI there in
line with my original proposal), or whether this is a case of systemd
doing something that is broken by design (if well-intentioned).  Since
there have been wacky arch-specific mprotect flags around for a fair
while I'd be tempted to argue the latter -- but then I am biased.


Anyway, although I'm a bit queasy about the cause of this patch, the
patch itself looks perfectly reasonable.  If nothing else, it makes
sense as a cleanup or optimisation, so that ld.so doesn't have to do a
bunch of mprotect() calls every time it loads a program.

Do we know how libcs will detect that they don't need to do the
mprotect() calls?  Do we need a detection mechanism at all?

Ignoring certain errors from mprotect() when ld.so is trying to set
PROT_BTI on the main executable's code pages is probably a reasonable,
backwards-compatible compromise here, but it seems a bit wasteful.

> Resolve this by checking the BTI property for the main executable and

> enabling BTI if it is present when doing the initial mapping. This does

> mean that we may get more code with BTI enabled if running on a system

> without BTI support in the dynamic linker, this is expected to be a safe

> configuration and testing seems to confirm that. It also reduces the


Ack, plus IIUC the architecture is designed so that everything works
providing that PROT_BTI is never set on non-BTI-aware code pages.  For
BTI-aware code, the sooner we set PROT_BTI the better I guess.

> flexibility userspace has to disable BTI but it is expected that for cases

> where there are problems which require BTI to be disabled it is more likely

> that it will need to be disabled on a system level.


There's no flexibility impact unless MemoryDenyWriteExecute is in force,
right?

Self-modifying programs (JITs etc.) already can't use that IIUC, so
shouldn't be affected.  That seems the main scenario where people are
likely to be twiddling PROT_{EXEC,WRITE,BTI} on existing pages.

If the main binary is marked as supporting BTI but breaks with
PROT_BTI, then that almost certainly means the toolchain, system
libraries or hardware are broken -- so it would be pointless to have an
elegant workaround.  A big global kill switch seems adequate to me.

> 

> Signed-off-by: Mark Brown <broonie@kernel.org>

> ---

>  arch/arm64/include/asm/elf.h | 14 ++++++++++----

>  arch/arm64/kernel/process.c  | 18 ++++++------------

>  2 files changed, 16 insertions(+), 16 deletions(-)

> 

> diff --git a/arch/arm64/include/asm/elf.h b/arch/arm64/include/asm/elf.h

> index c8678a8c36d5..a6e9032b951a 100644

> --- a/arch/arm64/include/asm/elf.h

> +++ b/arch/arm64/include/asm/elf.h

> @@ -253,7 +253,8 @@ struct arch_elf_state {

>  	int flags;

>  };

>  

> -#define ARM64_ELF_BTI		(1 << 0)

> +#define ARM64_ELF_INTERP_BTI		(1 << 0)

> +#define ARM64_ELF_EXEC_BTI		(1 << 1)

>  

>  #define INIT_ARCH_ELF_STATE {			\

>  	.flags = 0,				\

> @@ -274,9 +275,14 @@ static inline int arch_parse_elf_property(u32 type, const void *data,

>  		if (datasz != sizeof(*p))

>  			return -ENOEXEC;

>  

> -		if (system_supports_bti() && is_interp &&

> -		    (*p & GNU_PROPERTY_AARCH64_FEATURE_1_BTI))

> -			arch->flags |= ARM64_ELF_BTI;

> +		if (system_supports_bti() &&

> +		    (*p & GNU_PROPERTY_AARCH64_FEATURE_1_BTI)) {

> +			if (is_interp) {


Nit: can we drop the extra curlies?

> +				arch->flags |= ARM64_ELF_INTERP_BTI;

> +			} else {

> +				arch->flags |= ARM64_ELF_EXEC_BTI;

> +			}

> +		}

>  	}

>  

>  	return 0;

> diff --git a/arch/arm64/kernel/process.c b/arch/arm64/kernel/process.c

> index b4bb67f17a2c..f7fff4a4c99f 100644

> --- a/arch/arm64/kernel/process.c

> +++ b/arch/arm64/kernel/process.c

> @@ -744,19 +744,13 @@ asmlinkage void __sched arm64_preempt_schedule_irq(void)

>  int arch_elf_adjust_prot(int prot, const struct arch_elf_state *state,

>  			 bool has_interp, bool is_interp)

>  {

> -	/*

> -	 * For dynamically linked executables the interpreter is

> -	 * responsible for setting PROT_BTI on everything except

> -	 * itself.

> -	 */

> -	if (is_interp != has_interp)

> -		return prot;

> +	if (prot & PROT_EXEC) {

> +		if (state->flags & ARM64_ELF_INTERP_BTI && is_interp)

> +			prot |= PROT_BTI;

>  

> -	if (!(state->flags & ARM64_ELF_BTI))

> -		return prot;

> -

> -	if (prot & PROT_EXEC)

> -		prot |= PROT_BTI;

> +		if (state->flags & ARM64_ELF_EXEC_BTI && !is_interp)


Merge these ifs together somehow?  I'm happy either way, though.

> +			prot |= PROT_BTI;

> +	}


Since is_interp and has_interp were only needed for this logic in the
first place, I think we can probably drop those, maybe in a subsequent
patch.  Probably better to do it now before too much dust settles on
them.

Again, Cc Yu-cheng Yu if doing that, since it might affect his patches.

Reviewed-by: Dave Martin <Dave.Martin@arm.com>


(though if some of the suggested changes are made elsewhere, this will
probably need a minor respin).

Cheers
---Dave
Siddhesh Poyarekar via Libc-alpha June 3, 2021, 4:51 p.m. | #2
On Thu, Jun 03, 2021 at 04:40:35PM +0100, Dave Martin wrote:

> Do we know how libcs will detect that they don't need to do the

> mprotect() calls?  Do we need a detection mechanism at all?


> Ignoring certain errors from mprotect() when ld.so is trying to set

> PROT_BTI on the main executable's code pages is probably a reasonable,

> backwards-compatible compromise here, but it seems a bit wasteful.


I think the theory was that they would just do the mprotect() calls and
ignore any errors as they currently do, or declare that they depend on a
new enough kernel version I guess (not an option for glibc but might be
for others which didn't do BTI yet).

> > flexibility userspace has to disable BTI but it is expected that for cases

> > where there are problems which require BTI to be disabled it is more likely

> > that it will need to be disabled on a system level.


> There's no flexibility impact unless MemoryDenyWriteExecute is in force,

> right?


Right, or some other mechanism that has the same effect.
Siddhesh Poyarekar via Libc-alpha June 3, 2021, 6:04 p.m. | #3
On Thu, Jun 03, 2021 at 05:51:34PM +0100, Mark Brown wrote:
> On Thu, Jun 03, 2021 at 04:40:35PM +0100, Dave Martin wrote:

> > Do we know how libcs will detect that they don't need to do the

> > mprotect() calls?  Do we need a detection mechanism at all?

> > 

> > Ignoring certain errors from mprotect() when ld.so is trying to set

> > PROT_BTI on the main executable's code pages is probably a reasonable,

> > backwards-compatible compromise here, but it seems a bit wasteful.

> 

> I think the theory was that they would just do the mprotect() calls and

> ignore any errors as they currently do, or declare that they depend on a

> new enough kernel version I guess (not an option for glibc but might be

> for others which didn't do BTI yet).


I think we discussed the possibility of an AT_FLAGS bit. Until recently,
this field was 0 but it gained a new bit now. If we are to expose this
to arch-specific things, it may need some reservations. Anyway, that's
an optimisation that can be added subsequently.

-- 
Catalin
Siddhesh Poyarekar via Libc-alpha June 7, 2021, 11:25 a.m. | #4
On Thu, Jun 03, 2021 at 07:04:31PM +0100, Catalin Marinas via Libc-alpha wrote:
> On Thu, Jun 03, 2021 at 05:51:34PM +0100, Mark Brown wrote:

> > On Thu, Jun 03, 2021 at 04:40:35PM +0100, Dave Martin wrote:

> > > Do we know how libcs will detect that they don't need to do the

> > > mprotect() calls?  Do we need a detection mechanism at all?

> > > 

> > > Ignoring certain errors from mprotect() when ld.so is trying to set

> > > PROT_BTI on the main executable's code pages is probably a reasonable,

> > > backwards-compatible compromise here, but it seems a bit wasteful.

> > 

> > I think the theory was that they would just do the mprotect() calls and

> > ignore any errors as they currently do, or declare that they depend on a

> > new enough kernel version I guess (not an option for glibc but might be

> > for others which didn't do BTI yet).

> 

> I think we discussed the possibility of an AT_FLAGS bit. Until recently,

> this field was 0 but it gained a new bit now. If we are to expose this

> to arch-specific things, it may need some reservations. Anyway, that's

> an optimisation that can be added subsequently.


I suppose so, but AT_FLAGS doesn't seem appropriate somehow.

I wonder why we suddenly start considering adding a flag to AT_FLAGS
every few months, when it had sat empty for decades.  This may say
something about the current health of the kernel ABI, but I'm not sure
exactly what.

I think having mprotect() fail in a predictable way may be preferable
for now: glibc still only needs to probe with a single call and could
cache the knowledge after that.  Code outside libc / ld.so seems quite
unlikely to care about this.

Since only the executable segment(s) of the main binary need to be
protected, this should require only a very small number of mprotect()
calls in normal situations.  Although it feels a bit cruddy as a design,
cost-wise I think that extra overhead would be swamped by other noise in
realistic scenarios.  Often, there is just a single executable segment,
so the common case would probably require just one mprotect() call.  I
don't know if it ever gets much more complicated when using the
standard linker scripts.

Any ideas on how we would document this behaviour?  The kernel and libc
behaviour are 100% clear: you _are_ allowed to twiddle PROT_BTI on
executable mappings, and there is no legitimate (or even useful) reason
to disallow this.  It's only systemd deliberately breaking the API that
causes the behaviour seem by "userspace" to vary.

Cheers
---Dave
Siddhesh Poyarekar via Libc-alpha June 7, 2021, 6:12 p.m. | #5
On Mon, Jun 07, 2021 at 12:25:38PM +0100, Dave P Martin wrote:
> On Thu, Jun 03, 2021 at 07:04:31PM +0100, Catalin Marinas via Libc-alpha wrote:

> > On Thu, Jun 03, 2021 at 05:51:34PM +0100, Mark Brown wrote:

> > > On Thu, Jun 03, 2021 at 04:40:35PM +0100, Dave Martin wrote:

> > > > Do we know how libcs will detect that they don't need to do the

> > > > mprotect() calls?  Do we need a detection mechanism at all?

> > > > 

> > > > Ignoring certain errors from mprotect() when ld.so is trying to set

> > > > PROT_BTI on the main executable's code pages is probably a reasonable,

> > > > backwards-compatible compromise here, but it seems a bit wasteful.

> > > 

> > > I think the theory was that they would just do the mprotect() calls and

> > > ignore any errors as they currently do, or declare that they depend on a

> > > new enough kernel version I guess (not an option for glibc but might be

> > > for others which didn't do BTI yet).

> > 

> > I think we discussed the possibility of an AT_FLAGS bit. Until recently,

> > this field was 0 but it gained a new bit now. If we are to expose this

> > to arch-specific things, it may need some reservations. Anyway, that's

> > an optimisation that can be added subsequently.

> 

> I suppose so, but AT_FLAGS doesn't seem appropriate somehow.

> 

> I wonder why we suddenly start considering adding a flag to AT_FLAGS

> every few months, when it had sat empty for decades.  This may say

> something about the current health of the kernel ABI, but I'm not sure

> exactly what.

> 

> I think having mprotect() fail in a predictable way may be preferable

> for now: glibc still only needs to probe with a single call and could

> cache the knowledge after that.  Code outside libc / ld.so seems quite

> unlikely to care about this.


I think that's the expected approach for now. If anyone complains about
an extra syscall, we can look into options but I wouldn't rush on doing
something.

> Any ideas on how we would document this behaviour?  The kernel and libc

> behaviour are 100% clear: you _are_ allowed to twiddle PROT_BTI on

> executable mappings, and there is no legitimate (or even useful) reason

> to disallow this.  It's only systemd deliberately breaking the API that

> causes the behaviour seem by "userspace" to vary.


I don't think we can document all the filters that can be added on top
various syscalls, so I'd leave it undocumented (or part of the systemd
documentation). It was a user space program (systemd) breaking another
user space program (well, anything with a new enough glibc). The kernel
ABI was still valid when /sbin/init started ;).

-- 
Catalin
Siddhesh Poyarekar via Libc-alpha June 8, 2021, 11:33 a.m. | #6
On Mon, Jun 07, 2021 at 07:12:13PM +0100, Catalin Marinas wrote:

> I don't think we can document all the filters that can be added on top

> various syscalls, so I'd leave it undocumented (or part of the systemd

> documentation). It was a user space program (systemd) breaking another

> user space program (well, anything with a new enough glibc). The kernel

> ABI was still valid when /sbin/init started ;).


Indeed.  I think from a kernel point of view the main thing is to look
at why userspace feels the need to do things like this and see if
there's anything we can improve or do better with in future APIs, part
of the original discussion here was figuring out that there's not really
any other reasonable options for userspace to implement this check at
the minute.
Siddhesh Poyarekar via Libc-alpha June 8, 2021, 3:19 p.m. | #7
On Tue, Jun 08, 2021 at 12:33:18PM +0100, Mark Brown via Libc-alpha wrote:
> On Mon, Jun 07, 2021 at 07:12:13PM +0100, Catalin Marinas wrote:

> 

> > I don't think we can document all the filters that can be added on top

> > various syscalls, so I'd leave it undocumented (or part of the systemd

> > documentation). It was a user space program (systemd) breaking another

> > user space program (well, anything with a new enough glibc). The kernel

> > ABI was still valid when /sbin/init started ;).

> 

> Indeed.  I think from a kernel point of view the main thing is to look

> at why userspace feels the need to do things like this and see if

> there's anything we can improve or do better with in future APIs, part

> of the original discussion here was figuring out that there's not really

> any other reasonable options for userspace to implement this check at

> the minute.


Ack, that would be my policy -- just wanted to make it explicit.
It would be good if there were better dialogue between the systemd
and kernel folks on this kind of thing.

SECCOMP makes it rather easy to (attempt to) paper over kernel/user API
design problems, which probably reduces the chance of the API ever being
fixed properly, if we're not careful...

Cheers
---Dave
Siddhesh Poyarekar via Libc-alpha June 8, 2021, 3:42 p.m. | #8
On 6/8/21 10:19 AM, Dave Martin wrote:
> On Tue, Jun 08, 2021 at 12:33:18PM +0100, Mark Brown via Libc-alpha wrote:

>> On Mon, Jun 07, 2021 at 07:12:13PM +0100, Catalin Marinas wrote:

>>

>>> I don't think we can document all the filters that can be added on top

>>> various syscalls, so I'd leave it undocumented (or part of the systemd

>>> documentation). It was a user space program (systemd) breaking another

>>> user space program (well, anything with a new enough glibc). The kernel

>>> ABI was still valid when /sbin/init started ;).

>>

>> Indeed.  I think from a kernel point of view the main thing is to look

>> at why userspace feels the need to do things like this and see if

>> there's anything we can improve or do better with in future APIs, part

>> of the original discussion here was figuring out that there's not really

>> any other reasonable options for userspace to implement this check at

>> the minute.

> 

> Ack, that would be my policy -- just wanted to make it explicit.

> It would be good if there were better dialogue between the systemd

> and kernel folks on this kind of thing.

> 

> SECCOMP makes it rather easy to (attempt to) paper over kernel/user API

> design problems, which probably reduces the chance of the API ever being

> fixed properly, if we're not careful...


Well IMHO the problem is larger than just BTI here, what systemd is 
trying to do by fixing the exec state of a service is admirable but its 
a 90% solution without the entire linker/loader being in a more 
privileged context. While BTI makes finding a generic gadget that can 
call mprotect harder, it still seems like it might just be a little too 
easy. The secomp filter is providing a nice bonus by removing the 
ability to disable BTI via mprotect without also disabling X. So without 
moving more of the linker into the kernel its hard to see how one can 
really lock down X only pages.

Anyway, i'm testing this on rawhide now.

Thanks!
Siddhesh Poyarekar via Libc-alpha June 10, 2021, 10:33 a.m. | #9
On Tue, Jun 08, 2021 at 10:42:41AM -0500, Jeremy Linton wrote:
> On 6/8/21 10:19 AM, Dave Martin wrote:

> >On Tue, Jun 08, 2021 at 12:33:18PM +0100, Mark Brown via Libc-alpha wrote:

> >>On Mon, Jun 07, 2021 at 07:12:13PM +0100, Catalin Marinas wrote:

> >>

> >>>I don't think we can document all the filters that can be added on top

> >>>various syscalls, so I'd leave it undocumented (or part of the systemd

> >>>documentation). It was a user space program (systemd) breaking another

> >>>user space program (well, anything with a new enough glibc). The kernel

> >>>ABI was still valid when /sbin/init started ;).

> >>

> >>Indeed.  I think from a kernel point of view the main thing is to look

> >>at why userspace feels the need to do things like this and see if

> >>there's anything we can improve or do better with in future APIs, part

> >>of the original discussion here was figuring out that there's not really

> >>any other reasonable options for userspace to implement this check at

> >>the minute.

> >

> >Ack, that would be my policy -- just wanted to make it explicit.

> >It would be good if there were better dialogue between the systemd

> >and kernel folks on this kind of thing.

> >

> >SECCOMP makes it rather easy to (attempt to) paper over kernel/user API

> >design problems, which probably reduces the chance of the API ever being

> >fixed properly, if we're not careful...

> 

> Well IMHO the problem is larger than just BTI here, what systemd is trying

> to do by fixing the exec state of a service is admirable but its a 90%

> solution without the entire linker/loader being in a more privileged

> context. While BTI makes finding a generic gadget that can call mprotect

> harder, it still seems like it might just be a little too easy. The secomp

> filter is providing a nice bonus by removing the ability to disable BTI via

> mprotect without also disabling X. So without moving more of the linker into

> the kernel its hard to see how one can really lock down X only pages.

> 

> Anyway, i'm testing this on rawhide now.

> 

> Thanks!


Well, I agree that there are larger issues here.  But we need to be
realistic and try not to do too much damage to future maintainability.

Note, your "bonus" is really a feature-like bug.  This is what we
should be trying to avoid IMHO: if it's important, it needs to be
designed and guaranteed.  Something that works by accident is likely to
get broken again by accident in the future.

Cheers
---Dave

Patch

diff --git a/arch/arm64/include/asm/elf.h b/arch/arm64/include/asm/elf.h
index c8678a8c36d5..a6e9032b951a 100644
--- a/arch/arm64/include/asm/elf.h
+++ b/arch/arm64/include/asm/elf.h
@@ -253,7 +253,8 @@  struct arch_elf_state {
 	int flags;
 };
 
-#define ARM64_ELF_BTI		(1 << 0)
+#define ARM64_ELF_INTERP_BTI		(1 << 0)
+#define ARM64_ELF_EXEC_BTI		(1 << 1)
 
 #define INIT_ARCH_ELF_STATE {			\
 	.flags = 0,				\
@@ -274,9 +275,14 @@  static inline int arch_parse_elf_property(u32 type, const void *data,
 		if (datasz != sizeof(*p))
 			return -ENOEXEC;
 
-		if (system_supports_bti() && is_interp &&
-		    (*p & GNU_PROPERTY_AARCH64_FEATURE_1_BTI))
-			arch->flags |= ARM64_ELF_BTI;
+		if (system_supports_bti() &&
+		    (*p & GNU_PROPERTY_AARCH64_FEATURE_1_BTI)) {
+			if (is_interp) {
+				arch->flags |= ARM64_ELF_INTERP_BTI;
+			} else {
+				arch->flags |= ARM64_ELF_EXEC_BTI;
+			}
+		}
 	}
 
 	return 0;
diff --git a/arch/arm64/kernel/process.c b/arch/arm64/kernel/process.c
index b4bb67f17a2c..f7fff4a4c99f 100644
--- a/arch/arm64/kernel/process.c
+++ b/arch/arm64/kernel/process.c
@@ -744,19 +744,13 @@  asmlinkage void __sched arm64_preempt_schedule_irq(void)
 int arch_elf_adjust_prot(int prot, const struct arch_elf_state *state,
 			 bool has_interp, bool is_interp)
 {
-	/*
-	 * For dynamically linked executables the interpreter is
-	 * responsible for setting PROT_BTI on everything except
-	 * itself.
-	 */
-	if (is_interp != has_interp)
-		return prot;
+	if (prot & PROT_EXEC) {
+		if (state->flags & ARM64_ELF_INTERP_BTI && is_interp)
+			prot |= PROT_BTI;
 
-	if (!(state->flags & ARM64_ELF_BTI))
-		return prot;
-
-	if (prot & PROT_EXEC)
-		prot |= PROT_BTI;
+		if (state->flags & ARM64_ELF_EXEC_BTI && !is_interp)
+			prot |= PROT_BTI;
+	}
 
 	return prot;
 }