RFC: Add posix_spawn_file_actions_closefrom

Message ID 20190521184859.29249-1-adhemerval.zanella@linaro.org
State New
Headers show
Series
  • RFC: Add posix_spawn_file_actions_closefrom
Related show

Commit Message

Adhemerval Zanella May 21, 2019, 6:48 p.m.
This is WIP patch with an RFC to add the way to close a range of file
descriptors on posix_spawn as a file action.  The API is similar to
the one provided by Solaris11, where the file action causes the
all open file descriptors greater than or equal to input on to
closed when the new process is spawned.

There are some discussions on BZ#10353 [2], although the bug itself
asks for a generic solution (similar to the closeall provided by
some BSD).  The posix_spawn is safe to be implemented by interacting
over /proc/self/fd, the Linux spawni does not use CLONE_FILES, so
the helper process has its own file descriptor table and any failure
(in /proc operation) aborts the process creation and returns an
error to the caller.

I am aware that this file action might be redundant to the current
approach of POSIX in promoting O_CLOEXEC in more interfaces. However
O_CLOEXEC is still not the default and for some specific usages, the caller
needs to close all possible file descriptors to avoid them leaking.  Some
examples are CPython (discussed in BZ#10353) and OpenJDK jspawnhelper [3]
(where OpenJDK spawns a helper process to exactly closes all file
descriptors).  Most likely any environment which calls functions that
might open file descriptor under the hoods and aim to use posix_spawn
might face the same requirement.

Regarding implementation, to avoid relying on malloc call on child
helper I added an internal opendir that receives a previously allocated
buffer.  The allocated buffer should be large enough to a sane file
descriptor usage, and I think it would be hardly a hotspot (so I think
there is little point in increasing the buffer and thus stack usage).

The patch is incomplete, most lacking the abifiles.  If the idea to
indeed add such posix_spawn file action I will resend a complete version.

[1] https://docs.oracle.com/cd/E36784_01/html/E36874/posix-spawn-file-actions-addclosefrom-np-3c.html
[2] https://sourceware.org/bugzilla/show_bug.cgi?id=10353
[3] https://github.com/openjdk/jdk/blob/master/src/java.base/unix/native/libjava/childproc.c
---
 include/dirent.h                   |   6 +-
 posix/Makefile                     |   5 +-
 posix/Versions                     |   1 +
 posix/spawn.h                      |   6 +
 posix/spawn_faction_addclosefrom.c |  51 +++++++++
 posix/spawn_faction_destroy.c      |   1 +
 posix/spawn_int.h                  |   5 +
 posix/tst-spawn5.c                 | 176 +++++++++++++++++++++++++++++
 sysdeps/posix/closedir.c           |   3 +-
 sysdeps/posix/dirstream.h          |   1 +
 sysdeps/posix/fdopendir.c          |   2 +-
 sysdeps/posix/opendir.c            |  94 +++++++++++----
 sysdeps/unix/sysv/linux/spawni.c   |  52 ++++++++-
 13 files changed, 373 insertions(+), 30 deletions(-)
 create mode 100644 posix/spawn_faction_addclosefrom.c
 create mode 100644 posix/tst-spawn5.c

-- 
2.17.1

Comments

Florian Weimer May 24, 2019, 11:34 a.m. | #1
* Adhemerval Zanella:

> diff --git a/posix/spawn.h b/posix/spawn.h

> index 471dbea022..095ee67a26 100644

> --- a/posix/spawn.h

> +++ b/posix/spawn.h

> @@ -213,6 +213,12 @@ extern int posix_spawn_file_actions_addchdir_np (posix_spawn_file_actions_t *

>  extern int posix_spawn_file_actions_addfchdir_np (posix_spawn_file_actions_t *,

>  						  int __fd)

>       __THROW __nonnull ((1));

> +

> +/* Add an action to close all file descriptor greater than FROM durint

> +   spawn.  This affects the subsequent file actions.  */

> +extern int posix_spawn_file_actions_addclosefrom_np (posix_spawn_file_actions_t *,

> +						     int __from);

> +


Missing __THROW __nonnull ((11)), I think.

> diff --git a/posix/spawn_faction_addclosefrom.c b/posix/spawn_faction_addclosefrom.c

> new file mode 100644

> index 0000000000..24e6c62dbc

> --- /dev/null

> +++ b/posix/spawn_faction_addclosefrom.c


Can we implement this on Hurd fairly quickly?  Otherwise, this should
probable be a stub that returns ENOSYS.

> diff --git a/posix/tst-spawn5.c b/posix/tst-spawn5.c

> new file mode 100644

> index 0000000000..7ed03810e4


> +static int

> +do_test (int argc, char *argv[])

> +{

> +  /* We must have one or four parameters left if called initially:

> +       + path for ld.so		optional

> +       + "--library-path"	optional

> +       + the library path	optional

> +       + the application name

> +

> +     Plus one parameter to indicate which test to execute through

> +     re-execution.

> +

> +     So for default usage without --enable-hardcoded-path-in-tests, it

> +     will be called initially with 5 arguments and later with 2.  For

> +     --enable-hardcoded-path-in-tests it will be called with 2 arguments

> +     regardless.  */


You could run this as a container test, so that this would become
unnecessary.  (Or link statically.)

The test doesn't exercise the gaps case.

> diff --git a/sysdeps/posix/opendir.c b/sysdeps/posix/opendir.c

> index 8e3ba480b7..b576b3233d 100644

> --- a/sysdeps/posix/opendir.c

> +++ b/sysdeps/posix/opendir.c


> +DIR *

> +__alloc_dir (int fd, bool close_fd, const struct stat64 *statp,

> +	     void *buffer, size_t size)

>  {

>    /* We have to set the close-on-exit flag if the user provided the

>       file descriptor.  */

> @@ -114,31 +147,44 @@ __alloc_dir (int fd, bool close_fd, int flags, const struct stat64 *statp)

>      allocation = MIN (MAX ((size_t) statp->st_blksize, default_allocation),

>  		      MAX_DIR_BUFFER_SIZE);

>  #endif

> -

> -  DIR *dirp = (DIR *) malloc (sizeof (DIR) + allocation);

> -  if (dirp == NULL)

> +  DIR *dirp;

> +  if (!buffer)


Style issue, I think: buffer != NULL.

Otherwise, this is a fairly nice hack to get an async-signal-safe
readdir.  But I'm not sure that it's what we need here.  See below.

> diff --git a/sysdeps/unix/sysv/linux/spawni.c b/sysdeps/unix/sysv/linux/spawni.c

> index c1abf3f960..fe0ff95825 100644

> --- a/sysdeps/unix/sysv/linux/spawni.c

> +++ b/sysdeps/unix/sysv/linux/spawni.c


> +/* Close all file descriptor up to FROM by interacting /proc/self/fd.

> +   Any failure should */

> +static bool

> +spawn_closefrom (int from)

> +{

> +  /* Increasing the buffer size incurs in less getdents syscalls from

> +     readdir, however it would require more stack size to be allocated

> +     on __spawnix.  */

> +  char buffer[sizeof (struct __dirstream) + 2 * sizeof (struct dirent)];


We could allocate this on the heap, in the parent.  Maybe we could
opendir in the parent, and play with the underlying descriptor in the
child?  Then you wouldn't need to add __opendir_inplace at all.  Given
that we know what our implementation looks like, this should be fairly
safe.

> +  DIR *dp;

> +  if ((dp = __opendir_inplace ("/proc/self/fd", buffer, sizeof buffer))

> +      == NULL)

> +    return false;


This could check for ENFILE/EMFILE/ENOMEM and try closing descriptors
directly in case of that error, to make room for the new descriptor.
But perhaps that's not worth the complexity.

> +  bool ret = true;

> +  struct dirent *dirp;

> +  while ((dirp = __readdir (dp)) != NULL)


Should this be __readdir64?

> +    {

> +      if (dirp->d_name[0] == '.')

> +        continue;

> +

> +      char *endptr;

> +      long int fd = strtol (dirp->d_name, &endptr, 10);

> +      if (*endptr != '\0' || fd < 0 || fd > INT_MAX)

> +	{

> +	  ret = false;

> +	  break;

> +	}

> +

> +      if (fd == dirfd (dp) || fd < from)

> +        continue;

> +

> +      __close (fd);

> +    }

> +  __closedir (dp);

> +

> +  return ret;

> +}


I'm not sure if this is entirely correct.  If we close some descriptors,
and then readdir calls getdents64, what will the kernel return?  Will
there be a gap in the descriptor list?  (Curiously, it's the same issue
we have the the fork handler list. 8-)

If you share my concerns, maybe we should call getdents64 directly and
parse the buffer?  And restart after the buffer has been exhausted?
(I have a patch with such parser functions and planned to submit it
after the getdents64 syscall wrapper went in.)

> +

>  /* Function used in the clone call to setup the signals mask, posix_spawn

>     attributes, and file actions.  It run on its own stack (provided by the

>     posix_spawn call).  */

> @@ -280,6 +321,11 @@ __spawni_child (void *arguments)

>  	      if (__fchdir (action->action.fchdir_action.fd) != 0)

>  		goto fail;

>  	      break;

> +

> +	    case spawn_do_closefrom:

> +	      if (!spawn_closefrom (action->action.closefrom_action.from))

> +		goto fail;

> +	      break;

>  	    }

>  	}


The Hurd and generic implementations will need to be updated to handle
spawn_do_closefrom as well, otherwise they will fail to build.

Thanks,
Florian
Adhemerval Zanella May 24, 2019, 2:06 p.m. | #2
On 24/05/2019 08:34, Florian Weimer wrote:
> * Adhemerval Zanella:

> 

>> diff --git a/posix/spawn.h b/posix/spawn.h

>> index 471dbea022..095ee67a26 100644

>> --- a/posix/spawn.h

>> +++ b/posix/spawn.h

>> @@ -213,6 +213,12 @@ extern int posix_spawn_file_actions_addchdir_np (posix_spawn_file_actions_t *

>>  extern int posix_spawn_file_actions_addfchdir_np (posix_spawn_file_actions_t *,

>>  						  int __fd)

>>       __THROW __nonnull ((1));

>> +

>> +/* Add an action to close all file descriptor greater than FROM durint

>> +   spawn.  This affects the subsequent file actions.  */

>> +extern int posix_spawn_file_actions_addclosefrom_np (posix_spawn_file_actions_t *,

>> +						     int __from);

>> +

> 

> Missing __THROW __nonnull ((11)), I think.


Ack.

> 

>> diff --git a/posix/spawn_faction_addclosefrom.c b/posix/spawn_faction_addclosefrom.c

>> new file mode 100644

>> index 0000000000..24e6c62dbc

>> --- /dev/null

>> +++ b/posix/spawn_faction_addclosefrom.c

> 

> Can we implement this on Hurd fairly quickly?  Otherwise, this should

> probable be a stub that returns ENOSYS.


Not sure, I will need to check how to accomplish it on Hurd.  I think it would
be better to add a ENOSYS wrapper for now.

> 

>> diff --git a/posix/tst-spawn5.c b/posix/tst-spawn5.c

>> new file mode 100644

>> index 0000000000..7ed03810e4

> 

>> +static int

>> +do_test (int argc, char *argv[])

>> +{

>> +  /* We must have one or four parameters left if called initially:

>> +       + path for ld.so		optional

>> +       + "--library-path"	optional

>> +       + the library path	optional

>> +       + the application name

>> +

>> +     Plus one parameter to indicate which test to execute through

>> +     re-execution.

>> +

>> +     So for default usage without --enable-hardcoded-path-in-tests, it

>> +     will be called initially with 5 arguments and later with 2.  For

>> +     --enable-hardcoded-path-in-tests it will be called with 2 arguments

>> +     regardless.  */

> 

> You could run this as a container test, so that this would become

> unnecessary.  (Or link statically.)


Right, I think maybe container could be an option.  Another possibility is
to add some API to make this process re-spawn less bloated.

> 

> The test doesn't exercise the gaps case.


Do you mean gaps in file descriptor initial set before posix_spawn?

> 

>> diff --git a/sysdeps/posix/opendir.c b/sysdeps/posix/opendir.c

>> index 8e3ba480b7..b576b3233d 100644

>> --- a/sysdeps/posix/opendir.c

>> +++ b/sysdeps/posix/opendir.c

> 

>> +DIR *

>> +__alloc_dir (int fd, bool close_fd, const struct stat64 *statp,

>> +	     void *buffer, size_t size)

>>  {

>>    /* We have to set the close-on-exit flag if the user provided the

>>       file descriptor.  */

>> @@ -114,31 +147,44 @@ __alloc_dir (int fd, bool close_fd, int flags, const struct stat64 *statp)

>>      allocation = MIN (MAX ((size_t) statp->st_blksize, default_allocation),

>>  		      MAX_DIR_BUFFER_SIZE);

>>  #endif

>> -

>> -  DIR *dirp = (DIR *) malloc (sizeof (DIR) + allocation);

>> -  if (dirp == NULL)

>> +  DIR *dirp;

>> +  if (!buffer)

> 

> Style issue, I think: buffer != NULL.

> 

> Otherwise, this is a fairly nice hack to get an async-signal-safe

> readdir.  But I'm not sure that it's what we need here.  See below.

> 

>> diff --git a/sysdeps/unix/sysv/linux/spawni.c b/sysdeps/unix/sysv/linux/spawni.c

>> index c1abf3f960..fe0ff95825 100644

>> --- a/sysdeps/unix/sysv/linux/spawni.c

>> +++ b/sysdeps/unix/sysv/linux/spawni.c

> 

>> +/* Close all file descriptor up to FROM by interacting /proc/self/fd.

>> +   Any failure should */

>> +static bool

>> +spawn_closefrom (int from)

>> +{

>> +  /* Increasing the buffer size incurs in less getdents syscalls from

>> +     readdir, however it would require more stack size to be allocated

>> +     on __spawnix.  */

>> +  char buffer[sizeof (struct __dirstream) + 2 * sizeof (struct dirent)];

> 

> We could allocate this on the heap, in the parent.  Maybe we could

> opendir in the parent, and play with the underlying descriptor in the

> child?  Then you wouldn't need to add __opendir_inplace at all.  Given

> that we know what our implementation looks like, this should be fairly

> safe.


I don't have a strong opinion here, it would add some complexity on parent
helper which would need to transverse all file actions, call opendir, and
deallocate after helper process returns.  My idea is to keep the required 
logic more in place, so its more obvious where things are initiated.

> 

>> +  DIR *dp;

>> +  if ((dp = __opendir_inplace ("/proc/self/fd", buffer, sizeof buffer))

>> +      == NULL)

>> +    return false;

> 

> This could check for ENFILE/EMFILE/ENOMEM and try closing descriptors

> directly in case of that error, to make room for the new descriptor.

> But perhaps that's not worth the complexity.


Hum, this could be an enhancement indeed.  However the main issue is 
to find which is the lower opened file descriptor greater than FROM
without polling /proc/self/fd or by using close with random file
descriptors.

> 

>> +  bool ret = true;

>> +  struct dirent *dirp;

>> +  while ((dirp = __readdir (dp)) != NULL)

> 

> Should this be __readdir64?


It should indeed, we should definitely move away from non-LFS calls.

> 

>> +    {

>> +      if (dirp->d_name[0] == '.')

>> +        continue;

>> +

>> +      char *endptr;

>> +      long int fd = strtol (dirp->d_name, &endptr, 10);

>> +      if (*endptr != '\0' || fd < 0 || fd > INT_MAX)

>> +	{

>> +	  ret = false;

>> +	  break;

>> +	}

>> +

>> +      if (fd == dirfd (dp) || fd < from)

>> +        continue;

>> +

>> +      __close (fd);

>> +    }

>> +  __closedir (dp);

>> +

>> +  return ret;

>> +}

> 

> I'm not sure if this is entirely correct.  If we close some descriptors,

> and then readdir calls getdents64, what will the kernel return?  Will

> there be a gap in the descriptor list?  (Curiously, it's the same issue

> we have the the fork handler list. 8-)


It does not seems to be case with my experiments.  I hack opendir to 
allocate the minimum workable buffer (__dirstream plus a 
struct dirent, about 40 bytes on x86_64) to force each readdir to
call getdents.  A simple testcase shows:

--
#include <unistd.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <assert.h>
#include <dirent.h>
#include <limits.h>

int main (int argc, char *argv[])
{ 
  int fd1 = open ("/dev/null", O_WRONLY);
  assert (fd1 != -1);
  int fd2 = open ("/dev/null", O_WRONLY); 
  assert (fd2 != -1);
  int fd3 = open ("/dev/null", O_WRONLY);
  assert (fd3 != -1);
  int fd4 = open ("/dev/null", O_WRONLY);
  assert (fd4 != -1);
  printf ("fd1=%d fd2=%d fd3=%d fd4=%d\n", fd1, fd2, fd3, fd4);
  
  system ("ls /proc/self/fd");
  
  int from = fd1 - 1;

  DIR *dp = opendir ("/proc/self/fd");
  assert (dp != NULL);
  
  struct dirent *dirp;
  while ((dirp = readdir (dp)) != NULL)
    { 
      if (dirp->d_name[0] == '.')
        continue;

      char *endptr;
      long int fd = strtol (dirp->d_name, &endptr, 10);
      if (*endptr != '\0' || fd < 0 || fd > INT_MAX)
        break;

      if (fd == dirfd (dp) || fd < from)
        continue;

      close (fd);
      system ("ls /proc/self/fd");
    }
  closedir (dp);

  return 0;
}

$ ./testrun.sh /tmp/test
fd1=3 fd2=4 fd3=5 fd4=6
0  1  2  3  4  5  6  7
0  1  2  3  4  5  6
0  1  2  4  5  6
0  1  2  5  6
0  1  2  6
0  1  2
$ ./testrun.sh --tool=strace /tmp/test 2>&1 | grep getdents64
getdents64(7, /* 1 entries */, 40)      = 24
getdents64(7, /* 1 entries */, 40)      = 24
getdents64(7, /* 1 entries */, 40)      = 24
getdents64(7, /* 1 entries */, 40)      = 24
getdents64(7, /* 1 entries */, 40)      = 24
getdents64(7, /* 1 entries */, 40)      = 24
getdents64(7, /* 1 entries */, 40)      = 24
getdents64(7, /* 1 entries */, 40)      = 24
getdents64(7, /* 1 entries */, 40)      = 24
getdents64(7, /* 1 entries */, 40)      = 24
getdents64(7, /* 0 entries */, 40)      = 0
--

I tried also to filter out an additional file descriptor by setting the
condition:

      if (fd == dirfd (dp) || fd < from || fd == fd2)
        continue;

And the result seems what is expected:

$ ./testrun.sh /tmp/test
fd1=3 fd2=4 fd3=5 fd4=6
0  1  2  3  4  5  6  7
0  1  2  3  4  5  6
0  1  2  4  5  6
0  1  2  4  6
0  1  2  4

I haven't actually read the kernel source to check what exactly getdents
does for /proc/self/fd updates neither if it handles it differently.

> 

> If you share my concerns, maybe we should call getdents64 directly and

> parse the buffer?  And restart after the buffer has been exhausted?

> (I have a patch with such parser functions and planned to submit it

> after the getdents64 syscall wrapper went in.)


I though about calling getdents64 directly, but I think we would just ended up 
replicating readdir code on closefrom.  The __opendir_inplace has the extra
advantage to work with readdir.

> 

>> +

>>  /* Function used in the clone call to setup the signals mask, posix_spawn

>>     attributes, and file actions.  It run on its own stack (provided by the

>>     posix_spawn call).  */

>> @@ -280,6 +321,11 @@ __spawni_child (void *arguments)

>>  	      if (__fchdir (action->action.fchdir_action.fd) != 0)

>>  		goto fail;

>>  	      break;

>> +

>> +	    case spawn_do_closefrom:

>> +	      if (!spawn_closefrom (action->action.closefrom_action.from))

>> +		goto fail;

>> +	      break;

>>  	    }

>>  	}

> 

> The Hurd and generic implementations will need to be updated to handle

> spawn_do_closefrom as well, otherwise they will fail to build.


Ack.
Florian Weimer May 24, 2019, 2:37 p.m. | #3
* Adhemerval Zanella:

>> The test doesn't exercise the gaps case.

>

> Do you mean gaps in file descriptor initial set before posix_spawn?


Yes, where the directory descriptor is in the middle of the closefrom
range.

>>> +/* Close all file descriptor up to FROM by interacting /proc/self/fd.

>>> +   Any failure should */

>>> +static bool

>>> +spawn_closefrom (int from)

>>> +{

>>> +  /* Increasing the buffer size incurs in less getdents syscalls from

>>> +     readdir, however it would require more stack size to be allocated

>>> +     on __spawnix.  */

>>> +  char buffer[sizeof (struct __dirstream) + 2 * sizeof (struct dirent)];

>> 

>> We could allocate this on the heap, in the parent.  Maybe we could

>> opendir in the parent, and play with the underlying descriptor in the

>> child?  Then you wouldn't need to add __opendir_inplace at all.  Given

>> that we know what our implementation looks like, this should be fairly

>> safe.

>

> I don't have a strong opinion here, it would add some complexity on parent

> helper which would need to transverse all file actions, call opendir, and

> deallocate after helper process returns.  My idea is to keep the required 

> logic more in place, so its more obvious where things are initiated.


You could turn one of the padding elements in posix_spawn_file_actions_t
into a flag and have posix_spawn_file_actions_addclosefrom_np set the
flag.  Then the second iteration isn't necessary.

>>> +  DIR *dp;

>>> +  if ((dp = __opendir_inplace ("/proc/self/fd", buffer, sizeof buffer))

>>> +      == NULL)

>>> +    return false;

>> 

>> This could check for ENFILE/EMFILE/ENOMEM and try closing descriptors

>> directly in case of that error, to make room for the new descriptor.

>> But perhaps that's not worth the complexity.

>

> Hum, this could be an enhancement indeed.  However the main issue is 

> to find which is the lower opened file descriptor greater than FROM

> without polling /proc/self/fd or by using close with random file

> descriptors.


You can do close (from), close (from + 1), etc., up to a certain limit,
and retry if one of the close calls doesn't return EBADF.  The magic
limit is needed in case the closefrom does not overlap with any file
descriptors.

>>> +    {

>>> +      if (dirp->d_name[0] == '.')

>>> +        continue;

>>> +

>>> +      char *endptr;

>>> +      long int fd = strtol (dirp->d_name, &endptr, 10);

>>> +      if (*endptr != '\0' || fd < 0 || fd > INT_MAX)

>>> +	{

>>> +	  ret = false;

>>> +	  break;

>>> +	}

>>> +

>>> +      if (fd == dirfd (dp) || fd < from)

>>> +        continue;

>>> +

>>> +      __close (fd);

>>> +    }

>>> +  __closedir (dp);

>>> +

>>> +  return ret;

>>> +}

>> 

>> I'm not sure if this is entirely correct.  If we close some descriptors,

>> and then readdir calls getdents64, what will the kernel return?  Will

>> there be a gap in the descriptor list?  (Curiously, it's the same issue

>> we have the the fork handler list. 8-)

>

> It does not seems to be case with my experiments.  I hack opendir to 

> allocate the minimum workable buffer (__dirstream plus a 

> struct dirent, about 40 bytes on x86_64) to force each readdir to

> call getdents.  A simple testcase shows:


It's still looks very implementation-defined to me.  proc_readfd_common
does this:

	for (fd = ctx->pos - 2;
	     fd < files_fdtable(files)->max_fds;
	     fd++, ctx->pos++) {

And I think ctx->pos somehow corresponds to d_off.  But I don't see a
1:1 correspondence between descriptors and offsets.  I wonder whether
the single-entry case is indeed the worst-possible test case for this.

Thanks,
Florian
Adhemerval Zanella May 24, 2019, 2:55 p.m. | #4
On 24/05/2019 11:37, Florian Weimer wrote:
> * Adhemerval Zanella:

> 

>>> The test doesn't exercise the gaps case.

>>

>> Do you mean gaps in file descriptor initial set before posix_spawn?

> 

> Yes, where the directory descriptor is in the middle of the closefrom

> range.


Ok, I would add a test to check for it.

> 

>>>> +/* Close all file descriptor up to FROM by interacting /proc/self/fd.

>>>> +   Any failure should */

>>>> +static bool

>>>> +spawn_closefrom (int from)

>>>> +{

>>>> +  /* Increasing the buffer size incurs in less getdents syscalls from

>>>> +     readdir, however it would require more stack size to be allocated

>>>> +     on __spawnix.  */

>>>> +  char buffer[sizeof (struct __dirstream) + 2 * sizeof (struct dirent)];

>>>

>>> We could allocate this on the heap, in the parent.  Maybe we could

>>> opendir in the parent, and play with the underlying descriptor in the

>>> child?  Then you wouldn't need to add __opendir_inplace at all.  Given

>>> that we know what our implementation looks like, this should be fairly

>>> safe.

>>

>> I don't have a strong opinion here, it would add some complexity on parent

>> helper which would need to transverse all file actions, call opendir, and

>> deallocate after helper process returns.  My idea is to keep the required 

>> logic more in place, so its more obvious where things are initiated.

> 

> You could turn one of the padding elements in posix_spawn_file_actions_t

> into a flag and have posix_spawn_file_actions_addclosefrom_np set the

> flag.  Then the second iteration isn't necessary.

> 

>>>> +  DIR *dp;

>>>> +  if ((dp = __opendir_inplace ("/proc/self/fd", buffer, sizeof buffer))

>>>> +      == NULL)

>>>> +    return false;

>>>

>>> This could check for ENFILE/EMFILE/ENOMEM and try closing descriptors

>>> directly in case of that error, to make room for the new descriptor.

>>> But perhaps that's not worth the complexity.

>>

>> Hum, this could be an enhancement indeed.  However the main issue is 

>> to find which is the lower opened file descriptor greater than FROM

>> without polling /proc/self/fd or by using close with random file

>> descriptors.

> 

> You can do close (from), close (from + 1), etc., up to a certain limit,

> and retry if one of the close calls doesn't return EBADF.  The magic

> limit is needed in case the closefrom does not overlap with any file

> descriptors.


Yeah, this is exactly the random close calls I would like to avoid. But
I also don't see a better option.

> 

>>>> +    {

>>>> +      if (dirp->d_name[0] == '.')

>>>> +        continue;

>>>> +

>>>> +      char *endptr;

>>>> +      long int fd = strtol (dirp->d_name, &endptr, 10);

>>>> +      if (*endptr != '\0' || fd < 0 || fd > INT_MAX)

>>>> +	{

>>>> +	  ret = false;

>>>> +	  break;

>>>> +	}

>>>> +

>>>> +      if (fd == dirfd (dp) || fd < from)

>>>> +        continue;

>>>> +

>>>> +      __close (fd);

>>>> +    }

>>>> +  __closedir (dp);

>>>> +

>>>> +  return ret;

>>>> +}

>>>

>>> I'm not sure if this is entirely correct.  If we close some descriptors,

>>> and then readdir calls getdents64, what will the kernel return?  Will

>>> there be a gap in the descriptor list?  (Curiously, it's the same issue

>>> we have the the fork handler list. 8-)

>>

>> It does not seems to be case with my experiments.  I hack opendir to 

>> allocate the minimum workable buffer (__dirstream plus a 

>> struct dirent, about 40 bytes on x86_64) to force each readdir to

>> call getdents.  A simple testcase shows:

> 

> It's still looks very implementation-defined to me.  proc_readfd_common

> does this:

> 

> 	for (fd = ctx->pos - 2;

> 	     fd < files_fdtable(files)->max_fds;

> 	     fd++, ctx->pos++) {

> 

> And I think ctx->pos somehow corresponds to d_off.  But I don't see a

> 1:1 correspondence between descriptors and offsets.  I wonder whether

> the single-entry case is indeed the worst-possible test case for this.

> 


I will check with different permutations by changing the opendir buffer
and a file descriptor set with different set of gaps.
Adhemerval Zanella May 27, 2019, 9:02 p.m. | #5
On 24/05/2019 11:55, Adhemerval Zanella wrote:
> 

> 

> On 24/05/2019 11:37, Florian Weimer wrote:

>> * Adhemerval Zanella:

>>

>>>> The test doesn't exercise the gaps case.

>>>

>>> Do you mean gaps in file descriptor initial set before posix_spawn?

>>

>> Yes, where the directory descriptor is in the middle of the closefrom

>> range.

> 

> Ok, I would add a test to check for it.

> 

>>

>>>>> +/* Close all file descriptor up to FROM by interacting /proc/self/fd.

>>>>> +   Any failure should */

>>>>> +static bool

>>>>> +spawn_closefrom (int from)

>>>>> +{

>>>>> +  /* Increasing the buffer size incurs in less getdents syscalls from

>>>>> +     readdir, however it would require more stack size to be allocated

>>>>> +     on __spawnix.  */

>>>>> +  char buffer[sizeof (struct __dirstream) + 2 * sizeof (struct dirent)];

>>>>

>>>> We could allocate this on the heap, in the parent.  Maybe we could

>>>> opendir in the parent, and play with the underlying descriptor in the

>>>> child?  Then you wouldn't need to add __opendir_inplace at all.  Given

>>>> that we know what our implementation looks like, this should be fairly

>>>> safe.

>>>

>>> I don't have a strong opinion here, it would add some complexity on parent

>>> helper which would need to transverse all file actions, call opendir, and

>>> deallocate after helper process returns.  My idea is to keep the required 

>>> logic more in place, so its more obvious where things are initiated.

>>

>> You could turn one of the padding elements in posix_spawn_file_actions_t

>> into a flag and have posix_spawn_file_actions_addclosefrom_np set the

>> flag.  Then the second iteration isn't necessary.


Another possible advantage of using a inplace buffer for opendir is we
can tune memory usage based on expected filename entries.  For /proc/self/fds
the names are expected for be at most sizeof (int) * 3 + 1, so with:

  enum {
    dirent_base_size  = offsetof (struct dirent, d_name),
    d_name_max_length = sizeof (int) * 3 + 1,
    dirent_max_size   = ALIGN_UP (dirent_base_size + d_name_max_length,
                                 sizeof (long))
  };
  char buffer[sizeof (struct __dirstream) + 10 * dirent_max_size]

We can obtain 10 entries for each getdents calls at cost of about just
432 stack size for spawn_closefrom instead of allocate 4*BUFSIZ/BUFSIZ
(32728 in most cases).

>>

>>>>> +  DIR *dp;

>>>>> +  if ((dp = __opendir_inplace ("/proc/self/fd", buffer, sizeof buffer))

>>>>> +      == NULL)

>>>>> +    return false;

>>>>

>>>> This could check for ENFILE/EMFILE/ENOMEM and try closing descriptors

>>>> directly in case of that error, to make room for the new descriptor.

>>>> But perhaps that's not worth the complexity.

>>>

>>> Hum, this could be an enhancement indeed.  However the main issue is 

>>> to find which is the lower opened file descriptor greater than FROM

>>> without polling /proc/self/fd or by using close with random file

>>> descriptors.

>>

>> You can do close (from), close (from + 1), etc., up to a certain limit,

>> and retry if one of the close calls doesn't return EBADF.  The magic

>> limit is needed in case the closefrom does not overlap with any file

>> descriptors.

> 

> Yeah, this is exactly the random close calls I would like to avoid. But

> I also don't see a better option.

> 

>>

>>>>> +    {

>>>>> +      if (dirp->d_name[0] == '.')

>>>>> +        continue;

>>>>> +

>>>>> +      char *endptr;

>>>>> +      long int fd = strtol (dirp->d_name, &endptr, 10);

>>>>> +      if (*endptr != '\0' || fd < 0 || fd > INT_MAX)

>>>>> +	{

>>>>> +	  ret = false;

>>>>> +	  break;

>>>>> +	}

>>>>> +

>>>>> +      if (fd == dirfd (dp) || fd < from)

>>>>> +        continue;

>>>>> +

>>>>> +      __close (fd);

>>>>> +    }

>>>>> +  __closedir (dp);

>>>>> +

>>>>> +  return ret;

>>>>> +}

>>>>

>>>> I'm not sure if this is entirely correct.  If we close some descriptors,

>>>> and then readdir calls getdents64, what will the kernel return?  Will

>>>> there be a gap in the descriptor list?  (Curiously, it's the same issue

>>>> we have the the fork handler list. 8-)

>>>

>>> It does not seems to be case with my experiments.  I hack opendir to 

>>> allocate the minimum workable buffer (__dirstream plus a 

>>> struct dirent, about 40 bytes on x86_64) to force each readdir to

>>> call getdents.  A simple testcase shows:

>>

>> It's still looks very implementation-defined to me.  proc_readfd_common

>> does this:

>>

>> 	for (fd = ctx->pos - 2;

>> 	     fd < files_fdtable(files)->max_fds;

>> 	     fd++, ctx->pos++) {

>>

>> And I think ctx->pos somehow corresponds to d_off.  But I don't see a

>> 1:1 correspondence between descriptors and offsets.  I wonder whether

>> the single-entry case is indeed the worst-possible test case for this.

>>

> 

> I will check with different permutations by changing the opendir buffer

> and a file descriptor set with different set of gaps.


I checked a permutation of 3 tests:

  - Issue 10 open calls and call closefrom with minimum file descriptor
    value;
  - Issue 10 open calls, close first one, and call closefrom;
  - Issue 10 open calls, close first and last one, and call closefrom.

With a buffer for getdents starting with just sizeof (struct __dirstream)
plus ALIGN_UP (offsetof (struct dirent, d_name) + sizeof (int) * 3 + 1)
up to sizeof (struct __dirstream) plus 10 * ALIGN_UP value.  In the end
the idea is to vary the maximum entry amount getdents will return from
1 to 10 (the number of files to test).

I saw no issue, the opendir interaction close all the file without any
leakage (I can send you the testcase to check if I get something wrong).

And I think it would be expected behaviour imho: if we get all the
file descriptor information with a getdents call, holes does not really
matter. If holes does exist we have two scenarios: a hole in the obtained
buffer or a hole in a subsequent call. Former should be handled as first
case, while latter kernel removes it from next getdents call (as I am
seeing in my experiments).

Patch

diff --git a/include/dirent.h b/include/dirent.h
index 400835eefe..7642fdaba0 100644
--- a/include/dirent.h
+++ b/include/dirent.h
@@ -16,6 +16,8 @@  struct scandir_cancel_struct
 
 /* Now define the internal interfaces.  */
 extern DIR *__opendir (const char *__name) attribute_hidden;
+extern DIR *__opendir_inplace (const char *__name, void *buffer, size_t size)
+  attribute_hidden;
 extern DIR *__opendirat (int dfd, const char *__name) attribute_hidden;
 extern DIR *__fdopendir (int __fd) attribute_hidden;
 extern int __closedir (DIR *__dirp) attribute_hidden;
@@ -44,8 +46,8 @@  extern int __alphasort64 (const struct dirent64 **a, const struct dirent64 **b)
 extern int __versionsort64 (const struct dirent64 **a,
 			    const struct dirent64 **b)
      __attribute_pure__;
-extern DIR *__alloc_dir (int fd, bool close_fd, int flags,
-			 const struct stat64 *statp) attribute_hidden;
+extern DIR *__alloc_dir (int fd, bool close_fd, const struct stat64 *statp,
+			 void *buffer, size_t size) attribute_hidden;
 extern __typeof (rewinddir) __rewinddir;
 extern __typeof (seekdir) __seekdir;
 extern __typeof (dirfd) __dirfd;
diff --git a/posix/Makefile b/posix/Makefile
index 8ac6743ad7..1ac41ad85a 100644
--- a/posix/Makefile
+++ b/posix/Makefile
@@ -57,6 +57,7 @@  routines :=								      \
 	spawn_faction_init spawn_faction_destroy spawn_faction_addclose	      \
 	spawn_faction_addopen spawn_faction_adddup2 spawn_valid_fd	      \
 	spawn_faction_addchdir spawn_faction_addfchdir			      \
+	spawn_faction_addclosefrom					      \
 	spawnattr_init spawnattr_destroy				      \
 	spawnattr_getdefault spawnattr_setdefault			      \
 	spawnattr_getflags spawnattr_setflags				      \
@@ -100,7 +101,8 @@  tests		:= test-errno tstgetopt testfnm runtests runptests \
 		   tst-posix_fadvise tst-posix_fadvise64 \
 		   tst-sysconf-empty-chroot tst-glob_symlinks tst-fexecve \
 		   tst-glob-tilde test-ssize-max tst-spawn4 bug-regex37 \
-		   bug-regex38 tst-regcomp-truncated tst-spawn-chdir
+		   bug-regex38 tst-regcomp-truncated tst-spawn-chdir \
+		   tst-spawn5
 tests-internal	:= bug-regex5 bug-regex20 bug-regex33 \
 		   tst-rfc3484 tst-rfc3484-2 tst-rfc3484-3 \
 		   tst-glob_lstat_compat tst-spawn4-compat
@@ -254,6 +256,7 @@  tst-exec-static-ARGS = $(tst-exec-ARGS)
 tst-execvpe5-ARGS = -- $(host-test-program-cmd)
 tst-spawn-ARGS = -- $(host-test-program-cmd)
 tst-spawn-static-ARGS = $(tst-spawn-ARGS)
+tst-spawn5-ARGS = -- $(host-test-program-cmd)
 tst-dir-ARGS = `pwd` `cd $(common-objdir)/$(subdir); pwd` `cd $(common-objdir); pwd` $(objpfx)tst-dir
 tst-chmod-ARGS = $(objdir)
 tst-vfork3-ARGS = --test-dir=$(objpfx)
diff --git a/posix/Versions b/posix/Versions
index 7d06a6d0c0..34fe242072 100644
--- a/posix/Versions
+++ b/posix/Versions
@@ -144,6 +144,7 @@  libc {
   GLIBC_2.29 {
     posix_spawn_file_actions_addchdir_np;
     posix_spawn_file_actions_addfchdir_np;
+    posix_spawn_file_actions_addclosefrom_np;
   }
   GLIBC_2.30 {
   }
diff --git a/posix/spawn.h b/posix/spawn.h
index 471dbea022..095ee67a26 100644
--- a/posix/spawn.h
+++ b/posix/spawn.h
@@ -213,6 +213,12 @@  extern int posix_spawn_file_actions_addchdir_np (posix_spawn_file_actions_t *
 extern int posix_spawn_file_actions_addfchdir_np (posix_spawn_file_actions_t *,
 						  int __fd)
      __THROW __nonnull ((1));
+
+/* Add an action to close all file descriptor greater than FROM durint
+   spawn.  This affects the subsequent file actions.  */
+extern int posix_spawn_file_actions_addclosefrom_np (posix_spawn_file_actions_t *,
+						     int __from);
+
 #endif
 
 __END_DECLS
diff --git a/posix/spawn_faction_addclosefrom.c b/posix/spawn_faction_addclosefrom.c
new file mode 100644
index 0000000000..24e6c62dbc
--- /dev/null
+++ b/posix/spawn_faction_addclosefrom.c
@@ -0,0 +1,51 @@ 
+/* Add a closefrom to a file action list for posix_spawn.
+   Copyright (C) 2019 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#include <errno.h>
+#include <spawn.h>
+#include <unistd.h>
+
+#include "spawn_int.h"
+
+int
+__posix_spawn_file_actions_addclosefrom (posix_spawn_file_actions_t
+					 *file_actions, int from)
+{
+  struct __spawn_action *rec;
+
+  if (!__spawn_valid_fd (from))
+    return EBADF;
+
+  /* Allocate more memory if needed.  */
+  if (file_actions->__used == file_actions->__allocated
+      && __posix_spawn_file_actions_realloc (file_actions) != 0)
+    /* This can only mean we ran out of memory.  */
+    return ENOMEM;
+
+  /* Add the new value.  */
+  rec = &file_actions->__actions[file_actions->__used];
+  rec->tag = spawn_do_closefrom;
+  rec->action.closefrom_action.from = from;
+
+  /* Account for the new entry.  */
+  ++file_actions->__used;
+
+  return 0;
+}
+weak_alias (__posix_spawn_file_actions_addclosefrom,
+	    posix_spawn_file_actions_addclosefrom_np)
diff --git a/posix/spawn_faction_destroy.c b/posix/spawn_faction_destroy.c
index 51fab13585..b45d1cd889 100644
--- a/posix/spawn_faction_destroy.c
+++ b/posix/spawn_faction_destroy.c
@@ -39,6 +39,7 @@  __posix_spawn_file_actions_destroy (posix_spawn_file_actions_t *file_actions)
 	case spawn_do_close:
 	case spawn_do_dup2:
 	case spawn_do_fchdir:
+	case spawn_do_closefrom:
 	  /* No cleanup required.  */
 	  break;
 	}
diff --git a/posix/spawn_int.h b/posix/spawn_int.h
index 93b7597f90..2abefe0f57 100644
--- a/posix/spawn_int.h
+++ b/posix/spawn_int.h
@@ -32,6 +32,7 @@  struct __spawn_action
     spawn_do_open,
     spawn_do_chdir,
     spawn_do_fchdir,
+    spawn_do_closefrom
   } tag;
 
   union
@@ -60,6 +61,10 @@  struct __spawn_action
     {
       int fd;
     } fchdir_action;
+    struct
+    {
+      int from;
+    } closefrom_action;
   } action;
 };
 
diff --git a/posix/tst-spawn5.c b/posix/tst-spawn5.c
new file mode 100644
index 0000000000..7ed03810e4
--- /dev/null
+++ b/posix/tst-spawn5.c
@@ -0,0 +1,176 @@ 
+/* Tests for posix_spawn signal handling.
+   Copyright (C) 2019 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <getopt.h>
+#include <spawn.h>
+#include <fcntl.h>
+#include <sys/wait.h>
+#include <dirent.h>
+#include <stdbool.h>
+#include <errno.h>
+#include <limits.h>
+
+#include <support/check.h>
+#include <support/xunistd.h>
+#include <support/support.h>
+#include <array_length.h>
+
+/* Nonzero if the program gets called via `exec'.  */
+static int restart;
+#define CMDLINE_OPTIONS \
+  { "restart", no_argument, &restart, 1 },
+
+/* Number of lingering opened file descriptors to opened and checked on
+   the spawned process.  */
+static const int num_fd_to_open = 10;
+
+/* Called on process re-execution.  */
+static int
+handle_restart (int from)
+{
+  DIR *fds = opendir ("/proc/self/fd");
+  if (fds == NULL)
+    FAIL_EXIT1 ("opendir (\"/proc/self/fd\"): %m");
+
+  while (true)
+    {
+      errno = 0;
+      struct dirent64 *e = readdir64 (fds);
+      if (e == NULL)
+        {
+          if (errno != 0)
+            FAIL_EXIT1 ("readdir: %m");
+          break;
+        }
+
+      if (e->d_name[0] == '.')
+        continue;
+
+      char *endptr;
+      long int fd = strtol (e->d_name, &endptr, 10);
+      if (*endptr != '\0' || fd < 0 || fd > INT_MAX)
+        FAIL_EXIT1 ("readdir: invalid file descriptor name: /proc/self/fd/%s",
+                    e->d_name);
+
+      /* Skip the descriptor which is used to enumerate the
+         descriptors.  */
+      if (fd == dirfd (fds))
+        continue;
+
+      struct stat64 st;
+      if (fstat64 (fd, &st) != 0)
+        FAIL_EXIT1 ("readdir: fstat64 (%ld) failed: %m", fd);
+
+      if (fd >= from)
+	FAIL_EXIT1 ("error: fd (%ld) greater than from (%d)", fd, from);
+    }
+
+  closedir (fds);
+
+  return 0;
+}
+
+/* Common argument used for process re-execution.  */
+static char *initial_spargv[5];
+static size_t initial_spargv_size;
+
+/* Re-execute the test process with both '--direct', '--restart', and the
+   TEST (as integer value) as arguments.  */
+static void
+reexecute (int fd, const posix_spawn_file_actions_t *fa)
+{
+  char *spargv[8];
+  int i;
+
+  for (i = 0; i < initial_spargv_size; i++)
+    spargv[i] = initial_spargv[i];
+  /* Three digits per byte plus null terminator.  */
+  char teststr[3 * sizeof (fd) + 1];
+  snprintf (teststr, array_length (teststr), "%d", fd);
+  spargv[i++] = teststr;
+  spargv[i] = NULL;
+  TEST_VERIFY (i < 8);
+
+  pid_t pid;
+  int status;
+
+  TEST_COMPARE (posix_spawn (&pid, spargv[0], fa, NULL, spargv, environ),
+		0);
+  TEST_COMPARE (xwaitpid (pid, &status, 0), pid);
+  TEST_VERIFY (WIFEXITED (status));
+  TEST_VERIFY (!WIFSIGNALED (status));
+  TEST_COMPARE (WEXITSTATUS (status), 0);
+}
+
+static void
+do_test_closefrom (void)
+{
+  int from = xopen ("/dev/null", O_WRONLY, 0);
+  for (int i = 1; i < num_fd_to_open; i++)
+    xopen ("/dev/null", O_WRONLY, 0);
+
+  posix_spawn_file_actions_t fa;
+  /* posix_spawn_file_actions_init does not fail.  */
+  posix_spawn_file_actions_init (&fa);
+
+  TEST_COMPARE (posix_spawn_file_actions_addclosefrom_np (&fa, from), 0);
+
+  reexecute (from, &fa);
+}
+
+static int
+do_test (int argc, char *argv[])
+{
+  /* We must have one or four parameters left if called initially:
+       + path for ld.so		optional
+       + "--library-path"	optional
+       + the library path	optional
+       + the application name
+
+     Plus one parameter to indicate which test to execute through
+     re-execution.
+
+     So for default usage without --enable-hardcoded-path-in-tests, it
+     will be called initially with 5 arguments and later with 2.  For
+     --enable-hardcoded-path-in-tests it will be called with 2 arguments
+     regardless.  */
+
+  if (argc != (restart ? 2 : 5) && argc != 2)
+    FAIL_EXIT1 ("wrong number of arguments (%d)", argc);
+
+  if (restart)
+    return handle_restart (atoi (argv[1]));
+
+  {
+    int i;
+    for (i = 0; i < (argc == 5 ? 4 : 1); i++)
+      initial_spargv[i] = argv[i + 1];
+    initial_spargv[i++] = (char *) "--direct";
+    initial_spargv[i++] = (char *) "--restart";
+    initial_spargv_size = i;
+  }
+
+  do_test_closefrom ();
+
+  return 0;
+}
+
+#define TEST_FUNCTION_ARGV do_test
+#include <support/test-driver.c>
diff --git a/sysdeps/posix/closedir.c b/sysdeps/posix/closedir.c
index fb7da86218..0e55659485 100644
--- a/sysdeps/posix/closedir.c
+++ b/sysdeps/posix/closedir.c
@@ -47,7 +47,8 @@  __closedir (DIR *dirp)
   __libc_lock_fini (dirp->lock);
 #endif
 
-  free ((void *) dirp);
+  if (dirp->allocated)
+    free ((void *) dirp);
 
   return __close_nocancel (fd);
 }
diff --git a/sysdeps/posix/dirstream.h b/sysdeps/posix/dirstream.h
index 109fcc1609..861932abd6 100644
--- a/sysdeps/posix/dirstream.h
+++ b/sysdeps/posix/dirstream.h
@@ -34,6 +34,7 @@  struct __dirstream
     __libc_lock_define (, lock) /* Mutex lock for this structure.  */
 
     size_t allocation;		/* Space allocated for the block.  */
+    bool allocated;		/* Space was allocated by opendir.  */
     size_t size;		/* Total valid data in the block.  */
     size_t offset;		/* Current offset into the block.  */
 
diff --git a/sysdeps/posix/fdopendir.c b/sysdeps/posix/fdopendir.c
index 4b4e218115..13fe3f14b1 100644
--- a/sysdeps/posix/fdopendir.c
+++ b/sysdeps/posix/fdopendir.c
@@ -47,6 +47,6 @@  __fdopendir (int fd)
       return NULL;
     }
 
-  return __alloc_dir (fd, false, flags, &statbuf);
+  return __alloc_dir (fd, false, &statbuf, NULL, 0);
 }
 weak_alias (__fdopendir, fdopendir)
diff --git a/sysdeps/posix/opendir.c b/sysdeps/posix/opendir.c
index 8e3ba480b7..b576b3233d 100644
--- a/sysdeps/posix/opendir.c
+++ b/sysdeps/posix/opendir.c
@@ -21,6 +21,8 @@ 
 #include <stdio.h>	/* For BUFSIZ.  */
 #include <sys/param.h>	/* For MIN and MAX.  */
 
+#include <dirstream.h>
+
 #include <not-cancel.h>
 
 /* The st_blksize value of the directory is used as a hint for the
@@ -46,27 +48,47 @@  invalid_name (const char *name)
   return false;
 }
 
-static DIR *
-opendir_tail (int fd)
+static bool
+opendir_tail_common (int fd, struct stat64 *statbuf)
 {
   if (__glibc_unlikely (fd < 0))
-    return NULL;
+    return false;
 
   /* Now make sure this really is a directory and nothing changed since the
      `stat' call.  The S_ISDIR check is superfluous if O_DIRECTORY works,
      but it's cheap and we need the stat call for st_blksize anyway.  */
-  struct stat64 statbuf;
-  if (__glibc_unlikely (__fxstat64 (_STAT_VER, fd, &statbuf) < 0))
-    goto lose;
-  if (__glibc_unlikely (! S_ISDIR (statbuf.st_mode)))
+  if (__glibc_unlikely (__fxstat64 (_STAT_VER, fd, statbuf) < 0))
+    {
+      __close_nocancel_nostatus (fd);
+      return false;
+    }
+  if (__glibc_unlikely (! S_ISDIR (statbuf->st_mode)))
     {
       __set_errno (ENOTDIR);
-    lose:
       __close_nocancel_nostatus (fd);
-      return NULL;
+      return false;
     }
+  return true;
+}
+
+static DIR *
+opendir_tail (int fd)
+{
+  struct stat64 statbuf;
+  if (!opendir_tail_common (fd, &statbuf))
+    return NULL;
+
+  return __alloc_dir (fd, true, &statbuf, NULL, 0);
+}
 
-  return __alloc_dir (fd, true, 0, &statbuf);
+static DIR *
+opendir_tail_inplace (int fd, void *buffer, size_t size)
+{
+  struct stat64 statbuf;
+  if (!opendir_tail_common (fd, &statbuf))
+    return NULL;
+
+  return __alloc_dir (fd, true, &statbuf, buffer, size);
 }
 
 
@@ -94,7 +116,18 @@  __opendir (const char *name)
 weak_alias (__opendir, opendir)
 
 DIR *
-__alloc_dir (int fd, bool close_fd, int flags, const struct stat64 *statp)
+__opendir_inplace (const char *name, void *buffer, size_t size)
+{
+  if (__glibc_unlikely (invalid_name (name)))
+    return NULL;
+
+  return opendir_tail_inplace (__open_nocancel (name, opendir_oflags),
+			       buffer, size);
+}
+
+DIR *
+__alloc_dir (int fd, bool close_fd, const struct stat64 *statp,
+	     void *buffer, size_t size)
 {
   /* We have to set the close-on-exit flag if the user provided the
      file descriptor.  */
@@ -114,31 +147,44 @@  __alloc_dir (int fd, bool close_fd, int flags, const struct stat64 *statp)
     allocation = MIN (MAX ((size_t) statp->st_blksize, default_allocation),
 		      MAX_DIR_BUFFER_SIZE);
 #endif
-
-  DIR *dirp = (DIR *) malloc (sizeof (DIR) + allocation);
-  if (dirp == NULL)
+  DIR *dirp;
+  if (!buffer)
     {
-      allocation = small_allocation;
-      dirp = (DIR *) malloc (sizeof (DIR) + allocation);
-
+      dirp = malloc (sizeof (DIR) + allocation);
       if (dirp == NULL)
-      lose:
 	{
-	  if (close_fd)
+	  allocation = small_allocation;
+	  dirp = (DIR *) malloc (sizeof (DIR) + allocation);
+
+	  if (dirp == NULL)
 	    {
-	      int save_errno = errno;
-	      __close_nocancel_nostatus (fd);
-	      __set_errno (save_errno);
+	      if (close_fd)
+	      lose:
+		{
+		  int save_errno = errno;
+		  __close_nocancel_nostatus (fd);
+		  __set_errno (save_errno);
+		}
+	      return NULL;
 	    }
-	  return NULL;
 	}
+      dirp->allocated = true;
+      dirp->allocation = allocation;
+    }
+  else
+    {
+      dirp = buffer;
+      if (allocation < sizeof (struct __dirstream)
+		       + sizeof (struct dirent))
+	goto lose;
+      dirp->allocated = false;
+      dirp->allocation = size;
     }
 
   dirp->fd = fd;
 #if IS_IN (libc)
   __libc_lock_init (dirp->lock);
 #endif
-  dirp->allocation = allocation;
   dirp->size = 0;
   dirp->offset = 0;
   dirp->filepos = 0;
diff --git a/sysdeps/unix/sysv/linux/spawni.c b/sysdeps/unix/sysv/linux/spawni.c
index c1abf3f960..fe0ff95825 100644
--- a/sysdeps/unix/sysv/linux/spawni.c
+++ b/sysdeps/unix/sysv/linux/spawni.c
@@ -31,6 +31,7 @@ 
 #include <dl-sysdep.h>
 #include <libc-pointer-arith.h>
 #include <ldsodefs.h>
+#include <dirent.h>
 #include "spawn_int.h"
 
 /* The Linux implementation of posix_spawn{p} uses the clone syscall directly
@@ -114,6 +115,46 @@  maybe_script_execute (struct posix_spawn_args *args)
     }
 }
 
+/* Close all file descriptor up to FROM by interacting /proc/self/fd.
+   Any failure should */
+static bool
+spawn_closefrom (int from)
+{
+  /* Increasing the buffer size incurs in less getdents syscalls from
+     readdir, however it would require more stack size to be allocated
+     on __spawnix.  */
+  char buffer[sizeof (struct __dirstream) + 2 * sizeof (struct dirent)];
+
+  DIR *dp;
+  if ((dp = __opendir_inplace ("/proc/self/fd", buffer, sizeof buffer))
+      == NULL)
+    return false;
+
+  bool ret = true;
+  struct dirent *dirp;
+  while ((dirp = __readdir (dp)) != NULL)
+    {
+      if (dirp->d_name[0] == '.')
+        continue;
+
+      char *endptr;
+      long int fd = strtol (dirp->d_name, &endptr, 10);
+      if (*endptr != '\0' || fd < 0 || fd > INT_MAX)
+	{
+	  ret = false;
+	  break;
+	}
+
+      if (fd == dirfd (dp) || fd < from)
+        continue;
+
+      __close (fd);
+    }
+  __closedir (dp);
+
+  return ret;
+}
+
 /* Function used in the clone call to setup the signals mask, posix_spawn
    attributes, and file actions.  It run on its own stack (provided by the
    posix_spawn call).  */
@@ -280,6 +321,11 @@  __spawni_child (void *arguments)
 	      if (__fchdir (action->action.fchdir_action.fd) != 0)
 		goto fail;
 	      break;
+
+	    case spawn_do_closefrom:
+	      if (!spawn_closefrom (action->action.closefrom_action.from))
+		goto fail;
+	      break;
 	    }
 	}
     }
@@ -339,8 +385,12 @@  __spawnix (pid_t * pid, const char *file,
   int prot = (PROT_READ | PROT_WRITE
 	     | ((GL (dl_stack_flags) & PF_X) ? PROT_EXEC : 0));
 
+  /* The __spawni_child uses about 768 on x86_64 (including the stack for
+     opendir_inplace), so add some extra space.  */
+  const size_t stack_slack = 1024;
+
   /* Add a slack area for child's stack.  */
-  size_t argv_size = (argc * sizeof (void *)) + 512;
+  size_t argv_size = (argc * sizeof (void *)) + stack_slack;
   /* We need at least a few pages in case the compiler's stack checking is
      enabled.  In some configs, it is known to use at least 24KiB.  We use
      32KiB to be "safe" from anything the compiler might do.  Besides, the