[00/12] CTF symbol functionality

Message ID 20201025141413.363381-1-nick.alcock@oracle.com
Headers show
Series
  • CTF symbol functionality
Related show

Message

Alan Modra via Binutils Oct. 25, 2020, 2:14 p.m.
This patch series lets you look up the types of symbols in ELF objects in
the func info / data object CTF sections (hereafter, "symtypetab sections")
by just compiling with -gt with a suitable compiler (see below) and linking
them as usual, then opening the CTF dict as usual76 with ctf_open and
calling ctf_arc_lookup_symbol with a dynamic symbol index: you get a
ctf_id_t back which you can chase around the type graph and look up
properties of.  Function symbols and data symbols that are function pointers
both yield types of kind CTF_K_FUNCTION.

Symbols that are of types with ambiguous definitions are handled properly: the
type you get back has been looked up in the appropriate dict for the type the
compiler actually had in scope when it emitted the original definition of the
symbol in any given object file.


The representation is quite efficient: for symtypetabs where most symbols have
types, it's an array associated 1:1 with functions or objects in the ELF
(dynamic) symtab, in the same order: each element contains a type ID, with
untyped symbols represented as padding of type 0.  This representation avoids
storing the symbol name or number at all at the cost of some theoretical
fragility (the producer and consumer had better be using the exact same .dynsym).

For symtypetabs where most symbols do not have types (common for child dicts
with only a few ambiguous types in them), we'd be wasting a lot of space
with padding, so an "indexed" representation is used where the symtypetab
section is stored in ASCIIbetical order by symbol name, with a corresponding
"index" section in 1:1 correspondence with it which records the name (not
number!) of each symbol for bsearching, as a strtab offset, usually into the
ELF dynstrtab.  This uses more space per entry because it has to record a
name offset, but less space because it doesn't have to record untyped
symbols at all.  When generating CTF dicts, we use a crude guess of the
cutoff point to switch from the one representation to the other: I'll switch
to something cleverer (but slower) later.

The compiler always emits the indexed representation (but doesn't sort it) since
it has no idea what order the symbols will be in in the final symtab, or even
whether they'll be public there at all. ld -r does the same thing.

From the libctf user's viewpoint, there is no distinction between indexed and
unindexed symtypetabs: some symtypetabs the user uses will be indexed, some
won't be, and the API acts just the same: you give it a symbol index, you get
back a type.

This series finally gets us to parity with the original Solaris libctf
implementation, and better: Solaris didn't have indexed sections, so it in
practice could not have symtypetabs in child dicts at all without taking a
dreadful space cost.  Solaris also had no way to look up a symbol in more
than one dict at once (it had no equivalent of the ctf_archive_t), no way to
deduplicate function symbols of identical types, and you had to use
different functions to look up function and data symbols.


In related matters in the same series, we also fix deduplication of function
types, slightly improving deduplication efficiency (how much depends on how
many functions with identical type signatures you've got), and fix a
terrible bug with the connection between libctf and the ELF symtab and
strtab: see below.


Things that need review:

Patches 1 and 2 touch things that use a ctf_file_t to adapt to some type
renamings, but nothing more: they are purely mechanical.  Patch 3 touches
objdump and readelf, even if only the CTF parts.

Patch 4 is a fairly invasive patch (written only yesterday) that redoes the
CTF symtab hookery in bfd and ld: we were, it turns out, using the wrong
symtab and strtab for CTF the whole time, but we must not use .symtab and
.strtab because they are stripped out!  This change switches over to use
.dynsym and .dynstr (which is what Solaris did), but this is a bit more
fiddly because .dynstr is built up in pieces in many places and is never
present in a single unified lump like the symtab: so every place that swaps
symbols out to the dynsym needs a hook call added before it.  (I'd prefer to
have hooked in inside swap_symbol_out, but that doesn't know if its symbol
is going into the dynsym or not and has no access to the
bfd_link_callbacks).  We keep a hook in place for hooking into .symtab
additions as well as .dynstr, but that hook is NULL for now because we don't
need it.

The underlying CTF ctf_link* functions have changed correspondingly: you can
now call ctf_link_add_linker_symbol repeatedly to notify libctf about one
symbol at a time, possibly with a strtab offset rather than a name, then
call ctf_link_shuffle_syms at some point after you've called both that and
ctf_link_add_strtab to shuffle things into place.  It shouldn't be that hard
to review the bfd and ld bits because they are very similar to what was
there before, just hooked in a few more places and calling libctf
differently.

There are now testcaess to make sure we do not regress (in particular, if we
do, ld/testsuite/ld-ctf/data-func-conflicted.d will fail).

There is a new header flag in the preamble whose absence means to use .strtab,
so that old CTF written before this change can still be read (as long as the
.strtab hasn't been stripped out, sigh). (There are other new header flags too:
see below.)



(Testing status: cross-built to a lot of targets and did a make check-ld, no new
failures: also tested 32-bit and 64-bit pc-linux-gnu and FreeeBSD 12.  All the
CTF tests, including the new ones, pass everywhere I've tried them.  All my
usual test matrix is done and happy other than cross-endianness tests from
SPARC/Linux and Solaris because the remote machine I was testing on panicked
while I was doing a GCC bootstrap.  I'll get my compile farm account activated
and look at it soon.)


Possibly contentious parts of the libctf side:

- We do a bunch of renaming (keeping the old names for compatibility):

  - ctf_file_t -> ctf_dict_t, struct ctf_file -> struct ctf_dict
  - ctf_arc_open_by_name* -> ctf_dict_open*.

  The new naming is, I think, clearer (a ctf_file_t is not a file at all but a
  component of a ctf_archive_t, ctf_arc_open_by_name doesn't open an archive),
  and lets us reserve the ctf_arc_* namespace for functions that operate on
  entire archives.  This is not an ABI break, and I don't think this is an API
  break because old code will continue to compile (albeit with harmless warnings
  for functions that used to take a ctf_file_t and now take a ctf_dict_t).

- The struct ctf_link_sym, in the public API, has changed incompatibly: but
  before this patch series there were no users of this type (the functions
  taking a ctf_link_sym in libctf were stubbed out), so the only people this
  might break are people using struct ctf_link_sym for their own purposes
  outside libctf.  This seems too unlikely to be concerned about.

  Several ctf_link_sym-taking functions (like ctf_link_shuffle_syms) have
  changed too. None of the affected functions are in use outside ld in any case,
  and most of them were stubbed-out and did nothing before this series.

- The file format has (technically) changed: a revised compiler is in place at
  https://github.com/oracle/gcc, branch oracle/ctf-gen-newfuncinfo. The func
  info section format emitted by the old compiler was never accepted by any
  libctf, so we can change it without worrying about backward compatibility.
  There is another new preamble flag CTF_F_NEWFUNCINFO to stop new libctf from
  trying to use the func info format from an old compiler, and vice versa.
  (Some ld-ctf tests will of course fail if you use such a compiler.)

  This compiler push is just for test purposes: the final compiler will not be
  based on this at all (though it will emit the same format), but it is enough
  to try out this libctf branch with.  The older CTF-capable compiler will
  trigger test failures in many ld-ctf tests, because the tests now assume
  correct population of the func info and data object sections.

- the code to size the emitted symtypetab sections, the code to actually emit
  them, and the code to emit the index, are distinct but fairly similar and must
  be kept in sync or you get a buffer overrun (well, actually, you get an
  assertion failure so as to avoid a buffer overrun).  The code is adjacent
  (symtypetab_density, emit_symtypetab, and emit_symtypetab_index) but this is
  still not ideal.  I tried to implement the three in terms of a common function
  and the result was unbearably complex: it's pretty complex as it is.
  (The section sizing also needs to be done in advance of the actual emission,
  because we need it to determine whether to emit an indexed section or not.)

  I'll probably be looking at this bit again and seeing if I can unify them some
  more in the future.

Not *entirely* tested yet: all the bits needed for linking and dumping work, but
automated testing of the lookup side requires a whole new testsuite I have yet
to write (and will write soon).

API/ABI changes (all more or less certain not to affect any real users outside
binutils, particularly given that no released binutils has a working CTF linker
at all yet):

New functions:
  ctf_dict_open
  ctf_dict_open_sections
  ctf_dict_close
  ctf_parent_dict

  ctf_symbol_next
  ctf_add_objt_sym
  ctf_add_func_sym

  ctf_link_add_linker_symbol

  ctf_arc_lookup_symbol
  ctf_arc_flush_caches

  ctf_getsymsect
  ctf_getstrsect

Changed functions (ignoring the ctf_file_t -> ctf_dict_t change):
  ctf_link_shuffle_syms

Changed types:
  ctf_file_t renamed to ctf_dict_t (compatibility name provided).
  ctf_link_sym_t (new fields st_nameidx, st_nameidx_set, st_symidx).

Removed types:
  ctf_link_iter_symbol_f

[1] the mapping is not quite 1:1: some symbols that cannot ever be typed are
    omitted and not encoded even as padding. This set can only be changed when
    the file format version is bumped, so the time to revise it is now! See
    libctf/ctf-create.c:ctf_symtab_skippable, which is totally Solarisy right
    now and surely needs something done to it. I wish there was a way to say
    "skip symbols in crtstuff" but there is no way to identify these reliably
    that I can see.  So, if anyone has more ideas for ways to spot sorts of
    function or data symbols that cannot ever have types from inside the linker,
    please do say, you'll save four bytes per symbol :)

Nick Alcock (12):
  libctf, include, binutils, gdb, ld: rename ctf_file_t to ctf_dict_t
  libctf, include, binutils, gdb: rename CTF-opening functions
  objdump, readelf: Report errors from CTF archive iteration
  bfd, include, ld, binutils, libctf: CTF should use the dynstr/sym
  libctf: symbol type linking support
  libctf: adjust dumper for symtypetab changes
  ld, ctf: new and adjusted CTF tests due to func info / object data
    sections
  libctf, ld: properly deduplicate function types
  libctf, include: CTF-archive-wide symbol lookup
  libctf, include: add ctf_getsymsect and ctf_getstrsect
  libctf: error-handling fixes
  libctf: do not crash when CTF symbol or variable linking fails

 bfd/elf.c                                     |   14 +-
 bfd/elflink.c                                 |   38 +-
 binutils/objdump.c                            |   19 +-
 binutils/readelf.c                            |   24 +-
 gdb/ctfread.c                                 |   40 +-
 include/bfdlink.h                             |   16 +-
 include/ctf-api.h                             |  288 ++---
 include/ctf.h                                 |   60 +-
 ld/emultempl/aix.em                           |    3 +-
 ld/emultempl/armcoff.em                       |    3 +-
 ld/emultempl/beos.em                          |    3 +-
 ld/emultempl/elf-generic.em                   |    3 +-
 ld/emultempl/elf.em                           |    3 +-
 ld/emultempl/generic.em                       |    3 +-
 ld/emultempl/linux.em                         |    3 +-
 ld/emultempl/msp430.em                        |    3 +-
 ld/emultempl/pe.em                            |    3 +-
 ld/emultempl/pep.em                           |    3 +-
 ld/emultempl/ticoff.em                        |    3 +-
 ld/emultempl/vanilla.em                       |    3 +-
 ld/ldelfgen.c                                 |  110 +-
 ld/ldelfgen.h                                 |   11 +-
 ld/ldemul.c                                   |   18 +-
 ld/ldemul.h                                   |   30 +-
 ld/ldlang.c                                   |   41 +-
 ld/ldlang.h                                   |    6 +-
 ld/ldmain.c                                   |    4 +-
 ld/testsuite/ld-ctf/array.d                   |   11 +-
 ld/testsuite/ld-ctf/conflicting-cycle-1.B-1.d |    5 +-
 ld/testsuite/ld-ctf/conflicting-cycle-1.B-2.d |    5 +-
 .../ld-ctf/conflicting-cycle-1.parent.d       |    4 +-
 ld/testsuite/ld-ctf/conflicting-cycle-2.A-1.d |    1 +
 ld/testsuite/ld-ctf/conflicting-cycle-2.A-2.d |    1 +
 .../ld-ctf/conflicting-cycle-2.parent.d       |    6 +-
 ld/testsuite/ld-ctf/conflicting-cycle-3.C-1.d |    1 +
 ld/testsuite/ld-ctf/conflicting-cycle-3.C-2.d |    1 +
 .../ld-ctf/conflicting-cycle-3.parent.d       |    1 +
 ld/testsuite/ld-ctf/cross-tu-noncyclic.d      |    4 +-
 ld/testsuite/ld-ctf/cycle-1.d                 |    4 +-
 ld/testsuite/ld-ctf/cycle-2.A.d               |    4 +-
 ld/testsuite/ld-ctf/cycle-2.B.d               |    4 +-
 ld/testsuite/ld-ctf/cycle-2.C.d               |    4 +-
 ld/testsuite/ld-ctf/data-func-1.c             | 1031 +++++++++++++++++
 ld/testsuite/ld-ctf/data-func-2.c             |    5 +
 ld/testsuite/ld-ctf/data-func-conflicted.d    |   63 +
 ld/testsuite/ld-ctf/diag-cttname-null.d       |    5 +-
 ld/testsuite/ld-ctf/diag-cuname.d             |   11 +-
 ld/testsuite/ld-ctf/diag-parlabel.d           |   12 +-
 .../ld-ctf/diag-wrong-magic-number-mixed.d    |    1 +
 ld/testsuite/ld-ctf/function.d                |    8 +-
 ld/testsuite/ld-ctf/slice.d                   |   12 +-
 ld/testsuite/ld-ctf/super-sub-cycles.d        |    1 +
 libctf/ctf-archive.c                          |  449 +++++--
 libctf/ctf-create.c                           |  939 +++++++++++++--
 libctf/ctf-decl.c                             |    2 +-
 libctf/ctf-dedup.c                            |  200 ++--
 libctf/ctf-dump.c                             |  228 ++--
 libctf/ctf-error.c                            |    2 +-
 libctf/ctf-hash.c                             |   13 +-
 libctf/ctf-impl.h                             |  241 ++--
 libctf/ctf-inlines.h                          |    6 +-
 libctf/ctf-labels.c                           |    8 +-
 libctf/ctf-link.c                             |  531 +++++++--
 libctf/ctf-lookup.c                           |  582 +++++++---
 libctf/ctf-open-bfd.c                         |   46 +-
 libctf/ctf-open.c                             |  391 ++++---
 libctf/ctf-string.c                           |   79 +-
 libctf/ctf-subr.c                             |   14 +-
 libctf/ctf-types.c                            |  136 ++-
 libctf/ctf-util.c                             |   56 +-
 libctf/libctf.ver                             |   20 +
 71 files changed, 4474 insertions(+), 1429 deletions(-)
 create mode 100644 ld/testsuite/ld-ctf/data-func-1.c
 create mode 100644 ld/testsuite/ld-ctf/data-func-2.c
 create mode 100644 ld/testsuite/ld-ctf/data-func-conflicted.d

-- 
2.29.0.249.g249b51256f

Comments

Alan Modra via Binutils Nov. 3, 2020, 4:19 p.m. | #1
On 25 Oct 2020, Nick Alcock said:

> This patch series lets you look up the types of symbols in ELF objects in

> the func info / data object CTF sections (hereafter, "symtypetab sections")

> by just compiling with -gt with a suitable compiler (see below) and linking

> them as usual, then opening the CTF dict as usual76 with ctf_open and

> calling ctf_arc_lookup_symbol with a dynamic symbol index: you get a

> ctf_id_t back which you can chase around the type graph and look up

> properties of.  Function symbols and data symbols that are function pointers

> both yield types of kind CTF_K_FUNCTION.

[...]
> Things that need review:

>

> Patches 1 and 2 touch things that use a ctf_file_t to adapt to some type

> renamings, but nothing more: they are purely mechanical.  Patch 3 touches

> objdump and readelf, even if only the CTF parts.

>

> Patch 4 is a fairly invasive patch (written only yesterday) that redoes the

> CTF symtab hookery in bfd and ld: we were, it turns out, using the wrong

> symtab and strtab for CTF the whole time, but we must not use .symtab and

> .strtab because they are stripped out!  This change switches over to use

> .dynsym and .dynstr (which is what Solaris did), but this is a bit more

> fiddly because .dynstr is built up in pieces in many places and is never

> present in a single unified lump like the symtab: so every place that swaps

> symbols out to the dynsym needs a hook call added before it.  (I'd prefer to

> have hooked in inside swap_symbol_out, but that doesn't know if its symbol

> is going into the dynsym or not and has no access to the

> bfd_link_callbacks).  We keep a hook in place for hooking into .symtab

> additions as well as .dynstr, but that hook is NULL for now because we don't

> need it.


Gentle ping?

(Again, patches 1 and 2 (possibly) and 4 (definitely) need review.)
Alan Modra via Binutils Nov. 16, 2020, 3:38 p.m. | #2
On 25 Oct 2020, Nick Alcock via Binutils said:

> This patch series lets you look up the types of symbols in ELF objects in

> the func info / data object CTF sections (hereafter, "symtypetab sections")

[...]

> Things that need review:

>

> Patches 1 and 2 touch things that use a ctf_file_t to adapt to some type

> renamings, but nothing more: they are purely mechanical.  Patch 3 touches

> objdump and readelf, even if only the CTF parts.

>

> Patch 4 is a fairly invasive patch (written only yesterday) that redoes the

> CTF symtab hookery in bfd and ld: we were, it turns out, using the wrong

> symtab and strtab for CTF the whole time, but we must not use .symtab and

> .strtab because they are stripped out!  This change switches over to use

> .dynsym and .dynstr (which is what Solaris did), but this is a bit more

> fiddly because .dynstr is built up in pieces in many places and is never

> present in a single unified lump like the symtab: so every place that swaps

> symbols out to the dynsym needs a hook call added before it.  (I'd prefer to

> have hooked in inside swap_symbol_out, but that doesn't know if its symbol

> is going into the dynsym or not and has no access to the

> bfd_link_callbacks).  We keep a hook in place for hooking into .symtab

> additions as well as .dynstr, but that hook is NULL for now because we don't

> need it.


Ping^2?

(Alan, I suspect you're the only person that could review patch 3. Rest
assured that I was kicking myself for adding to your review burden like
this :( )