[3/5] i386: Align branches within a fixed boundary

Message ID 20191112161905.10048-4-hjl.tools@gmail.com
State Superseded
Headers show
Series
  • i386: Optimize for Jump Conditional Code Erratum
Related show

Commit Message

H.J. Lu Nov. 12, 2019, 4:19 p.m.
Add 3 command-line options to align branches within a fixed boundary
with segment prefixes or NOPs:

1. -malign-branch-boundary=NUM aligns branches within NUM byte boundary.
2. -malign-branch=TYPE[+TYPE...] specifies types of branches to align.
The supported branches are:
  a. Conditional jump.
  b. Fused conditional jump.
  c. Unconditional jump.
  d. Call.
  e. Ret.
  f. Indirect jump and call.
3. -malign-branch-prefix-size=NUM aligns branches with NUM segment
prefixes per instruction.

3 new rs_machine_dependent frag types are added:

1. BRANCH_PADDING.  The variable size frag to insert NOP before branch.
2. BRANCH_PREFIX.  The variable size frag to insert segment prefixes to
an instruction.  The choices of prefixes are:
   a. Use the existing segment prefix if there is one.
   b. Use CS segment prefix in 64-bit mode.
   c. In 32-bit mode, use SS segment prefix with ESP/EBP base register
   and use DS segment prefix without ESP/EBP base register.
3. FUSED_JCC_PADDING.  The variable size frag to insert NOP before fused
conditional jump.

The new rs_machine_dependent frags aren't inserted if the previous item
is a prefix or a constant directive, which may be used to hardcode an
instruction, since there is no clear instruction boundary.  NOP padding
is disabled before tls_get_addr calls to keep TLS instruction sequence
unchanged.

md_estimate_size_before_relax() and i386_generic_table_relax_frag() are
used to handled BRANCH_PADDING, BRANCH_PREFIX and FUSED_JCC_PADDING frags.
i386_generic_table_relax_frag() grows or shrinks sizes of segment prefix
and NOP to align the next branch frag:

1. First try to add segment prefixes to instructions before a branch.
2. If there is no sufficient room to add segment prefixes, NOP will be
inserted before a branch.

	* config/tc-i386.c (_i386_insn): Add has_gotpc_reloc.
	(tls_get_addr): New.
	(last_insn): New.
	(align_branch_power): New.
	(align_branch_kind): New.
	(align_branch): New.
	(MAX_FUSED_JCC_PADDING_SIZE): New.
	(align_branch_prefix_size): New.
	(BRANCH_PADDING): New.
	(BRANCH_PREFIX): New.
	(FUSED_JCC_PADDING): New.
	(i386_generate_nops): Support BRANCH_PADDING and FUSED_JCC_PADDING.
	(md_begin): Abort if align_branch_prefix_size <
	MAX_FUSED_JCC_PADDING_SIZE.
	(md_assemble): Set last_insn.
	(maybe_fused_with_jcc_p): New.
	(add_fused_jcc_padding_frag_p): New.
	(add_branch_prefix_frag_p): New.
	(add_branch_padding_frag_p): New.
	(output_insn): Generate a BRANCH_PADDING, FUSED_JCC_PADDING or
	BRANCH_PREFIX frag and terminate each frag to align branches.
	(output_disp): Set i.has_gotpc_reloc to TRUE for GOTPC
	relocation.
	(output_imm): Likewise.
	(i386_next_non_empty_frag): New.
	(i386_next_jcc_frag): New.
	(i386_classify_machine_dependent_frag): New.
	(i386_branch_padding_size): New.
	(i386_generic_table_relax_frag): New.
	(md_estimate_size_before_relax): Handle COND_JUMP_PADDING,
	FUSED_JCC_PADDING and COND_JUMP_PREFIX frags.
	(md_convert_frag): Handle BRANCH_PADDING, BRANCH_PREFIX and
	FUSED_JCC_PADDING frags.
	(OPTION_MALIGN_BRANCH_BOUNDARY): New.
	(OPTION_MALIGN_BRANCH_PREFIX_SIZE): New.
	(OPTION_MALIGN_BRANCH): New.
	(md_longopts): Add -malign-branch-boundary=,
	-malign-branch-prefix-size= and -malign-branch=.
	(md_parse_option): Handle -malign-branch-boundary=,
	-malign-branch-prefix-size= and -malign-branch=.
	(md_show_usage): Display -malign-branch-boundary=,
	-malign-branch-prefix-size= and -malign-branch=.
	(i386_target_format): Set tls_get_addr.
	(i386_cons_worker): New.
	* config/tc-i386.h (i386_cons_worker): New.
	(md_cons_worker): New.
	(i386_generic_table_relax_frag): New.
	(md_generic_table_relax_frag): New.
	(i386_tc_frag_data): Add u, padding_address, length,
	max_prefix_length, prefix_length, default_prefix, cmp_size,
	classified and branch_type.
	(TC_FRAG_INIT): Initialize u, padding_address, base_opcode,
	length, max_prefix_length, prefix_length, default_prefix,
	cmp_size and classified.
	* doc/c-i386.texi: Document -malign-branch-boundary=,
	-malign-branch-prefix-size= and -malign-branch=.
---
 gas/config/tc-i386.c | 955 ++++++++++++++++++++++++++++++++++++++++++-
 gas/config/tc-i386.h |  31 ++
 gas/doc/c-i386.texi  |  26 ++
 3 files changed, 1008 insertions(+), 4 deletions(-)

-- 
2.21.0

Comments

Jan Beulich Nov. 13, 2019, 1:15 p.m. | #1
On 12.11.2019 17:19,  H.J. Lu  wrote:
> @@ -632,6 +652,31 @@ static enum check_kind

>    }

>  sse_check, operand_check = check_warning;

>  

> +/* Non-zero if branches should be aligned within power of 2 boundary.  */

> +static int align_branch_power = 0;

> +

> +/* Types of branches to align.  */

> +enum align_branch_kind

> +  {

> +    align_branch_none = 0,

> +    align_branch_jcc = 1 << 0,

> +    align_branch_fused = 1 << 1,

> +    align_branch_jmp = 1 << 2,

> +    align_branch_call = 1 << 3,

> +    align_branch_indirect = 1 << 4,

> +    align_branch_ret = 1 << 5

> +  };

> +

> +static unsigned int align_branch = (align_branch_jcc

> +				    | align_branch_fused

> +				    | align_branch_jmp);

> +

> +/* The maximum padding size for fused jcc.  */

> +#define MAX_FUSED_JCC_PADDING_SIZE 20


What is this number derived from?

> @@ -3012,6 +3070,11 @@ md_begin (void)

>        x86_dwarf2_return_column = 8;

>        x86_cie_data_alignment = -4;

>      }

> +

> +  /* NB: FUSED_JCC_PADDING frag must have sufficient room so that it

> +     can be turned into BRANCH_PREFIX frag.  */

> +  if (align_branch_prefix_size > MAX_FUSED_JCC_PADDING_SIZE)

> +    abort ();


Wouldn't such a value better be rejected at command line option
handling time?

> @@ -8178,11 +8252,181 @@ encoding_length (const fragS *start_frag, offsetT start_off,

>    return len - start_off + (frag_now_ptr - frag_now->fr_literal);

>  }

>  

> +/* Return 1 for test, and, cmp, add, sub, inc and dec which may

> +   be macro-fused with conditional jumps.  */

> +

> +static int

> +maybe_fused_with_jcc_p (void)

> +{

> +  /* No RIP address.  */

> +  if (i.base_reg && i.base_reg->reg_num == RegIP)

> +    return 0;

> +

> +  /* and, add, sub with destination register.  */

> +  if (!strcmp (i.tm.name, "and")

> +      || !strcmp (i.tm.name, "add")

> +      || !strcmp (i.tm.name, "sub"))

> +    return i.types[1].bitfield.class == Reg;

> +

> +  /* test, cmp with any register.  */

> +  if (!strcmp (i.tm.name, "test") || !strcmp (i.tm.name, "cmp"))

> +    return (i.types[0].bitfield.class == Reg

> +	    || i.types[1].bitfield.class == Reg);

> +

> +  /* inc, dec with 16/32/64-bit register.   */

> +  if (!strcmp (i.tm.name, "inc") || !strcmp (i.tm.name, "dec"))

> +    return i.types[0].bitfield.class == Reg;


Code and comment don't match up: Either you also mean 8-bit
registers and comment is wrong, or you need to extend the
condition / return value expression.

Furthermore with this being executed quite frequently, I
think using strcmp() isn't a very performant approach here.
Like elsewhere I think you want to use i.tm checks here. This
would also eliminate my worry of the code not catching cases
where optimization changes the opcode in i.tm without also
changing the name.

> +static int

> +add_fused_jcc_padding_frag_p (void)

> +{

> +  if (!align_branch_power

> +      || now_seg == absolute_section

> +      || !cpu_arch_flags.bitfield.cpui386


Why the i386 dependency?

> +      || !(align_branch & align_branch_fused))

> +    return 0;

> +

> +  if (maybe_fused_with_jcc_p ())

> +    {

> +      if (last_insn.kind != last_insn_other

> +	  && last_insn.seg == now_seg)

> +	{

> +	  if (flag_debug)

> +	    as_warn_where (last_insn.file, last_insn.line,

> +			   _("`%s` skips -malign-branch-boundary on `%s`"),

> +			   last_insn.name, i.tm.name);

> +	  return 0;

> +	}

> +      return 1;

> +    }

> +

> +  return 0;


Maybe slightly better (less indentation and a folded return) as

  if (maybe_fused_with_jcc_p ())
    {
      if (last_insn.kind == last_insn_other
	  || last_insn.seg != now_seg)
	return 1;
      if (flag_debug)
	as_warn_where (last_insn.file, last_insn.line,
		       _("`%s` skips -malign-branch-boundary on `%s`"),
		       last_insn.name, i.tm.name);
    }

  return 0;

?

> +}

> +

> +/* Return 1 if a BRANCH_PREFIX frag should be generated.  */

> +

> +static int

> +add_branch_prefix_frag_p (void)

> +{

> +  if (!align_branch_power

> +      || now_seg == absolute_section

> +      || i.tm.cpu_flags.bitfield.cpupadlock

> +      || !cpu_arch_flags.bitfield.cpui386)


Why the PadLock and i386 dependencies?

> +    return 0;

> +

> +  /* Don't add prefix if it is a prefix or there is no operand.  */

> +  if (!i.operands || i.tm.opcode_modifier.isprefix)

> +    return 0;

> +

> +  if (last_insn.kind != last_insn_other

> +      && last_insn.seg == now_seg)

> +    {

> +      if (flag_debug)

> +	as_warn_where (last_insn.file, last_insn.line,

> +		       _("`%s` skips -malign-branch-boundary on `%s`"),

> +		       last_insn.name, i.tm.name);

> +      return 0;

> +    }

> +

> +  return 1;


Like above this too can be had with less indentation.

> +static int

> +add_branch_padding_frag_p (enum align_branch_kind *branch_p)

> +{

> +  int add_padding;

> +

> +  if (!align_branch_power

> +      || now_seg == absolute_section

> +      || !cpu_arch_flags.bitfield.cpui386)


Same question as above.

> +    return 0;

> +

> +  add_padding = 0;

> +

> +  /* Check for jcc and direct jmp.  */

> +  if (i.tm.opcode_modifier.jump)

> +    {

> +      if (i.tm.base_opcode == JUMP_PC_RELATIVE)

> +	{

> +	  *branch_p = align_branch_jmp;

> +	  add_padding = align_branch & align_branch_jmp;

> +	}

> +      else

> +	{

> +	  *branch_p = align_branch_jcc;

> +	  if ((align_branch & align_branch_jcc))

> +	    add_padding = 1;

> +	}

> +    }

> +  else if (i.tm.base_opcode == 0xc2

> +	   || i.tm.base_opcode == 0xc3

> +	   || i.tm.base_opcode == 0xca

> +	   || i.tm.base_opcode == 0xcb)


These will also match various VEX/XOP/EVEX encoded templates. Perhaps
exclude any such right away in the first if()?

Also you handle near and far returns here, but ...

> +    {

> +      *branch_p = align_branch_ret;

> +      if ((align_branch & align_branch_ret))

> +	add_padding = 1;

> +    }

> +  else

> +    {

> +      if (i.tm.base_opcode == 0xe8)

> +	{

> +	  *branch_p = align_branch_call;

> +	  if ((align_branch & align_branch_call))

> +	    add_padding = 1;

> +	}

> +      else if (i.tm.base_opcode == 0xff

> +	       && (i.rm.reg == 2 || i.rm.reg == 4))


... only near indirect call/jmp here?

It would also seem more consistent to be if you used
i.tm.extension_opcode instead of i.rm.reg for the checks here.

> +	{

> +	  *branch_p = align_branch_indirect;

> +	  if ((align_branch & align_branch_indirect))

> +	    add_padding = 1;

> +	}

> +

> +      /* Check for indirect jmp, direct and indirect calls.  */


I guess this comment really belongs several lines up from here (and
a similar one is missing ahead of the RET related conditional)?

> @@ -8279,6 +8523,30 @@ output_insn (void)

>    insn_start_frag = frag_now;

>    insn_start_off = frag_now_fix ();

>  

> +  if (add_branch_padding_frag_p (&branch))

> +    {

> +      char *p;

> +      unsigned int max_branch_padding_size = 14;


Where is this number coming from? Like above (and I think also in
one more case below) any such seemingly arbitrary numbers could do
with a comment.

> @@ -8317,6 +8585,44 @@ output_insn (void)

>  	  i.prefix[LOCK_PREFIX] = 0;

>  	}

>  

> +      if (branch)

> +	/* Skip if this is a branch.  */

> +	;

> +      else if (add_fused_jcc_padding_frag_p ())

> +	{

> +	  unsigned int max_fused_padding_size

> +	    = MAX_FUSED_JCC_PADDING_SIZE;


Why the local variable? You could as well use the manifest constant
at all the use sites below as it seems. If it was just for the
identifier length, then I guess the variable could do with further
shortening.

> @@ -8461,12 +8767,81 @@ output_insn (void)

>        if (now_seg != absolute_section)

>  	{

>  	  j = encoding_length (insn_start_frag, insn_start_off, frag_more (0));

> -	  if (j > 15)

> +	  if (fragP)

> +	    {

> +	      /* NB: Don't add prefix with GOTPC relocation since

> +		 output_disp() above depends on the fixed encoding

> +		 length.  */

> +	      unsigned int max = i.has_gotpc_reloc ? 0 : 15 - j;

> +	      /* Prefix count on the current instruction.  */

> +	      unsigned int count = !!is_any_vex_encoding (&i.tm);

> +	      unsigned int k;

> +	      for (k = 0; k < ARRAY_SIZE (i.prefix); k++)

> +		if (i.prefix[k])

> +		  count++;

> +

> +	      /* NB: prefix count + instruction size must be <= 15.  */

> +	      if (j > 15)

> +		as_fatal (_("instruction length of %u bytes exceeds the limit of 15"),

> +			  j);


I think you'd better keep the if (j > 15) as the first check above
and make the new code an else-if to it. You shouldn't fail assembly
in this case, whereas the warning implying no padding attempt to
get done seems quite reasonable a behavior to me.

> +	      if (TYPE_FROM_RELAX_STATE (fragP->fr_subtype)

> +		  == BRANCH_PREFIX)

> +		{

> +		  /* Set the maximum prefix size in BRANCH_PREFIX

> +		     frag.  */

> +		  if (fragP->tc_frag_data.max_bytes > max)

> +		    fragP->tc_frag_data.max_bytes = max;

> +		  if (fragP->tc_frag_data.max_bytes > count)

> +		    fragP->tc_frag_data.max_bytes -= count;

> +		  else

> +		    fragP->tc_frag_data.max_bytes = 0;

> +		}

> +	      else

> +		{

> +		  /* Remember the maximum prefix size in FUSED_JCC_PADDING

> +		     frag.  */

> +		  unsigned int max_prefix_size;

> +		  if (align_branch_prefix_size > max)

> +		    max_prefix_size = max;

> +		  else

> +		    max_prefix_size = align_branch_prefix_size;

> +		  if (max_prefix_size > count)

> +		    fragP->tc_frag_data.max_prefix_length

> +		      = max_prefix_size - count;

> +		}

> +

> +	      /* Use existing segment prefix if possible.  Use CS

> +		 segment prefix in 64-bit mode.  In 32-bit mode, use SS

> +		 segment prefix with ESP/EBP base register and use DS

> +		 segment prefix without ESP/EBP base register.  */


Is there a reason for the preference of CS: in 64-bit mode?

> +static void

> +i386_classify_machine_dependent_frag (fragS *fragP)

> +{

> +  fragS *cmp_fragP;

> +  fragS *pad_fragP;

> +  fragS *branch_fragP;

> +  fragS *next_fragP;

> +  unsigned int max_prefix_length;

> +

> +  if (fragP->tc_frag_data.classified)

> +    return;

> +

> +  /* First scan for BRANCH_PADDING and FUSED_JCC_PADDING.  Convert

> +     FUSED_JCC_PADDING and merge BRANCH_PADDING.  */

> +  for (next_fragP = fragP;

> +       next_fragP != NULL;

> +       next_fragP = next_fragP->fr_next)

> +    {

> +      next_fragP->tc_frag_data.classified = 1;

> +      if (next_fragP->fr_type == rs_machine_dependent)

> +	switch (TYPE_FROM_RELAX_STATE (next_fragP->fr_subtype))

> +	  {

> +	  case BRANCH_PADDING:

> +	    /* The BRANCH_PADDING frag must be followed by a branch

> +	       frag.  */

> +	    branch_fragP = i386_next_non_empty_frag (next_fragP);

> +	    next_fragP->tc_frag_data.u.branch_fragP = branch_fragP;

> +	    break;

> +	  case FUSED_JCC_PADDING:

> +	    /* Check if this is a fused jcc:

> +	       FUSED_JCC_PADDING

> +	       CMP


Here and in a number of other places I wonder why you say CMP when
the set of macro-fusion candidate insns is quite a bit larger.

> +	       BRANCH_PADDING

> +	       COND_JUMP

> +	       */

> +	    cmp_fragP = i386_next_non_empty_frag (next_fragP);

> +	    pad_fragP = i386_next_non_empty_frag (cmp_fragP);

> +	    branch_fragP = i386_next_jcc_frag (pad_fragP);

> +	    if (branch_fragP)

> +	      {

> +		/* The BRANCH_PADDING frag is merged with the

> +		   FUSED_JCC_PADDING frag.  */

> +		next_fragP->tc_frag_data.u.branch_fragP = branch_fragP;

> +		/* CMP instruction size.  */

> +		next_fragP->tc_frag_data.cmp_size = cmp_fragP->fr_fix;

> +		frag_wane (pad_fragP);

> +		/* Skip to branch_fragP.  */

> +		next_fragP = branch_fragP;

> +	      }

> +	    else if (next_fragP->tc_frag_data.max_prefix_length)

> +	      {

> +		/* Turn FUSED_JCC_PADDING into BRANCH_PREFIX if it isn't

> +		   a fused jcc.  */

> +		next_fragP->fr_subtype

> +		  = ENCODE_RELAX_STATE (BRANCH_PREFIX, 0);

> +		next_fragP->tc_frag_data.max_bytes

> +		  = next_fragP->tc_frag_data.max_prefix_length;

> +		/* This will be updated in the BRANCH_PREFIX scan.  */

> +		next_fragP->tc_frag_data.max_prefix_length = 0;

> +	      }

> +	    else

> +	      frag_wane (next_fragP);

> +	    break;

> +	  }

> +    }

> +

> +  /* Scan for BRANCH_PREFIX.  */

> +  for (; fragP != NULL; fragP = fragP->fr_next)

> +    if (fragP->fr_type == rs_machine_dependent

> +	&& (TYPE_FROM_RELAX_STATE (fragP->fr_subtype)

> +	    == BRANCH_PREFIX))

> +      {

> +	/* Count all BRANCH_PREFIX frags before BRANCH_PADDING and

> +	   COND_JUMP_PREFIX.  */

> +	max_prefix_length = 0;

> +	for (next_fragP = fragP;

> +	     next_fragP != NULL;

> +	     next_fragP = next_fragP->fr_next)

> +	  {

> +	    if (next_fragP->fr_type == rs_fill)

> +	      /* Skip rs_fill frags.  */

> +	      ;

> +	    else if (next_fragP->fr_type == rs_machine_dependent)

> +	      {

> +		if (TYPE_FROM_RELAX_STATE (next_fragP->fr_subtype)

> +		    == BRANCH_PREFIX)

> +		  {

> +		    /* Count BRANCH_PREFIX frags.  */

> +		    if (max_prefix_length >= MAX_FUSED_JCC_PADDING_SIZE)

> +		      {

> +			max_prefix_length = MAX_FUSED_JCC_PADDING_SIZE;

> +			frag_wane (next_fragP);

> +		      }

> +		    else

> +		      max_prefix_length

> +			+= next_fragP->tc_frag_data.max_bytes;

> +		  }

> +		else if ((TYPE_FROM_RELAX_STATE (next_fragP->fr_subtype)

> +			  == BRANCH_PADDING)

> +			 || (TYPE_FROM_RELAX_STATE (next_fragP->fr_subtype)

> +			     == FUSED_JCC_PADDING))

> +		  {

> +		    /* Stop at BRANCH_PADDING and FUSED_JCC_PADDING.  */

> +		    fragP->tc_frag_data.u.padding_fragP = next_fragP;

> +		    break;

> +		  }

> +		else

> +		  /* Stop for other rs_machine_dependent frags.  */

> +		  break;

> +	      }

> +	    else

> +	      /* Stop for all other frags.  */

> +	      break;


Some indentation depth could again be saved by re-arranging the code.

> +static int

> +i386_branch_padding_size (fragS *fragP, offsetT address)

> +{

> +  unsigned int offset, size, padding_size;

> +  fragS *branch_fragP = fragP->tc_frag_data.u.branch_fragP;

> +

> +  /* The start address of the BRANCH_PADDING or FUSED_JCC_PADDING frag.  */

> +  if (!address)

> +    address = fragP->fr_address;

> +  address += fragP->fr_fix;

> +

> +  /* CMP instrunction size.  */

> +  size = fragP->tc_frag_data.cmp_size;

> +

> +  /* The base size of the branch frag.  */

> +  size += branch_fragP->fr_fix;

> +

> +  /* Add opcode and displacement bytes for the rs_machine_dependent

> +     branch frag.  */

> +  if (branch_fragP->fr_type == rs_machine_dependent)

> +    size += md_relax_table[branch_fragP->fr_subtype].rlx_length;

> +

> +  /* Check if branch is within boundary and doesn't end at the last

> +     byte.  */

> +  offset = address & ((1U << align_branch_power) - 1);

> +  if ((offset + size) >= (1U << align_branch_power))

> +    /* Padding needed to avoid crossing boundary.  */

> +    padding_size = (1 << align_branch_power) - offset;


Shouldn't command line option handling make sure the shifts here
won't become undefined due to too large a shift count? Also isn't
the last 1 missing a U suffix?

> +  else

> +    /* No padding needed.  */

> +    padding_size = 0;

> +

> +  if (!fits_in_signed_byte (padding_size))

> +    abort ();


Why? The function's return type is int (should probably be
unsigned int).

> @@ -11483,6 +12323,78 @@ md_parse_option (int c, const char *arg)

>          as_fatal (_("invalid -mrelax-relocations= option: `%s'"), arg);

>        break;

>  

> +    case OPTION_MALIGN_BRANCH_BOUNDARY:

> +      {

> +	char *end;

> +	int align = strtoul (arg, &end, 0);

> +	if (*end == '\0')

> +	  {

> +	    if (align == 0)

> +	      {

> +		align_branch_power = 0;

> +		break;

> +	      }

> +	    else if (align >= 32)


Why the enforcement of 32 as the lower boundary? I.e. why would 16
not be an option as well?

> +	      {

> +		int align_power;

> +		for (align_power = 0;

> +		     (align & 1) == 0;

> +		     align >>= 1, align_power++)

> +		  continue;

> +		if (align == 1)

> +		  {

> +		    align_branch_power = align_power;

> +		    break;

> +		  }

> +	      }

> +	  }

> +	as_fatal (_("invalid -malign-branch-boundary= value: %s"), arg);

> +      }

> +      break;

> +

> +    case OPTION_MALIGN_BRANCH_PREFIX_SIZE:

> +      {

> +	char *end;

> +	int align = strtoul (arg, &end, 0);

> +	if (*end == '\0' && align >= 0 && align < 6)


Similarly here - why is 5 the upper boundary on the accepted values?

> +	  {

> +	    align_branch_prefix_size = align;

> +	    break;

> +	  }

> +	as_fatal (_("invalid -malign-branch-prefix-size= value: %s"),

> +		  arg);

> +      }

> +      break;

> +

> +    case OPTION_MALIGN_BRANCH:

> +      align_branch = 0;

> +      saved = xstrdup (arg);

> +      type = saved;

> +      do

> +	{

> +	  next = strchr (type, '+');

> +	  if (next)

> +	    *next++ = '\0';

> +	  if (strcasecmp (type, "jcc") == 0)

> +	    align_branch |= align_branch_jcc;

> +	  else if (strcasecmp (type, "fused") == 0)

> +	    align_branch |= align_branch_fused;

> +	  else if (strcasecmp (type, "jmp") == 0)

> +	    align_branch |= align_branch_jmp;

> +	  else if (strcasecmp (type, "call") == 0)

> +	    align_branch |= align_branch_call;

> +	  else if (strcasecmp (type, "ret") == 0)

> +	    align_branch |= align_branch_ret;

> +	  else if (strcasecmp (type, "indirect") == 0)

> +	    align_branch |= align_branch_indirect;


Aren't command line options generally case sensitive, i.e. wouldn't
you better use strcmp() here?

> @@ -11943,6 +12875,21 @@ s_bss (int ignore ATTRIBUTE_UNUSED)

>  

>  #endif

>  

> +/* Remember constant diretive.  */

> +

> +void

> +i386_cons_worker (int ignore ATTRIBUTE_UNUSED)

> +{

> +  if (last_insn.kind != last_insn_directive

> +      && (bfd_section_flags (now_seg) & SEC_CODE))

> +    {

> +      last_insn.seg = now_seg;

> +      last_insn.kind = last_insn_directive;

> +      last_insn.name = "constant diretive";


Nit: Missing 'c' in directive.

> @@ -250,10 +257,24 @@ extern i386_cpu_flags cpu_arch_isa_flags;

>  

>  struct i386_tc_frag_data

>  {

> +  union

> +    {

> +      fragS *padding_fragP;

> +      fragS *branch_fragP;

> +    } u;

> +  addressT padding_address;

>    enum processor_type isa;

>    i386_cpu_flags isa_flags;

>    enum processor_type tune;

>    unsigned int max_bytes;

> +  signed char length;

> +  signed char last_length;

> +  signed char max_prefix_length;

> +  signed char prefix_length;

> +  signed char default_prefix;

> +  signed char cmp_size;


Why signed char for all of these? Most if not all look to be
holding unsigned quantities only.

> +  unsigned int classified : 1;

> +  unsigned int branch_type : 7;


This doesn't require 7 bits, does it?

Jan
H.J. Lu Nov. 14, 2019, 11:21 p.m. | #2
On Wed, Nov 13, 2019 at 5:15 AM Jan Beulich <jbeulich@suse.com> wrote:
>

> On 12.11.2019 17:19,  H.J. Lu  wrote:

> > @@ -632,6 +652,31 @@ static enum check_kind

> >    }

> >  sse_check, operand_check = check_warning;

> >

> > +/* Non-zero if branches should be aligned within power of 2 boundary.  */

> > +static int align_branch_power = 0;

> > +

> > +/* Types of branches to align.  */

> > +enum align_branch_kind

> > +  {

> > +    align_branch_none = 0,

> > +    align_branch_jcc = 1 << 0,

> > +    align_branch_fused = 1 << 1,

> > +    align_branch_jmp = 1 << 2,

> > +    align_branch_call = 1 << 3,

> > +    align_branch_indirect = 1 << 4,

> > +    align_branch_ret = 1 << 5

> > +  };

> > +

> > +static unsigned int align_branch = (align_branch_jcc

> > +                                 | align_branch_fused

> > +                                 | align_branch_jmp);

> > +

> > +/* The maximum padding size for fused jcc.  */

> > +#define MAX_FUSED_JCC_PADDING_SIZE 20

>

> What is this number derived from?


CMP can be 9 bytes and jcc can be 6 bytes.   I leave some room
for prefixes.

> > @@ -3012,6 +3070,11 @@ md_begin (void)

> >        x86_dwarf2_return_column = 8;

> >        x86_cie_data_alignment = -4;

> >      }

> > +

> > +  /* NB: FUSED_JCC_PADDING frag must have sufficient room so that it

> > +     can be turned into BRANCH_PREFIX frag.  */

> > +  if (align_branch_prefix_size > MAX_FUSED_JCC_PADDING_SIZE)

> > +    abort ();

>

> Wouldn't such a value better be rejected at command line option

> handling time?


There is

/* The maximum number of prefixes added for an instruction.  */
static unsigned int align_branch_prefix_size = 5;

I'd like to check it even if it isn't changed.

> > @@ -8178,11 +8252,181 @@ encoding_length (const fragS *start_frag, offsetT start_off,

> >    return len - start_off + (frag_now_ptr - frag_now->fr_literal);

> >  }

> >

> > +/* Return 1 for test, and, cmp, add, sub, inc and dec which may

> > +   be macro-fused with conditional jumps.  */

> > +

> > +static int

> > +maybe_fused_with_jcc_p (void)

> > +{

> > +  /* No RIP address.  */

> > +  if (i.base_reg && i.base_reg->reg_num == RegIP)

> > +    return 0;

> > +

> > +  /* and, add, sub with destination register.  */

> > +  if (!strcmp (i.tm.name, "and")

> > +      || !strcmp (i.tm.name, "add")

> > +      || !strcmp (i.tm.name, "sub"))

> > +    return i.types[1].bitfield.class == Reg;

> > +

> > +  /* test, cmp with any register.  */

> > +  if (!strcmp (i.tm.name, "test") || !strcmp (i.tm.name, "cmp"))

> > +    return (i.types[0].bitfield.class == Reg

> > +         || i.types[1].bitfield.class == Reg);

> > +

> > +  /* inc, dec with 16/32/64-bit register.   */

> > +  if (!strcmp (i.tm.name, "inc") || !strcmp (i.tm.name, "dec"))

> > +    return i.types[0].bitfield.class == Reg;

>

> Code and comment don't match up: Either you also mean 8-bit

> registers and comment is wrong, or you need to extend the

> condition / return value expression.


Fixed.

> Furthermore with this being executed quite frequently, I

> think using strcmp() isn't a very performant approach here.


I timed a big assembly file:

[hjl@gnu-cfl-1 jcc-11]$ ls -lh torque_base_jumbo_2.s
-rw-rw-r-- 1 hjl hjl 154M Nov 13 08:13 torque_base_jumbo_2.s
[hjl@gnu-cfl-1 jcc-11]$ time ./as -o torque_base_jumbo_2.o
torque_base_jumbo_2.s -mbranches-within-32B-boundaries

real 0m7.534s
user 0m7.204s
sys 0m0.318s
[hjl@gnu-cfl-1 jcc-11]$ time ./as -o torque_base_jumbo_2.o torque_base_jumbo_2.s

real 0m7.472s
user 0m7.152s
sys 0m0.305s
[hjl@gnu-cfl-1 jcc-11]$

Time difference is very small.  We can improve it later.

> Like elsewhere I think you want to use i.tm checks here. This

> would also eliminate my worry of the code not catching cases

> where optimization changes the opcode in i.tm without also

> changing the name.


It don't think CMP instructions can be replaced by totally different
instructions.

> > +static int

> > +add_fused_jcc_padding_frag_p (void)

> > +{

> > +  if (!align_branch_power

> > +      || now_seg == absolute_section

> > +      || !cpu_arch_flags.bitfield.cpui386

>

> Why the i386 dependency?


It doesn't work with COND_JUMP86 without i386.

> > +      || !(align_branch & align_branch_fused))

> > +    return 0;

> > +

> > +  if (maybe_fused_with_jcc_p ())

> > +    {

> > +      if (last_insn.kind != last_insn_other

> > +       && last_insn.seg == now_seg)

> > +     {

> > +       if (flag_debug)

> > +         as_warn_where (last_insn.file, last_insn.line,

> > +                        _("`%s` skips -malign-branch-boundary on `%s`"),

> > +                        last_insn.name, i.tm.name);

> > +       return 0;

> > +     }

> > +      return 1;

> > +    }

> > +

> > +  return 0;

>

> Maybe slightly better (less indentation and a folded return) as

>

>   if (maybe_fused_with_jcc_p ())

>     {

>       if (last_insn.kind == last_insn_other

>           || last_insn.seg != now_seg)

>         return 1;

>       if (flag_debug)

>         as_warn_where (last_insn.file, last_insn.line,

>                        _("`%s` skips -malign-branch-boundary on `%s`"),

>                        last_insn.name, i.tm.name);

>     }

>

>   return 0;

>

> ?


Fixed.

> > +}

> > +

> > +/* Return 1 if a BRANCH_PREFIX frag should be generated.  */

> > +

> > +static int

> > +add_branch_prefix_frag_p (void)

> > +{

> > +  if (!align_branch_power

> > +      || now_seg == absolute_section

> > +      || i.tm.cpu_flags.bitfield.cpupadlock

> > +      || !cpu_arch_flags.bitfield.cpui386)

>

> Why the PadLock and i386 dependencies?


Since PadLock instructions include prefixes in opcode, I don't
want to mess with them.

> > +    return 0;

> > +

> > +  /* Don't add prefix if it is a prefix or there is no operand.  */

> > +  if (!i.operands || i.tm.opcode_modifier.isprefix)

> > +    return 0;

> > +

> > +  if (last_insn.kind != last_insn_other

> > +      && last_insn.seg == now_seg)

> > +    {

> > +      if (flag_debug)

> > +     as_warn_where (last_insn.file, last_insn.line,

> > +                    _("`%s` skips -malign-branch-boundary on `%s`"),

> > +                    last_insn.name, i.tm.name);

> > +      return 0;

> > +    }

> > +

> > +  return 1;

>

> Like above this too can be had with less indentation.


Fixed.

> > +static int

> > +add_branch_padding_frag_p (enum align_branch_kind *branch_p)

> > +{

> > +  int add_padding;

> > +

> > +  if (!align_branch_power

> > +      || now_seg == absolute_section

> > +      || !cpu_arch_flags.bitfield.cpui386)

>

> Same question as above.

>

> > +    return 0;

> > +

> > +  add_padding = 0;

> > +

> > +  /* Check for jcc and direct jmp.  */

> > +  if (i.tm.opcode_modifier.jump)

> > +    {

> > +      if (i.tm.base_opcode == JUMP_PC_RELATIVE)

> > +     {

> > +       *branch_p = align_branch_jmp;

> > +       add_padding = align_branch & align_branch_jmp;

> > +     }

> > +      else

> > +     {

> > +       *branch_p = align_branch_jcc;

> > +       if ((align_branch & align_branch_jcc))

> > +         add_padding = 1;

> > +     }

> > +    }

> > +  else if (i.tm.base_opcode == 0xc2

> > +        || i.tm.base_opcode == 0xc3

> > +        || i.tm.base_opcode == 0xca

> > +        || i.tm.base_opcode == 0xcb)

>

> These will also match various VEX/XOP/EVEX encoded templates. Perhaps

> exclude any such right away in the first if()?

>

> Also you handle near and far returns here, but ...


I will change it to

  else if (is_any_vex_encoding (&i.tm))
    return 0;
  else if (i.tm.base_opcode == 0xc2 || i.tm.base_opcode == 0xc3)
    {
      /* Near ret.  */
      *branch_p = align_branch_ret;
      if ((align_branch & align_branch_ret))
        add_padding = 1;
    }

> > +    {

> > +      *branch_p = align_branch_ret;

> > +      if ((align_branch & align_branch_ret))

> > +     add_padding = 1;

> > +    }

> > +  else

> > +    {

> > +      if (i.tm.base_opcode == 0xe8)

> > +     {

> > +       *branch_p = align_branch_call;

> > +       if ((align_branch & align_branch_call))

> > +         add_padding = 1;

> > +     }

> > +      else if (i.tm.base_opcode == 0xff

> > +            && (i.rm.reg == 2 || i.rm.reg == 4))

>

> ... only near indirect call/jmp here?


Fart ret will be removed.

> It would also seem more consistent to be if you used

> i.tm.extension_opcode instead of i.rm.reg for the checks here.


Fixed.

> > +     {

> > +       *branch_p = align_branch_indirect;

> > +       if ((align_branch & align_branch_indirect))

> > +         add_padding = 1;

> > +     }

> > +

> > +      /* Check for indirect jmp, direct and indirect calls.  */

>

> I guess this comment really belongs several lines up from here (and

> a similar one is missing ahead of the RET related conditional)?


Fixed.

> > @@ -8279,6 +8523,30 @@ output_insn (void)

> >    insn_start_frag = frag_now;

> >    insn_start_off = frag_now_fix ();

> >

> > +  if (add_branch_padding_frag_p (&branch))

> > +    {

> > +      char *p;

> > +      unsigned int max_branch_padding_size = 14;

>

> Where is this number coming from? Like above (and I think also in

> one more case below) any such seemingly arbitrary numbers could do

> with a comment.


Branch can be 8 bytes.  I leave some room for prefixes.

> > @@ -8317,6 +8585,44 @@ output_insn (void)

> >         i.prefix[LOCK_PREFIX] = 0;

> >       }

> >

> > +      if (branch)

> > +     /* Skip if this is a branch.  */

> > +     ;

> > +      else if (add_fused_jcc_padding_frag_p ())

> > +     {

> > +       unsigned int max_fused_padding_size

> > +         = MAX_FUSED_JCC_PADDING_SIZE;

>

> Why the local variable? You could as well use the manifest constant

> at all the use sites below as it seems. If it was just for the

> identifier length, then I guess the variable could do with further

> shortening.


I will remove max_fused_padding_size and use
MAX_FUSED_JCC_PADDING_SIZE.

> > @@ -8461,12 +8767,81 @@ output_insn (void)

> >        if (now_seg != absolute_section)

> >       {

> >         j = encoding_length (insn_start_frag, insn_start_off, frag_more (0));

> > -       if (j > 15)

> > +       if (fragP)

> > +         {

> > +           /* NB: Don't add prefix with GOTPC relocation since

> > +              output_disp() above depends on the fixed encoding

> > +              length.  */

> > +           unsigned int max = i.has_gotpc_reloc ? 0 : 15 - j;

> > +           /* Prefix count on the current instruction.  */

> > +           unsigned int count = !!is_any_vex_encoding (&i.tm);

> > +           unsigned int k;

> > +           for (k = 0; k < ARRAY_SIZE (i.prefix); k++)

> > +             if (i.prefix[k])

> > +               count++;

> > +

> > +           /* NB: prefix count + instruction size must be <= 15.  */

> > +           if (j > 15)

> > +             as_fatal (_("instruction length of %u bytes exceeds the limit of 15"),

> > +                       j);

>

> I think you'd better keep the if (j > 15) as the first check above

> and make the new code an else-if to it. You shouldn't fail assembly

> in this case, whereas the warning implying no padding attempt to

> get done seems quite reasonable a behavior to me.


Fixed.

> > +           if (TYPE_FROM_RELAX_STATE (fragP->fr_subtype)

> > +               == BRANCH_PREFIX)

> > +             {

> > +               /* Set the maximum prefix size in BRANCH_PREFIX

> > +                  frag.  */

> > +               if (fragP->tc_frag_data.max_bytes > max)

> > +                 fragP->tc_frag_data.max_bytes = max;

> > +               if (fragP->tc_frag_data.max_bytes > count)

> > +                 fragP->tc_frag_data.max_bytes -= count;

> > +               else

> > +                 fragP->tc_frag_data.max_bytes = 0;

> > +             }

> > +           else

> > +             {

> > +               /* Remember the maximum prefix size in FUSED_JCC_PADDING

> > +                  frag.  */

> > +               unsigned int max_prefix_size;

> > +               if (align_branch_prefix_size > max)

> > +                 max_prefix_size = max;

> > +               else

> > +                 max_prefix_size = align_branch_prefix_size;

> > +               if (max_prefix_size > count)

> > +                 fragP->tc_frag_data.max_prefix_length

> > +                   = max_prefix_size - count;

> > +             }

> > +

> > +           /* Use existing segment prefix if possible.  Use CS

> > +              segment prefix in 64-bit mode.  In 32-bit mode, use SS

> > +              segment prefix with ESP/EBP base register and use DS

> > +              segment prefix without ESP/EBP base register.  */

>

> Is there a reason for the preference of CS: in 64-bit mode?


CS is ignored in 64-bit mode.

> > +static void

> > +i386_classify_machine_dependent_frag (fragS *fragP)

> > +{

> > +  fragS *cmp_fragP;

> > +  fragS *pad_fragP;

> > +  fragS *branch_fragP;

> > +  fragS *next_fragP;

> > +  unsigned int max_prefix_length;

> > +

> > +  if (fragP->tc_frag_data.classified)

> > +    return;

> > +

> > +  /* First scan for BRANCH_PADDING and FUSED_JCC_PADDING.  Convert

> > +     FUSED_JCC_PADDING and merge BRANCH_PADDING.  */

> > +  for (next_fragP = fragP;

> > +       next_fragP != NULL;

> > +       next_fragP = next_fragP->fr_next)

> > +    {

> > +      next_fragP->tc_frag_data.classified = 1;

> > +      if (next_fragP->fr_type == rs_machine_dependent)

> > +     switch (TYPE_FROM_RELAX_STATE (next_fragP->fr_subtype))

> > +       {

> > +       case BRANCH_PADDING:

> > +         /* The BRANCH_PADDING frag must be followed by a branch

> > +            frag.  */

> > +         branch_fragP = i386_next_non_empty_frag (next_fragP);

> > +         next_fragP->tc_frag_data.u.branch_fragP = branch_fragP;

> > +         break;

> > +       case FUSED_JCC_PADDING:

> > +         /* Check if this is a fused jcc:

> > +            FUSED_JCC_PADDING

> > +            CMP

>

> Here and in a number of other places I wonder why you say CMP when

> the set of macro-fusion candidate insns is quite a bit larger.


I use CMP to refer all these instructions.

> > +            BRANCH_PADDING

> > +            COND_JUMP

> > +            */

> > +         cmp_fragP = i386_next_non_empty_frag (next_fragP);

> > +         pad_fragP = i386_next_non_empty_frag (cmp_fragP);

> > +         branch_fragP = i386_next_jcc_frag (pad_fragP);

> > +         if (branch_fragP)

> > +           {

> > +             /* The BRANCH_PADDING frag is merged with the

> > +                FUSED_JCC_PADDING frag.  */

> > +             next_fragP->tc_frag_data.u.branch_fragP = branch_fragP;

> > +             /* CMP instruction size.  */

> > +             next_fragP->tc_frag_data.cmp_size = cmp_fragP->fr_fix;

> > +             frag_wane (pad_fragP);

> > +             /* Skip to branch_fragP.  */

> > +             next_fragP = branch_fragP;

> > +           }

> > +         else if (next_fragP->tc_frag_data.max_prefix_length)

> > +           {

> > +             /* Turn FUSED_JCC_PADDING into BRANCH_PREFIX if it isn't

> > +                a fused jcc.  */

> > +             next_fragP->fr_subtype

> > +               = ENCODE_RELAX_STATE (BRANCH_PREFIX, 0);

> > +             next_fragP->tc_frag_data.max_bytes

> > +               = next_fragP->tc_frag_data.max_prefix_length;

> > +             /* This will be updated in the BRANCH_PREFIX scan.  */

> > +             next_fragP->tc_frag_data.max_prefix_length = 0;

> > +           }

> > +         else

> > +           frag_wane (next_fragP);

> > +         break;

> > +       }

> > +    }

> > +

> > +  /* Scan for BRANCH_PREFIX.  */

> > +  for (; fragP != NULL; fragP = fragP->fr_next)

> > +    if (fragP->fr_type == rs_machine_dependent

> > +     && (TYPE_FROM_RELAX_STATE (fragP->fr_subtype)

> > +         == BRANCH_PREFIX))

> > +      {

> > +     /* Count all BRANCH_PREFIX frags before BRANCH_PADDING and

> > +        COND_JUMP_PREFIX.  */

> > +     max_prefix_length = 0;

> > +     for (next_fragP = fragP;

> > +          next_fragP != NULL;

> > +          next_fragP = next_fragP->fr_next)

> > +       {

> > +         if (next_fragP->fr_type == rs_fill)

> > +           /* Skip rs_fill frags.  */

> > +           ;

> > +         else if (next_fragP->fr_type == rs_machine_dependent)

> > +           {

> > +             if (TYPE_FROM_RELAX_STATE (next_fragP->fr_subtype)

> > +                 == BRANCH_PREFIX)

> > +               {

> > +                 /* Count BRANCH_PREFIX frags.  */

> > +                 if (max_prefix_length >= MAX_FUSED_JCC_PADDING_SIZE)

> > +                   {

> > +                     max_prefix_length = MAX_FUSED_JCC_PADDING_SIZE;

> > +                     frag_wane (next_fragP);

> > +                   }

> > +                 else

> > +                   max_prefix_length

> > +                     += next_fragP->tc_frag_data.max_bytes;

> > +               }

> > +             else if ((TYPE_FROM_RELAX_STATE (next_fragP->fr_subtype)

> > +                       == BRANCH_PADDING)

> > +                      || (TYPE_FROM_RELAX_STATE (next_fragP->fr_subtype)

> > +                          == FUSED_JCC_PADDING))

> > +               {

> > +                 /* Stop at BRANCH_PADDING and FUSED_JCC_PADDING.  */

> > +                 fragP->tc_frag_data.u.padding_fragP = next_fragP;

> > +                 break;

> > +               }

> > +             else

> > +               /* Stop for other rs_machine_dependent frags.  */

> > +               break;

> > +           }

> > +         else

> > +           /* Stop for all other frags.  */

> > +           break;

>

> Some indentation depth could again be saved by re-arranging the code.


I will try it.

> > +static int

> > +i386_branch_padding_size (fragS *fragP, offsetT address)

> > +{

> > +  unsigned int offset, size, padding_size;

> > +  fragS *branch_fragP = fragP->tc_frag_data.u.branch_fragP;

> > +

> > +  /* The start address of the BRANCH_PADDING or FUSED_JCC_PADDING frag.  */

> > +  if (!address)

> > +    address = fragP->fr_address;

> > +  address += fragP->fr_fix;

> > +

> > +  /* CMP instrunction size.  */

> > +  size = fragP->tc_frag_data.cmp_size;

> > +

> > +  /* The base size of the branch frag.  */

> > +  size += branch_fragP->fr_fix;

> > +

> > +  /* Add opcode and displacement bytes for the rs_machine_dependent

> > +     branch frag.  */

> > +  if (branch_fragP->fr_type == rs_machine_dependent)

> > +    size += md_relax_table[branch_fragP->fr_subtype].rlx_length;

> > +

> > +  /* Check if branch is within boundary and doesn't end at the last

> > +     byte.  */

> > +  offset = address & ((1U << align_branch_power) - 1);

> > +  if ((offset + size) >= (1U << align_branch_power))

> > +    /* Padding needed to avoid crossing boundary.  */

> > +    padding_size = (1 << align_branch_power) - offset;

>

> Shouldn't command line option handling make sure the shifts here

> won't become undefined due to too large a shift count? Also isn't

> the last 1 missing a U suffix?


I will add 1U and limit align_branch_power to 31.

> > +  else

> > +    /* No padding needed.  */

> > +    padding_size = 0;

> > +

> > +  if (!fits_in_signed_byte (padding_size))

> > +    abort ();

>

> Why? The function's return type is int (should probably be

> unsigned int).


It must fit tc_frag_data.length.

> > @@ -11483,6 +12323,78 @@ md_parse_option (int c, const char *arg)

> >          as_fatal (_("invalid -mrelax-relocations= option: `%s'"), arg);

> >        break;

> >

> > +    case OPTION_MALIGN_BRANCH_BOUNDARY:

> > +      {

> > +     char *end;

> > +     int align = strtoul (arg, &end, 0);

> > +     if (*end == '\0')

> > +       {

> > +         if (align == 0)

> > +           {

> > +             align_branch_power = 0;

> > +             break;

> > +           }

> > +         else if (align >= 32)

>

> Why the enforcement of 32 as the lower boundary? I.e. why would 16

> not be an option as well?


CMP can be 9 bytes and jcc can be 6 bytes.  15 bytes is too close to
16 bytes.

> > +           {

> > +             int align_power;

> > +             for (align_power = 0;

> > +                  (align & 1) == 0;

> > +                  align >>= 1, align_power++)

> > +               continue;

> > +             if (align == 1)

> > +               {

> > +                 align_branch_power = align_power;

> > +                 break;

> > +               }

> > +           }

> > +       }

> > +     as_fatal (_("invalid -malign-branch-boundary= value: %s"), arg);

> > +      }

> > +      break;

> > +

> > +    case OPTION_MALIGN_BRANCH_PREFIX_SIZE:

> > +      {

> > +     char *end;

> > +     int align = strtoul (arg, &end, 0);

> > +     if (*end == '\0' && align >= 0 && align < 6)

>

> Similarly here - why is 5 the upper boundary on the accepted values?


I am told that some processors only support 5 prefixes.

> > +       {

> > +         align_branch_prefix_size = align;

> > +         break;

> > +       }

> > +     as_fatal (_("invalid -malign-branch-prefix-size= value: %s"),

> > +               arg);

> > +      }

> > +      break;

> > +

> > +    case OPTION_MALIGN_BRANCH:

> > +      align_branch = 0;

> > +      saved = xstrdup (arg);

> > +      type = saved;

> > +      do

> > +     {

> > +       next = strchr (type, '+');

> > +       if (next)

> > +         *next++ = '\0';

> > +       if (strcasecmp (type, "jcc") == 0)

> > +         align_branch |= align_branch_jcc;

> > +       else if (strcasecmp (type, "fused") == 0)

> > +         align_branch |= align_branch_fused;

> > +       else if (strcasecmp (type, "jmp") == 0)

> > +         align_branch |= align_branch_jmp;

> > +       else if (strcasecmp (type, "call") == 0)

> > +         align_branch |= align_branch_call;

> > +       else if (strcasecmp (type, "ret") == 0)

> > +         align_branch |= align_branch_ret;

> > +       else if (strcasecmp (type, "indirect") == 0)

> > +         align_branch |= align_branch_indirect;

>

> Aren't command line options generally case sensitive, i.e. wouldn't

> you better use strcmp() here?


I'd like to leave strcasecmp there just in case.

> > @@ -11943,6 +12875,21 @@ s_bss (int ignore ATTRIBUTE_UNUSED)

> >

> >  #endif

> >

> > +/* Remember constant diretive.  */

> > +

> > +void

> > +i386_cons_worker (int ignore ATTRIBUTE_UNUSED)

> > +{

> > +  if (last_insn.kind != last_insn_directive

> > +      && (bfd_section_flags (now_seg) & SEC_CODE))

> > +    {

> > +      last_insn.seg = now_seg;

> > +      last_insn.kind = last_insn_directive;

> > +      last_insn.name = "constant diretive";

>

> Nit: Missing 'c' in directive.


I will fix it.

> > @@ -250,10 +257,24 @@ extern i386_cpu_flags cpu_arch_isa_flags;

> >

> >  struct i386_tc_frag_data

> >  {

> > +  union

> > +    {

> > +      fragS *padding_fragP;

> > +      fragS *branch_fragP;

> > +    } u;

> > +  addressT padding_address;

> >    enum processor_type isa;

> >    i386_cpu_flags isa_flags;

> >    enum processor_type tune;

> >    unsigned int max_bytes;

> > +  signed char length;

> > +  signed char last_length;

> > +  signed char max_prefix_length;

> > +  signed char prefix_length;

> > +  signed char default_prefix;

> > +  signed char cmp_size;

>

> Why signed char for all of these? Most if not all look to be

> holding unsigned quantities only.


I will change it to unsigned char.

> > +  unsigned int classified : 1;

> > +  unsigned int branch_type : 7;

>

> This doesn't require 7 bits, does it?

>


I will change it to

unsigned int branch_type : 6;

Thanks.

-- 
H.J.
Jan Beulich Nov. 15, 2019, 9:16 a.m. | #3
On 15.11.2019 00:21,  H.J. Lu  wrote:
> On Wed, Nov 13, 2019 at 5:15 AM Jan Beulich <jbeulich@suse.com> wrote:

>> On 12.11.2019 17:19,  H.J. Lu  wrote:

>>> @@ -632,6 +652,31 @@ static enum check_kind

>>>    }

>>>  sse_check, operand_check = check_warning;

>>>

>>> +/* Non-zero if branches should be aligned within power of 2 boundary.  */

>>> +static int align_branch_power = 0;

>>> +

>>> +/* Types of branches to align.  */

>>> +enum align_branch_kind

>>> +  {

>>> +    align_branch_none = 0,

>>> +    align_branch_jcc = 1 << 0,

>>> +    align_branch_fused = 1 << 1,

>>> +    align_branch_jmp = 1 << 2,

>>> +    align_branch_call = 1 << 3,

>>> +    align_branch_indirect = 1 << 4,

>>> +    align_branch_ret = 1 << 5

>>> +  };

>>> +

>>> +static unsigned int align_branch = (align_branch_jcc

>>> +                                 | align_branch_fused

>>> +                                 | align_branch_jmp);

>>> +

>>> +/* The maximum padding size for fused jcc.  */

>>> +#define MAX_FUSED_JCC_PADDING_SIZE 20

>>

>> What is this number derived from?

> 

> CMP can be 9 bytes and jcc can be 6 bytes.   I leave some room

> for prefixes.


Can you extend the comment to this effect please?

>>> @@ -3012,6 +3070,11 @@ md_begin (void)

>>>        x86_dwarf2_return_column = 8;

>>>        x86_cie_data_alignment = -4;

>>>      }

>>> +

>>> +  /* NB: FUSED_JCC_PADDING frag must have sufficient room so that it

>>> +     can be turned into BRANCH_PREFIX frag.  */

>>> +  if (align_branch_prefix_size > MAX_FUSED_JCC_PADDING_SIZE)

>>> +    abort ();

>>

>> Wouldn't such a value better be rejected at command line option

>> handling time?

> 

> There is

> 

> /* The maximum number of prefixes added for an instruction.  */

> static unsigned int align_branch_prefix_size = 5;

> 

> I'd like to check it even if it isn't changed.


Ok, okay.

>> Like elsewhere I think you want to use i.tm checks here. This

>> would also eliminate my worry of the code not catching cases

>> where optimization changes the opcode in i.tm without also

>> changing the name.

> 

> It don't think CMP instructions can be replaced by totally different

> instructions.


But there's also e.g. AND in the set, which I would expect
optimization to be able to switch _to_ (not from, as you were
implying).

>>> +static int

>>> +add_fused_jcc_padding_frag_p (void)

>>> +{

>>> +  if (!align_branch_power

>>> +      || now_seg == absolute_section

>>> +      || !cpu_arch_flags.bitfield.cpui386

>>

>> Why the i386 dependency?

> 

> It doesn't work with COND_JUMP86 without i386.


Why would it not work? People using the option may need to
re-code some Jcc to keep the targets reachable despite the
inserted padding, but that's not a reason to suppress the
adjustments altogether.

>>> +static int

>>> +add_branch_prefix_frag_p (void)

>>> +{

>>> +  if (!align_branch_power

>>> +      || now_seg == absolute_section

>>> +      || i.tm.cpu_flags.bitfield.cpupadlock

>>> +      || !cpu_arch_flags.bitfield.cpui386)

>>

>> Why the PadLock and i386 dependencies?

> 

> Since PadLock instructions include prefixes in opcode, I don't

> want to mess with them.


How are PadLock insns different in this regard from e.g. SSE insns?

>>> +  else if (i.tm.base_opcode == 0xc2

>>> +        || i.tm.base_opcode == 0xc3

>>> +        || i.tm.base_opcode == 0xca

>>> +        || i.tm.base_opcode == 0xcb)

>>

>> These will also match various VEX/XOP/EVEX encoded templates. Perhaps

>> exclude any such right away in the first if()?

>>

>> Also you handle near and far returns here, but ...

> 

> I will change it to

> 

>   else if (is_any_vex_encoding (&i.tm))

>     return 0;


Why not move this even earlier in the function?

>   else if (i.tm.base_opcode == 0xc2 || i.tm.base_opcode == 0xc3)


Maybe (i.tm.base_opcode | 1) == 0xc3?

>>> @@ -8279,6 +8523,30 @@ output_insn (void)

>>>    insn_start_frag = frag_now;

>>>    insn_start_off = frag_now_fix ();

>>>

>>> +  if (add_branch_padding_frag_p (&branch))

>>> +    {

>>> +      char *p;

>>> +      unsigned int max_branch_padding_size = 14;

>>

>> Where is this number coming from? Like above (and I think also in

>> one more case below) any such seemingly arbitrary numbers could do

>> with a comment.

> 

> Branch can be 8 bytes.  I leave some room for prefixes.


Again please explain this in a comment.

>>> +           /* Use existing segment prefix if possible.  Use CS

>>> +              segment prefix in 64-bit mode.  In 32-bit mode, use SS

>>> +              segment prefix with ESP/EBP base register and use DS

>>> +              segment prefix without ESP/EBP base register.  */

>>

>> Is there a reason for the preference of CS: in 64-bit mode?

> 

> CS is ignored in 64-bit mode.


As are DS, ES, and SS. I.e. my question could also be put as "Why
different rules for 64- and 32-bit code?"

>>> +static void

>>> +i386_classify_machine_dependent_frag (fragS *fragP)

>>> +{

>>> +  fragS *cmp_fragP;

>>> +  fragS *pad_fragP;

>>> +  fragS *branch_fragP;

>>> +  fragS *next_fragP;

>>> +  unsigned int max_prefix_length;

>>> +

>>> +  if (fragP->tc_frag_data.classified)

>>> +    return;

>>> +

>>> +  /* First scan for BRANCH_PADDING and FUSED_JCC_PADDING.  Convert

>>> +     FUSED_JCC_PADDING and merge BRANCH_PADDING.  */

>>> +  for (next_fragP = fragP;

>>> +       next_fragP != NULL;

>>> +       next_fragP = next_fragP->fr_next)

>>> +    {

>>> +      next_fragP->tc_frag_data.classified = 1;

>>> +      if (next_fragP->fr_type == rs_machine_dependent)

>>> +     switch (TYPE_FROM_RELAX_STATE (next_fragP->fr_subtype))

>>> +       {

>>> +       case BRANCH_PADDING:

>>> +         /* The BRANCH_PADDING frag must be followed by a branch

>>> +            frag.  */

>>> +         branch_fragP = i386_next_non_empty_frag (next_fragP);

>>> +         next_fragP->tc_frag_data.u.branch_fragP = branch_fragP;

>>> +         break;

>>> +       case FUSED_JCC_PADDING:

>>> +         /* Check if this is a fused jcc:

>>> +            FUSED_JCC_PADDING

>>> +            CMP

>>

>> Here and in a number of other places I wonder why you say CMP when

>> the set of macro-fusion candidate insns is quite a bit larger.

> 

> I use CMP to refer all these instructions.


May I ask that at least in comments you clarify it's not just CMP.
This doesn't mean to re-enumerate all the insns, but at least say
something like "et al".

>>> +  if (!fits_in_signed_byte (padding_size))

>>> +    abort ();

>>

>> Why? The function's return type is int (should probably be

>> unsigned int).

> 

> It must fit tc_frag_data.length.


And how would one recognize the connection when e.g. widening the
field?

>>> @@ -11483,6 +12323,78 @@ md_parse_option (int c, const char *arg)

>>>          as_fatal (_("invalid -mrelax-relocations= option: `%s'"), arg);

>>>        break;

>>>

>>> +    case OPTION_MALIGN_BRANCH_BOUNDARY:

>>> +      {

>>> +     char *end;

>>> +     int align = strtoul (arg, &end, 0);

>>> +     if (*end == '\0')

>>> +       {

>>> +         if (align == 0)

>>> +           {

>>> +             align_branch_power = 0;

>>> +             break;

>>> +           }

>>> +         else if (align >= 32)

>>

>> Why the enforcement of 32 as the lower boundary? I.e. why would 16

>> not be an option as well?

> 

> CMP can be 9 bytes and jcc can be 6 bytes.  15 bytes is too close to

> 16 bytes.


I don't understand: Making such a pair fit within a single aligned
16-byte block doesn't seem impossible at all.

>>> +           {

>>> +             int align_power;

>>> +             for (align_power = 0;

>>> +                  (align & 1) == 0;

>>> +                  align >>= 1, align_power++)

>>> +               continue;

>>> +             if (align == 1)

>>> +               {

>>> +                 align_branch_power = align_power;

>>> +                 break;

>>> +               }

>>> +           }

>>> +       }

>>> +     as_fatal (_("invalid -malign-branch-boundary= value: %s"), arg);

>>> +      }

>>> +      break;

>>> +

>>> +    case OPTION_MALIGN_BRANCH_PREFIX_SIZE:

>>> +      {

>>> +     char *end;

>>> +     int align = strtoul (arg, &end, 0);

>>> +     if (*end == '\0' && align >= 0 && align < 6)

>>

>> Similarly here - why is 5 the upper boundary on the accepted values?

> 

> I am told that some processors only support 5 prefixes.


Now that very certainly requires a comment to explain. Furthermore -
what would happen on such processors if there was a 6th one? Is this
merely a performance concern, or worse?

>>> +  unsigned int classified : 1;

>>> +  unsigned int branch_type : 7;

>>

>> This doesn't require 7 bits, does it?

>>

> 

> I will change it to

> 

> unsigned int branch_type : 6;


Isn't 6 still too high? I realize the enumerators stored into here
aren't truly enumerators, but add_branch_padding_frag_p() only ever
stores one of the values (i.e. they don't get OR-ed together).

Jan
H.J. Lu Nov. 15, 2019, 11:11 p.m. | #4
On Fri, Nov 15, 2019 at 1:16 AM Jan Beulich <jbeulich@suse.com> wrote:
>

> On 15.11.2019 00:21,  H.J. Lu  wrote:

> > On Wed, Nov 13, 2019 at 5:15 AM Jan Beulich <jbeulich@suse.com> wrote:

> >> On 12.11.2019 17:19,  H.J. Lu  wrote:

> >>> @@ -632,6 +652,31 @@ static enum check_kind

> >>>    }

> >>>  sse_check, operand_check = check_warning;

> >>>

> >>> +/* Non-zero if branches should be aligned within power of 2 boundary.  */

> >>> +static int align_branch_power = 0;

> >>> +

> >>> +/* Types of branches to align.  */

> >>> +enum align_branch_kind

> >>> +  {

> >>> +    align_branch_none = 0,

> >>> +    align_branch_jcc = 1 << 0,

> >>> +    align_branch_fused = 1 << 1,

> >>> +    align_branch_jmp = 1 << 2,

> >>> +    align_branch_call = 1 << 3,

> >>> +    align_branch_indirect = 1 << 4,

> >>> +    align_branch_ret = 1 << 5

> >>> +  };

> >>> +

> >>> +static unsigned int align_branch = (align_branch_jcc

> >>> +                                 | align_branch_fused

> >>> +                                 | align_branch_jmp);

> >>> +

> >>> +/* The maximum padding size for fused jcc.  */

> >>> +#define MAX_FUSED_JCC_PADDING_SIZE 20

> >>

> >> What is this number derived from?

> >

> > CMP can be 9 bytes and jcc can be 6 bytes.   I leave some room

> > for prefixes.

>

> Can you extend the comment to this effect please?


I did:

/* The maximum padding size for fused jcc.  CMP can be 9 bytes and jcc
   can be 6 bytes.  Leave room just in case for prefixes.   */
#define MAX_FUSED_JCC_PADDING_SIZE 20

> >>> @@ -3012,6 +3070,11 @@ md_begin (void)

> >>>        x86_dwarf2_return_column = 8;

> >>>        x86_cie_data_alignment = -4;

> >>>      }

> >>> +

> >>> +  /* NB: FUSED_JCC_PADDING frag must have sufficient room so that it

> >>> +     can be turned into BRANCH_PREFIX frag.  */

> >>> +  if (align_branch_prefix_size > MAX_FUSED_JCC_PADDING_SIZE)

> >>> +    abort ();

> >>

> >> Wouldn't such a value better be rejected at command line option

> >> handling time?

> >

> > There is

> >

> > /* The maximum number of prefixes added for an instruction.  */

> > static unsigned int align_branch_prefix_size = 5;

> >

> > I'd like to check it even if it isn't changed.

>

> Ok, okay.

>

> >> Like elsewhere I think you want to use i.tm checks here. This

> >> would also eliminate my worry of the code not catching cases

> >> where optimization changes the opcode in i.tm without also

> >> changing the name.

> >

> > It don't think CMP instructions can be replaced by totally different

> > instructions.

>

> But there's also e.g. AND in the set, which I would expect

> optimization to be able to switch _to_ (not from, as you were

> implying).


What other instructions can be encoded as AND?

> >>> +static int

> >>> +add_fused_jcc_padding_frag_p (void)

> >>> +{

> >>> +  if (!align_branch_power

> >>> +      || now_seg == absolute_section

> >>> +      || !cpu_arch_flags.bitfield.cpui386

> >>

> >> Why the i386 dependency?

> >

> > It doesn't work with COND_JUMP86 without i386.

>

> Why would it not work? People using the option may need to

> re-code some Jcc to keep the targets reachable despite the

> inserted padding, but that's not a reason to suppress the

> adjustments altogether.


This option isn't for processors without i386.

> >>> +static int

> >>> +add_branch_prefix_frag_p (void)

> >>> +{

> >>> +  if (!align_branch_power

> >>> +      || now_seg == absolute_section

> >>> +      || i.tm.cpu_flags.bitfield.cpupadlock

> >>> +      || !cpu_arch_flags.bitfield.cpui386)

> >>

> >> Why the PadLock and i386 dependencies?

> >

> > Since PadLock instructions include prefixes in opcode, I don't

> > want to mess with them.

>

> How are PadLock insns different in this regard from e.g. SSE insns?


There are

             if (i.tm.base_opcode & 0xff000000)
                {
                  prefix = (i.tm.base_opcode >> 24) & 0xff;
                  if (!i.tm.cpu_flags.bitfield.cpupadlock
                      || prefix != REPE_PREFIX_OPCODE
                      || (i.prefix[REP_PREFIX] != REPE_PREFIX_OPCODE))
                    add_prefix (prefix);

I don't want to change it for PadLock.

> >>> +  else if (i.tm.base_opcode == 0xc2

> >>> +        || i.tm.base_opcode == 0xc3

> >>> +        || i.tm.base_opcode == 0xca

> >>> +        || i.tm.base_opcode == 0xcb)

> >>

> >> These will also match various VEX/XOP/EVEX encoded templates. Perhaps

> >> exclude any such right away in the first if()?

> >>

> >> Also you handle near and far returns here, but ...

> >

> > I will change it to

> >

> >   else if (is_any_vex_encoding (&i.tm))

> >     return 0;

>

> Why not move this even earlier in the function?


I'd like to delay is_any_vex_encoding after

if (i.tm.opcode_modifier.jump)

> >   else if (i.tm.base_opcode == 0xc2 || i.tm.base_opcode == 0xc3)

>

> Maybe (i.tm.base_opcode | 1) == 0xc3?


I put it in.

> >>> @@ -8279,6 +8523,30 @@ output_insn (void)

> >>>    insn_start_frag = frag_now;

> >>>    insn_start_off = frag_now_fix ();

> >>>

> >>> +  if (add_branch_padding_frag_p (&branch))

> >>> +    {

> >>> +      char *p;

> >>> +      unsigned int max_branch_padding_size = 14;

> >>

> >> Where is this number coming from? Like above (and I think also in

> >> one more case below) any such seemingly arbitrary numbers could do

> >> with a comment.

> >

> > Branch can be 8 bytes.  I leave some room for prefixes.

>

> Again please explain this in a comment.


I did:

     /* Branch can be 8 bytes.  Leave some room for prefixes.  */
      unsigned int max_branch_padding_size = 14;

> >>> +           /* Use existing segment prefix if possible.  Use CS

> >>> +              segment prefix in 64-bit mode.  In 32-bit mode, use SS

> >>> +              segment prefix with ESP/EBP base register and use DS

> >>> +              segment prefix without ESP/EBP base register.  */

> >>

> >> Is there a reason for the preference of CS: in 64-bit mode?

> >

> > CS is ignored in 64-bit mode.

>

> As are DS, ES, and SS. I.e. my question could also be put as "Why


We pick CS.

> different rules for 64- and 32-bit code?"


Since CS isn't ignored in 32-bit mode, we can't use CS.  We pick the
default segment register in 32-bit mode.

> >>> +static void

> >>> +i386_classify_machine_dependent_frag (fragS *fragP)

> >>> +{

> >>> +  fragS *cmp_fragP;

> >>> +  fragS *pad_fragP;

> >>> +  fragS *branch_fragP;

> >>> +  fragS *next_fragP;

> >>> +  unsigned int max_prefix_length;

> >>> +

> >>> +  if (fragP->tc_frag_data.classified)

> >>> +    return;

> >>> +

> >>> +  /* First scan for BRANCH_PADDING and FUSED_JCC_PADDING.  Convert

> >>> +     FUSED_JCC_PADDING and merge BRANCH_PADDING.  */

> >>> +  for (next_fragP = fragP;

> >>> +       next_fragP != NULL;

> >>> +       next_fragP = next_fragP->fr_next)

> >>> +    {

> >>> +      next_fragP->tc_frag_data.classified = 1;

> >>> +      if (next_fragP->fr_type == rs_machine_dependent)

> >>> +     switch (TYPE_FROM_RELAX_STATE (next_fragP->fr_subtype))

> >>> +       {

> >>> +       case BRANCH_PADDING:

> >>> +         /* The BRANCH_PADDING frag must be followed by a branch

> >>> +            frag.  */

> >>> +         branch_fragP = i386_next_non_empty_frag (next_fragP);

> >>> +         next_fragP->tc_frag_data.u.branch_fragP = branch_fragP;

> >>> +         break;

> >>> +       case FUSED_JCC_PADDING:

> >>> +         /* Check if this is a fused jcc:

> >>> +            FUSED_JCC_PADDING

> >>> +            CMP

> >>

> >> Here and in a number of other places I wonder why you say CMP when

> >> the set of macro-fusion candidate insns is quite a bit larger.

> >

> > I use CMP to refer all these instructions.

>

> May I ask that at least in comments you clarify it's not just CMP.

> This doesn't mean to re-enumerate all the insns, but at least say

> something like "et al".


I replaced CMP with CMP like instruction.

> >>> +  if (!fits_in_signed_byte (padding_size))

> >>> +    abort ();

> >>

> >> Why? The function's return type is int (should probably be

> >> unsigned int).

> >

> > It must fit tc_frag_data.length.

>

> And how would one recognize the connection when e.g. widening the

> field?


I did

  /* The return value may be saved in tc_frag_data.length which is
     unsigned byte.  */
  if (!fits_in_unsigned_byte (padding_size))
    abort ();

> >>> @@ -11483,6 +12323,78 @@ md_parse_option (int c, const char *arg)

> >>>          as_fatal (_("invalid -mrelax-relocations= option: `%s'"), arg);

> >>>        break;

> >>>

> >>> +    case OPTION_MALIGN_BRANCH_BOUNDARY:

> >>> +      {

> >>> +     char *end;

> >>> +     int align = strtoul (arg, &end, 0);

> >>> +     if (*end == '\0')

> >>> +       {

> >>> +         if (align == 0)

> >>> +           {

> >>> +             align_branch_power = 0;

> >>> +             break;

> >>> +           }

> >>> +         else if (align >= 32)

> >>

> >> Why the enforcement of 32 as the lower boundary? I.e. why would 16

> >> not be an option as well?

> >

> > CMP can be 9 bytes and jcc can be 6 bytes.  15 bytes is too close to

> > 16 bytes.

>

> I don't understand: Making such a pair fit within a single aligned

> 16-byte block doesn't seem impossible at all.


It can be 16 bytes:

[hjl@gnu-cfl-1 jcc-1]$ cat f.s
cmpw  -12000(%eax,%ebx,8), %r8w
je foo
[hjl@gnu-cfl-1 jcc-1]$ gcc -c f.s
[hjl@gnu-cfl-1 jcc-1]$ objdump -dw f.o

f.o:     file format elf64-x86-64


Disassembly of section .text:

0000000000000000 <.text>:
   0: 67 66 44 3b 84 d8 20 d1 ff ff cmp    -0x2ee0(%eax,%ebx,8),%r8w
   a: 0f 84 00 00 00 00    je     0x10
[hjl@gnu-cfl-1 jcc-1]$

I prefer not to lower it to 16 bytes unless there is a good reason.

> >>> +           {

> >>> +             int align_power;

> >>> +             for (align_power = 0;

> >>> +                  (align & 1) == 0;

> >>> +                  align >>= 1, align_power++)

> >>> +               continue;

> >>> +             if (align == 1)

> >>> +               {

> >>> +                 align_branch_power = align_power;

> >>> +                 break;

> >>> +               }

> >>> +           }

> >>> +       }

> >>> +     as_fatal (_("invalid -malign-branch-boundary= value: %s"), arg);

> >>> +      }

> >>> +      break;

> >>> +

> >>> +    case OPTION_MALIGN_BRANCH_PREFIX_SIZE:

> >>> +      {

> >>> +     char *end;

> >>> +     int align = strtoul (arg, &end, 0);

> >>> +     if (*end == '\0' && align >= 0 && align < 6)

> >>

> >> Similarly here - why is 5 the upper boundary on the accepted values?

> >

> > I am told that some processors only support 5 prefixes.

>

> Now that very certainly requires a comment to explain. Furthermore -

> what would happen on such processors if there was a 6th one? Is this

> merely a performance concern, or worse?


I don't know exactly will will happen.  I just cap it to 5.

> >>> +  unsigned int classified : 1;

> >>> +  unsigned int branch_type : 7;

> >>

> >> This doesn't require 7 bits, does it?

> >>

> >

> > I will change it to

> >

> > unsigned int branch_type : 6;

>

> Isn't 6 still too high? I realize the enumerators stored into here

> aren't truly enumerators, but add_branch_padding_frag_p() only ever

> stores one of the values (i.e. they don't get OR-ed together).

>


I changed it to 3 bits with

/* Types of branches to align.  */
enum align_branch_kind
  {
    align_branch_none = 0,
    align_branch_jcc = 1,
    align_branch_fused = 2,
    align_branch_jmp = 3,
    align_branch_call = 4,
    align_branch_indirect = 5,
    align_branch_ret = 6
  };

/* Type bits of branches to align.  */
enum align_branch_bit
  {
    align_branch_jcc_bit = 1 << align_branch_jcc,
    align_branch_fused_bit = 1 << align_branch_fused,
    align_branch_jmp_bit = 1 << align_branch_jmp,
    align_branch_call_bit = 1 << align_branch_call,
    align_branch_indirect_bit = 1 << align_branch_indirect,
    align_branch_ret_bit = 1 << align_branch_ret
  };

Thanks.

-- 
H.J.
Jan Beulich Nov. 18, 2019, 10:50 a.m. | #5
On 16.11.2019 00:11, H.J. Lu wrote:
> On Fri, Nov 15, 2019 at 1:16 AM Jan Beulich <jbeulich@suse.com> wrote:

>> On 15.11.2019 00:21,  H.J. Lu  wrote:

>>> On Wed, Nov 13, 2019 at 5:15 AM Jan Beulich <jbeulich@suse.com> wrote:

>>>> On 12.11.2019 17:19,  H.J. Lu  wrote:

>>>> Like elsewhere I think you want to use i.tm checks here. This

>>>> would also eliminate my worry of the code not catching cases

>>>> where optimization changes the opcode in i.tm without also

>>>> changing the name.

>>>

>>> It don't think CMP instructions can be replaced by totally different

>>> instructions.

>>

>> But there's also e.g. AND in the set, which I would expect

>> optimization to be able to switch _to_ (not from, as you were

>> implying).

> 

> What other instructions can be encoded as AND?


I didn't say there _are_ such optimizations, but I wouldn't exclude
there might be in the future. Other aspects of the patch already
suggested that the combination of these new options and the
optimization ones may not have been fully taken into consideration
on this first version of the patch. I'd simply be afraid of such
aspect to be overlooked later on (when adding further optimizations)
as well. Yet the changes here are useful only if they indeed manage
to cover _all_ problematic code sequences, not just most of them.

>>>>> +static int

>>>>> +add_fused_jcc_padding_frag_p (void)

>>>>> +{

>>>>> +  if (!align_branch_power

>>>>> +      || now_seg == absolute_section

>>>>> +      || !cpu_arch_flags.bitfield.cpui386

>>>>

>>>> Why the i386 dependency?

>>>

>>> It doesn't work with COND_JUMP86 without i386.

>>

>> Why would it not work? People using the option may need to

>> re-code some Jcc to keep the targets reachable despite the

>> inserted padding, but that's not a reason to suppress the

>> adjustments altogether.

> 

> This option isn't for processors without i386.


Can this be stated then in the documentation you add?

>>>>> +static int

>>>>> +add_branch_prefix_frag_p (void)

>>>>> +{

>>>>> +  if (!align_branch_power

>>>>> +      || now_seg == absolute_section

>>>>> +      || i.tm.cpu_flags.bitfield.cpupadlock

>>>>> +      || !cpu_arch_flags.bitfield.cpui386)

>>>>

>>>> Why the PadLock and i386 dependencies?

>>>

>>> Since PadLock instructions include prefixes in opcode, I don't

>>> want to mess with them.

>>

>> How are PadLock insns different in this regard from e.g. SSE insns?

> 

> There are

> 

>              if (i.tm.base_opcode & 0xff000000)

>                 {

>                   prefix = (i.tm.base_opcode >> 24) & 0xff;

>                   if (!i.tm.cpu_flags.bitfield.cpupadlock

>                       || prefix != REPE_PREFIX_OPCODE

>                       || (i.prefix[REP_PREFIX] != REPE_PREFIX_OPCODE))

>                     add_prefix (prefix);

> 

> I don't want to change it for PadLock.


I understand that you don't want to do this, but you didn't
answer my question. Next to the code fragment you quote there
is similar code dealing with prefixes on (as said) e.g. SSE
insns. Why is it okay to leave SSE insns as they are, but
treat PadLock ones specially?

>>>>> +           /* Use existing segment prefix if possible.  Use CS

>>>>> +              segment prefix in 64-bit mode.  In 32-bit mode, use SS

>>>>> +              segment prefix with ESP/EBP base register and use DS

>>>>> +              segment prefix without ESP/EBP base register.  */

>>>>

>>>> Is there a reason for the preference of CS: in 64-bit mode?

>>>

>>> CS is ignored in 64-bit mode.

>>

>> As are DS, ES, and SS. I.e. my question could also be put as "Why

> 

> We pick CS.

> 

>> different rules for 64- and 32-bit code?"

> 

> Since CS isn't ignored in 32-bit mode, we can't use CS.  We pick the

> default segment register in 32-bit mode.


Which again doesn't answer my question: Code would be simpler if
there wouldn't need to be a distinction between 32- and 64-bit
modes. Hence the more tight constraints of 32-bit mode could be
applied to 64-bit mode as well. I certainly understand that the
more relaxed constraints of 64-bit mode can't be applied to
32-bit mode handling.

>>>>> +           {

>>>>> +             int align_power;

>>>>> +             for (align_power = 0;

>>>>> +                  (align & 1) == 0;

>>>>> +                  align >>= 1, align_power++)

>>>>> +               continue;

>>>>> +             if (align == 1)

>>>>> +               {

>>>>> +                 align_branch_power = align_power;

>>>>> +                 break;

>>>>> +               }

>>>>> +           }

>>>>> +       }

>>>>> +     as_fatal (_("invalid -malign-branch-boundary= value: %s"), arg);

>>>>> +      }

>>>>> +      break;

>>>>> +

>>>>> +    case OPTION_MALIGN_BRANCH_PREFIX_SIZE:

>>>>> +      {

>>>>> +     char *end;

>>>>> +     int align = strtoul (arg, &end, 0);

>>>>> +     if (*end == '\0' && align >= 0 && align < 6)

>>>>

>>>> Similarly here - why is 5 the upper boundary on the accepted values?

>>>

>>> I am told that some processors only support 5 prefixes.

>>

>> Now that very certainly requires a comment to explain. Furthermore -

>> what would happen on such processors if there was a 6th one? Is this

>> merely a performance concern, or worse?

> 

> I don't know exactly will will happen.  I just cap it to 5.


That's unfortunate then. Five years from now, how will anyone
understand why this seemingly artificial limit is the way it is
(without digging out this mail thread)?

Thanks for all the other adjustments you did.

Jan
Jan Beulich Nov. 20, 2019, 9:07 a.m. | #6
On 18.11.2019 11:50, Jan Beulich wrote:
> On 16.11.2019 00:11, H.J. Lu wrote:

>> On Fri, Nov 15, 2019 at 1:16 AM Jan Beulich <jbeulich@suse.com> wrote:

>>> On 15.11.2019 00:21,  H.J. Lu  wrote:

>>>> On Wed, Nov 13, 2019 at 5:15 AM Jan Beulich <jbeulich@suse.com> wrote:

>>>>> On 12.11.2019 17:19,  H.J. Lu  wrote:

>>>>> Like elsewhere I think you want to use i.tm checks here. This

>>>>> would also eliminate my worry of the code not catching cases

>>>>> where optimization changes the opcode in i.tm without also

>>>>> changing the name.

>>>>

>>>> It don't think CMP instructions can be replaced by totally different

>>>> instructions.

>>>

>>> But there's also e.g. AND in the set, which I would expect

>>> optimization to be able to switch _to_ (not from, as you were

>>> implying).

>>

>> What other instructions can be encoded as AND?

> 

> I didn't say there _are_ such optimizations, but I wouldn't exclude

> there might be in the future. Other aspects of the patch already

> suggested that the combination of these new options and the

> optimization ones may not have been fully taken into consideration

> on this first version of the patch. I'd simply be afraid of such

> aspect to be overlooked later on (when adding further optimizations)

> as well. Yet the changes here are useful only if they indeed manage

> to cover _all_ problematic code sequences, not just most of them.


Actually AND isn't the best example. TEST is even better. We
already have such an optimization: We translate various OR and
AND to TEST. If you want to catch these TESTs, then you either
need to make optimize_encoding() update the insn's name, or
(preferable imo) recognize insns here by looking at the opcodes.

Jan
H.J. Lu Dec. 3, 2019, 7:10 p.m. | #7
On Wed, Nov 20, 2019 at 1:07 AM Jan Beulich <jbeulich@suse.com> wrote:
>

> On 18.11.2019 11:50, Jan Beulich wrote:

> > On 16.11.2019 00:11, H.J. Lu wrote:

> >> On Fri, Nov 15, 2019 at 1:16 AM Jan Beulich <jbeulich@suse.com> wrote:

> >>> On 15.11.2019 00:21,  H.J. Lu  wrote:

> >>>> On Wed, Nov 13, 2019 at 5:15 AM Jan Beulich <jbeulich@suse.com> wrote:

> >>>>> On 12.11.2019 17:19,  H.J. Lu  wrote:

> >>>>> Like elsewhere I think you want to use i.tm checks here. This

> >>>>> would also eliminate my worry of the code not catching cases

> >>>>> where optimization changes the opcode in i.tm without also

> >>>>> changing the name.

> >>>>

> >>>> It don't think CMP instructions can be replaced by totally different

> >>>> instructions.

> >>>

> >>> But there's also e.g. AND in the set, which I would expect

> >>> optimization to be able to switch _to_ (not from, as you were

> >>> implying).

> >>

> >> What other instructions can be encoded as AND?

> >

> > I didn't say there _are_ such optimizations, but I wouldn't exclude

> > there might be in the future. Other aspects of the patch already

> > suggested that the combination of these new options and the

> > optimization ones may not have been fully taken into consideration

> > on this first version of the patch. I'd simply be afraid of such

> > aspect to be overlooked later on (when adding further optimizations)

> > as well. Yet the changes here are useful only if they indeed manage

> > to cover _all_ problematic code sequences, not just most of them.

>

> Actually AND isn't the best example. TEST is even better. We

> already have such an optimization: We translate various OR and

> AND to TEST. If you want to catch these TESTs, then you either

> need to make optimize_encoding() update the insn's name, or

> (preferable imo) recognize insns here by looking at the opcodes.

>

> Jan


I submitted a new set of patches:

https://sourceware.org/ml/binutils/2019-12/msg00035.html
https://sourceware.org/ml/binutils/2019-12/msg00032.html
https://sourceware.org/ml/binutils/2019-12/msg00038.html
https://sourceware.org/ml/binutils/2019-12/msg00039.html
https://sourceware.org/ml/binutils/2019-12/msg00037.html


-- 
H.J.

Patch

diff --git a/gas/config/tc-i386.c b/gas/config/tc-i386.c
index c55904165a..cb96f8b348 100644
--- a/gas/config/tc-i386.c
+++ b/gas/config/tc-i386.c
@@ -365,6 +365,9 @@  struct _i386_insn
     /* Has ZMM register operands.  */
     bfd_boolean has_regzmm;
 
+    /* Has GOTPC relocation.  */
+    bfd_boolean has_gotpc_reloc;
+
     /* RM and SIB are the modrm byte and the sib byte where the
        addressing modes of this insn are encoded.  */
     modrm_byte rm;
@@ -559,6 +562,8 @@  static enum flag_code flag_code;
 static unsigned int object_64bit;
 static unsigned int disallow_64bit_reloc;
 static int use_rela_relocations = 0;
+/* __tls_get_addr/___tls_get_addr symbol for TLS.  */
+static const char *tls_get_addr;
 
 #if ((defined (OBJ_MAYBE_COFF) && defined (OBJ_MAYBE_AOUT)) \
      || defined (OBJ_ELF) || defined (OBJ_MAYBE_ELF) \
@@ -619,6 +624,21 @@  static int omit_lock_prefix = 0;
    "lock addl $0, (%{re}sp)".  */
 static int avoid_fence = 0;
 
+/* Type of the previous instruction.  */
+static struct
+  {
+    segT seg;
+    const char *file;
+    const char *name;
+    unsigned int line;
+    enum last_insn_kind
+      {
+	last_insn_other = 0,
+	last_insn_directive,
+	last_insn_prefix
+      } kind;
+  } last_insn;
+
 /* 1 if the assembler should generate relax relocations.  */
 
 static int generate_relax_relocations
@@ -632,6 +652,31 @@  static enum check_kind
   }
 sse_check, operand_check = check_warning;
 
+/* Non-zero if branches should be aligned within power of 2 boundary.  */
+static int align_branch_power = 0;
+
+/* Types of branches to align.  */
+enum align_branch_kind
+  {
+    align_branch_none = 0,
+    align_branch_jcc = 1 << 0,
+    align_branch_fused = 1 << 1,
+    align_branch_jmp = 1 << 2,
+    align_branch_call = 1 << 3,
+    align_branch_indirect = 1 << 4,
+    align_branch_ret = 1 << 5
+  };
+
+static unsigned int align_branch = (align_branch_jcc
+				    | align_branch_fused
+				    | align_branch_jmp);
+
+/* The maximum padding size for fused jcc.  */
+#define MAX_FUSED_JCC_PADDING_SIZE 20
+
+/* The maximum number of prefixes added for an instruction.  */
+static unsigned int align_branch_prefix_size = 5;
+
 /* Optimization:
    1. Clear the REX_W bit with register operand if possible.
    2. Above plus use 128bit vector instruction to clear the full vector
@@ -735,12 +780,19 @@  int x86_cie_data_alignment;
 /* Interface to relax_segment.
    There are 3 major relax states for 386 jump insns because the
    different types of jumps add different sizes to frags when we're
-   figuring out what sort of jump to choose to reach a given label.  */
+   figuring out what sort of jump to choose to reach a given label.
+
+   BRANCH_PADDING, BRANCH_PREFIX and FUSED_JCC_PADDING are used to align
+   branches which are handled by md_estimate_size_before_relax() and
+   i386_generic_table_relax_frag().  */
 
 /* Types.  */
 #define UNCOND_JUMP 0
 #define COND_JUMP 1
 #define COND_JUMP86 2
+#define BRANCH_PADDING 3
+#define BRANCH_PREFIX 4
+#define FUSED_JCC_PADDING 5
 
 /* Sizes.  */
 #define CODE16	1
@@ -1381,6 +1433,12 @@  i386_generate_nops (fragS *fragP, char *where, offsetT count, int limit)
     case rs_fill_nop:
     case rs_align_code:
       break;
+    case rs_machine_dependent:
+      /* Allow NOP padding for jumps and calls.  */
+      if (TYPE_FROM_RELAX_STATE (fragP->fr_subtype) == BRANCH_PADDING
+	  || TYPE_FROM_RELAX_STATE (fragP->fr_subtype) == FUSED_JCC_PADDING)
+	break;
+      /* Fall through.  */
     default:
       return;
     }
@@ -1525,7 +1583,7 @@  i386_generate_nops (fragS *fragP, char *where, offsetT count, int limit)
 	  return;
 	}
     }
-  else
+  else if (fragP->fr_type != rs_machine_dependent)
     fragP->fr_var = count;
 
   if ((count / max_single_nop_size) > max_number_of_nops)
@@ -3012,6 +3070,11 @@  md_begin (void)
       x86_dwarf2_return_column = 8;
       x86_cie_data_alignment = -4;
     }
+
+  /* NB: FUSED_JCC_PADDING frag must have sufficient room so that it
+     can be turned into BRANCH_PREFIX frag.  */
+  if (align_branch_prefix_size > MAX_FUSED_JCC_PADDING_SIZE)
+    abort ();
 }
 
 void
@@ -4538,6 +4601,17 @@  md_assemble (char *line)
 
   /* We are ready to output the insn.  */
   output_insn ();
+
+  last_insn.seg = now_seg;
+
+  if (i.tm.opcode_modifier.isprefix)
+    {
+      last_insn.kind = last_insn_prefix;
+      last_insn.name = i.tm.name;
+      last_insn.file = as_where (&last_insn.line);
+    }
+  else
+    last_insn.kind = last_insn_other;
 }
 
 static char *
@@ -8178,11 +8252,181 @@  encoding_length (const fragS *start_frag, offsetT start_off,
   return len - start_off + (frag_now_ptr - frag_now->fr_literal);
 }
 
+/* Return 1 for test, and, cmp, add, sub, inc and dec which may
+   be macro-fused with conditional jumps.  */
+
+static int
+maybe_fused_with_jcc_p (void)
+{
+  /* No RIP address.  */
+  if (i.base_reg && i.base_reg->reg_num == RegIP)
+    return 0;
+
+  /* and, add, sub with destination register.  */
+  if (!strcmp (i.tm.name, "and")
+      || !strcmp (i.tm.name, "add")
+      || !strcmp (i.tm.name, "sub"))
+    return i.types[1].bitfield.class == Reg;
+
+  /* test, cmp with any register.  */
+  if (!strcmp (i.tm.name, "test") || !strcmp (i.tm.name, "cmp"))
+    return (i.types[0].bitfield.class == Reg
+	    || i.types[1].bitfield.class == Reg);
+
+  /* inc, dec with 16/32/64-bit register.   */
+  if (!strcmp (i.tm.name, "inc") || !strcmp (i.tm.name, "dec"))
+    return i.types[0].bitfield.class == Reg;
+
+  return 0;
+}
+
+/* Return 1 if a FUSED_JCC_PADDING frag should be generated.  */
+
+static int
+add_fused_jcc_padding_frag_p (void)
+{
+  if (!align_branch_power
+      || now_seg == absolute_section
+      || !cpu_arch_flags.bitfield.cpui386
+      || !(align_branch & align_branch_fused))
+    return 0;
+
+  if (maybe_fused_with_jcc_p ())
+    {
+      if (last_insn.kind != last_insn_other
+	  && last_insn.seg == now_seg)
+	{
+	  if (flag_debug)
+	    as_warn_where (last_insn.file, last_insn.line,
+			   _("`%s` skips -malign-branch-boundary on `%s`"),
+			   last_insn.name, i.tm.name);
+	  return 0;
+	}
+      return 1;
+    }
+
+  return 0;
+}
+
+/* Return 1 if a BRANCH_PREFIX frag should be generated.  */
+
+static int
+add_branch_prefix_frag_p (void)
+{
+  if (!align_branch_power
+      || now_seg == absolute_section
+      || i.tm.cpu_flags.bitfield.cpupadlock
+      || !cpu_arch_flags.bitfield.cpui386)
+    return 0;
+
+  /* Don't add prefix if it is a prefix or there is no operand.  */
+  if (!i.operands || i.tm.opcode_modifier.isprefix)
+    return 0;
+
+  if (last_insn.kind != last_insn_other
+      && last_insn.seg == now_seg)
+    {
+      if (flag_debug)
+	as_warn_where (last_insn.file, last_insn.line,
+		       _("`%s` skips -malign-branch-boundary on `%s`"),
+		       last_insn.name, i.tm.name);
+      return 0;
+    }
+
+  return 1;
+}
+
+/* Return 1 if a BRANCH_PADDING frag should be generated.  */
+
+static int
+add_branch_padding_frag_p (enum align_branch_kind *branch_p)
+{
+  int add_padding;
+
+  if (!align_branch_power
+      || now_seg == absolute_section
+      || !cpu_arch_flags.bitfield.cpui386)
+    return 0;
+
+  add_padding = 0;
+
+  /* Check for jcc and direct jmp.  */
+  if (i.tm.opcode_modifier.jump)
+    {
+      if (i.tm.base_opcode == JUMP_PC_RELATIVE)
+	{
+	  *branch_p = align_branch_jmp;
+	  add_padding = align_branch & align_branch_jmp;
+	}
+      else
+	{
+	  *branch_p = align_branch_jcc;
+	  if ((align_branch & align_branch_jcc))
+	    add_padding = 1;
+	}
+    }
+  else if (i.tm.base_opcode == 0xc2
+	   || i.tm.base_opcode == 0xc3
+	   || i.tm.base_opcode == 0xca
+	   || i.tm.base_opcode == 0xcb)
+    {
+      *branch_p = align_branch_ret;
+      if ((align_branch & align_branch_ret))
+	add_padding = 1;
+    }
+  else
+    {
+      if (i.tm.base_opcode == 0xe8)
+	{
+	  *branch_p = align_branch_call;
+	  if ((align_branch & align_branch_call))
+	    add_padding = 1;
+	}
+      else if (i.tm.base_opcode == 0xff
+	       && (i.rm.reg == 2 || i.rm.reg == 4))
+	{
+	  *branch_p = align_branch_indirect;
+	  if ((align_branch & align_branch_indirect))
+	    add_padding = 1;
+	}
+
+      /* Check for indirect jmp, direct and indirect calls.  */
+      if (add_padding
+	  && i.disp_operands
+	  && tls_get_addr
+	  && (i.op[0].disps->X_op == O_symbol
+	      || (i.op[0].disps->X_op == O_subtract
+		  && i.op[0].disps->X_op_symbol == GOT_symbol)))
+	{
+	  symbolS *s = i.op[0].disps->X_add_symbol;
+	  /* No padding to call to global or undefined tls_get_addr.  */
+	  if ((S_IS_EXTERNAL (s) || !S_IS_DEFINED (s))
+	      && strcmp (S_GET_NAME (s), tls_get_addr) == 0)
+	    return 0;
+	}
+    }
+
+  if (add_padding
+      && last_insn.kind != last_insn_other
+      && last_insn.seg == now_seg)
+    {
+      if (flag_debug)
+	as_warn_where (last_insn.file, last_insn.line,
+		       _("`%s` skips -malign-branch-boundary on `%s`"),
+		       last_insn.name, i.tm.name);
+      return 0;
+    }
+
+  return add_padding;
+}
+
 static void
 output_insn (void)
 {
   fragS *insn_start_frag;
   offsetT insn_start_off;
+  fragS *fragP = NULL;
+  enum align_branch_kind branch = align_branch_none;
 
 #if defined (OBJ_ELF) || defined (OBJ_MAYBE_ELF)
   if (IS_ELF && x86_used_note)
@@ -8279,6 +8523,30 @@  output_insn (void)
   insn_start_frag = frag_now;
   insn_start_off = frag_now_fix ();
 
+  if (add_branch_padding_frag_p (&branch))
+    {
+      char *p;
+      unsigned int max_branch_padding_size = 14;
+
+      /* Align section to boundary.  */
+      record_alignment (now_seg, align_branch_power);
+
+      /* Make room for padding.  */
+      frag_grow (max_branch_padding_size);
+
+      /* Start of the padding.  */
+      p = frag_more (0);
+
+      fragP = frag_now;
+
+      frag_var (rs_machine_dependent, max_branch_padding_size, 0,
+		ENCODE_RELAX_STATE (BRANCH_PADDING, 0),
+		NULL, 0, p);
+
+      fragP->tc_frag_data.branch_type = branch;
+      fragP->tc_frag_data.max_bytes = max_branch_padding_size;
+    }
+
   /* Output jumps.  */
   if (i.tm.opcode_modifier.jump)
     output_branch ();
@@ -8317,6 +8585,44 @@  output_insn (void)
 	  i.prefix[LOCK_PREFIX] = 0;
 	}
 
+      if (branch)
+	/* Skip if this is a branch.  */
+	;
+      else if (add_fused_jcc_padding_frag_p ())
+	{
+	  unsigned int max_fused_padding_size
+	    = MAX_FUSED_JCC_PADDING_SIZE;
+
+	  /* Make room for padding.  */
+	  frag_grow (max_fused_padding_size);
+	  p = frag_more (0);
+
+	  fragP = frag_now;
+
+	  frag_var (rs_machine_dependent, max_fused_padding_size, 0,
+		    ENCODE_RELAX_STATE (FUSED_JCC_PADDING, 0),
+		    NULL, 0, p);
+
+	  fragP->tc_frag_data.branch_type = align_branch_fused;
+	  fragP->tc_frag_data.max_bytes = max_fused_padding_size;
+	}
+      else if (add_branch_prefix_frag_p ())
+	{
+	  unsigned int max_prefix_size = align_branch_prefix_size;
+
+	  /* Make room for padding.  */
+	  frag_grow (max_prefix_size);
+	  p = frag_more (0);
+
+	  fragP = frag_now;
+
+	  frag_var (rs_machine_dependent, max_prefix_size, 0,
+		    ENCODE_RELAX_STATE (BRANCH_PREFIX, 0),
+		    NULL, 0, p);
+
+	  fragP->tc_frag_data.max_bytes = max_prefix_size;
+	}
+
       /* Since the VEX/EVEX prefix contains the implicit prefix, we
 	 don't need the explicit prefix.  */
       if (!i.tm.opcode_modifier.vex && !i.tm.opcode_modifier.evex)
@@ -8461,12 +8767,81 @@  output_insn (void)
       if (now_seg != absolute_section)
 	{
 	  j = encoding_length (insn_start_frag, insn_start_off, frag_more (0));
-	  if (j > 15)
+	  if (fragP)
+	    {
+	      /* NB: Don't add prefix with GOTPC relocation since
+		 output_disp() above depends on the fixed encoding
+		 length.  */
+	      unsigned int max = i.has_gotpc_reloc ? 0 : 15 - j;
+	      /* Prefix count on the current instruction.  */
+	      unsigned int count = !!is_any_vex_encoding (&i.tm);
+	      unsigned int k;
+	      for (k = 0; k < ARRAY_SIZE (i.prefix); k++)
+		if (i.prefix[k])
+		  count++;
+
+	      /* NB: prefix count + instruction size must be <= 15.  */
+	      if (j > 15)
+		as_fatal (_("instruction length of %u bytes exceeds the limit of 15"),
+			  j);
+
+	      if (TYPE_FROM_RELAX_STATE (fragP->fr_subtype)
+		  == BRANCH_PREFIX)
+		{
+		  /* Set the maximum prefix size in BRANCH_PREFIX
+		     frag.  */
+		  if (fragP->tc_frag_data.max_bytes > max)
+		    fragP->tc_frag_data.max_bytes = max;
+		  if (fragP->tc_frag_data.max_bytes > count)
+		    fragP->tc_frag_data.max_bytes -= count;
+		  else
+		    fragP->tc_frag_data.max_bytes = 0;
+		}
+	      else
+		{
+		  /* Remember the maximum prefix size in FUSED_JCC_PADDING
+		     frag.  */
+		  unsigned int max_prefix_size;
+		  if (align_branch_prefix_size > max)
+		    max_prefix_size = max;
+		  else
+		    max_prefix_size = align_branch_prefix_size;
+		  if (max_prefix_size > count)
+		    fragP->tc_frag_data.max_prefix_length
+		      = max_prefix_size - count;
+		}
+
+	      /* Use existing segment prefix if possible.  Use CS
+		 segment prefix in 64-bit mode.  In 32-bit mode, use SS
+		 segment prefix with ESP/EBP base register and use DS
+		 segment prefix without ESP/EBP base register.  */
+	      if (i.prefix[SEG_PREFIX])
+		fragP->tc_frag_data.default_prefix = i.prefix[SEG_PREFIX];
+	      else if (flag_code == CODE_64BIT)
+		fragP->tc_frag_data.default_prefix = CS_PREFIX_OPCODE;
+	      else if (i.base_reg
+		       && (i.base_reg->reg_num == 4
+			   || i.base_reg->reg_num == 5))
+		fragP->tc_frag_data.default_prefix = SS_PREFIX_OPCODE;
+	      else
+		fragP->tc_frag_data.default_prefix = DS_PREFIX_OPCODE;
+	    }
+	  else if (j > 15)
 	    as_warn (_("instruction length of %u bytes exceeds the limit of 15"),
 		     j);
 	}
     }
 
+  if (align_branch_power
+      && now_seg != absolute_section
+      && cpu_arch_flags.bitfield.cpui386)
+    {
+      /* Terminate each frag so that we can add prefix and check for
+         fused jcc.  */
+      frag_wane (frag_now);
+      frag_new (0);
+    }
+
 #ifdef DEBUG386
   if (flag_debug)
     {
@@ -8576,6 +8951,7 @@  output_disp (fragS *insn_start_frag, offsetT insn_start_off)
 		  if (!object_64bit)
 		    {
 		      reloc_type = BFD_RELOC_386_GOTPC;
+		      i.has_gotpc_reloc = TRUE;
 		      i.op[n].imms->X_add_number +=
 			encoding_length (insn_start_frag, insn_start_off, p);
 		    }
@@ -8728,6 +9104,7 @@  output_imm (fragS *insn_start_frag, offsetT insn_start_off)
 		    reloc_type = BFD_RELOC_X86_64_GOTPC32;
 		  else if (size == 8)
 		    reloc_type = BFD_RELOC_X86_64_GOTPC64;
+		  i.has_gotpc_reloc = TRUE;
 		  i.op[n].imms->X_add_number +=
 		    encoding_length (insn_start_frag, insn_start_off, p);
 		}
@@ -10353,6 +10730,355 @@  elf_symbol_resolved_in_segment_p (symbolS *fr_symbol, offsetT fr_var)
 }
 #endif
 
+/* Return the next non-empty frag.  */
+
+static fragS *
+i386_next_non_empty_frag (fragS *fragP)
+{
+  /* There may be a frag with a ".fill 0" when there is no room in
+     the current frag for frag_grow in output_insn.  */
+  for (fragP = fragP->fr_next;
+       (fragP != NULL
+	&& fragP->fr_type == rs_fill
+	&& fragP->fr_fix == 0);
+       fragP = fragP->fr_next)
+    ;
+  return fragP;
+}
+
+/* Return the next jcc frag after BRANCH_PADDING.  */
+
+static fragS *
+i386_next_jcc_frag (fragS *fragP)
+{
+  if (!fragP)
+    return NULL;
+
+  if (fragP->fr_type == rs_machine_dependent
+      && (TYPE_FROM_RELAX_STATE (fragP->fr_subtype)
+	  == BRANCH_PADDING))
+    {
+      fragP = i386_next_non_empty_frag (fragP);
+      if (fragP->fr_type != rs_machine_dependent)
+	return NULL;
+      if (TYPE_FROM_RELAX_STATE (fragP->fr_subtype) == COND_JUMP)
+	return fragP;
+    }
+
+  return NULL;
+}
+
+/* Classify BRANCH_PADDING, BRANCH_PREFIX and FUSED_JCC_PADDING frags.  */
+
+static void
+i386_classify_machine_dependent_frag (fragS *fragP)
+{
+  fragS *cmp_fragP;
+  fragS *pad_fragP;
+  fragS *branch_fragP;
+  fragS *next_fragP;
+  unsigned int max_prefix_length;
+
+  if (fragP->tc_frag_data.classified)
+    return;
+
+  /* First scan for BRANCH_PADDING and FUSED_JCC_PADDING.  Convert
+     FUSED_JCC_PADDING and merge BRANCH_PADDING.  */
+  for (next_fragP = fragP;
+       next_fragP != NULL;
+       next_fragP = next_fragP->fr_next)
+    {
+      next_fragP->tc_frag_data.classified = 1;
+      if (next_fragP->fr_type == rs_machine_dependent)
+	switch (TYPE_FROM_RELAX_STATE (next_fragP->fr_subtype))
+	  {
+	  case BRANCH_PADDING:
+	    /* The BRANCH_PADDING frag must be followed by a branch
+	       frag.  */
+	    branch_fragP = i386_next_non_empty_frag (next_fragP);
+	    next_fragP->tc_frag_data.u.branch_fragP = branch_fragP;
+	    break;
+	  case FUSED_JCC_PADDING:
+	    /* Check if this is a fused jcc:
+	       FUSED_JCC_PADDING
+	       CMP
+	       BRANCH_PADDING
+	       COND_JUMP
+	       */
+	    cmp_fragP = i386_next_non_empty_frag (next_fragP);
+	    pad_fragP = i386_next_non_empty_frag (cmp_fragP);
+	    branch_fragP = i386_next_jcc_frag (pad_fragP);
+	    if (branch_fragP)
+	      {
+		/* The BRANCH_PADDING frag is merged with the
+		   FUSED_JCC_PADDING frag.  */
+		next_fragP->tc_frag_data.u.branch_fragP = branch_fragP;
+		/* CMP instruction size.  */
+		next_fragP->tc_frag_data.cmp_size = cmp_fragP->fr_fix;
+		frag_wane (pad_fragP);
+		/* Skip to branch_fragP.  */
+		next_fragP = branch_fragP;
+	      }
+	    else if (next_fragP->tc_frag_data.max_prefix_length)
+	      {
+		/* Turn FUSED_JCC_PADDING into BRANCH_PREFIX if it isn't
+		   a fused jcc.  */
+		next_fragP->fr_subtype
+		  = ENCODE_RELAX_STATE (BRANCH_PREFIX, 0);
+		next_fragP->tc_frag_data.max_bytes
+		  = next_fragP->tc_frag_data.max_prefix_length;
+		/* This will be updated in the BRANCH_PREFIX scan.  */
+		next_fragP->tc_frag_data.max_prefix_length = 0;
+	      }
+	    else
+	      frag_wane (next_fragP);
+	    break;
+	  }
+    }
+
+  /* Scan for BRANCH_PREFIX.  */
+  for (; fragP != NULL; fragP = fragP->fr_next)
+    if (fragP->fr_type == rs_machine_dependent
+	&& (TYPE_FROM_RELAX_STATE (fragP->fr_subtype)
+	    == BRANCH_PREFIX))
+      {
+	/* Count all BRANCH_PREFIX frags before BRANCH_PADDING and
+	   COND_JUMP_PREFIX.  */
+	max_prefix_length = 0;
+	for (next_fragP = fragP;
+	     next_fragP != NULL;
+	     next_fragP = next_fragP->fr_next)
+	  {
+	    if (next_fragP->fr_type == rs_fill)
+	      /* Skip rs_fill frags.  */
+	      ;
+	    else if (next_fragP->fr_type == rs_machine_dependent)
+	      {
+		if (TYPE_FROM_RELAX_STATE (next_fragP->fr_subtype)
+		    == BRANCH_PREFIX)
+		  {
+		    /* Count BRANCH_PREFIX frags.  */
+		    if (max_prefix_length >= MAX_FUSED_JCC_PADDING_SIZE)
+		      {
+			max_prefix_length = MAX_FUSED_JCC_PADDING_SIZE;
+			frag_wane (next_fragP);
+		      }
+		    else
+		      max_prefix_length
+			+= next_fragP->tc_frag_data.max_bytes;
+		  }
+		else if ((TYPE_FROM_RELAX_STATE (next_fragP->fr_subtype)
+			  == BRANCH_PADDING)
+			 || (TYPE_FROM_RELAX_STATE (next_fragP->fr_subtype)
+			     == FUSED_JCC_PADDING))
+		  {
+		    /* Stop at BRANCH_PADDING and FUSED_JCC_PADDING.  */
+		    fragP->tc_frag_data.u.padding_fragP = next_fragP;
+		    break;
+		  }
+		else
+		  /* Stop for other rs_machine_dependent frags.  */
+		  break;
+	      }
+	    else
+	      /* Stop for all other frags.  */
+	      break;
+	  }
+
+	fragP->tc_frag_data.max_prefix_length = max_prefix_length;
+
+	/* Skip to the next frag.  */
+	fragP = next_fragP;
+      }
+}
+
+/* Compute padding size for
+
+	FUSED_JCC_PADDING
+	CMP
+	BRANCH_PADDING
+	COND_JUMP/UNCOND_JUMP
+
+   or
+
+	BRANCH_PADDING
+	COND_JUMP/UNCOND_JUMP
+ */
+
+static int
+i386_branch_padding_size (fragS *fragP, offsetT address)
+{
+  unsigned int offset, size, padding_size;
+  fragS *branch_fragP = fragP->tc_frag_data.u.branch_fragP;
+
+  /* The start address of the BRANCH_PADDING or FUSED_JCC_PADDING frag.  */
+  if (!address)
+    address = fragP->fr_address;
+  address += fragP->fr_fix;
+
+  /* CMP instrunction size.  */
+  size = fragP->tc_frag_data.cmp_size;
+
+  /* The base size of the branch frag.  */
+  size += branch_fragP->fr_fix;
+
+  /* Add opcode and displacement bytes for the rs_machine_dependent
+     branch frag.  */
+  if (branch_fragP->fr_type == rs_machine_dependent)
+    size += md_relax_table[branch_fragP->fr_subtype].rlx_length;
+
+  /* Check if branch is within boundary and doesn't end at the last
+     byte.  */
+  offset = address & ((1U << align_branch_power) - 1);
+  if ((offset + size) >= (1U << align_branch_power))
+    /* Padding needed to avoid crossing boundary.  */
+    padding_size = (1 << align_branch_power) - offset;
+  else
+    /* No padding needed.  */
+    padding_size = 0;
+
+  if (!fits_in_signed_byte (padding_size))
+    abort ();
+
+  return padding_size;
+}
+
+/* i386_generic_table_relax_frag()
+
+   Handle BRANCH_PADDING, BRANCH_PREFIX and FUSED_JCC_PADDING frags to
+   grow/shrink padding to align branch frags.  Hand others to
+   relax_frag().  */
+
+long
+i386_generic_table_relax_frag (segT segment, fragS *fragP, long stretch)
+{
+  if (TYPE_FROM_RELAX_STATE (fragP->fr_subtype) == BRANCH_PADDING
+      || TYPE_FROM_RELAX_STATE (fragP->fr_subtype) == FUSED_JCC_PADDING)
+    {
+      long padding_size = i386_branch_padding_size (fragP, 0);
+      long grow = padding_size - fragP->tc_frag_data.length;
+
+      /* When the BRANCH_PREFIX frag is used, the computed address
+         must match the actual address and there should be no padding.  */
+      if (fragP->tc_frag_data.padding_address
+	  && (fragP->tc_frag_data.padding_address != fragP->fr_address
+	      || padding_size))
+	abort ();
+
+      /* Update the padding size.  */
+      if (grow)
+	fragP->tc_frag_data.length = padding_size;
+
+      return grow;
+    }
+  else if (TYPE_FROM_RELAX_STATE (fragP->fr_subtype) == BRANCH_PREFIX)
+    {
+      fragS *padding_fragP, *next_fragP;
+      long padding_size, left_size, last_size;
+
+      padding_fragP = fragP->tc_frag_data.u.padding_fragP;
+      if (!padding_fragP)
+	/* Use the padding set by the leading BRANCH_PREFIX frag.  */
+	return (fragP->tc_frag_data.length
+		- fragP->tc_frag_data.last_length);
+
+      /* Compute the relative address of the padding frag in the very
+        first time where the BRANCH_PREFIX frag sizes are zero.  */
+      if (!fragP->tc_frag_data.padding_address)
+	fragP->tc_frag_data.padding_address
+	  = padding_fragP->fr_address - (fragP->fr_address - stretch);
+
+      /* First update the last length from the previous interation.  */
+      left_size = fragP->tc_frag_data.prefix_length;
+      for (next_fragP = fragP;
+	   next_fragP != padding_fragP;
+	   next_fragP = next_fragP->fr_next)
+	if (next_fragP->fr_type == rs_machine_dependent
+	    && (TYPE_FROM_RELAX_STATE (next_fragP->fr_subtype)
+		== BRANCH_PREFIX))
+	  {
+	    if (left_size)
+	      {
+		int max = next_fragP->tc_frag_data.max_bytes;
+		if (max)
+		  {
+		    int size;
+		    if (max > left_size)
+		      size = left_size;
+		    else
+		      size = max;
+		    left_size -= size;
+		    next_fragP->tc_frag_data.last_length = size;
+		  }
+	      }
+	    else
+	      next_fragP->tc_frag_data.last_length = 0;
+	  }
+
+      /* Check the padding size for the padding frag.  */
+      padding_size = i386_branch_padding_size
+	(padding_fragP, (fragP->fr_address
+			 + fragP->tc_frag_data.padding_address));
+
+      last_size = fragP->tc_frag_data.prefix_length;
+      /* Check if there is change from the last interation.  */
+      if (padding_size == last_size)
+	{
+	  /* Update the expected address of the padding frag.  */
+	  padding_fragP->tc_frag_data.padding_address
+	    = (fragP->fr_address + padding_size
+	       + fragP->tc_frag_data.padding_address);
+	  return 0;
+	}
+
+      if (padding_size > fragP->tc_frag_data.max_prefix_length)
+	{
+	  /* No padding if there is no sufficient room.  Clear the
+	     expected address of the padding frag.  */
+	  padding_fragP->tc_frag_data.padding_address = 0;
+	  padding_size = 0;
+	}
+      else
+	/* Store the expected address of the padding frag.  */
+	padding_fragP->tc_frag_data.padding_address
+	  = (fragP->fr_address + padding_size
+	     + fragP->tc_frag_data.padding_address);
+
+      fragP->tc_frag_data.prefix_length = padding_size;
+
+      /* Update the length for the current interation.  */
+      left_size = padding_size;
+      for (next_fragP = fragP;
+	   next_fragP != padding_fragP;
+	   next_fragP = next_fragP->fr_next)
+	if (next_fragP->fr_type == rs_machine_dependent
+	    && (TYPE_FROM_RELAX_STATE (next_fragP->fr_subtype)
+		== BRANCH_PREFIX))
+	  {
+	    if (left_size)
+	      {
+		int max = next_fragP->tc_frag_data.max_bytes;
+		if (max)
+		  {
+		    int size;
+		    if (max > left_size)
+		      size = left_size;
+		    else
+		      size = max;
+		    left_size -= size;
+		    next_fragP->tc_frag_data.length = size;
+		  }
+	      }
+	    else
+	      next_fragP->tc_frag_data.length = 0;
+	  }
+
+      return (fragP->tc_frag_data.length
+	      - fragP->tc_frag_data.last_length);
+    }
+  return relax_frag (segment, fragP, stretch);
+}
+
 /* md_estimate_size_before_relax()
 
    Called just before relax() for rs_machine_dependent frags.  The x86
@@ -10369,6 +11095,14 @@  elf_symbol_resolved_in_segment_p (symbolS *fr_symbol, offsetT fr_var)
 int
 md_estimate_size_before_relax (fragS *fragP, segT segment)
 {
+  if (TYPE_FROM_RELAX_STATE (fragP->fr_subtype) == BRANCH_PADDING
+      || TYPE_FROM_RELAX_STATE (fragP->fr_subtype) == BRANCH_PREFIX
+      || TYPE_FROM_RELAX_STATE (fragP->fr_subtype) == FUSED_JCC_PADDING)
+    {
+      i386_classify_machine_dependent_frag (fragP);
+      return fragP->tc_frag_data.length;
+    }
+
   /* We've already got fragP->fr_subtype right;  all we have to do is
      check for un-relaxable symbols.  On an ELF system, we can't relax
      an externally visible symbol, because it may be overridden by a
@@ -10502,6 +11236,106 @@  md_convert_frag (bfd *abfd ATTRIBUTE_UNUSED, segT sec ATTRIBUTE_UNUSED,
   unsigned int extension = 0;
   offsetT displacement_from_opcode_start;
 
+  if (TYPE_FROM_RELAX_STATE (fragP->fr_subtype) == BRANCH_PADDING
+      || TYPE_FROM_RELAX_STATE (fragP->fr_subtype) == FUSED_JCC_PADDING
+      || TYPE_FROM_RELAX_STATE (fragP->fr_subtype) == BRANCH_PREFIX)
+    {
+      /* Generate nop padding.  */
+      unsigned int size = fragP->tc_frag_data.length;
+      if (size)
+	{
+	  if (size > fragP->tc_frag_data.max_bytes)
+	    abort ();
+
+	  if (flag_debug)
+	    {
+	      const char *msg;
+	      const char *branch = "branch";
+	      const char *prefix = "";
+	      fragS *padding_fragP;
+	      if (TYPE_FROM_RELAX_STATE (fragP->fr_subtype)
+		  == BRANCH_PREFIX)
+		{
+		  padding_fragP = fragP->tc_frag_data.u.padding_fragP;
+		  switch (fragP->tc_frag_data.default_prefix)
+		    {
+		    default:
+		      abort ();
+		      break;
+		    case CS_PREFIX_OPCODE:
+		      prefix = " cs";
+		      break;
+		    case DS_PREFIX_OPCODE:
+		      prefix = " ds";
+		      break;
+		    case ES_PREFIX_OPCODE:
+		      prefix = " es";
+		      break;
+		    case FS_PREFIX_OPCODE:
+		      prefix = " fs";
+		      break;
+		    case GS_PREFIX_OPCODE:
+		      prefix = " gs";
+		      break;
+		    case SS_PREFIX_OPCODE:
+		      prefix = " ss";
+		      break;
+		    }
+		  if (padding_fragP)
+		    msg = _("%s:%u: add %d%s at 0x%llx to align "
+			    "%s within %d-byte boundary\n");
+		  else
+		    msg = _("%s:%u: add additional %d%s at 0x%llx to "
+			    "align %s within %d-byte boundary\n");
+		}
+	      else
+		{
+		  padding_fragP = fragP;
+		  msg = _("%s:%u: add %d%s-byte nop at 0x%llx to align "
+			  "%s within %d-byte boundary\n");
+		}
+
+	      if (padding_fragP)
+		switch (padding_fragP->tc_frag_data.branch_type)
+		  {
+		  case align_branch_jcc:
+		    branch = "jcc";
+		    break;
+		  case align_branch_fused:
+		    branch = "fused jcc";
+		    break;
+		  case align_branch_jmp:
+		    branch = "jmp";
+		    break;
+		  case align_branch_call:
+		    branch = "call";
+		    break;
+		  case align_branch_indirect:
+		    branch = "indiret branch";
+		    break;
+		  case align_branch_ret:
+		    branch = "ret";
+		    break;
+		  default:
+		    break;
+		  }
+
+	      fprintf (stdout, msg,
+		       fragP->fr_file, fragP->fr_line, size, prefix,
+		       (long long) fragP->fr_address, branch,
+		       1 << align_branch_power);
+	    }
+	  if (TYPE_FROM_RELAX_STATE (fragP->fr_subtype) == BRANCH_PREFIX)
+	    memset (fragP->fr_opcode,
+		    fragP->tc_frag_data.default_prefix, size);
+	  else
+	    i386_generate_nops (fragP, (char *) fragP->fr_opcode,
+				size, 0);
+	  fragP->fr_fix += size;
+	}
+      return;
+    }
+
   opcode = (unsigned char *) fragP->fr_opcode;
 
   /* Address we want to reach in file space.  */
@@ -11060,6 +11894,9 @@  const char *md_shortopts = "qnO::";
 #define OPTION_MFENCE_AS_LOCK_ADD (OPTION_MD_BASE + 24)
 #define OPTION_X86_USED_NOTE (OPTION_MD_BASE + 25)
 #define OPTION_MVEXWIG (OPTION_MD_BASE + 26)
+#define OPTION_MALIGN_BRANCH_BOUNDARY (OPTION_MD_BASE + 27)
+#define OPTION_MALIGN_BRANCH_PREFIX_SIZE (OPTION_MD_BASE + 28)
+#define OPTION_MALIGN_BRANCH (OPTION_MD_BASE + 29)
 
 struct option md_longopts[] =
 {
@@ -11095,6 +11932,9 @@  struct option md_longopts[] =
   {"mfence-as-lock-add", required_argument, NULL, OPTION_MFENCE_AS_LOCK_ADD},
   {"mrelax-relocations", required_argument, NULL, OPTION_MRELAX_RELOCATIONS},
   {"mevexrcig", required_argument, NULL, OPTION_MEVEXRCIG},
+  {"malign-branch-boundary", required_argument, NULL, OPTION_MALIGN_BRANCH_BOUNDARY},
+  {"malign-branch-prefix-size", required_argument, NULL, OPTION_MALIGN_BRANCH_PREFIX_SIZE},
+  {"malign-branch", required_argument, NULL, OPTION_MALIGN_BRANCH},
   {"mamd64", no_argument, NULL, OPTION_MAMD64},
   {"mintel64", no_argument, NULL, OPTION_MINTEL64},
   {NULL, no_argument, NULL, 0}
@@ -11105,7 +11945,7 @@  int
 md_parse_option (int c, const char *arg)
 {
   unsigned int j;
-  char *arch, *next, *saved;
+  char *arch, *next, *saved, *type;
 
   switch (c)
     {
@@ -11483,6 +12323,78 @@  md_parse_option (int c, const char *arg)
         as_fatal (_("invalid -mrelax-relocations= option: `%s'"), arg);
       break;
 
+    case OPTION_MALIGN_BRANCH_BOUNDARY:
+      {
+	char *end;
+	int align = strtoul (arg, &end, 0);
+	if (*end == '\0')
+	  {
+	    if (align == 0)
+	      {
+		align_branch_power = 0;
+		break;
+	      }
+	    else if (align >= 32)
+	      {
+		int align_power;
+		for (align_power = 0;
+		     (align & 1) == 0;
+		     align >>= 1, align_power++)
+		  continue;
+		if (align == 1)
+		  {
+		    align_branch_power = align_power;
+		    break;
+		  }
+	      }
+	  }
+	as_fatal (_("invalid -malign-branch-boundary= value: %s"), arg);
+      }
+      break;
+
+    case OPTION_MALIGN_BRANCH_PREFIX_SIZE:
+      {
+	char *end;
+	int align = strtoul (arg, &end, 0);
+	if (*end == '\0' && align >= 0 && align < 6)
+	  {
+	    align_branch_prefix_size = align;
+	    break;
+	  }
+	as_fatal (_("invalid -malign-branch-prefix-size= value: %s"),
+		  arg);
+      }
+      break;
+
+    case OPTION_MALIGN_BRANCH:
+      align_branch = 0;
+      saved = xstrdup (arg);
+      type = saved;
+      do
+	{
+	  next = strchr (type, '+');
+	  if (next)
+	    *next++ = '\0';
+	  if (strcasecmp (type, "jcc") == 0)
+	    align_branch |= align_branch_jcc;
+	  else if (strcasecmp (type, "fused") == 0)
+	    align_branch |= align_branch_fused;
+	  else if (strcasecmp (type, "jmp") == 0)
+	    align_branch |= align_branch_jmp;
+	  else if (strcasecmp (type, "call") == 0)
+	    align_branch |= align_branch_call;
+	  else if (strcasecmp (type, "ret") == 0)
+	    align_branch |= align_branch_ret;
+	  else if (strcasecmp (type, "indirect") == 0)
+	    align_branch |= align_branch_indirect;
+	  else
+	    as_fatal (_("invalid -malign-branch= option: `%s'"), arg);
+	  type = next;
+	}
+      while (next != NULL);
+      free (saved);
+      break;
+
     case OPTION_MAMD64:
       intel64 = 0;
       break;
@@ -11735,6 +12647,17 @@  md_show_usage (FILE *stream)
   fprintf (stream, _("\
                           generate relax relocations\n"));
   fprintf (stream, _("\
+  -malign-branch-boundary=NUM (default: 0)\n\
+                          align branches within NUM byte boundary\n"));
+  fprintf (stream, _("\
+  -malign-branch=TYPE[+TYPE...] (default: jcc+fused+jmp)\n\
+                          TYPE is combination of jcc, fused, jmp, call, ret,\n\
+                           indirect\n\
+                          specify types of branches to align\n"));
+  fprintf (stream, _("\
+  -malign-branch-prefix-size=NUM (default: 5)\n\
+                          align branches with NUM prefixes per instruction\n"));
+  fprintf (stream, _("\
   -mamd64                 accept only AMD64 ISA [default]\n"));
   fprintf (stream, _("\
   -mintel64               accept only Intel64 ISA\n"));
@@ -11818,15 +12741,24 @@  i386_target_format (void)
 	  {
 	  default:
 	    format = ELF_TARGET_FORMAT;
+#ifndef TE_SOLARIS
+	    tls_get_addr = "___tls_get_addr";
+#endif
 	    break;
 	  case X86_64_ABI:
 	    use_rela_relocations = 1;
 	    object_64bit = 1;
+#ifndef TE_SOLARIS
+	    tls_get_addr = "__tls_get_addr";
+#endif
 	    format = ELF_TARGET_FORMAT64;
 	    break;
 	  case X86_64_X32_ABI:
 	    use_rela_relocations = 1;
 	    object_64bit = 1;
+#ifndef TE_SOLARIS
+	    tls_get_addr = "__tls_get_addr";
+#endif
 	    disallow_64bit_reloc = 1;
 	    format = ELF_TARGET_FORMAT32;
 	    break;
@@ -11943,6 +12875,21 @@  s_bss (int ignore ATTRIBUTE_UNUSED)
 
 #endif
 
+/* Remember constant diretive.  */
+
+void
+i386_cons_worker (int ignore ATTRIBUTE_UNUSED)
+{
+  if (last_insn.kind != last_insn_directive
+      && (bfd_section_flags (now_seg) & SEC_CODE))
+    {
+      last_insn.seg = now_seg;
+      last_insn.kind = last_insn_directive;
+      last_insn.name = "constant diretive";
+      last_insn.file = as_where (&last_insn.line);
+    }
+}
+
 void
 i386_validate_fix (fixS *fixp)
 {
diff --git a/gas/config/tc-i386.h b/gas/config/tc-i386.h
index b02a25671f..5c87dad866 100644
--- a/gas/config/tc-i386.h
+++ b/gas/config/tc-i386.h
@@ -210,12 +210,19 @@  if ((n)									\
 
 #define MAX_MEM_FOR_RS_ALIGN_CODE  (alignment ? ((1 << alignment) - 1) : 1)
 
+extern void i386_cons_worker (int);
+#define md_cons_worker(nbytes) i386_cons_worker (nbytes)
+
 void i386_print_statistics (FILE *);
 #define tc_print_statistics i386_print_statistics
 
 extern unsigned int i386_frag_max_var (fragS *);
 #define md_frag_max_var i386_frag_max_var
 
+extern long i386_generic_table_relax_frag (segT, fragS *, long);
+#define md_generic_table_relax_frag(segment, fragP, stretch) \
+  i386_generic_table_relax_frag (segment, fragP, stretch)
+
 #define md_number_to_chars number_to_chars_littleendian
 
 enum processor_type
@@ -250,10 +257,24 @@  extern i386_cpu_flags cpu_arch_isa_flags;
 
 struct i386_tc_frag_data
 {
+  union
+    {
+      fragS *padding_fragP;
+      fragS *branch_fragP;
+    } u;
+  addressT padding_address;
   enum processor_type isa;
   i386_cpu_flags isa_flags;
   enum processor_type tune;
   unsigned int max_bytes;
+  signed char length;
+  signed char last_length;
+  signed char max_prefix_length;
+  signed char prefix_length;
+  signed char default_prefix;
+  signed char cmp_size;
+  unsigned int classified : 1;
+  unsigned int branch_type : 7;
 };
 
 /* We need to emit the right NOP pattern in .align frags.  This is
@@ -264,10 +285,20 @@  struct i386_tc_frag_data
 #define TC_FRAG_INIT(FRAGP, MAX_BYTES)				\
  do								\
    {								\
+     (FRAGP)->tc_frag_data.u.padding_fragP = NULL;		\
+     (FRAGP)->tc_frag_data.padding_address = 0;			\
      (FRAGP)->tc_frag_data.isa = cpu_arch_isa;			\
      (FRAGP)->tc_frag_data.isa_flags = cpu_arch_isa_flags;	\
      (FRAGP)->tc_frag_data.tune = cpu_arch_tune;		\
      (FRAGP)->tc_frag_data.max_bytes = (MAX_BYTES);		\
+     (FRAGP)->tc_frag_data.length = 0;				\
+     (FRAGP)->tc_frag_data.last_length = 0;			\
+     (FRAGP)->tc_frag_data.max_prefix_length = 0;		\
+     (FRAGP)->tc_frag_data.prefix_length = 0;			\
+     (FRAGP)->tc_frag_data.default_prefix = 0;			\
+     (FRAGP)->tc_frag_data.cmp_size = 0;			\
+     (FRAGP)->tc_frag_data.classified = 0;			\
+     (FRAGP)->tc_frag_data.branch_type = 0;			\
    }								\
  while (0)
 
diff --git a/gas/doc/c-i386.texi b/gas/doc/c-i386.texi
index 589b4260f0..9b86ee3485 100644
--- a/gas/doc/c-i386.texi
+++ b/gas/doc/c-i386.texi
@@ -423,6 +423,32 @@  R_X86_64_REX_GOTPCRELX, in 64-bit mode.
 relocations.  The default can be controlled by a configure option
 @option{--enable-x86-relax-relocations}.
 
+@cindex @samp{-malign-branch-boundary=} option, i386
+@cindex @samp{-malign-branch-boundary=} option, x86-64
+@item -malign-branch-boundary=@var{NUM}
+This option controls how the assembler should align branches with segment
+prefixes or NOP.  @var{NUM} must be a power of 2.  It should be 0 or
+no less than 32.  Branches will be aligned within @var{NUM} byte
+boundary.  @option{-malign-branch-boundary=0}, which is the default,
+doesn't align branches.
+
+@cindex @samp{-malign-branch=} option, i386
+@cindex @samp{-malign-branch=} option, x86-64
+@item -malign-branch=@var{TYPE}[+@var{TYPE}...]
+This option specifies types of branches to align. @var{TYPE} is
+combination of @samp{jcc}, which aligns conditional jumps,
+@samp{fused}, which aligns fused conditional jumps, @samp{jmp},
+which aligns unconditional jumps, @samp{call} which aligns calls,
+@samp{ret}, which aligns rets, @samp{indirect}, which aligns indirect
+jumps and calls.  The default is @option{-malign-branch=jcc+fused+jmp}.
+
+@cindex @samp{-malign-branch-prefix-size=} option, i386
+@cindex @samp{-malign-branch-prefix-size=} option, x86-64
+@item -malign-branch-prefix-size=@var{NUM}
+This option specifies the maximum number of prefixes on an instruction
+to align branches.  @var{NUM} should be between 0 and 5.  The default
+@var{NUM} is 5.
+
 @cindex @samp{-mx86-used-note=} option, i386
 @cindex @samp{-mx86-used-note=} option, x86-64
 @item -mx86-used-note=@var{no}