[RFC] SHIFT_COUNT_TRUNCATED and shift_truncation_mask

Message ID 6f3a6c31-5a02-86bb-a659-13f541350616@linux.ibm.com
State New
Headers show
Series
  • [RFC] SHIFT_COUNT_TRUNCATED and shift_truncation_mask
Related show

Commit Message

Robin Dapp May 9, 2019, 11:52 a.m.
Hi,

while trying to improve s390 code generation for rotate and shift I
noticed superfluous subregs for shift count operands. In our backend we
already have quite cumbersome patterns that would need to be duplicated
(or complicated further by more subst patterns) in order to get rid of
the subregs.

I had already finished all the patterns when I realized that
SHIFT_COUNT_TRUNCATED and the target hook shift_truncation_mask already
exist and could do what is needed without extra patterns.  Just defining
 shift_truncation_mask was not enough though as most of the additional
insns get introduced by combine.

Event SHIFT_COUNT_TRUNCATED is no perfect match to what our hardware
does because we always only consider the last 6 bits of a shift operand.

Despite all the warnings in the other backends, most notably
SHIFT_COUNT_TRUNCATED being "discouraged" as mentioned in riscv.h, I
wrote the attached tentative patch.  It's a little ad-hoc, uses the
SHIFT_COUNT_TRUNCATED paths only if shift_truncation_mask () != 0 and,
instead of truncating via & (GET_MODE_BITSIZE (mode) - 1), it applies
the mask returned by shift_truncation_mask.  Doing so, usage of both
"methods" actually reduces to a single way.

I assume both were originally intended for different purposes but
without knowing the history the separation seems artificial to me.  A
quick look at other backends showed that at least some (e.g. ARM) do not
use SHIFT_COUNT_TRUNCATED because the general behavior is not
fine-grained enough, e.g. the masks for shift and rotate differ.

While the attached patch might probably work for s390, it will probably
not for other targets.  In addition to what my patch does, would it be
useful to unify both truncation methods in a target hook that takes the
operation (shift, rotate, zero_extract, ...) as well as the mode as
arguments?  Thus, we would let the target decide what to do with the
specific combination of both.  Maybe this would also allow to
distinguish bit test operations from the rest.

Of course, when getting everything right in the backend, there will be
no difference in result, but in my experience it's easily possible to
forget a subreg/... somewhere and end up with worse code by accident.
Maybe there is another reason why SHIFT_COUNT_TRUNCATED is discouraged
that I missed entirely?

Regards
 Robin

Comments

Jeff Law May 9, 2019, 3:59 p.m. | #1
On 5/9/19 5:52 AM, Robin Dapp wrote:
> Hi,

> 

> while trying to improve s390 code generation for rotate and shift I

> noticed superfluous subregs for shift count operands. In our backend we

> already have quite cumbersome patterns that would need to be duplicated

> (or complicated further by more subst patterns) in order to get rid of

> the subregs.

> 

> I had already finished all the patterns when I realized that

> SHIFT_COUNT_TRUNCATED and the target hook shift_truncation_mask already

> exist and could do what is needed without extra patterns.  Just defining

>  shift_truncation_mask was not enough though as most of the additional

> insns get introduced by combine.

> 

> Event SHIFT_COUNT_TRUNCATED is no perfect match to what our hardware

> does because we always only consider the last 6 bits of a shift operand.>

> Despite all the warnings in the other backends, most notably

> SHIFT_COUNT_TRUNCATED being "discouraged" as mentioned in riscv.h, I

> wrote the attached tentative patch.  It's a little ad-hoc, uses the

> SHIFT_COUNT_TRUNCATED paths only if shift_truncation_mask () != 0 and,

> instead of truncating via & (GET_MODE_BITSIZE (mode) - 1), it applies

> the mask returned by shift_truncation_mask.  Doing so, usage of both

> "methods" actually reduces to a single way.

THe main reason it's discouraged is because some targets have insns
where the count would be truncated and others where it would not.   So
for example normal shifts might truncate, but vector shifts might or
(mips) or shifts might truncate but bit tests do not (x86).

I don't know enough about the s390 architecture to know if there's any
corner cases.  You'd have to look at ever pattern in your machine
description with a shift and verify that it's going to DTRT if the count
hasn't been truncated.


It would really help if you could provide testcases which show the
suboptimal code and any analysis you've done.


Jeff
Segher Boessenkool May 9, 2019, 4:44 p.m. | #2
[ Please Cc: me on combine patches.  Thanks! ]

On Thu, May 09, 2019 at 01:52:58PM +0200, Robin Dapp wrote:
> I had already finished all the patterns when I realized that

> SHIFT_COUNT_TRUNCATED and the target hook shift_truncation_mask already

> exist and could do what is needed without extra patterns.  Just defining

>  shift_truncation_mask was not enough though as most of the additional

> insns get introduced by combine.


SHIFT_COUNT_TRUNCATED says that *all* of your (variable) shifts truncate
the shift count in a particular way.  targetm.shift_truncation_mask says
that *all* of your variable shifts in a particular mode are done in a
particular way.  Neither is true, for some (many?) targets.

For example, on Power, if done in the integer registers, simple shifts
use one bit more than the datum rotated (this works for rotates as well
of course).  In the vector registers, you get a mask of just the size
of the datum however.

If it is not a simple shift, but a shift-and-mask, *nothing* works: for
example, a DImode shift-and-mask with certain arguments is actually
implemented with a 32-bit instruction (but many use a 64-bit instruction).

It really depends on the machine instruction what the behaviour is, what
the mask has to be; and for that at a minimum you need to know the icode,
but many patterns have different machine insns in their alternatives, too.

> While the attached patch might probably work for s390, it will probably

> not for other targets.  In addition to what my patch does, would it be

> useful to unify both truncation methods in a target hook that takes the

> operation (shift, rotate, zero_extract, ...) as well as the mode as

> arguments?  Thus, we would let the target decide what to do with the

> specific combination of both.  Maybe this would also allow to

> distinguish bit test operations from the rest.


Will this be useful for many targets?  That's the question for all new
hooks that aren't needed, that are just a nicety.  It's a cost for
everyone, not just your target, which is a bad tradeoff if it doesn't
help enough targets, and helps enough.

> @@ -12295,7 +12293,7 @@ simplify_comparison (enum rtx_code code, rtx *pop0, rtx *pop1)

>  	     between the position and the location of the single bit.  */

>  	  /* Except we can't if SHIFT_COUNT_TRUNCATED is set, since we might

>  	     have already reduced the shift count modulo the word size.  */

> -	  if (!SHIFT_COUNT_TRUNCATED

> +	  if ((!SHIFT_COUNT_TRUNCATED || !targetm.shift_truncation_mask (mode))


Please make the (default) hook implementation use the macro, instead?

> diff --git a/gcc/config/s390/s390.c b/gcc/config/s390/s390.c

> index ad8eacdf4dc..1d723f29e1e 100644

> --- a/gcc/config/s390/s390.c

> +++ b/gcc/config/s390/s390.c

> @@ -2320,6 +2320,7 @@ s390_single_part (rtx op,

>  	    part = i;

>  	}

>      }

> +


No spurious whitespace changes please.  There is a lot that can be
improved or should be fixed, but let's not do that one by one, and not
in random patches.


Segher
Segher Boessenkool May 9, 2019, 5:05 p.m. | #3
On Thu, May 09, 2019 at 09:59:55AM -0600, Jeff Law wrote:
> THe main reason it's discouraged is because some targets have insns

> where the count would be truncated and others where it would not.   So

> for example normal shifts might truncate, but vector shifts might or

> (mips) or shifts might truncate but bit tests do not (x86).


And don't forget about zero_extract, on targets that are unforunate
enough to have that.

> I don't know enough about the s390 architecture to know if there's any

> corner cases.  You'd have to look at ever pattern in your machine

> description with a shift and verify that it's going to DTRT if the count

> hasn't been truncated.


Yeah.

> It would really help if you could provide testcases which show the

> suboptimal code and any analysis you've done.


I'd love to see this, too!


Segher
Richard Biener May 10, 2019, 7:08 a.m. | #4
On Thu, May 9, 2019 at 6:00 PM Jeff Law <law@redhat.com> wrote:
>

> On 5/9/19 5:52 AM, Robin Dapp wrote:

> > Hi,

> >

> > while trying to improve s390 code generation for rotate and shift I

> > noticed superfluous subregs for shift count operands. In our backend we

> > already have quite cumbersome patterns that would need to be duplicated

> > (or complicated further by more subst patterns) in order to get rid of

> > the subregs.

> >

> > I had already finished all the patterns when I realized that

> > SHIFT_COUNT_TRUNCATED and the target hook shift_truncation_mask already

> > exist and could do what is needed without extra patterns.  Just defining

> >  shift_truncation_mask was not enough though as most of the additional

> > insns get introduced by combine.

> >

> > Event SHIFT_COUNT_TRUNCATED is no perfect match to what our hardware

> > does because we always only consider the last 6 bits of a shift operand.>

> > Despite all the warnings in the other backends, most notably

> > SHIFT_COUNT_TRUNCATED being "discouraged" as mentioned in riscv.h, I

> > wrote the attached tentative patch.  It's a little ad-hoc, uses the

> > SHIFT_COUNT_TRUNCATED paths only if shift_truncation_mask () != 0 and,

> > instead of truncating via & (GET_MODE_BITSIZE (mode) - 1), it applies

> > the mask returned by shift_truncation_mask.  Doing so, usage of both

> > "methods" actually reduces to a single way.

> THe main reason it's discouraged is because some targets have insns

> where the count would be truncated and others where it would not.   So

> for example normal shifts might truncate, but vector shifts might or

> (mips) or shifts might truncate but bit tests do not (x86).

>

> I don't know enough about the s390 architecture to know if there's any

> corner cases.  You'd have to look at ever pattern in your machine

> description with a shift and verify that it's going to DTRT if the count

> hasn't been truncated.

>

>

> It would really help if you could provide testcases which show the

> suboptimal code and any analysis you've done.


The main reason I dislike SHIFT_COUNT_TRUNCATED is that it
changes the meaning of the IL.  We generally want these kind
of things to be explicit.

Richard.



>

> Jeff
Jeff Law May 10, 2019, 2:08 p.m. | #5
On 5/10/19 1:08 AM, Richard Biener wrote:
> On Thu, May 9, 2019 at 6:00 PM Jeff Law <law@redhat.com> wrote:

>>

>> On 5/9/19 5:52 AM, Robin Dapp wrote:

>>> Hi,

>>>

>>> while trying to improve s390 code generation for rotate and shift I

>>> noticed superfluous subregs for shift count operands. In our backend we

>>> already have quite cumbersome patterns that would need to be duplicated

>>> (or complicated further by more subst patterns) in order to get rid of

>>> the subregs.

>>>

>>> I had already finished all the patterns when I realized that

>>> SHIFT_COUNT_TRUNCATED and the target hook shift_truncation_mask already

>>> exist and could do what is needed without extra patterns.  Just defining

>>>  shift_truncation_mask was not enough though as most of the additional

>>> insns get introduced by combine.

>>>

>>> Event SHIFT_COUNT_TRUNCATED is no perfect match to what our hardware

>>> does because we always only consider the last 6 bits of a shift operand.>

>>> Despite all the warnings in the other backends, most notably

>>> SHIFT_COUNT_TRUNCATED being "discouraged" as mentioned in riscv.h, I

>>> wrote the attached tentative patch.  It's a little ad-hoc, uses the

>>> SHIFT_COUNT_TRUNCATED paths only if shift_truncation_mask () != 0 and,

>>> instead of truncating via & (GET_MODE_BITSIZE (mode) - 1), it applies

>>> the mask returned by shift_truncation_mask.  Doing so, usage of both

>>> "methods" actually reduces to a single way.

>> THe main reason it's discouraged is because some targets have insns

>> where the count would be truncated and others where it would not.   So

>> for example normal shifts might truncate, but vector shifts might or

>> (mips) or shifts might truncate but bit tests do not (x86).

>>

>> I don't know enough about the s390 architecture to know if there's any

>> corner cases.  You'd have to look at ever pattern in your machine

>> description with a shift and verify that it's going to DTRT if the count

>> hasn't been truncated.

>>

>>

>> It would really help if you could provide testcases which show the

>> suboptimal code and any analysis you've done.

> 

> The main reason I dislike SHIFT_COUNT_TRUNCATED is that it

> changes the meaning of the IL.  We generally want these kind

> of things to be explicit.

Understood and agreed.  So I think this argues that Robin should look at
a different approach and we should start pushing ports away from
SHIFT_COUNT_TRUNCATED a bit more aggressively.

Jeff
Robin Dapp May 15, 2019, 2:30 p.m. | #6
> It would really help if you could provide testcases which show the

> suboptimal code and any analysis you've done.


I tried introducing a define_subst pattern that substitutes something
one of two other subst patterns already changed.

The first subst pattern helps remove a superfluous and on the shift
count operand by accepting both variants, with and without an and for
the insn pattern.

(define_subst "masked_op_subst"
  [(set (match_operand:DSI 0 ""           "")
        (shift:DSI (match_operand:DSI 1 "" "")
		   (match_operand:SI  2 "" "")))]
  ""
  [(set (match_dup 0)
        (shift:DSI (match_dup 1)
		   (and:SI (match_dup 2)
(match_operand:SI 3 "const_int_6bitset_operand" "jm6"))))])

The second subst helps encode a shift count addition like $r1 + 1 as
address style operand 1($r1) that is directly supported by the shift
instruction.

(define_subst "addr_style_op_subst"
  [(set (match_operand:DSI_VI 0 "" "")
        (shift:DSI_VI (match_operand:DSI_VI 1 "" "")
		      (match_operand:SI 2 "" "")))]
  ""
  [(set (match_dup 0)
        (shift:DSI_VI (match_dup 1)
		      (plus:SI (match_operand:SI 2 "register_operand" "a")
(match_operand 3 "const_int_operand" "n"))))])

Both of these are also used in combination.

Now, in order to get rid of the subregs in the pattern combine creates,
I would need to be able to do something like

(define_subst "subreg_subst"
  [(set (match_operand:DI 0 "" "")
        (shift:DI (match_operand:DI 1 "" "")
		  (subreg:SI (match_dup:DI 2)))]

where the (match_dup:DI 2) would capture both (and:SI ...) [with the
first argument being either a register or an already substituted
(plus:SI ...)] as well as a simple (plus:SI ...).

As far as I can tell match_dup:mode can be used to change the mode of
the top-level operation but the operands will remain the same.  For
this, a match_dup_deep or whatever would be useful.  I'm pretty sure we
don't want to open this can of worms, though :)

To get rid of this, I explicitly duplicated all three subst combinations
resulting in a lot of additional code.  This is not necessary when the
subregs are eliminated by the middle end via SHIFT_COUNT_TRUNCATED.
Maybe there is a much more obvious way that I missed?

Regards
 Robin
Jeff Law May 21, 2019, 10:58 p.m. | #7
On 5/15/19 8:30 AM, Robin Dapp wrote:
>> It would really help if you could provide testcases which show the

>> suboptimal code and any analysis you've done.

> 

> I tried introducing a define_subst pattern that substitutes something

> one of two other subst patterns already changed.

> 

> The first subst pattern helps remove a superfluous and on the shift

> count operand by accepting both variants, with and without an and for

> the insn pattern.

> 

> (define_subst "masked_op_subst"

>   [(set (match_operand:DSI 0 ""           "")

>         (shift:DSI (match_operand:DSI 1 "" "")

> 		   (match_operand:SI  2 "" "")))]

>   ""

>   [(set (match_dup 0)

>         (shift:DSI (match_dup 1)

> 		   (and:SI (match_dup 2)

> (match_operand:SI 3 "const_int_6bitset_operand" "jm6"))))])

> 

> The second subst helps encode a shift count addition like $r1 + 1 as

> address style operand 1($r1) that is directly supported by the shift

> instruction.

> 

> (define_subst "addr_style_op_subst"

>   [(set (match_operand:DSI_VI 0 "" "")

>         (shift:DSI_VI (match_operand:DSI_VI 1 "" "")

> 		      (match_operand:SI 2 "" "")))]

>   ""

>   [(set (match_dup 0)

>         (shift:DSI_VI (match_dup 1)

> 		      (plus:SI (match_operand:SI 2 "register_operand" "a")

> (match_operand 3 "const_int_operand" "n"))))])

> 

> Both of these are also used in combination.

> 

> Now, in order to get rid of the subregs in the pattern combine creates,

> I would need to be able to do something like

> 

> (define_subst "subreg_subst"

>   [(set (match_operand:DI 0 "" "")

>         (shift:DI (match_operand:DI 1 "" "")

> 		  (subreg:SI (match_dup:DI 2)))]

> 

> where the (match_dup:DI 2) would capture both (and:SI ...) [with the

> first argument being either a register or an already substituted

> (plus:SI ...)] as well as a simple (plus:SI ...).

> 

> As far as I can tell match_dup:mode can be used to change the mode of

> the top-level operation but the operands will remain the same.  For

> this, a match_dup_deep or whatever would be useful.  I'm pretty sure we

> don't want to open this can of worms, though :)

> 

> To get rid of this, I explicitly duplicated all three subst combinations

> resulting in a lot of additional code.  This is not necessary when the

> subregs are eliminated by the middle end via SHIFT_COUNT_TRUNCATED.

> Maybe there is a much more obvious way that I missed?

Painful.  I doubt exposing the masking during the RTL expansion phase
and hoping the standard optimizers will eliminate it would work better
-- though perhaps if the expanders queried the global range information
and elided the masking when the range of the shift was known to be in range.

jeff
Robin Dapp June 4, 2019, 11:14 a.m. | #8
>> Now, in order to get rid of the subregs in the pattern combine creates,

>> I would need to be able to do something like

>>

>> (define_subst "subreg_subst"

>>   [(set (match_operand:DI 0 "" "")

>>         (shift:DI (match_operand:DI 1 "" "")

>> 		  (subreg:SI (match_dup:DI 2)))]

>>

>> where the (match_dup:DI 2) would capture both (and:SI ...) [with the

>> first argument being either a register or an already substituted

>> (plus:SI ...)] as well as a simple (plus:SI ...).

>>

>> As far as I can tell match_dup:mode can be used to change the mode of

>> the top-level operation but the operands will remain the same.  For

>> this, a match_dup_deep or whatever would be useful.  I'm pretty sure we

>> don't want to open this can of worms, though :)

[..]

> Painful.  I doubt exposing the masking during the RTL expansion phase

> and hoping the standard optimizers will eliminate it would work better

> -- though perhaps if the expanders queried the global range information

> and elided the masking when the range of the shift was known to be in range.


I went for another approach - creating a dedicated big predicate and a
constraint that captures the full subreg (and (plus ...))) block.  This,
however lead to reload problems: When needing to spill the shiftcount
register, it would not be able to reload the full operation including
and/plus etc. and just bail out/ICE.  I guess that's expected behavior
since the predicate is too generic to handle now.  Andreas suggested a
secondary reload for that but we're not sure whether we really want that.

Thinking back at the mode-changing "match_dup_deep", I still cannot
imagine how to define this in a sane way or rather where to stop with
mode changing (e.g. what about mems).

What I would need it to do is

 (and:SI (reg:SI const_int)) -->
 (and:DI (reg:DI) const_int)
 (plus:SI (reg:SI const_int)) -->
 (plus:DI (reg:DI const_int))
 (and:SI (plus:SI (reg:SI const_int)) const_int) -->
 (and:DI (plus:DI (reg:DI const_int)) const_int)

Any other ideas how to achieve that without tons of boilerplate code?

Regards
 Robin

Patch

diff --git a/gcc/combine.c b/gcc/combine.c
index 91e32c88c88..d2a659f929b 100644
--- a/gcc/combine.c
+++ b/gcc/combine.c
@@ -6445,14 +6445,12 @@  combine_simplify_rtx (rtx x, machine_mode op0_mode, int in_dest,
 	return simplify_shift_const (x, code, mode, XEXP (x, 0),
 				     INTVAL (XEXP (x, 1)));
 
-      else if (SHIFT_COUNT_TRUNCATED && !REG_P (XEXP (x, 1)))
+      else if (SHIFT_COUNT_TRUNCATED
+	  && targetm.shift_truncation_mask (mode)
+	  && !REG_P (XEXP (x, 1)))
 	SUBST (XEXP (x, 1),
 	       force_to_mode (XEXP (x, 1), GET_MODE (XEXP (x, 1)),
-			      (HOST_WIDE_INT_1U
-			       << exact_log2 (GET_MODE_UNIT_BITSIZE
-					      (GET_MODE (x))))
-			      - 1,
-			      0));
+		 targetm.shift_truncation_mask (mode), 0));
       break;
 
     default:
@@ -10594,8 +10592,8 @@  simplify_shift_const_1 (enum rtx_code code, machine_mode result_mode,
   /* Make sure and truncate the "natural" shift on the way in.  We don't
      want to do this inside the loop as it makes it more difficult to
      combine shifts.  */
-  if (SHIFT_COUNT_TRUNCATED)
-    orig_count &= GET_MODE_UNIT_BITSIZE (mode) - 1;
+  if (SHIFT_COUNT_TRUNCATED && targetm.shift_truncation_mask (mode))
+    orig_count &= targetm.shift_truncation_mask (mode);
 
   /* If we were given an invalid count, don't do anything except exactly
      what was requested.  */
@@ -12295,7 +12293,7 @@  simplify_comparison (enum rtx_code code, rtx *pop0, rtx *pop1)
 	     between the position and the location of the single bit.  */
 	  /* Except we can't if SHIFT_COUNT_TRUNCATED is set, since we might
 	     have already reduced the shift count modulo the word size.  */
-	  if (!SHIFT_COUNT_TRUNCATED
+	  if ((!SHIFT_COUNT_TRUNCATED || !targetm.shift_truncation_mask (mode))
 	      && CONST_INT_P (XEXP (op0, 0))
 	      && XEXP (op0, 1) == const1_rtx
 	      && equality_comparison_p && const_op == 0
diff --git a/gcc/config/s390/s390.c b/gcc/config/s390/s390.c
index ad8eacdf4dc..1d723f29e1e 100644
--- a/gcc/config/s390/s390.c
+++ b/gcc/config/s390/s390.c
@@ -2320,6 +2320,7 @@  s390_single_part (rtx op,
 	    part = i;
 	}
     }
+
   return part == -1 ? -1 : n_parts - 1 - part;
 }
 
@@ -2702,6 +2703,7 @@  s390_logical_operator_ok_p (rtx *operands)
   return true;
 }
 
+
 /* Narrow logical operation CODE of memory operand MEMOP with immediate
    operand IMMOP to switch from SS to SI type instructions.  */
 
@@ -16294,6 +16296,13 @@  s390_case_values_threshold (void)
   return default_case_values_threshold ();
 }
 
+static unsigned HOST_WIDE_INT
+s390_shift_truncation_mask (machine_mode mode)
+{
+  return (mode == DImode || mode == SImode
+      || mode == HImode || mode == QImode) ? 63 : 0;
+}
+
 /* Initialize GCC target structure.  */
 
 #undef  TARGET_ASM_ALIGNED_HI_OP
@@ -16585,6 +16594,9 @@  s390_case_values_threshold (void)
 #undef TARGET_CASE_VALUES_THRESHOLD
 #define TARGET_CASE_VALUES_THRESHOLD s390_case_values_threshold
 
+#undef TARGET_SHIFT_TRUNCATION_MASK
+#define TARGET_SHIFT_TRUNCATION_MASK s390_shift_truncation_mask
+
 /* Use only short displacement, since long displacement is not available for
    the floating point instructions.  */
 #undef TARGET_MAX_ANCHOR_OFFSET
diff --git a/gcc/config/s390/s390.h b/gcc/config/s390/s390.h
index 969f58a2ba0..d85bfcc5e04 100644
--- a/gcc/config/s390/s390.h
+++ b/gcc/config/s390/s390.h
@@ -1188,5 +1188,6 @@  struct GTY(()) machine_function
 
 #define TARGET_INDIRECT_BRANCH_TABLE s390_indirect_branch_table
 
+#define SHIFT_COUNT_TRUNCATED 1
 
 #endif /* S390_H */
diff --git a/gcc/cse.c b/gcc/cse.c
index 6c9cda16a98..f7bf287e9ef 100644
--- a/gcc/cse.c
+++ b/gcc/cse.c
@@ -3649,10 +3649,11 @@  fold_rtx (rtx x, rtx_insn *insn)
 		  && (INTVAL (const_arg1) >= GET_MODE_UNIT_PRECISION (mode)
 		      || INTVAL (const_arg1) < 0))
 		{
-		  if (SHIFT_COUNT_TRUNCATED)
+		  if (SHIFT_COUNT_TRUNCATED
+		      && targetm.shift_truncation_mask (mode))
 		    canon_const_arg1 = gen_int_shift_amount
 		      (mode, (INTVAL (const_arg1)
-			      & (GET_MODE_UNIT_BITSIZE (mode) - 1)));
+			      & targetm.shift_truncation_mask (mode)));
 		  else
 		    break;
 		}
@@ -3698,10 +3699,11 @@  fold_rtx (rtx x, rtx_insn *insn)
 		  && (INTVAL (inner_const) >= GET_MODE_UNIT_PRECISION (mode)
 		      || INTVAL (inner_const) < 0))
 		{
-		  if (SHIFT_COUNT_TRUNCATED)
+		  if (SHIFT_COUNT_TRUNCATED
+		      && targetm.shift_truncation_mask (mode))
 		    inner_const = gen_int_shift_amount
 		      (mode, (INTVAL (inner_const)
-			      & (GET_MODE_UNIT_BITSIZE (mode) - 1)));
+			      & targetm.shift_truncation_mask (mode)));
 		  else
 		    break;
 		}
diff --git a/gcc/expmed.c b/gcc/expmed.c
index d7f8e9a5d76..d8eebac5f08 100644
--- a/gcc/expmed.c
+++ b/gcc/expmed.c
@@ -2476,11 +2476,14 @@  expand_shift_1 (enum tree_code code, machine_mode mode, rtx shifted,
 	      (unsigned HOST_WIDE_INT) GET_MODE_BITSIZE (scalar_mode)))
 	op1 = gen_int_shift_amount (mode,
 				    (unsigned HOST_WIDE_INT) INTVAL (op1)
-				    % GET_MODE_BITSIZE (scalar_mode));
+				    % targetm.shift_truncation_mask (scalar_mode) + 1);
       else if (GET_CODE (op1) == SUBREG
 	       && subreg_lowpart_p (op1)
 	       && SCALAR_INT_MODE_P (GET_MODE (SUBREG_REG (op1)))
-	       && SCALAR_INT_MODE_P (GET_MODE (op1)))
+	       && SCALAR_INT_MODE_P (GET_MODE (op1))
+	       && targetm.shift_truncation_mask (mode)
+		== (unsigned HOST_WIDE_INT) GET_MODE_BITSIZE (scalar_mode) - 1
+	       )
 	op1 = SUBREG_REG (op1);
     }
 
diff --git a/gcc/simplify-rtx.c b/gcc/simplify-rtx.c
index 89a46a933fa..88064e4ac57 100644
--- a/gcc/simplify-rtx.c
+++ b/gcc/simplify-rtx.c
@@ -3525,9 +3525,10 @@  simplify_binary_operation_1 (enum rtx_code code, machine_mode mode,
 	  return lowpart_subreg (int_mode, tmp, inner_mode);
 	}
 
-      if (SHIFT_COUNT_TRUNCATED && CONST_INT_P (op1))
+      if (SHIFT_COUNT_TRUNCATED && targetm.shift_truncation_mask (mode)
+	  && CONST_INT_P (op1))
 	{
-	  val = INTVAL (op1) & (GET_MODE_UNIT_PRECISION (mode) - 1);
+	  val = INTVAL (op1) & targetm.shift_truncation_mask (mode);
 	  if (val != INTVAL (op1))
 	    return simplify_gen_binary (code, mode, op0,
 					gen_int_shift_amount (mode, val));
@@ -4347,8 +4348,9 @@  simplify_const_binary_operation (enum rtx_code code, machine_mode mode,
 	case ASHIFT:
 	  {
 	    wide_int wop1 = pop1;
-	    if (SHIFT_COUNT_TRUNCATED)
-	      wop1 = wi::umod_trunc (wop1, GET_MODE_PRECISION (int_mode));
+	    if (SHIFT_COUNT_TRUNCATED && targetm.shift_truncation_mask (int_mode))
+	      wop1 = wi::umod_trunc (wop1,
+		  targetm.shift_truncation_mask (int_mode) + 1);
 	    else if (wi::geu_p (wop1, GET_MODE_PRECISION (int_mode)))
 	      return NULL_RTX;
 
@@ -4426,8 +4428,9 @@  simplify_const_binary_operation (enum rtx_code code, machine_mode mode,
 	  if (CONST_SCALAR_INT_P (op1))
 	    {
 	      wide_int shift = rtx_mode_t (op1, mode);
-	      if (SHIFT_COUNT_TRUNCATED)
-		shift = wi::umod_trunc (shift, GET_MODE_PRECISION (int_mode));
+	      if (SHIFT_COUNT_TRUNCATED && targetm.shift_truncation_mask (int_mode))
+		shift = wi::umod_trunc (shift,
+		    targetm.shift_truncation_mask (int_mode) + 1);
 	      else if (wi::geu_p (shift, GET_MODE_PRECISION (int_mode)))
 		return NULL_RTX;
 	      result = wi::to_poly_wide (op0, mode) << shift;