Prefer simple case changes in spelling suggestions

Message ID 20200529165443.24395-1-tromey@adacore.com
State New
Headers show
Series
  • Prefer simple case changes in spelling suggestions
Related show

Commit Message

Tom Tromey May 29, 2020, 4:54 p.m.
I got this error message when editing gcc and recompiling:

../../gcc/gcc/ada/gcc-interface/decl.c:7714:39: error: ‘DWARF_GNAT_ENCODINGS_all’ was not declared in this scope; did you mean ‘DWARF_GNAT_ENCODINGS_GDB’?
 7714 |     = debug_info && gnat_encodings == DWARF_GNAT_ENCODINGS_all;
      |                                       ^~~~~~~~~~~~~~~~~~~~~~~~
      |                                       DWARF_GNAT_ENCODINGS_GDB

This suggestion could be improved -- what happened here is that I
failed to upper-case the word, and DWARF_GNAT_ENCODINGS_ALL was the
correct spelling.

This patch changes gcc's spell checker to prefer simple case changes
when possible.

I tested this using the self-tests.  A new self-test is also included.

gcc/ChangeLog:

	* spellcheck.c (CASE_COST): New define.
	(BASE_COST): New define.
	(get_edit_distance): Recognize case changes.
	(get_edit_distance_cutoff): Update.
	(test_edit_distances): Update.
	(get_old_cutoff): Update.
	(test_find_closest_string): Add case sensitivity test.
---
 gcc/spellcheck.c | 114 ++++++++++++++++++++++++++++++-----------------
 1 file changed, 74 insertions(+), 40 deletions(-)

-- 
2.21.3

Comments

Kewen.Lin via Gcc-patches May 29, 2020, 6:21 p.m. | #1
On Fri, May 29, 2020 at 6:02 PM Tom Tromey <tromey@adacore.com> wrote:
> This patch changes gcc's spell checker to prefer simple case changes

> when possible.

>

> I tested this using the self-tests.  A new self-test is also included.


I think that's great, but it should be mentioned in the comment that
the distance function used is not the Damerau-Levenshtein distance,
and not a metric.

(No triangle inequality. For example, the distance between "aB" and
"ba" is 4, but the distance between "aB" and "ab" is 1 and that
between "ab" and "ba" is 2, unless I missed something very clever in
your code.)

IIRC, minimum string alignment does not satisfy the triangle
inequality anyway, so test_metric_conditions should probably not
pretend to test it...
Kewen.Lin via Gcc-patches May 30, 2020, 1:40 p.m. | #2
On Fri, May 29, 2020 at 6:21 PM Pip Cet <pipcet@gmail.com> wrote:
> IIRC, minimum string alignment does not satisfy the triangle

> inequality anyway, so test_metric_conditions should probably not

> pretend to test it...


I did remember correctly, though of course that should have been
"optimal string alignment" :-). If you change the spellcheck.c
test_data array to include "abc", "ac", and "ca", the self-test will
fail, even without your patch.

I think we should just omit the triangle inequality test from the
self-test, as in the attached patch.
From bbb8b5cd7368f471bcbdbe451591e74315a5adcd Mon Sep 17 00:00:00 2001
From: Pip Cet <pipcet@gmail.com>
Date: Sat, 30 May 2020 13:39:09 +0000
Subject: [PATCH] Don't test the triangle inequality in the spellcheck.c
 self-test.

* spellcheck.c (test_data): Add problematic strings.
(test_metric_conditions): Don't test the triangle inequality
condition, which our distance function does not satisfy.
---
 gcc/ChangeLog    |  6 ++++++
 gcc/spellcheck.c | 19 +++++--------------
 2 files changed, 11 insertions(+), 14 deletions(-)

diff --git a/gcc/ChangeLog b/gcc/ChangeLog
index 4fc37369d39..e565d8ecadf 100644
--- a/gcc/ChangeLog
+++ b/gcc/ChangeLog
@@ -1,3 +1,9 @@
+2020-05-30  Pip Cet  <pipcet@gmail.com>
+
+	* spellcheck.c (test_data): Add problematic strings.
+	(test_metric_conditions): Don't test the triangle inequality
+	condition, which our distance function does not satisfy.
+
 2020-05-28  Nicolas Bértolo  <nicolasbertolo@gmail.com>
 
 	* Makefile.in: don't look for libiberty in the "pic" subdirectory
diff --git a/gcc/spellcheck.c b/gcc/spellcheck.c
index 7891260a258..1d2df070445 100644
--- a/gcc/spellcheck.c
+++ b/gcc/spellcheck.c
@@ -438,13 +438,14 @@ static const char * const test_data[] = {
   "food",
   "boo",
   "1234567890123456789012345678901234567890123456789012345678901234567890"
+  "abc",
+  "ac",
+  "ca",
 };
 
 /* Verify that get_edit_distance appears to be a sane distance function,
-   i.e. the conditions for being a metric.  This is done directly for a
-   small set of examples, using test_data above.  This is O(N^3) in the size
-   of the array, due to the test for the triangle inequality, so we keep the
-   array small.  */
+   even though it doesn't satisfy the conditions for being a metric.  This
+   is done directly for a small set of examples, using test_data above.  */
 
 static void
 test_metric_conditions ()
@@ -468,16 +469,6 @@ test_metric_conditions ()
 	  edit_distance_t dist_ji
 	    = get_edit_distance (test_data[j], test_data[i]);
 	  ASSERT_EQ (dist_ij, dist_ji);
-
-	  /* Triangle inequality.  */
-	  for (int k = 0; k < num_test_cases; k++)
-	    {
-	      edit_distance_t dist_ik
-		= get_edit_distance (test_data[i], test_data[k]);
-	      edit_distance_t dist_jk
-		= get_edit_distance (test_data[j], test_data[k]);
-	      ASSERT_TRUE (dist_ik <= dist_ij + dist_jk);
-	    }
 	}
     }
 }
Kewen.Lin via Gcc-patches May 30, 2020, 5:03 p.m. | #3
On Fri, 2020-05-29 at 10:54 -0600, Tom Tromey wrote:
> I got this error message when editing gcc and recompiling:

> 

> ../../gcc/gcc/ada/gcc-interface/decl.c:7714:39: error:

> ‘DWARF_GNAT_ENCODINGS_all’ was not declared in this scope; did you

> mean ‘DWARF_GNAT_ENCODINGS_GDB’?

>  7714 |     = debug_info && gnat_encodings ==

> DWARF_GNAT_ENCODINGS_all;

>       |                                       ^~~~~~~~~~~~~~~~~~~~~~~

> ~

>       |                                       DWARF_GNAT_ENCODINGS_GD

> B

> 

> This suggestion could be improved -- what happened here is that I

> failed to upper-case the word, and DWARF_GNAT_ENCODINGS_ALL was the

> correct spelling.

> 

> This patch changes gcc's spell checker to prefer simple case changes

> when possible.


Thanks.  I like the overall idea.

> I tested this using the self-tests.  A new self-test is also

> included.


Did the full DejaGnu testsuite get run?  There are a lot of tests in it
that make use of this code.

> gcc/ChangeLog:

> 

> 	* spellcheck.c (CASE_COST): New define.

> 	(BASE_COST): New define.

> 	(get_edit_distance): Recognize case changes.

> 	(get_edit_distance_cutoff): Update.

> 	(test_edit_distances): Update.

> 	(get_old_cutoff): Update.

> 	(test_find_closest_string): Add case sensitivity test.

> ---

>  gcc/spellcheck.c | 114 ++++++++++++++++++++++++++++++---------------

> --

>  1 file changed, 74 insertions(+), 40 deletions(-)


> diff --git a/gcc/spellcheck.c b/gcc/spellcheck.c

> index 7891260a258..9002617453f 100644

> --- a/gcc/spellcheck.c

> +++ b/gcc/spellcheck.c


[...snip...]

The patch should probably update the leading comment to
get_edit_distance.

> @@ -228,47 +241,50 @@ test_get_edit_distance_both_ways (const char

> *a, const char *b,

>  static void

>  test_edit_distances ()

>  {

> -  test_get_edit_distance_both_ways ("", "nonempty", strlen

> ("nonempty"));

> -  test_get_edit_distance_both_ways ("saturday", "sunday", 3);

> -  test_get_edit_distance_both_ways ("foo", "m_foo", 2);

> -  test_get_edit_distance_both_ways ("hello_world", "HelloWorld", 3);

> +  test_get_edit_distance_both_ways ("", "nonempty",

> +				    BASE_COST * strlen ("nonempty"));

> +  test_get_edit_distance_both_ways ("saturday", "sunday",

> +				    BASE_COST * 3);

> +  test_get_edit_distance_both_ways ("foo", "m_foo", BASE_COST * 2);

> +  test_get_edit_distance_both_ways ("hello_world", "HelloWorld", 4);

>    test_get_edit_distance_both_ways

> -    ("the quick brown fox jumps over the lazy dog", "dog", 40);

> +    ("the quick brown fox jumps over the lazy dog", "dog", BASE_COST

> * 40);

>    test_get_edit_distance_both_ways

>      ("the quick brown fox jumps over the lazy dog",

>       "the quick brown dog jumps over the lazy fox",

> -     4);

> +     BASE_COST * 4);

>    test_get_edit_distance_both_ways

>      ("Lorem ipsum dolor sit amet, consectetur adipiscing elit,",

>       "All your base are belong to us",

> -     44);

> +     BASE_COST * 44);

>    test_get_edit_distance_both_ways ("foo", "FOO", 3);

> -  test_get_edit_distance_both_ways ("fee", "deed", 2);

> -  test_get_edit_distance_both_ways ("coorzd1", "coordx1", 2);

> -  test_get_edit_distance_both_ways ("assert", "sqrt", 3);

> -  test_get_edit_distance_both_ways ("PATH_MAX", "INT8_MAX", 3);

> -  test_get_edit_distance_both_ways ("time", "nice", 2);

> -  test_get_edit_distance_both_ways ("bar", "carg", 2);

> +  test_get_edit_distance_both_ways ("fee", "deed", BASE_COST * 2);

> +  test_get_edit_distance_both_ways ("coorzd1", "coordx1", BASE_COST

> * 2);

> +  test_get_edit_distance_both_ways ("assert", "sqrt", BASE_COST *

> 3);

> +  test_get_edit_distance_both_ways ("PATH_MAX", "INT8_MAX",

> BASE_COST * 3);

> +  test_get_edit_distance_both_ways ("time", "nice", BASE_COST * 2);

> +  test_get_edit_distance_both_ways ("bar", "carg", BASE_COST * 2);

>    test_get_edit_distance_both_ways ("gtk_widget_show_all",

> -				    "GtkWidgetShowAll", 7);

> -  test_get_edit_distance_both_ways ("m_bar", "bar", 2);

> -  test_get_edit_distance_both_ways ("MACRO", "MACRAME", 3);

> -  test_get_edit_distance_both_ways ("ab", "ac", 1);

> -  test_get_edit_distance_both_ways ("ab", "a", 1);

> -  test_get_edit_distance_both_ways ("a", "b", 1);

> -  test_get_edit_distance_both_ways ("nanl", "name", 2);

> -  test_get_edit_distance_both_ways ("char", "bar", 2);

> -  test_get_edit_distance_both_ways ("-optimize", "fsanitize", 5);

> -  test_get_edit_distance_both_ways ("__DATE__", "__i386__", 4);

> +				    "GtkWidgetShowAll", 10);

> +  test_get_edit_distance_both_ways ("m_bar", "bar", BASE_COST * 2);

> +  test_get_edit_distance_both_ways ("MACRO", "MACRAME", BASE_COST *

> 3);

> +  test_get_edit_distance_both_ways ("ab", "ac", BASE_COST * 1);

> +  test_get_edit_distance_both_ways ("ab", "a", BASE_COST * 1);

> +  test_get_edit_distance_both_ways ("a", "b", BASE_COST * 1);

> +  test_get_edit_distance_both_ways ("nanl", "name", BASE_COST * 2);

> +  test_get_edit_distance_both_ways ("char", "bar", BASE_COST * 2);

> +  test_get_edit_distance_both_ways ("-optimize", "fsanitize",

> BASE_COST * 5);

> +  test_get_edit_distance_both_ways ("__DATE__", "__i386__",

> BASE_COST * 4);

>  

>    /* Examples where transposition helps.  */

> -  test_get_edit_distance_both_ways ("ab", "ba", 1);

> -  test_get_edit_distance_both_ways ("ba", "abc", 2);

> -  test_get_edit_distance_both_ways ("coorzd1", "coordz1", 1);

> +  test_get_edit_distance_both_ways ("ab", "ba", BASE_COST * 1);

> +  test_get_edit_distance_both_ways ("ba", "abc", BASE_COST * 2);

> +  test_get_edit_distance_both_ways ("coorzd1", "coordz1", BASE_COST

> * 1);

>    test_get_edit_distance_both_ways ("abcdefghijklmnopqrstuvwxyz",

> -				    "bacdefghijklmnopqrstuvwxzy", 2);

> -  test_get_edit_distance_both_ways ("saturday", "sundya", 4);

> -  test_get_edit_distance_both_ways ("signed", "singed", 1);

> +				    "bacdefghijklmnopqrstuvwxzy",

> +				    BASE_COST * 2);

> +  test_get_edit_distance_both_ways ("saturday", "sundya", BASE_COST

> * 4);

> +  test_get_edit_distance_both_ways ("signed", "singed", BASE_COST *

> 1);


If I'm reading things correctly, the patch here updates the existing
tests to apply the BASE_COST scale factor, but I don't think it adds
any direct checks of the cost of case-conversion.  It would be good to
add those.

[...snip...]

Dave
Kewen.Lin via Gcc-patches May 30, 2020, 5:06 p.m. | #4
On Sat, 2020-05-30 at 13:40 +0000, Pip Cet via Gcc-patches wrote:
> On Fri, May 29, 2020 at 6:21 PM Pip Cet <pipcet@gmail.com> wrote:

> > IIRC, minimum string alignment does not satisfy the triangle

> > inequality anyway, so test_metric_conditions should probably not

> > pretend to test it...

> 

> I did remember correctly, though of course that should have been

> "optimal string alignment" :-). If you change the spellcheck.c

> test_data array to include "abc", "ac", and "ca", the self-test will

> fail, even without your patch.

> 

> I think we should just omit the triangle inequality test from the

> self-test, as in the attached patch.


I like the idea, but can you update the comment so that it explains why
it's not a metric? (better to capture that in a comment in the source
rather than just in the mailing list archives).

Thanks
Dave
Kewen.Lin via Gcc-patches May 30, 2020, 6:51 p.m. | #5
On Sat, May 30, 2020 at 5:06 PM David Malcolm <dmalcolm@redhat.com> wrote:
> On Sat, 2020-05-30 at 13:40 +0000, Pip Cet via Gcc-patches wrote:

> > I think we should just omit the triangle inequality test from the

> > self-test, as in the attached patch.

>

> I like the idea,


Thanks!

> but can you update the comment so that it explains why

> it's not a metric? (better to capture that in a comment in the source

> rather than just in the mailing list archives).


How's this?
From 2a759aba3639dd782c28c727746c0e6913390fa6 Mon Sep 17 00:00:00 2001
From: Pip Cet <pipcet@gmail.com>
Date: Sat, 30 May 2020 13:39:09 +0000
Subject: [PATCH] Don't test the triangle inequality in the spellcheck.c
 self-test.

* spellcheck.c (test_data): Add problematic strings.
(test_metric_conditions): Don't test the triangle inequality
condition, which our distance function does not satisfy.
---
 gcc/ChangeLog    |  6 ++++++
 gcc/spellcheck.c | 22 ++++++++--------------
 2 files changed, 14 insertions(+), 14 deletions(-)

diff --git a/gcc/ChangeLog b/gcc/ChangeLog
index 4fc37369d39..e565d8ecadf 100644
--- a/gcc/ChangeLog
+++ b/gcc/ChangeLog
@@ -1,3 +1,9 @@
+2020-05-30  Pip Cet  <pipcet@gmail.com>
+
+	* spellcheck.c (test_data): Add problematic strings.
+	(test_metric_conditions): Don't test the triangle inequality
+	condition, which our distance function does not satisfy.
+
 2020-05-28  Nicolas Bértolo  <nicolasbertolo@gmail.com>
 
 	* Makefile.in: don't look for libiberty in the "pic" subdirectory
diff --git a/gcc/spellcheck.c b/gcc/spellcheck.c
index 7891260a258..ab17657e8dc 100644
--- a/gcc/spellcheck.c
+++ b/gcc/spellcheck.c
@@ -438,13 +438,17 @@ static const char * const test_data[] = {
   "food",
   "boo",
   "1234567890123456789012345678901234567890123456789012345678901234567890"
+  "abc",
+  "ac",
+  "ca",
 };
 
 /* Verify that get_edit_distance appears to be a sane distance function,
-   i.e. the conditions for being a metric.  This is done directly for a
-   small set of examples, using test_data above.  This is O(N^3) in the size
-   of the array, due to the test for the triangle inequality, so we keep the
-   array small.  */
+   even though it doesn't satisfy the conditions for being a metric.  (This
+   is because the triangle inequality fails to hold: the distance between
+   "ca" and "ac" is 1, and so is the distance between "abc" and "ac", but
+   the distance between "abc" and "ca" is 3. Algorithms that calculate the
+   true Levenshtein-Damerau metric are much more expensive.)  */
 
 static void
 test_metric_conditions ()
@@ -468,16 +472,6 @@ test_metric_conditions ()
 	  edit_distance_t dist_ji
 	    = get_edit_distance (test_data[j], test_data[i]);
 	  ASSERT_EQ (dist_ij, dist_ji);
-
-	  /* Triangle inequality.  */
-	  for (int k = 0; k < num_test_cases; k++)
-	    {
-	      edit_distance_t dist_ik
-		= get_edit_distance (test_data[i], test_data[k]);
-	      edit_distance_t dist_jk
-		= get_edit_distance (test_data[j], test_data[k]);
-	      ASSERT_TRUE (dist_ik <= dist_ij + dist_jk);
-	    }
 	}
     }
 }
Tom Tromey June 1, 2020, 4:26 p.m. | #6
>>>>> "David" == David Malcolm <dmalcolm@redhat.com> writes:


>> I tested this using the self-tests.  A new self-test is also

>> included.


> Did the full DejaGnu testsuite get run?  There are a lot of tests in it

> that make use of this code.


I didn't try it, but I can.

> The patch should probably update the leading comment to

> get_edit_distance.


Will do.

>> test_get_edit_distance_both_ways ("foo", "FOO", 3);

[...]

> If I'm reading things correctly, the patch here updates the existing

> tests to apply the BASE_COST scale factor, but I don't think it adds

> any direct checks of the cost of case-conversion.  It would be good to

> add those.


It isn't obvious but the foo/FOO test did change.

Tom
Tom Tromey June 1, 2020, 8:11 p.m. | #7
> Did the full DejaGnu testsuite get run?  There are a lot of tests in it

> that make use of this code.


I did "make check" and only saw some XFAILs.

Here's v2 of the patch, which I think addresses your comments.  I did
not add a new test of get_edit_distance, because as I mentioned earlier,
an existing test already does what you asked for.

Tom

commit e897a99dada8d3935343ebf7b14ad7ec36515b3d
Author: Tom Tromey <tromey@adacore.com>
Date:   Fri May 29 10:46:57 2020 -0600

    Prefer simple case changes in spelling suggestions
    
    I got this error message when editing gcc and recompiling:
    
    ../../gcc/gcc/ada/gcc-interface/decl.c:7714:39: error: ‘DWARF_GNAT_ENCODINGS_all’ was not declared in this scope; did you mean ‘DWARF_GNAT_ENCODINGS_GDB’?
     7714 |     = debug_info && gnat_encodings == DWARF_GNAT_ENCODINGS_all;
          |                                       ^~~~~~~~~~~~~~~~~~~~~~~~
          |                                       DWARF_GNAT_ENCODINGS_GDB
    
    This suggestion could be improved -- what happened here is that I
    failed to upper-case the word, and DWARF_GNAT_ENCODINGS_ALL was the
    correct spelling.
    
    This patch changes gcc's spell checker to prefer simple case changes
    when possible.
    
    I tested this using the self-tests.  A new self-test is also included.
    
    gcc/ChangeLog:
    
            * spellcheck.c (CASE_COST): New define.
            (BASE_COST): New define.
            (get_edit_distance): Recognize case changes.
            (get_edit_distance_cutoff): Update.
            (test_edit_distances): Update.
            (get_old_cutoff): Update.
            (test_find_closest_string): Add case sensitivity test.

diff --git a/gcc/spellcheck.c b/gcc/spellcheck.c
index 7891260a258..9f7351f364f 100644
--- a/gcc/spellcheck.c
+++ b/gcc/spellcheck.c
@@ -25,14 +25,22 @@ along with GCC; see the file COPYING3.  If not see
 #include "spellcheck.h"
 #include "selftest.h"
 
+/* Cost of a case transformation.  */
+#define CASE_COST 1
+
+/* Cost of another kind of edit.  */
+#define BASE_COST 2
+
 /* Get the edit distance between the two strings: the minimal
    number of edits that are needed to change one string into another,
    where edits can be one-character insertions, removals, or substitutions,
    or transpositions of two adjacent characters (counting as one "edit").
 
-   This implementation uses the Wagner-Fischer algorithm for the
-   Damerau-Levenshtein distance; specifically, the "optimal string alignment
-   distance" or "restricted edit distance" variant.  */
+   This implementation uses a modified variant of the Wagner-Fischer
+   algorithm for the Damerau-Levenshtein distance; specifically, the
+   "optimal string alignment distance" or "restricted edit distance"
+   variant.  This implementation has been further modified to take
+   case into account.  */
 
 edit_distance_t
 get_edit_distance (const char *s, int len_s,
@@ -47,9 +55,9 @@ get_edit_distance (const char *s, int len_s,
     }
 
   if (len_s == 0)
-    return len_t;
+    return BASE_COST * len_t;
   if (len_t == 0)
-    return len_s;
+    return BASE_COST * len_s;
 
   /* We effectively build a matrix where each (i, j) contains the
      distance between the prefix strings s[0:j] and t[0:i].
@@ -67,7 +75,7 @@ get_edit_distance (const char *s, int len_s,
   /* The first row is for the case of an empty target string, which
      we can reach by deleting every character in the source string.  */
   for (int i = 0; i < len_s + 1; i++)
-    v_one_ago[i] = i;
+    v_one_ago[i] = i * BASE_COST;
 
   /* Build successive rows.  */
   for (int i = 0; i < len_t; i++)
@@ -83,21 +91,28 @@ get_edit_distance (const char *s, int len_s,
       /* The initial column is for the case of an empty source string; we
 	 can reach prefixes of the target string of length i
 	 by inserting i characters.  */
-      v_next[0] = i + 1;
+      v_next[0] = (i + 1) * BASE_COST;
 
       /* Build the rest of the row by considering neighbors to
 	 the north, west and northwest.  */
       for (int j = 0; j < len_s; j++)
 	{
-	  edit_distance_t cost = (s[j] == t[i] ? 0 : 1);
-	  edit_distance_t deletion     = v_next[j] + 1;
-	  edit_distance_t insertion    = v_one_ago[j + 1] + 1;
+	  edit_distance_t cost;
+
+	  if (s[j] == t[i])
+	    cost = 0;
+	  else if (TOLOWER (s[j]) == TOLOWER (t[i]))
+	    cost = CASE_COST;
+	  else
+	    cost = BASE_COST;
+	  edit_distance_t deletion     = v_next[j] + BASE_COST;
+	  edit_distance_t insertion    = v_one_ago[j + 1] + BASE_COST;
 	  edit_distance_t substitution = v_one_ago[j] + cost;
 	  edit_distance_t cheapest = MIN (deletion, insertion);
 	  cheapest = MIN (cheapest, substitution);
 	  if (i > 0 && j > 0 && s[j] == t[i - 1] && s[j - 1] == t[i])
 	    {
-	      edit_distance_t transposition = v_two_ago[j - 1] + 1;
+	      edit_distance_t transposition = v_two_ago[j - 1] + BASE_COST;
 	      cheapest = MIN (cheapest, transposition);
 	    }
 	  v_next[j + 1] = cheapest;
@@ -185,11 +200,11 @@ get_edit_distance_cutoff (size_t goal_len, size_t candidate_len)
   /* If the lengths are close, then round down.  */
   if (max_length - min_length <= 1)
     /* ...but allow an edit distance of at least 1.  */
-    return MAX (max_length / 3, 1);
+    return BASE_COST * MAX (max_length / 3, 1);
 
   /* Otherwise, round up (thus giving a little extra leeway to some cases
      involving insertions/deletions).  */
-  return (max_length + 2) / 3;
+  return BASE_COST * (max_length + 2) / 3;
 }
 
 #if CHECKING_P
@@ -228,47 +243,50 @@ test_get_edit_distance_both_ways (const char *a, const char *b,
 static void
 test_edit_distances ()
 {
-  test_get_edit_distance_both_ways ("", "nonempty", strlen ("nonempty"));
-  test_get_edit_distance_both_ways ("saturday", "sunday", 3);
-  test_get_edit_distance_both_ways ("foo", "m_foo", 2);
-  test_get_edit_distance_both_ways ("hello_world", "HelloWorld", 3);
+  test_get_edit_distance_both_ways ("", "nonempty",
+				    BASE_COST * strlen ("nonempty"));
+  test_get_edit_distance_both_ways ("saturday", "sunday",
+				    BASE_COST * 3);
+  test_get_edit_distance_both_ways ("foo", "m_foo", BASE_COST * 2);
+  test_get_edit_distance_both_ways ("hello_world", "HelloWorld", 4);
   test_get_edit_distance_both_ways
-    ("the quick brown fox jumps over the lazy dog", "dog", 40);
+    ("the quick brown fox jumps over the lazy dog", "dog", BASE_COST * 40);
   test_get_edit_distance_both_ways
     ("the quick brown fox jumps over the lazy dog",
      "the quick brown dog jumps over the lazy fox",
-     4);
+     BASE_COST * 4);
   test_get_edit_distance_both_ways
     ("Lorem ipsum dolor sit amet, consectetur adipiscing elit,",
      "All your base are belong to us",
-     44);
+     BASE_COST * 44);
   test_get_edit_distance_both_ways ("foo", "FOO", 3);
-  test_get_edit_distance_both_ways ("fee", "deed", 2);
-  test_get_edit_distance_both_ways ("coorzd1", "coordx1", 2);
-  test_get_edit_distance_both_ways ("assert", "sqrt", 3);
-  test_get_edit_distance_both_ways ("PATH_MAX", "INT8_MAX", 3);
-  test_get_edit_distance_both_ways ("time", "nice", 2);
-  test_get_edit_distance_both_ways ("bar", "carg", 2);
+  test_get_edit_distance_both_ways ("fee", "deed", BASE_COST * 2);
+  test_get_edit_distance_both_ways ("coorzd1", "coordx1", BASE_COST * 2);
+  test_get_edit_distance_both_ways ("assert", "sqrt", BASE_COST * 3);
+  test_get_edit_distance_both_ways ("PATH_MAX", "INT8_MAX", BASE_COST * 3);
+  test_get_edit_distance_both_ways ("time", "nice", BASE_COST * 2);
+  test_get_edit_distance_both_ways ("bar", "carg", BASE_COST * 2);
   test_get_edit_distance_both_ways ("gtk_widget_show_all",
-				    "GtkWidgetShowAll", 7);
-  test_get_edit_distance_both_ways ("m_bar", "bar", 2);
-  test_get_edit_distance_both_ways ("MACRO", "MACRAME", 3);
-  test_get_edit_distance_both_ways ("ab", "ac", 1);
-  test_get_edit_distance_both_ways ("ab", "a", 1);
-  test_get_edit_distance_both_ways ("a", "b", 1);
-  test_get_edit_distance_both_ways ("nanl", "name", 2);
-  test_get_edit_distance_both_ways ("char", "bar", 2);
-  test_get_edit_distance_both_ways ("-optimize", "fsanitize", 5);
-  test_get_edit_distance_both_ways ("__DATE__", "__i386__", 4);
+				    "GtkWidgetShowAll", 10);
+  test_get_edit_distance_both_ways ("m_bar", "bar", BASE_COST * 2);
+  test_get_edit_distance_both_ways ("MACRO", "MACRAME", BASE_COST * 3);
+  test_get_edit_distance_both_ways ("ab", "ac", BASE_COST * 1);
+  test_get_edit_distance_both_ways ("ab", "a", BASE_COST * 1);
+  test_get_edit_distance_both_ways ("a", "b", BASE_COST * 1);
+  test_get_edit_distance_both_ways ("nanl", "name", BASE_COST * 2);
+  test_get_edit_distance_both_ways ("char", "bar", BASE_COST * 2);
+  test_get_edit_distance_both_ways ("-optimize", "fsanitize", BASE_COST * 5);
+  test_get_edit_distance_both_ways ("__DATE__", "__i386__", BASE_COST * 4);
 
   /* Examples where transposition helps.  */
-  test_get_edit_distance_both_ways ("ab", "ba", 1);
-  test_get_edit_distance_both_ways ("ba", "abc", 2);
-  test_get_edit_distance_both_ways ("coorzd1", "coordz1", 1);
+  test_get_edit_distance_both_ways ("ab", "ba", BASE_COST * 1);
+  test_get_edit_distance_both_ways ("ba", "abc", BASE_COST * 2);
+  test_get_edit_distance_both_ways ("coorzd1", "coordz1", BASE_COST * 1);
   test_get_edit_distance_both_ways ("abcdefghijklmnopqrstuvwxyz",
-				    "bacdefghijklmnopqrstuvwxzy", 2);
-  test_get_edit_distance_both_ways ("saturday", "sundya", 4);
-  test_get_edit_distance_both_ways ("signed", "singed", 1);
+				    "bacdefghijklmnopqrstuvwxzy",
+				    BASE_COST * 2);
+  test_get_edit_distance_both_ways ("saturday", "sundya", BASE_COST * 4);
+  test_get_edit_distance_both_ways ("signed", "singed", BASE_COST * 1);
 }
 
 /* Subroutine of test_get_edit_distance_cutoff, for emulating the
@@ -277,7 +295,7 @@ test_edit_distances ()
 static edit_distance_t
 get_old_cutoff (size_t goal_len, size_t candidate_len)
 {
-  return MAX (goal_len, candidate_len) / 2;
+  return BASE_COST * MAX (goal_len, candidate_len) / 2;
 }
 
 /* Verify that the cutoff for "meaningfulness" of suggestions is at least as
@@ -428,6 +446,24 @@ test_find_closest_string ()
   candidates.safe_push("coordy1");
   candidates.safe_push("coordz1");
   ASSERT_STREQ ("coordz1", find_closest_string ("coorzd1", &candidates));
+
+  candidates.truncate (0);
+  candidates.safe_push ("DWARF_GNAT_ENCODINGS_GDB");
+  candidates.safe_push ("DWARF_GNAT_ENCODINGS_ALL");
+  candidates.safe_push ("DWARF_GNAT_ENCODINGS_MINIMAL");
+  ASSERT_STREQ ("DWARF_GNAT_ENCODINGS_ALL",
+		find_closest_string ("DWARF_GNAT_ENCODINGS_all",
+				     &candidates));
+
+  /* The same as the previous test, but with a different order of
+     candidates.  */
+  candidates.truncate (0);
+  candidates.safe_push ("DWARF_GNAT_ENCODINGS_ALL");
+  candidates.safe_push ("DWARF_GNAT_ENCODINGS_GDB");
+  candidates.safe_push ("DWARF_GNAT_ENCODINGS_MINIMAL");
+  ASSERT_STREQ ("DWARF_GNAT_ENCODINGS_ALL",
+		find_closest_string ("DWARF_GNAT_ENCODINGS_all",
+				     &candidates));
 }
 
 /* Test data for test_metric_conditions.  */
Kewen.Lin via Gcc-patches June 2, 2020, 6:31 p.m. | #8
On Mon, 2020-06-01 at 14:11 -0600, Tom Tromey wrote:
> > Did the full DejaGnu testsuite get run?  There are a lot of tests

> > in it

> > that make use of this code.

> 

> I did "make check" and only saw some XFAILs.

> 

> Here's v2 of the patch, which I think addresses your comments.  I did

> not add a new test of get_edit_distance, because as I mentioned

> earlier,

> an existing test already does what you asked for.

> 

> Tom

> 

> commit e897a99dada8d3935343ebf7b14ad7ec36515b3d

> Author: Tom Tromey <tromey@adacore.com>

> Date:   Fri May 29 10:46:57 2020 -0600

> 

>     Prefer simple case changes in spelling suggestions

>     

>     I got this error message when editing gcc and recompiling:

>     

>     ../../gcc/gcc/ada/gcc-interface/decl.c:7714:39: error:

> ‘DWARF_GNAT_ENCODINGS_all’ was not declared in this scope; did you

> mean ‘DWARF_GNAT_ENCODINGS_GDB’?

>      7714 |     = debug_info && gnat_encodings ==

> DWARF_GNAT_ENCODINGS_all;

>           |                                       ^~~~~~~~~~~~~~~~~~~

> ~~~~~

>           |                                       DWARF_GNAT_ENCODING

> S_GDB

>     

>     This suggestion could be improved -- what happened here is that I

>     failed to upper-case the word, and DWARF_GNAT_ENCODINGS_ALL was

> the

>     correct spelling.

>     

>     This patch changes gcc's spell checker to prefer simple case

> changes

>     when possible.

>     

>     I tested this using the self-tests.  A new self-test is also

> included.

>     

>     gcc/ChangeLog:

>     

>             * spellcheck.c (CASE_COST): New define.

>             (BASE_COST): New define.

>             (get_edit_distance): Recognize case changes.

>             (get_edit_distance_cutoff): Update.

>             (test_edit_distances): Update.

>             (get_old_cutoff): Update.

>             (test_find_closest_string): Add case sensitivity test.


Thanks; looks good to me.

Dave
Kewen.Lin via Gcc-patches June 2, 2020, 6:33 p.m. | #9
On Sat, 2020-05-30 at 18:51 +0000, Pip Cet wrote:
> On Sat, May 30, 2020 at 5:06 PM David Malcolm <dmalcolm@redhat.com>

> wrote:

> > On Sat, 2020-05-30 at 13:40 +0000, Pip Cet via Gcc-patches wrote:

> > > I think we should just omit the triangle inequality test from the

> > > self-test, as in the attached patch.

> > 

> > I like the idea,

> 

> Thanks!

> 

> > but can you update the comment so that it explains why

> > it's not a metric? (better to capture that in a comment in the

> > source

> > rather than just in the mailing list archives).

> 

> How's this?


Thanks; looks good to me.  Hopefully this doesn't clash with Tom's
patch.


Dave
Kewen.Lin via Gcc-patches June 5, 2020, 6:23 p.m. | #10
David Malcolm <dmalcolm@redhat.com> writes:

> On Sat, 2020-05-30 at 18:51 +0000, Pip Cet wrote:

>> How's this?

>

> Thanks; looks good to me.  Hopefully this doesn't clash with Tom's

> patch.


It doesn't, but I hope I got the commit message right this time.

(I don't have git access, so if someone could commit this if it's
approved, that would be great).
From 7e89a6be601b268b134e20c2702bd55945388a22 Mon Sep 17 00:00:00 2001
From: Pip Cet <pipcet@gmail.com>

Date: Sat, 30 May 2020 13:39:09 +0000
Subject: [PATCH] Don't test the triangle inequality in the spellcheck.c
 self-test

The variant of editing distance we use doesn't satisfy the triangle
inequality.

gcc/ChangeLog:

	* spellcheck.c (test_data): Add problematic strings.
	(test_metric_conditions): Don't test the triangle inequality
	condition, which our distance function does not satisfy.
---
 gcc/spellcheck.c | 22 ++++++++--------------
 1 file changed, 8 insertions(+), 14 deletions(-)

diff --git a/gcc/spellcheck.c b/gcc/spellcheck.c
index 9f7351f364f..45c41d7cef9 100644
--- a/gcc/spellcheck.c
+++ b/gcc/spellcheck.c
@@ -474,13 +474,17 @@ static const char * const test_data[] = {
   "food",
   "boo",
   "1234567890123456789012345678901234567890123456789012345678901234567890"
+  "abc",
+  "ac",
+  "ca",
 };
 
 /* Verify that get_edit_distance appears to be a sane distance function,
-   i.e. the conditions for being a metric.  This is done directly for a
-   small set of examples, using test_data above.  This is O(N^3) in the size
-   of the array, due to the test for the triangle inequality, so we keep the
-   array small.  */
+   even though it doesn't satisfy the conditions for being a metric.  (This
+   is because the triangle inequality fails to hold: the distance between
+   "ca" and "ac" is 1, and so is the distance between "abc" and "ac", but
+   the distance between "abc" and "ca" is 3. Algorithms that calculate the
+   true Levenshtein-Damerau metric are much more expensive.)  */
 
 static void
 test_metric_conditions ()
@@ -504,16 +508,6 @@ test_metric_conditions ()
 	  edit_distance_t dist_ji
 	    = get_edit_distance (test_data[j], test_data[i]);
 	  ASSERT_EQ (dist_ij, dist_ji);
-
-	  /* Triangle inequality.  */
-	  for (int k = 0; k < num_test_cases; k++)
-	    {
-	      edit_distance_t dist_ik
-		= get_edit_distance (test_data[i], test_data[k]);
-	      edit_distance_t dist_jk
-		= get_edit_distance (test_data[j], test_data[k]);
-	      ASSERT_TRUE (dist_ik <= dist_ij + dist_jk);
-	    }
 	}
     }
 }
-- 
2.27.0.rc0
Kewen.Lin via Gcc-patches July 1, 2020, 9 p.m. | #11
On Fri, 2020-06-05 at 18:23 +0000, Pip Cet via Gcc-patches wrote:
> David Malcolm <dmalcolm@redhat.com> writes:

> 

> > On Sat, 2020-05-30 at 18:51 +0000, Pip Cet wrote:

> > > How's this?

> > 

> > Thanks; looks good to me.  Hopefully this doesn't clash with Tom's

> > patch.

> 

> It doesn't, but I hope I got the commit message right this time.

> 

> (I don't have git access, so if someone could commit this if it's

> approved, that would be great).

Thanks.  Pushed.
jeff

Patch

diff --git a/gcc/spellcheck.c b/gcc/spellcheck.c
index 7891260a258..9002617453f 100644
--- a/gcc/spellcheck.c
+++ b/gcc/spellcheck.c
@@ -25,6 +25,12 @@  along with GCC; see the file COPYING3.  If not see
 #include "spellcheck.h"
 #include "selftest.h"
 
+/* Cost of a case transformation.  */
+#define CASE_COST 1
+
+/* Cost of another kind of edit.  */
+#define BASE_COST 2
+
 /* Get the edit distance between the two strings: the minimal
    number of edits that are needed to change one string into another,
    where edits can be one-character insertions, removals, or substitutions,
@@ -47,9 +53,9 @@  get_edit_distance (const char *s, int len_s,
     }
 
   if (len_s == 0)
-    return len_t;
+    return BASE_COST * len_t;
   if (len_t == 0)
-    return len_s;
+    return BASE_COST * len_s;
 
   /* We effectively build a matrix where each (i, j) contains the
      distance between the prefix strings s[0:j] and t[0:i].
@@ -67,7 +73,7 @@  get_edit_distance (const char *s, int len_s,
   /* The first row is for the case of an empty target string, which
      we can reach by deleting every character in the source string.  */
   for (int i = 0; i < len_s + 1; i++)
-    v_one_ago[i] = i;
+    v_one_ago[i] = i * BASE_COST;
 
   /* Build successive rows.  */
   for (int i = 0; i < len_t; i++)
@@ -83,21 +89,28 @@  get_edit_distance (const char *s, int len_s,
       /* The initial column is for the case of an empty source string; we
 	 can reach prefixes of the target string of length i
 	 by inserting i characters.  */
-      v_next[0] = i + 1;
+      v_next[0] = (i + 1) * BASE_COST;
 
       /* Build the rest of the row by considering neighbors to
 	 the north, west and northwest.  */
       for (int j = 0; j < len_s; j++)
 	{
-	  edit_distance_t cost = (s[j] == t[i] ? 0 : 1);
-	  edit_distance_t deletion     = v_next[j] + 1;
-	  edit_distance_t insertion    = v_one_ago[j + 1] + 1;
+	  edit_distance_t cost;
+
+	  if (s[j] == t[i])
+	    cost = 0;
+	  else if (TOLOWER (s[j]) == TOLOWER (t[i]))
+	    cost = CASE_COST;
+	  else
+	    cost = BASE_COST;
+	  edit_distance_t deletion     = v_next[j] + BASE_COST;
+	  edit_distance_t insertion    = v_one_ago[j + 1] + BASE_COST;
 	  edit_distance_t substitution = v_one_ago[j] + cost;
 	  edit_distance_t cheapest = MIN (deletion, insertion);
 	  cheapest = MIN (cheapest, substitution);
 	  if (i > 0 && j > 0 && s[j] == t[i - 1] && s[j - 1] == t[i])
 	    {
-	      edit_distance_t transposition = v_two_ago[j - 1] + 1;
+	      edit_distance_t transposition = v_two_ago[j - 1] + BASE_COST;
 	      cheapest = MIN (cheapest, transposition);
 	    }
 	  v_next[j + 1] = cheapest;
@@ -185,11 +198,11 @@  get_edit_distance_cutoff (size_t goal_len, size_t candidate_len)
   /* If the lengths are close, then round down.  */
   if (max_length - min_length <= 1)
     /* ...but allow an edit distance of at least 1.  */
-    return MAX (max_length / 3, 1);
+    return BASE_COST * MAX (max_length / 3, 1);
 
   /* Otherwise, round up (thus giving a little extra leeway to some cases
      involving insertions/deletions).  */
-  return (max_length + 2) / 3;
+  return BASE_COST * (max_length + 2) / 3;
 }
 
 #if CHECKING_P
@@ -228,47 +241,50 @@  test_get_edit_distance_both_ways (const char *a, const char *b,
 static void
 test_edit_distances ()
 {
-  test_get_edit_distance_both_ways ("", "nonempty", strlen ("nonempty"));
-  test_get_edit_distance_both_ways ("saturday", "sunday", 3);
-  test_get_edit_distance_both_ways ("foo", "m_foo", 2);
-  test_get_edit_distance_both_ways ("hello_world", "HelloWorld", 3);
+  test_get_edit_distance_both_ways ("", "nonempty",
+				    BASE_COST * strlen ("nonempty"));
+  test_get_edit_distance_both_ways ("saturday", "sunday",
+				    BASE_COST * 3);
+  test_get_edit_distance_both_ways ("foo", "m_foo", BASE_COST * 2);
+  test_get_edit_distance_both_ways ("hello_world", "HelloWorld", 4);
   test_get_edit_distance_both_ways
-    ("the quick brown fox jumps over the lazy dog", "dog", 40);
+    ("the quick brown fox jumps over the lazy dog", "dog", BASE_COST * 40);
   test_get_edit_distance_both_ways
     ("the quick brown fox jumps over the lazy dog",
      "the quick brown dog jumps over the lazy fox",
-     4);
+     BASE_COST * 4);
   test_get_edit_distance_both_ways
     ("Lorem ipsum dolor sit amet, consectetur adipiscing elit,",
      "All your base are belong to us",
-     44);
+     BASE_COST * 44);
   test_get_edit_distance_both_ways ("foo", "FOO", 3);
-  test_get_edit_distance_both_ways ("fee", "deed", 2);
-  test_get_edit_distance_both_ways ("coorzd1", "coordx1", 2);
-  test_get_edit_distance_both_ways ("assert", "sqrt", 3);
-  test_get_edit_distance_both_ways ("PATH_MAX", "INT8_MAX", 3);
-  test_get_edit_distance_both_ways ("time", "nice", 2);
-  test_get_edit_distance_both_ways ("bar", "carg", 2);
+  test_get_edit_distance_both_ways ("fee", "deed", BASE_COST * 2);
+  test_get_edit_distance_both_ways ("coorzd1", "coordx1", BASE_COST * 2);
+  test_get_edit_distance_both_ways ("assert", "sqrt", BASE_COST * 3);
+  test_get_edit_distance_both_ways ("PATH_MAX", "INT8_MAX", BASE_COST * 3);
+  test_get_edit_distance_both_ways ("time", "nice", BASE_COST * 2);
+  test_get_edit_distance_both_ways ("bar", "carg", BASE_COST * 2);
   test_get_edit_distance_both_ways ("gtk_widget_show_all",
-				    "GtkWidgetShowAll", 7);
-  test_get_edit_distance_both_ways ("m_bar", "bar", 2);
-  test_get_edit_distance_both_ways ("MACRO", "MACRAME", 3);
-  test_get_edit_distance_both_ways ("ab", "ac", 1);
-  test_get_edit_distance_both_ways ("ab", "a", 1);
-  test_get_edit_distance_both_ways ("a", "b", 1);
-  test_get_edit_distance_both_ways ("nanl", "name", 2);
-  test_get_edit_distance_both_ways ("char", "bar", 2);
-  test_get_edit_distance_both_ways ("-optimize", "fsanitize", 5);
-  test_get_edit_distance_both_ways ("__DATE__", "__i386__", 4);
+				    "GtkWidgetShowAll", 10);
+  test_get_edit_distance_both_ways ("m_bar", "bar", BASE_COST * 2);
+  test_get_edit_distance_both_ways ("MACRO", "MACRAME", BASE_COST * 3);
+  test_get_edit_distance_both_ways ("ab", "ac", BASE_COST * 1);
+  test_get_edit_distance_both_ways ("ab", "a", BASE_COST * 1);
+  test_get_edit_distance_both_ways ("a", "b", BASE_COST * 1);
+  test_get_edit_distance_both_ways ("nanl", "name", BASE_COST * 2);
+  test_get_edit_distance_both_ways ("char", "bar", BASE_COST * 2);
+  test_get_edit_distance_both_ways ("-optimize", "fsanitize", BASE_COST * 5);
+  test_get_edit_distance_both_ways ("__DATE__", "__i386__", BASE_COST * 4);
 
   /* Examples where transposition helps.  */
-  test_get_edit_distance_both_ways ("ab", "ba", 1);
-  test_get_edit_distance_both_ways ("ba", "abc", 2);
-  test_get_edit_distance_both_ways ("coorzd1", "coordz1", 1);
+  test_get_edit_distance_both_ways ("ab", "ba", BASE_COST * 1);
+  test_get_edit_distance_both_ways ("ba", "abc", BASE_COST * 2);
+  test_get_edit_distance_both_ways ("coorzd1", "coordz1", BASE_COST * 1);
   test_get_edit_distance_both_ways ("abcdefghijklmnopqrstuvwxyz",
-				    "bacdefghijklmnopqrstuvwxzy", 2);
-  test_get_edit_distance_both_ways ("saturday", "sundya", 4);
-  test_get_edit_distance_both_ways ("signed", "singed", 1);
+				    "bacdefghijklmnopqrstuvwxzy",
+				    BASE_COST * 2);
+  test_get_edit_distance_both_ways ("saturday", "sundya", BASE_COST * 4);
+  test_get_edit_distance_both_ways ("signed", "singed", BASE_COST * 1);
 }
 
 /* Subroutine of test_get_edit_distance_cutoff, for emulating the
@@ -277,7 +293,7 @@  test_edit_distances ()
 static edit_distance_t
 get_old_cutoff (size_t goal_len, size_t candidate_len)
 {
-  return MAX (goal_len, candidate_len) / 2;
+  return BASE_COST * MAX (goal_len, candidate_len) / 2;
 }
 
 /* Verify that the cutoff for "meaningfulness" of suggestions is at least as
@@ -428,6 +444,24 @@  test_find_closest_string ()
   candidates.safe_push("coordy1");
   candidates.safe_push("coordz1");
   ASSERT_STREQ ("coordz1", find_closest_string ("coorzd1", &candidates));
+
+  candidates.truncate (0);
+  candidates.safe_push ("DWARF_GNAT_ENCODINGS_GDB");
+  candidates.safe_push ("DWARF_GNAT_ENCODINGS_ALL");
+  candidates.safe_push ("DWARF_GNAT_ENCODINGS_MINIMAL");
+  ASSERT_STREQ ("DWARF_GNAT_ENCODINGS_ALL",
+		find_closest_string ("DWARF_GNAT_ENCODINGS_all",
+				     &candidates));
+
+  /* The same as the previous test, but with a different order of
+     candidates.  */
+  candidates.truncate (0);
+  candidates.safe_push ("DWARF_GNAT_ENCODINGS_ALL");
+  candidates.safe_push ("DWARF_GNAT_ENCODINGS_GDB");
+  candidates.safe_push ("DWARF_GNAT_ENCODINGS_MINIMAL");
+  ASSERT_STREQ ("DWARF_GNAT_ENCODINGS_ALL",
+		find_closest_string ("DWARF_GNAT_ENCODINGS_all",
+				     &candidates));
 }
 
 /* Test data for test_metric_conditions.  */