#17055 closed enhancement (fixed)
[Patch] Validator should complain about simple typos in tag values
Reported by: | GerdP | Owned by: | team |
---|---|---|---|
Priority: | normal | Milestone: | 19.03 |
Component: | Core validator | Version: | |
Keywords: | template_report | Cc: |
Description
What steps will reproduce the problem?
- Create way with typo like highway=residental (missing 2nd i)
- Run validator
What is the expected result?
Warning Misspelled property value - Value 'residental' for key 'highway' looks like 'residential'. (1)
What happens instead?
Only an informational message is produced:
Presets do not contain property value - Value 'residental' for key 'highway' not in presets. (1)
Please provide any additional information below. Attach a screenshot if possible.
See also https://josm.openstreetmap.de/ticket/15203#comment:31
The attached patch uses Utils.getLevenshteinDistance() to detect typical typos and propose a fix.
I found one possible problem so far:
When you type highway=servics it suggests highway=services
Build-Date:2018-12-01 17:14:34 Revision:14478 Is-Local-Build:true Identification: JOSM/1.5 (14478 SVN en) Windows 10 64-Bit OS Build number: Windows 10 Home 1803 (17134) Memory Usage: 1437 MB / 3641 MB (388 MB allocated, but free) Java version: 1.8.0_191-b12, Oracle Corporation, Java HotSpot(TM) 64-Bit Server VM Screen: \Display0 1920x1080 Maximum Screen Size: 1920x1080 VM arguments: [-agentlib:jdwp=transport=dt_socket,suspend=y,address=localhost:51821, -ea, -Dfile.encoding=UTF-8] Dataset consistency test: No problems found Plugins: + OpeningHoursEditor (34535) + apache-commons (34506) + buildings_tools (34724) + download_along (34503) + ejml (34389) + geotools (34513) + jaxb (34506) + jts (34524) + merge-overlap (34664) + o5m (34405) + opendata (34698) + pbf (34576) + poly (34546) + reverter (34552) + undelete (34568) + utilsplugin2 (34506) Last errors/warnings: - W: Update plugins - org.openstreetmap.josm.plugins.PluginHandler$UpdatePluginsMessagePanel[,0,0,0x0,invalid,layout=java.awt.GridBagLayout,alignmentX=0.0,alignmentY=0.0,border=,flags=9,maximumSize=,minimumSize=,preferredSize=] - W: No configuration settings found. Using hardcoded default values for all pools. - W: Failed to locate resource '/README'. - W: Failed to locate resource '/CONTRIBUTION'. - W: Failed to locate resource '/LICENSE'.
Attachments (11)
Change History (74)
by , 6 years ago
Attachment: | 17055-v1.patch added |
---|
comment:3 by , 6 years ago
Replying to Don-vip:
If we go this way we should clean the
words.cfg
file (TagChecker.SPELL_FILE
) and make sure there is no regression.
Remove it? That has not be maintained for ages. I think this is from the time before we had real presets. The approach of this tickets sounds more like a modern approach comparing the input to valid data.
comment:4 by , 6 years ago
Removing it is the ultimate cleaning :) We just need to check we don't loose functionality (by sampling values).
comment:5 by , 6 years ago
My understanding is that words.cfg is only used for tags with misspelled keys, not for wrong tag values with known keys.
Try with e.g. highway=residentail
by , 6 years ago
Attachment: | filter_words.patch added |
---|
small hack to create a reduced version of words.cfg
by , 6 years ago
Attachment: | shrink-words.cfg.patch added |
---|
patch to reduce and update words.cfg (some values added, many ignored values removed)
by , 6 years ago
Attachment: | 17055-v2.patch added |
---|
comment:6 by , 6 years ago
17055-v2.patch improves:
- reduced words.cfg
- reduced memory footprint because harmonizedKeys no longer contains pairs where key and value are equal
- if more than one alternative for a tag value is found, all are listed and auto fix is disabled
- updated unit tests
My my gut feeling is that Utils.getLevenshteinDistance() will not work for the keys, but I'll try to code it later.
comment:7 by , 6 years ago
As expected we will get many false positives when we simply use all tag keys in the presets in combination with getLevenshteinDistance():
Key 'draft' looks like 'craft'. (3) Key 'link' looks like 'line'. (1) Key 'ref_name' looks like 'reg_name'. (59) Key 'sat' looks like 'salt'. (49) Key 'type2' looks like 'type'. (1)
Also performance is poor compared to the current code because we have to calculate the Levenshtein distance ~600 times for each key of each primitive when we use all keys from the presets.
I see possibilities:
1) Don't use getLevenshteinDistance() with keys
2) Specify a list of keys for which the getLevenshteinDistance() can be used. That could be a simple list or maybe an attribute in the presets xml.
I think I still prefer the solution implemented with 17055-v2.patch
comment:10 by , 6 years ago
How to disable clear false positives like Key 'ref_name' looks like 'reg_name'. (59)
? Simply add ref_name as correctly spelled key to words.cfg?
comment:11 by , 6 years ago
These wrong positives were reported when I tried to ignore words.cfg (did not post a patch for that). You should not see them with r14490.
On the other hand I just noticed that typo highway=residentail is still not flagged :-(
by , 6 years ago
Attachment: | 17055-enh-v1.patch added |
---|
follow-up: 13 comment:12 by , 6 years ago
Resolution: | fixed |
---|---|
Status: | closed → reopened |
With 17055-enh-v2.patch I tried to improve TagChecker so that it also suggests a fix when the harmonized tag value looks like a truncated value listed in the presets.
Examples:
highway=u -> highway=unclassified
Problem: Many values are ambiguous, e.g.
highway=res could be residential or rest_area
highway=ser could be serivce or services
highway=tr could be track, trunk, or trunk_link
In those cases a warning with "Ambiguous truncated property value" is produced.
My goal here is not to find the most likely value but to produce a warning instead of information.
What do you think about that?
There is still a problem with the getLevenshteinDistance(). I see possibly wrong warnings like
Value '125' for key 'fire_hydrant:diameter' looks like one of [100, 150, 250].
Value 'none' for key 'sidewalk' looks like 'no'.
It is probably not a good idea to use that for short values.
comment:13 by , 6 years ago
Replying to GerdP:
There is still a problem with the getLevenshteinDistance(). I see possibly wrong warnings like
Value '125' for key 'fire_hydrant:diameter' looks like one of [100, 150, 250].
fire_hydrant:diameter=125 is in the presets. The getLevenshteinDistance() should only be tested on tags which are not in the presets.
Value 'none' for key 'sidewalk' looks like 'no'.
none
is not "officially deprecated", but no
is the preferred value, so this warning would be ok in my opinion.
follow-up: 15 comment:14 by , 6 years ago
125 is not in the defaultpresets.xml. See
<combo key="fire_hydrant:diameter" text="Diameter (in mm)" values="50,80,100,150,200,250,300,400" />
Taginfo says that sidewalk=no and sidewalk=none are nearly equally used. Maybe we can have both in the presets?
comment:15 by , 6 years ago
Replying to GerdP:
125 is not in the defaultpresets.xml.
Right, sorry, I mixed it up with 150.
Taginfo says that sidewalk=no and sidewalk=none are nearly equally used. Maybe we can have both in the presets?
Looking at the history of the sidewalk wiki page the preferred tag changed several times over the years. We should have only one in our preset and stay with the current no
.
comment:17 by , 6 years ago
I didn't commit the changes reg. truncated values, they produced more noise than helpful warnings.
What I'd really like to have would be a flag that says if the list of preset values is quite complete and therefore other values are probably wrong.
This would allow to produce a warning when the value is not in the presets.
I simply don't like that we only produce an information for well known tag keys like highway, e.g. highway=dirt, highway=footpath or highway=u when we do all kinds of complex checks for highways like "Way end node near other highway".
An alternative woulld be to write a MapCSS check which lists all the values that appear in the presets, but I think that would mean a duplication of the lists.
I think a MapCSS rule is needed for something like highway=driveway, where the validator should suggest to replace it with
highway=service + service=driveway?
comment:18 by , 6 years ago
I am unsure if I am the only person who sometimes press Enter too quickly (before auto-completing a value with tab), but I usually have wrong surface=a
or surface=u
values in my data set.
For surface=u
it's suggesting mud
, while it most probably should be unpaved
Is it possible to have surface=a
→ surface=asphalt
too?
comment:19 by , 6 years ago
I coded something like that but was not satisfied with the results.
Can you explain why you have to press tab for autocompletion? I don't need to do that.
comment:20 by , 6 years ago
Next time I promise I will test exactly what I do before writing :-)
Indeed, I don't use tab but somehow I end up with surface=a
and surface=u
objects.
(and I will create some local rules to catch this an error)
Rectifying my previous message then, probably surface=u
shouldn't be understood as surface=mud
(nor offered an auto-fix for this specific case)
comment:21 by , 6 years ago
OK, I think JOSM should not suggest to fix a value with only one character with one that has three.
Reg. surface=a : You are right, it is probably always asphalt, but presets also contain artificial_turf, so JOSM would only show a list of these possible candidates, but it would not enable the Fix button. That's why I wasn't satisfied with my approach.
See also comment:12.
follow-up: 24 comment:23 by , 6 years ago
A way with sidewalk=none
is giving a warning saying that none
seems to be no
, but https://wiki.openstreetmap.org/wiki/Sidewalks#Sidewalk_as_refinement_to_a_highway says that "none
is also used synonymously to no
"
Probably there shouldn't be a warning for this case too.
comment:24 by , 6 years ago
Replying to naoliv:
A way with
sidewalk=none
is giving a warning saying thatnone
seems to beno
, but https://wiki.openstreetmap.org/wiki/Sidewalks#Sidewalk_as_refinement_to_a_highway says that "none
is also used synonymously tono
"
Probably there shouldn't be a warning for this case too.
I also noticed this, see comment:13 to comment:15. I also don't like the warning in this case, but I am unsure how to handle it.
I want to see the warning when one types e.g. oneway=none so one possible solution would be to add a line
K:sidewalk:no
to the end of file ignoretags.cfg.
comment:26 by , 6 years ago
shop=gas is warning with Value 'gas' for key 'shop' looks like one of [bag, car, yes].
Gerd, will this warning magically disappear if a preset is created for shop=gas
?
comment:27 by , 6 years ago
Yes, I think so. Still, this should also be handled in the code. Looking at it...
comment:29 by , 6 years ago
I wonder if we should flag all tag values with just one character (not a digit), so that we produce a warning instead of information for
e.g. highway=u or surface=a, but only for known preset keys.
We have a similar test that says "uncommon short key" for short unknown keys, e.g. for s=asphalt
comment:30 by , 6 years ago
shop=pasta
is giving a warning saying that it looks like pastry
, but they are different:
https://wiki.openstreetmap.org/wiki/Tag:shop%3Dpasta
https://wiki.openstreetmap.org/wiki/Tag:shop%3Dpastry
shop=party
is giving a warning saying that it looks like [art, pastry]
while it also shouldn't
follow-up: 36 comment:31 by , 6 years ago
I fear you will always find more here because the presets will probably never cover all cases.
Maybe we could create a list of "good" values for which we don't have a preset using taginfo? I think we already have a script TagInfoExtract.groovy
to find possibly missing presets, but it seems the result is not stored anywhere.
comment:32 by , 6 years ago
The only thing that I fear is people pressing Fix
without actually taking a look at the message and at the value (in this example, changing pasta
to pastry
)
Maybe it's safer to disable the autofix and only issue a warning?
Or, I don't know if it's possible, to have a fix option only for cases where there is no doubt at all.
follow-up: 39 comment:35 by , 6 years ago
Maybe I should also change the message text to something like
"Value '...' for key '...' is not in presets, did you mean '...'?"
follow-up: 37 comment:36 by , 6 years ago
Replying to GerdP:
I fear you will always find more here because the presets will probably never cover all cases.
Maybe we could create a list of "good" values for which we don't have a preset using taginfo? I think we already have a scriptTagInfoExtract.groovy
to find possibly missing presets, but it seems the result is not stored anywhere.
"Good" values which don't have a preset yet (e.g. due to low usage numbers like shop=pasta) can be added to source:/josm/trunk/data/validator/ignoretags.cfg Does your misspelled values test consider this list?
I think TagInfoExtract.groovy
is used only for the integration test. (Currently tags without wiki page are ignored too, but this constraint may be removed after adding some tags to the presets and feeding the ignore list a bit, see #16900.)
follow-up: 38 comment:37 by , 6 years ago
Replying to Klumbumbus:
"Good" values which don't have a preset yet (e.g. due to low usage numbers like shop=pasta) can be added to source:/josm/trunk/data/validator/ignoretags.cfg Does your misspelled values test consider this list?
Well, it is not executed for the values in this list. Might not be the best way to consider the list.
I think
TagInfoExtract.groovy
is used only for the integration test. (Currently tags without wiki page are ignored too, but this constraint may be removed after adding some tags to the presets and feeding the ignore list a bit, see #16900.)
The list doesn't contain the examples shop=party
or shop=pastry
, so we'd probably need a longer list.
comment:38 by , 6 years ago
Replying to GerdP:
Replying to Klumbumbus:
"Good" values which don't have a preset yet (e.g. due to low usage numbers like shop=pasta) can be added to source:/josm/trunk/data/validator/ignoretags.cfg Does your misspelled values test consider this list?
Well, it is not executed for the values in this list. Might not be the best way to consider the list.
Yes, with "consider" I meant that the test is not executed for the values in this list.
I think
TagInfoExtract.groovy
is used only for the integration test. (Currently tags without wiki page are ignored too, but this constraint may be removed after adding some tags to the presets and feeding the ignore list a bit, see #16900.)
The list doesn't contain the examples
shop=party
orshop=pastry
, so we'd probably need a longer list.
Yes these two (and probably some more we don't know atm) should be added to the list to remove the false positive warnings.
comment:39 by , 6 years ago
Replying to GerdP:
Maybe I should also change the message text to something like
"Value '...' for key '...' is not in presets, did you mean '...'?"
That sounds better. (Maybe ...in the presets...)
comment:40 by , 6 years ago
OK, will change it. Reg. the list I think about a different approach:
We could generate a list of all frequently used tags and run them through TagChecker. All false positives should be added to
a special list which would than be used to suggest correct spelling. The test is not so much about correct tagging, so I would not care if e.g. sidewalk=none or sidewalk=no is in the presets as long as both values are often used.
How does that sound?
comment:41 by , 6 years ago
Sounds good, however it might be hard to find an proper automatic rule. How to define "often used"?
follow-up: 45 comment:42 by , 6 years ago
Probably there is no simple answer. As a first approach I'd say all tags used > 1000 times are often used. Let me code something to process such a list, I probably cannot help filling it.
Thought about the warning text again. The "you" is not so good when someone else created the tag.
Next try:
single value in {2} : "Value ''{0}'' for key ''{1}'' is not in the presets, maybe ''{2}'' is meant?" multiple values in {2}: "Value ''{0}'' for key ''{1}'' is not in the presets, maybe one of {2} is meant?"
by , 6 years ago
Attachment: | 17055-use-ignore.patch added |
---|
use values with prefix K: for checks with Levenshtein distance, too
comment:43 by , 6 years ago
With the 17055-use-ignore.patch the lines from ignoretags.cfg starting with K: are treated as "often used".
So, if you add K:shop=party to it
and run the validator with shop=party it will be quiet, for shop=panty it will show
Unknown property value - Value 'panty' for key 'shop' is not in the presets, maybe 'party' is meant? (1)
and for shop=pasty you get
Unknown property value - Value 'pasty' for key 'shop' is not in the presets, maybe one of [party, pastry] is meant? (1)
open questions:
1) Should we use another file for the often used values? If yes, should we use the "K:" prefix?
2) Should we always raise a warning when a tag key is found in that list but not the value?
Depending on this the message texts would change again, I think "not in the presets" would then be "unknown".
comment:44 by , 6 years ago
Milestone: | → 19.01 |
---|
comment:45 by , 6 years ago
To determine frequently used tags, you could query Taginfo. For instance /api/4/tags/popular: https://taginfo.openstreetmap.org/api/4/tags/popular?page=1&rp=999&sortname=count_all&sortorder=desc
comment:46 by , 6 years ago
Yes, it is close to what we need, but it stops for value count < 10000. I think that's too early.
comment:47 by , 6 years ago
Alternatively, you could download the entire SQLite database from https://taginfo.openstreetmap.org/download and perform custom queries.
by , 6 years ago
Attachment: | frequent-tags.zip added |
---|
comment:48 by , 6 years ago
Thanks for the hint and sorry for the late answer. I've downloaded the complete DB (1.1GB) and played with a few SQL to extract frequently used tags.
My current thinking is this:
Execute a rather simple sql like
'select * from tags where count_all > 1000';
or maybe a bit more complex:
select * from tags where count_all > 1000 and key not like '%name%' and key not like '%source%' and key not like 'tiger%' and key not like 'yh:%' and key not like 'addr:%' and key not like 'is_in%' and key not like 'contact:%' and key not like 'turn:lanes%' and key not in ('circumference','color','colour','created_by','description','ele','fixme','height','width','level','opening_hours','website');
to extract a list of frequently used tags, lets name it frequent-tags.txt. Attached is the result for the latter.
The table tags looks like this:
CREATE TABLE tags ( key VARCHAR, value VARCHAR, count_all INTEGER DEFAULT 0, count_nodes INTEGER DEFAULT 0, count_ways INTEGER DEFAULT 0, count_relations INTEGER DEFAULT 0, in_wiki INTEGER DEFAULT 0, in_wiki_en INTEGER DEFAULT 0 );
The result contains the columns separated by |
.
My current thinking is this: I could either try to improve the sql so that it filters more tags which are not usefull for the Levenshtein text ("val looks like ...") so that we can add that result directly to the data\validator directory and parse it on each start of JOSM like we do with ignoretags.cfg
. The result could be some HashSets, maybe separated for nodes, ways, and relations, which would be used to suggest different spelling.
I'll try to code that now. Question is how the input file is updated.
I think it would be best if that file would be created on the taginfo server once we know exactly how the format should look like?
comment:49 by , 6 years ago
I've played with this for a while now. I have a list of frequently used tags where the values are strings. Many of them are also in the presets, BUT also many are not.
I don't think that I can simply use the list from taginfo without review, as it sometimes contains bad data from imports.
Example: leaf_type=broadl-leaved appears > 27000 times but is simply wrong (should be broadleaved)
I don't want to produce a hint like
Unknown property value - Value 'broadlleaved' for key 'leaf_type' is unknown, maybe one of [broadl-leaved, broadleaved] is meant? (1)
So the list has to be checked carefully.
I fear this will take some more days to produce a good start.
I think about involving the community here.
follow-up: 53 comment:52 by , 6 years ago
I think about adding these keys to ignoredtags.cfg (with prefix K:)
building=shop building=chalet building=clinic crop=rape crossing=marked crossing=controlled highway=trail highway=ford indoor=door natural=land natural=hill natural=arete natural=slope piste:grooming=classic+skating power=sub_station product=wine railway=derail railway=radio resource=sand shop=trade sidewalk=none sport=skating surface=hard
They all produce false warnings:
Value 'classic+skating' for key 'piste:grooming' is unknown, maybe 'classic;skating' is meant? (1) Value 'clinic' for key 'building' is unknown, maybe 'civic' is meant? (1) Value 'controlled' for key 'crossing' is unknown, maybe 'uncontrolled' is meant? (1) Value 'door' for key 'indoor' is unknown, maybe 'room' is meant? (1) Value 'ford' for key 'highway' is unknown, maybe 'road' is meant? (1) Value 'hard' for key 'surface' is unknown, maybe 'sand' is meant? (1) Value 'none' for key 'sidewalk' is unknown, maybe 'no' is meant? (1) Value 'radio' for key 'railway' is unknown, maybe 'rail' is meant? (1) Value 'rape' for key 'crop' is unknown, maybe 'grape' is meant? (1) Value 'sand' for key 'resource' is unknown, maybe 'salt' is meant? (1) Value 'skating' for key 'sport' is unknown, maybe 'karting' is meant? (1) Value 'slope' for key 'natural' is unknown, maybe 'stone' is meant? (1) Value 'sub_station' for key 'power' is unknown, maybe 'substation' is meant? (1) Value 'trade' for key 'shop' is unknown, maybe 'frame' is meant? (1) Value 'wine' for key 'product' is unknown, maybe 'lime' is meant? (1)
Any problems with that?
by , 6 years ago
Attachment: | extract_1000.sql added |
---|
by , 6 years ago
Attachment: | extract.script added |
---|
comment:55 by , 6 years ago
To find this list I did this:
1) install Sqlite3: (unzip sqlite-tools-win32-x86-3260000.zip and sqlite-dll-win64-x64-3260000.zip)
2) download and uncopress taginfo db.db.bz2 from https://taginfo.openstreetmap.org/download/taginfo-db.db.bz2(uncompressed size is ~6.5GB)
3) create attached simple sql and script. I also created a much more complex sql to filter all kinds of tags like names etc but decided that it would be difficult to maintain.
4) run Sqlite3 with sqlite3 -readonly -init extract.script taginfo-db.db
5) type ".quit" followed by ENTER to get out of Sqlite3 (I did not yet dind a way to automate that)
6) copy the result file frequent-tags.cfg (~ 2.6 MB) to \josm\nodist\data
7) execute unit test TaginfoTestIT with attached patch TaginfoTestIT.patch
This produces a list of possible false positives which has to be checked by someone who really understands the meaning of the tags,
I see no way to automate that last step. One has to know the tags and maybe also the results produced by other JOSM tests to decide what to do with this list. Maybe we could feed that list to a web page for review by the community?
The rest might be automated, esp. with a little help from taginfo to avoid the download of the full db.
comment:56 by , 6 years ago
Milestone: | 19.01 → 19.02 |
---|
comment:57 by , 6 years ago
Milestone: | 19.02 → 19.03 |
---|
comment:61 by , 6 years ago
I left it open for possible false positives. We can also close this ticket and maybe create a new one which collects the false positives.
comment:62 by , 6 years ago
Resolution: | → fixed |
---|---|
Status: | reopened → closed |
Yes. The main implementation was done. If new false positives pop up we can use new tickets or simply reference this one.
by , 5 years ago
Attachment: | TaginfoTestIT.patch added |
---|
If we go this way we should clean the
words.cfg
file (TagChecker.SPELL_FILE
) and make sure there is no regression.