#17595 closed defect (fixed)
False positive "Tag value contains non-printing character" for Persian script
Reported by: | Claudius | Owned by: | Don-vip |
---|---|---|---|
Priority: | normal | Milestone: | 19.04 |
Component: | Core validator | Version: | |
Keywords: | template_report persian unicode regression | Cc: |
Description (last modified by )
What steps will reproduce the problem?
- Load osmwww:node/3305202081 (Tested with node version v8 which contains the tag wikipedia=fa:منجیلآباد )
- Run validator
What is the expected result?
No "Tag value contains non-printing character" warning to be shown
What happens instead?
"Tag value contains non-printing character" is shown. The name Manjilabad written in Persian script منجیلآباد are perfectly valid and printable characters.
Relative:URL: ^/trunk Repository:UUID: 0c6e7542-c601-0410-84e7-c038aed88b3b Last:Changed Date: 2019-04-11 21:18:16 +0200 (Thu, 11 Apr 2019) Revision:14986 Build-Date:2019-04-12 01:30:51 URL:https://josm.openstreetmap.de/svn/trunk Identification: JOSM/1.5 (14986 de) Windows 10 64-Bit OS Build number: Windows 10 Home 1809 (17763) Memory Usage: 653 MB / 2048 MB (142 MB allocated, but free) Java version: 11.0.1+13, AdoptOpenJDK, OpenJDK 64-Bit Server VM Screen: \Display0 2736x1824 Maximum Screen Size: 2736x1824 Dataset consistency test: No problems found Plugins: + OpeningHoursEditor (34867) Tagging presets: + https://josm.openstreetmap.de/josmfile?page=Presets/OpenPisteMap&zip=1 + https://josm.openstreetmap.de/josmfile?page=Presets/OneClick&zip=1 + https://josm.openstreetmap.de/josmfile?page=Presets/Iranian_Preset&zip=1 Map paint styles: + https://josm.openstreetmap.de/josmfile?page=Styles/MaxspeedIcons&zip=1 + https://josm.openstreetmap.de/josmfile?page=Styles/Lane_and_Road_Attributes&zip=1 + https://josm.openstreetmap.de/josmfile?page=Styles/LayerChecker&zip=1 + https://josm.openstreetmap.de/josmfile?page=Styles/AdvertisingStyle&zip=1 + https://raw.githubusercontent.com/species/josm-preset-traffic_sign_direction/master/direction.mapcss + https://github.com/osmlab/appledata/archive/josm_paint_inline_validation.zip Last errors/warnings: - W: No configuration settings found. Using hardcoded default values for all pools.
Attachments (0)
Change History (20)
comment:1 by , 6 years ago
comment:2 by , 6 years ago
Keywords: | unicode block persian added |
---|---|
Owner: | changed from | to
Status: | new → assigned |
There is no autofix for this check. It is impossible to guess if the string contains an extra character that must be deleted or if someone used a wrong character instead of the correct one.
comment:3 by , 6 years ago
There is an autofix. I am just not sure if it works for this case where left to right and right to left is mixed.
comment:4 by , 6 years ago
Description: | modified (diff) |
---|
comment:5 by , 6 years ago
Keywords: | unicode block removed |
---|
Ah I confused with another check. With the autofix "fa:منجیلآباد" becomes "fa:منجیلآباد".
comment:6 by , 6 years ago
I guess the fix corrupts data?
before autofix: https://fa.wikipedia.org/wiki/%D9%85%D9%86%D8%AC%DB%8C%D9%84%E2%80%8C%D8%A2%D8%A8%D8%A7%D8%AF exists
after autofix: https://fa.wikipedia.org/wiki/%D9%85%D9%86%D8%AC%DB%8C%D9%84%D8%A2%D8%A8%D8%A7%D8%AF does not
comment:7 by , 6 years ago
Please double check. For me both links work and I think they both link to the same page.
follow-up: 9 comment:8 by , 6 years ago
Please double check. For me both links work and I think they both link to the same page.
For me the first links to https://fa.wikipedia.org/wiki/%D9%85%D9%86%D8%AC%DB%8C%D9%84%E2%80%8C%D8%A2%D8%A8%D8%A7%D8%AF with the article and second to https://fa.wikipedia.org/wiki/%D9%85%D9%86%D8%AC%DB%8C%D9%84%D8%A2%D8%A8%D8%A7%D8%AF with "There is currently no text in this page."
comment:9 by , 6 years ago
Replying to mkoniecz:
Please double check. For me both links work and I think they both link to the same page.
For me the first links to https://fa.wikipedia.org/wiki/%D9%85%D9%86%D8%AC%DB%8C%D9%84%E2%80%8C%D8%A2%D8%A8%D8%A7%D8%AF with the article and second to https://fa.wikipedia.org/wiki/%D9%85%D9%86%D8%AC%DB%8C%D9%84%D8%A2%D8%A8%D8%A7%D8%AF with "There is currently no text in this page."
Yes, sorry, my fault. I think now I understand better what dyslexics means :O
For me both pages looked similar and I did not even try to find out what the words mean :(
comment:10 by , 6 years ago
Hi guys , the problem is with the "Zero-width_non-joiner" or AKA "half space" character in 'wikipedia' tag.please read this wiki page about this character : https://en.wikipedia.org/wiki/Zero-width_non-joiner
we are not always using this character in Persian writing(typing) but i think this is valid character in Farsi.
PS: i speak in persian/farsi , so feel free to ask any question about it.
regards
comment:11 by , 6 years ago
i dont speaking/writing English very well, so im very sorry.
but, in Iran and other country(like Afghanistan and etc.), we use "Zero-width non-joiner" in writing . usually Zero-width non-joiner used in names. ex: "محمودآباد". but sometimes we used Zero-width non-joiner in many words like "ثبتنام" or "میخواهم" or other words.
please fix it.
thanks.
comment:12 by , 6 years ago
@Vincent: Maybe we can change the test so that it only creates the error when the "fixed" value contains only latin1 characters?
follow-up: 16 comment:13 by , 6 years ago
Hi,
In practice, zwnj is like space but with a width of zero and It does prevent adjacent characters to join each other. Although it's a valid character in Persian, yet there are some cases it appears at invalid position in a word. So It would be nice if we could keep warnings for invalid cases.
As an example think about "aa*aaa" as a word in Persian with zwnj included. astetisk is zwnj.
the only valid case is aa*aaa.
(between two adjacent letters that can join each other)
more common cases that are invalid:
- doubled zwnj or more (like doubled space): aa**aaa
- at start or end of word: *aa*aaa or aa*aaa*
- imediately before/after space character: aa* aaa or aa *aaa (this could happen in a word, because normally we type zwnj with shift+space)
- maybe a more tricky one:
- We have seven letters (و, ژ, ز, ر, ذ, د, ا) which do not connect to a following letter. So writing zwnj after them is useless and not needed. assume b is one of them. this is invalid: ab*aaa (this case could happen to other languages with similar but not the same letters)
thanks
comment:14 by , 6 years ago
The test was introduced to fix #17521, that's why I would try to limit it to those cases. I see no way to handle all the special cases of hundreds of languages.
comment:15 by , 6 years ago
No problem, maybe we should check it using some special quality assurance tools.
comment:16 by , 6 years ago
Replying to iman:
In practice, zwnj is like space but with a width of zero and It does prevent adjacent characters to join each other. Although it's a valid character in Persian, yet there are some cases it appears at invalid position in a word. So It would be nice if we could keep warnings for invalid cases.
As an example think about "aa*aaa" as a word in Persian with zwnj included. astetisk is zwnj.
the only valid case is aa*aaa.
(between two adjacent letters that can join each other)
more common cases that are invalid:
- doubled zwnj or more (like doubled space): aa**aaa
- at start or end of word: *aa*aaa or aa*aaa*
- imediately before/after space character: aa* aaa or aa *aaa (this could happen in a word, because normally we type zwnj with shift+space)
- maybe a more tricky one:
- We have seven letters (و, ژ, ز, ر, ذ, د, ا) which do not connect to a following letter. So writing zwnj after them is useless and not needed. assume b is one of them. this is invalid: ab*aaa (this case could happen to other languages with similar but not the same letters)
Thank you for this explanation! I will detect and fix the simplest case (start or end of the complete string) to fix this ticket and open a new one to see if we can do better.
comment:17 by , 6 years ago
Keywords: | unicode added |
---|---|
Milestone: | → 19.04 |
comment:19 by , 6 years ago
Component: | Core → Core validator |
---|---|
Keywords: | regression added |
This message is not about persian script characters. Please try the Fix button and check the result.