Re: [oXygen-user] Feature request: Improvement of Japanese search for WebHelp

Hi,
What would the custom Japanese user dictionary add to the build-time indexing process:
- a list of domain-specific words
It will be the most important.
- or some generic tweaks that would allow matching a client-side search
Presumably it will be JavaScript's role just as its wordsStartsWith() (I guess) does currently.
the generic tweaks added by a custom user dictionary would not be needed anymore. Could you give a short example please?
Suppose there is a Japanese sentence whose structure resembles "OVERPAYPALEBAYBERRY." Kuromoji or any analyzer may segment it as "over,paypal,ebay,berry", "overpay,pale,bayberry", "over,pay,pal,e,bay,berry" or so, depending on its built-in dictionary and method. There is no single authoritative answer. With one of such imperfect indexes, the user may search for 'over', 'overpay', 'paypal', 'pale', 'ale', 'ebay', 'bay', 'bayberry', 'verpaypalebaybe' or even 'e'... (The last two may look ridiculous, but such array of characters can be a self-contained atomic well-known word in the Japanese language.) Kuromoji is clever enough to return multiple terms such as "(paypal|pay,pal),(ebay|e,bay)", but that would never be perfect. Hence the user dictionary can play a critical role to let the analyzer know the novel word "verpaypalebaybe" or boost the priority of "ale". Thanks, T. Hatanaka ________________________________________ From: oxygen-user-bounces@oxygenxml.com <oxygen-user-bounces@oxygenxml.com> on behalf of Support Oxygen XML Editor (Sorin Ristache) <support@oxygenxml.com> Sent: Wednesday, April 15, 2015 21:14 To: T. Hatanaka; oXygen-user@oxygenxml.com Subject: Re: [oXygen-user] Feature request: Improvement of Japanese search for WebHelp Hello, Thank you for telling us, we will try to integrate the Kuromoji analyzer into the Apache Lucene system that indexes the WebHelp pages. What would the custom Japanese user dictionary add to the build-time indexing process: - a list of domain-specific words that are relevant for the domain of the current DITA map and that are missing in the generic dictionary that comes with the Kuromoji analyzer, - or some generic tweaks that would allow matching a client-side search term with a partial match in the index built based on the WebHelp pages? I thought Kuromoji was a morphological analyzer that builds the index so that a client-side search term will be matched with an indexed term picked up from a WebHelp page even though the indexed term is only a partial match, which means the generic tweaks added by a custom user dictionary would not be needed anymore. Could you give a short example please? Thank you, Sorin <oXygen/> XML Editor http://www.oxygenxml.com On 4/15/2015 2:19 PM, T. Hatanaka wrote:
Sorin,
That will be of great help! When you have a test or experimental build in the future, let me know and I'll be happy to test it, though I'm no expert.
In the meantime, I took a look at the result of the following Lucene/Kuromoji code with Japanese inputs.
UserDictionary userDic = new UserDictionary( new FileReader( new File( "userdic.txt" ) ) ); Analyzer analyzer = new JapaneseAnalyzer( userDic, JapaneseTokenizer.Mode.SEARCH, JapaneseAnalyzer.getDefaultStopSet(), JapaneseAnalyzer.getDefaultStopTags() );
These default parameters work fairly well even without UserDictionary(). However, the user dictionary at build-time would be a strong plus, considering that the current client-side JavaScript would miss critical keywords due to its little tweak for partial match. So, in your future design time, please also consider making the UserDictionary() file path configurable via a WebHelp transformation parameter.
Thanks, T. Hatanaka
________________________________________ From: oxygen-user-bounces@oxygenxml.com <oxygen-user-bounces@oxygenxml.com> on behalf of Support Oxygen XML Editor (Sorin Ristache) <support@oxygenxml.com> Sent: Wednesday, April 15, 2015 00:17 To: T. Hatanaka; oXygen-user@oxygenxml.com Subject: Re: [oXygen-user] Feature request: Improvement of Japanese search for WebHelp
Hello,
On 4/12/2015 5:26 AM, T. Hatanaka wrote:
That being said, it would still benefit a lot to integrate sophisticated analyzers, even if it's only at build-time. Run-time ones are less required, I guess. When the Japanese people search Web, they, human beings, usually perform a kind of tokenization and normalization by themselves. i.e. They do not usually enter "BROWNFOXJUMPS" in the search text box. In most cases we can expect them to type "BROWN FOX JUMP". Actually "Please enter keywords separated by spaces" has been a common instruction found on the Japanese search UI. People have got used to it. So I guess that if the index were created by a sophisticated analyzer at build-time with custom dictionaries, we could improve search experience a lot with relatively minor tweaks in run-time JavaScript.
Thank you for letting us know. In a future version we will integrate the Kuromoji analyzer in our Apache Lucene customization that runs on the generated WebHelp pages for building the WebHelp search index. This index will offer relevant search result in the WebHelp pages only for Japanese search terms entered in the browser that are properly separated with space characters.
Here's another piece of news: Kuromoji has been ported to JavaScript: https://github.com/takuyaa/kuromoji.js I haven't tried it, but expect some difficulties. I heard it required a 17MB dictionary.
That is too large for a client-side JavaScript operation. The tokenization of the search string entered by the user may take forever based on a 17 MB JavaScript dictionary. The search will rely on properly separated search terms entered by the user, as you suggested above.
Thanks, T. Hatanaka
Best regards, Sorin
<oXygen/> XML Editor
http://www.oxygenxml.com _______________________________________________ oXygen-user mailing list oXygen-user@oxygenxml.com http://www.oxygenxml.com/mailman/listinfo/oxygen-user
_______________________________________________ oXygen-user mailing list oXygen-user@oxygenxml.com http://www.oxygenxml.com/mailman/listinfo/oxygen-user
_______________________________________________ oXygen-user mailing list oXygen-user@oxygenxml.com http://www.oxygenxml.com/mailman/listinfo/oxygen-user

Hi, I did some research on the user dictionaries and my understanding is that setting a user dictionary is a great enhancement for morphological Lucene analyzers like the ones for the CJK languages, for example the JapaneseAnalyzer, and also for indexing domain-specific content in any language. So we should add a new parameter to the WebHelp page generation process for setting a user dictionary which makes sense for any language of the page content. The user dictionary has a simple and direct use for a morphological analyzer like the JapaneseAnalyzer because it is just a parameter of the class constructor, but is a little more complicated to integrate into the sequence of Lucene filters that follows an initial Lucene tokenizer in the typical Lucene processing pipeline for non-CJK languages. Thank you for your suggestions, Sorin <oXygen/> XML Editor http://www.oxygenxml.com On 4/15/2015 6:13 PM, T. Hatanaka wrote:
Hi,
What would the custom Japanese user dictionary add to the build-time indexing process:
- a list of domain-specific words
It will be the most important.
- or some generic tweaks that would allow matching a client-side search
Presumably it will be JavaScript's role just as its wordsStartsWith() (I guess) does currently.
the generic tweaks added by a custom user dictionary would not be needed anymore. Could you give a short example please?
Suppose there is a Japanese sentence whose structure resembles "OVERPAYPALEBAYBERRY."
Kuromoji or any analyzer may segment it as "over,paypal,ebay,berry", "overpay,pale,bayberry", "over,pay,pal,e,bay,berry" or so, depending on its built-in dictionary and method. There is no single authoritative answer.
With one of such imperfect indexes, the user may search for 'over', 'overpay', 'paypal', 'pale', 'ale', 'ebay', 'bay', 'bayberry', 'verpaypalebaybe' or even 'e'... (The last two may look ridiculous, but such array of characters can be a self-contained atomic well-known word in the Japanese language.)
Kuromoji is clever enough to return multiple terms such as "(paypal|pay,pal),(ebay|e,bay)", but that would never be perfect. Hence the user dictionary can play a critical role to let the analyzer know the novel word "verpaypalebaybe" or boost the priority of "ale".
Thanks, T. Hatanaka ________________________________________ From: oxygen-user-bounces@oxygenxml.com <oxygen-user-bounces@oxygenxml.com> on behalf of Support Oxygen XML Editor (Sorin Ristache) <support@oxygenxml.com> Sent: Wednesday, April 15, 2015 21:14 To: T. Hatanaka; oXygen-user@oxygenxml.com Subject: Re: [oXygen-user] Feature request: Improvement of Japanese search for WebHelp
Hello,
Thank you for telling us, we will try to integrate the Kuromoji analyzer into the Apache Lucene system that indexes the WebHelp pages.
What would the custom Japanese user dictionary add to the build-time indexing process:
- a list of domain-specific words that are relevant for the domain of the current DITA map and that are missing in the generic dictionary that comes with the Kuromoji analyzer,
- or some generic tweaks that would allow matching a client-side search term with a partial match in the index built based on the WebHelp pages?
I thought Kuromoji was a morphological analyzer that builds the index so that a client-side search term will be matched with an indexed term picked up from a WebHelp page even though the indexed term is only a partial match, which means the generic tweaks added by a custom user dictionary would not be needed anymore. Could you give a short example please?
Thank you, Sorin
<oXygen/> XML Editor
participants (2)
-
Support Oxygen XML Editor (Sorin Ristache)
-
T. Hatanaka