Re: [oXygen-user] Feature request: Improvement of Japanese search for WebHelp

Let me revive this old thread. Could you reconsider this? As already said, the Japanese text has no explicit word breakers. It's something like "THEQUICKBROWNFOXJUMPSOVERTHELAZYDOG." The current WebHelp implementation is tokenizing it into "TH" "HE" "EQ" "QU" "UI" "IC" "CK"... both at build-time and run-time. It works to some extent, but not quite comfortable. That's why analyzers such as Kuromoji are most welcome. And, to deliver fully capable search experience, WebHelp requires both build-time and run-time analyzers, as you pointed out. That being said, it would still benefit a lot to integrate sophisticated analyzers, even if it's only at build-time. Run-time ones are less required, I guess. When the Japanese people search Web, they, human beings, usually perform a kind of tokenization and normalization by themselves. i.e. They do not usually enter "BROWNFOXJUMPS" in the search text box. In most cases we can expect them to type "BROWN FOX JUMP". Actually "Please enter keywords separated by spaces" has been a common instruction found on the Japanese search UI. People have got used to it. So I guess that if the index were created by a sophisticated analyzer at build-time with custom dictionaries, we could improve search experience a lot with relatively minor tweaks in run-time JavaScript. Here's another piece of news: Kuromoji has been ported to JavaScript: https://github.com/takuyaa/kuromoji.js I haven't tried it, but expect some difficulties. I heard it required a 17MB dictionary. Thanks, T. Hatanaka -----Original Message----- From: oxygen-user-bounces@oxygenxml.com [mailto:oxygen-user-bounces@oxygenxml.com] On Behalf Of Sorin Ristache Sent: Tuesday, November 19, 2013 6:09 PM To: Naoki Hirai Cc: oXygen-user@oxygenxml.com Subject: Re: [oXygen-user] Feature request: Improvement of Japanese search for WebHelp Dear Naoki-san, The Webhelp content indexer is indeed based on the Lucene engine just like the Kuromoji morphological analyzer, so delegating the task of indexing any Japanese content at build time (when the Webhelp pages are created by the Oxygen Webhelp transformation) to the Kuromoji analyzer is doable. However the Webhelp search is performed at runtime on the client side, with JavaScript code running on the machine where the Webhelp search is executed in the browser, not on the server side, where the Webhelp pages are stored. The difficulty in integrating an analyzer that deals with a specific language sentence morphology like the Kuromoji analyzer comes from the lack of an equivalent JavaScript analyzer that is able to split the search string entered by the user into the morphological components recognized by the Lucene-based morphological analyzer that built the index database at build time. I did a Google search but I could not identify a client side JavaScript solution for a Japanese morphological analyzer. If you can suggest such a solution we would surely consider it as a future improvement for the Webhelp search. Kind regards, Sorin Naoki Hirai wrote:
Hi,
I like Oxygen WebHelp very much and recommend it to Japanese users. The WebHelp is sophisticated online manual solution, but one issue has still remained for Japanese users. That is a Japanese search. For Japanese it's difficult to extract words from sentences. Because the words are not separated by spaces. Therefore, in general, a morphological analyzer is used to extract the words from the sentences. Recently, an open source Japanese morphological analyzer which is called "Kuromoji" has become popular. The Apache Solr has introduced Kuromoji as the morphological analyzer.
So, my feature request is that Oxygen WebHelp plug-in will incorporate Kuromoji as the morphological analyzer. And add a parameter which selects a stemmer for generating a WebHelp output. I can help the development and the evaluation.
Please have a thought.
Best regards,
Naoki
_______________________________________________ oXygen-user mailing list oXygen-user@oxygenxml.com http://www.oxygenxml.com/mailman/listinfo/oxygen-user

Hello, On 4/12/2015 5:26 AM, T. Hatanaka wrote:
That being said, it would still benefit a lot to integrate sophisticated analyzers, even if it's only at build-time. Run-time ones are less required, I guess. When the Japanese people search Web, they, human beings, usually perform a kind of tokenization and normalization by themselves. i.e. They do not usually enter "BROWNFOXJUMPS" in the search text box. In most cases we can expect them to type "BROWN FOX JUMP". Actually "Please enter keywords separated by spaces" has been a common instruction found on the Japanese search UI. People have got used to it. So I guess that if the index were created by a sophisticated analyzer at build-time with custom dictionaries, we could improve search experience a lot with relatively minor tweaks in run-time JavaScript.
Thank you for letting us know. In a future version we will integrate the Kuromoji analyzer in our Apache Lucene customization that runs on the generated WebHelp pages for building the WebHelp search index. This index will offer relevant search result in the WebHelp pages only for Japanese search terms entered in the browser that are properly separated with space characters.
Here's another piece of news: Kuromoji has been ported to JavaScript: https://github.com/takuyaa/kuromoji.js I haven't tried it, but expect some difficulties. I heard it required a 17MB dictionary.
That is too large for a client-side JavaScript operation. The tokenization of the search string entered by the user may take forever based on a 17 MB JavaScript dictionary. The search will rely on properly separated search terms entered by the user, as you suggested above.
Thanks, T. Hatanaka
Best regards, Sorin <oXygen/> XML Editor http://www.oxygenxml.com
participants (2)
-
Support Oxygen XML Editor (Sorin Ristache)
-
T. Hatanaka