| 1 | = ONION |
| 2 | |
| 3 | == Description |
| 4 | ONION (ONe Instance ONly) is a tool for removing duplicate parts from large collections of texts. The tool has been implemented in Python, licensed under New BSD License and made an open source software (available for download including the source code at http://code.google.com/p/onion/). It is being successfuly used for cleaning large textual corpora at Natural language processing centre at Faculty of informatics, Masaryk university Brno and it's industry partners. The research leading to this piece of software was published in author's Ph.D. thesis "Removing Boilerplate and Duplicate Content from Web Corpora". The deduplication algorithm is based on comparing n-grams of words of text. The author's algorithm has been shown to be more suitable for textual corpora deduplication than competing algorithms (Broder, Charikar): in addition to detection of identical or very similar (95 %) duplicates, it is able to detect even partially similar duplicates (50 %) still achieving great performace (further described in author's Ph.D. thesis). The unique deduplication capabilities and scalability of the algorithm were been demonstrated while building corpora of American Spanish, Arabic, Czech, French, Japanese, Russian, Tajik, and six Turkic languages consisting --- several TB of text documents were deduplicated resulting in corpora of 70 billions tokens altogether. |
| 5 | |
| 6 | == How to use the tool |
| 7 | {{{onion [OPTIONS] [FILE]}}} |
| 8 | |
| 9 | Mark duplicate text parts in the input vertical file. |
| 10 | {{{ |
| 11 | -f FILE hashes of duplicate n-grams |
| 12 | -n NUM n-gram length (default: 5) |
| 13 | -t NUM duplicate content threshold (default: 0.5) |
| 14 | -d STR document tag (default: doc) |
| 15 | -p STR paragraph tag (default: p) |
| 16 | -s strip duplicate parts (rather than mark) |
| 17 | -m no smoothing |
| 18 | -T NUM trim n-gram hashes to NUM bits (default: 64) |
| 19 | -l NUM max stub length (default: 20) |
| 20 | -b NUM buffer size, in bytes (default: 16777216) |
| 21 | -q quiet; suppress all output except for errors |
| 22 | -V print version information and exit |
| 23 | -h display this help and exit |
| 24 | }}} |
| 25 | With no FILE, or when FILE is -, read standard input. Output is written to standard output |
| 26 | |
| 27 | == Source |
| 28 | [https://corpus.tools/wiki/Onion] |
| 29 | |
| 30 | == Acknowledgements |
| 31 | This software has been developed at the [http://nlp.fi.muni.cz/en/nlpc Natural Language Processing Centre] of [http://www.muni.cz/ Masaryk University in Brno] with financial support from [http://presemt.eu PRESEMT] and [http://www.sketchengine.co.uk Lexical Computing Ltd.] It also relates to Jan Pomikálek's [http://is.muni.cz/th/45523/fi_d/phdthesis.pdf PhD research]. |
| 32 | |
| 33 | If you use the system, please cite the related publication as well as the LINDAT/CLARIAH infrastructure: [http://hdl.handle.net/11858/00-097C-0000-000D-F67B-7] |
| 34 | |
| 35 | {{{ |
| 36 | @phdthesis{pomikalek2011removing, |
| 37 | title={Removing boilerplate and duplicate content from web corpora}, |
| 38 | author={Pomik{\'a}lek, Jan}, |
| 39 | school={Masaryk university, Faculty of informatics, Brno, Czech Republic}, |
| 40 | year={2011} |
| 41 | |
| 42 | }}} |
| 43 | |
| 44 | == License |
| 45 | Onion is licensed under the [http://opensource.org/licenses/BSD-3-Clause BSD 3-Clause License] |