Changes between Version 1 and Version 2 of JusText


Ignore:
Timestamp:
Oct 26, 2024, 10:27:57 PM (8 months ago)
Author:
qstengl
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • JusText

    v1 v2  
    1 = ONION
     1= JusText
    22
    33== Description
    4 ONION (ONe Instance ONly) is a tool for removing duplicate parts from large collections of texts. The tool has been implemented in Python, licensed under New BSD License and made an open source software (available for download including the source code at http://code.google.com/p/onion/). It is being successfuly used for cleaning large textual corpora at Natural language processing centre at Faculty of informatics, Masaryk university Brno and it's industry partners. The research leading to this piece of software was published in author's Ph.D. thesis "Removing Boilerplate and Duplicate Content from Web Corpora". The deduplication algorithm is based on comparing n-grams of words of text. The author's algorithm has been shown to be more suitable for textual corpora deduplication than competing algorithms (Broder, Charikar): in addition to detection of identical or very similar (95 %) duplicates, it is able to detect even partially similar duplicates (50 %) still achieving great performace (further described in author's Ph.D. thesis). The unique deduplication capabilities and scalability of the algorithm were been demonstrated while building corpora of American Spanish, Arabic, Czech, French, Japanese, Russian, Tajik, and six Turkic languages consisting --- several TB of text documents were deduplicated resulting in corpora of 70 billions tokens altogether.
     4JusText is a heuristic based boilerplate removal tool useful for cleaning documents in large textual corpora. The tool has been implemented in Python, licensed under New BSD License and made an open source software (available for download including the source code at http://code.google.com/p/justext/). It is successfully used for cleaning large textual corpora at Natural language processing centre at Faculty of informatics, Masaryk university Brno and it's industry partners. The research leading to this piece of software was published in author's Ph.D. thesis "Removing Boilerplate and Duplicate Content from Web Corpora". The boilerplate removal algorithm is able to remove most of non-grammatical sentences from a web page like navigation, advertisements, tables, short notes and so on. It has been shown it overperforms or at least keeps up with it's competitors (according to comparison with participants of Cleaneval competition in author's Ph.D. thesis). The precise removal of unwanted content and scalability of the algorithm has been demonstrated while building corpora of American Spanish, Arabic, Czech, French, Japanese, Russian, Tajik, and six Turkic languages consisting --- over 20 TB of HTML pages were processed resulting in corpora of 70 billions tokens altogether. and PRESEMT, Lexical Computing Ltd
    55
    6 == How to use the tool
    7 {{{onion [OPTIONS] [FILE]}}}
    8 
    9 Mark duplicate text parts in the input vertical file.
    10 {{{
    11  -f FILE   hashes of duplicate n-grams
    12  -n NUM    n-gram length (default: 5)
    13  -t NUM    duplicate content threshold (default: 0.5)
    14  -d STR    document tag (default: doc)
    15  -p STR    paragraph tag (default: p)
    16  -s        strip duplicate parts (rather than mark)
    17  -m        no smoothing
    18  -T NUM    trim n-gram hashes to NUM bits (default: 64)
    19  -l NUM    max stub length (default: 20)
    20  -b NUM    buffer size, in bytes (default: 16777216)
    21  -q        quiet; suppress all output except for errors
    22  -V        print version information and exit
    23  -h        display this help and exit
    24 }}}
    25 With no FILE, or when FILE is -, read standard input. Output is written to standard output
     6== Example
     7See what is kept and what is discarded from a [attachment:https://corpus.tools/raw-attachment/wiki/Justext/nlp_jusText_fi.jpg typical web page].
    268
    279== Source
    28 [https://corpus.tools/wiki/Onion]
     10[https://corpus.tools/wiki/Justext]
    2911
    3012== Acknowledgements
    31 This software has been developed at the [http://nlp.fi.muni.cz/en/nlpc Natural Language Processing Centre] of [http://www.muni.cz/ Masaryk University in Brno] with financial support from [http://presemt.eu PRESEMT] and [http://www.sketchengine.co.uk Lexical Computing Ltd.] It also relates to Jan Pomikálek's [http://is.muni.cz/th/45523/fi_d/phdthesis.pdf PhD research].
     13This software has been developed at the [https://nlp.fi.muni.cz/en/nlpc Natural Language Processing Centre] of [https://www.muni.cz/ Masaryk University in Brno] with financial support from [https://presemt.eu PRESEMT] and [https://www.sketchengine.eu Lexical Computing Ltd.] It also relates to Jan Pomikálek's [https://is.muni.cz/th/45523/fi_d/phdthesis.pdf PhD research].
    3214
    33 If you use the system, please cite the related publication as well as the LINDAT/CLARIAH infrastructure: [http://hdl.handle.net/11858/00-097C-0000-000D-F67B-7]
    34 
     15If you use the system, please cite the related publication as well as the LINDAT/CLARIAH infrastructure: [http://hdl.handle.net/11858/00-097C-0000-000D-F696-9]
    3516{{{
    3617@phdthesis{pomikalek2011removing,
     
    3920  school={Masaryk university, Faculty of informatics, Brno, Czech Republic},
    4021  year={2011}
     22}
     23
     24@misc{Zamazal2024thesis,
     25  author = {Zamazal, Kryštof},
     26  title = {Evaluation of web page cleaning tool Justext},
     27  year = {2024},
     28  type = {Bachelor's thesis},
     29  school = {Masaryk university, Faculty of informatics, Brno, Czech Republic},
     30  supervisor = {Vít Suchomel}
     31}
     32
    4133
    4234}}}
    4335
    4436== License
    45 Onion is licensed under the [http://opensource.org/licenses/BSD-3-Clause BSD 3-Clause License] 
     37Justext is licensed under the [https://opensource.org/licenses/BSD-3-Clause BSD 3-Clause License]