diff options
author | Lars-Dominik Braun <lars@6xq.net> | 2019-11-06 19:18:08 +0100 |
---|---|---|
committer | Lars-Dominik Braun <lars@6xq.net> | 2019-11-08 21:34:11 +0100 |
commit | e31d8731531b41a909bfe33ddc134de07f0a7bab (patch) | |
tree | 56da3225cfdd4e239c78173803412c1f9e1b5e36 /doc | |
parent | 43ad3e898a28798ac2f928041999997c24e7bf3c (diff) | |
download | lulua-e31d8731531b41a909bfe33ddc134de07f0a7bab.tar.gz lulua-e31d8731531b41a909bfe33ddc134de07f0a7bab.tar.bz2 lulua-e31d8731531b41a909bfe33ddc134de07f0a7bab.zip |
Add United Nations Parallel Corpus v1.0
See issue #5.
Diffstat (limited to 'doc')
-rw-r--r-- | doc/index.html | 18 |
1 files changed, 11 insertions, 7 deletions
diff --git a/doc/index.html b/doc/index.html index 6749647..19151b0 100644 --- a/doc/index.html +++ b/doc/index.html @@ -120,16 +120,18 @@ The corpus used for the following analysis consists of </p> <ul> - <li>547,110 articles from - <a href="https://www.aljazeera.net/">aljazeera.net</a>, an - Arabic-language news site</li> - <li>149,901 articles from <a href="http://www.bbc.com/arabic">BBC - Arabic</a>, another Arabic-language news site</li> <li><a href="https://dumps.wikimedia.org/arwiki/20190701/">a dump</a> of the <a href="https://ar.wikipedia.org/">Arabic Wikipedia</a> as of July 2019, extracted using <a href="https://github.com/attardi/wikiextractor/tree/3162bb6c3c9ebd2d15be507aa11d6fa818a454ac">wikiextractor</a> containing 857,386 articles</li> + <li>547,110 articles from + <a href="https://www.aljazeera.net/">aljazeera.net</a>, an + Arabic-language news site</li> + <li>149,901 articles from <a href="http://www.bbc.com/arabic">BBC + Arabic</a>, another Arabic-language news site</li> + <li>116,754 documents from the + <a href="https://conferences.unite.un.org/UNCorpus/en/DownloadOverview">United Nations Parallel Corpus v1.0</a></li> <li>1,709 ebooks from <a href="https://www.hindawi.org/books">hindawi.org</a></li> <li>and a plain-text copy of the Quran from <a @@ -137,12 +139,14 @@ options Simple Enhanced and Text (for inclusion of diacritics)</li> </ul> <p> - summing up to roughly two billion characters. + summing up to roughly + 825 million words or + 5.5 billion characters. <!-- == combined button presses --> <!-- --> The plot below shows <bdo dir="ltr" lang="ar">ا ل ي م و ن</bdo> can be considered the most frequently used letters in the Arabic language. <!-- --> - Together they account for more than 50% of all letters in the corpus. + Together they account for more than 55% of all letters in the corpus. </p> </div> </div> |