summaryrefslogtreecommitdiff
path: root/doc
diff options
context:
space:
mode:
authorLars-Dominik Braun <lars@6xq.net>2019-11-06 19:18:08 +0100
committerLars-Dominik Braun <lars@6xq.net>2019-11-08 21:34:11 +0100
commite31d8731531b41a909bfe33ddc134de07f0a7bab (patch)
tree56da3225cfdd4e239c78173803412c1f9e1b5e36 /doc
parent43ad3e898a28798ac2f928041999997c24e7bf3c (diff)
downloadlulua-e31d8731531b41a909bfe33ddc134de07f0a7bab.tar.gz
lulua-e31d8731531b41a909bfe33ddc134de07f0a7bab.tar.bz2
lulua-e31d8731531b41a909bfe33ddc134de07f0a7bab.zip
Add United Nations Parallel Corpus v1.0
See issue #5.
Diffstat (limited to 'doc')
-rw-r--r--doc/index.html18
1 files changed, 11 insertions, 7 deletions
diff --git a/doc/index.html b/doc/index.html
index 6749647..19151b0 100644
--- a/doc/index.html
+++ b/doc/index.html
@@ -120,16 +120,18 @@
The corpus used for the following analysis consists of
</p>
<ul>
- <li>547,110 articles from
- <a href="https://www.aljazeera.net/">aljazeera.net</a>, an
- Arabic-language news site</li>
- <li>149,901 articles from <a href="http://www.bbc.com/arabic">BBC
- Arabic</a>, another Arabic-language news site</li>
<li><a href="https://dumps.wikimedia.org/arwiki/20190701/">a
dump</a> of the <a href="https://ar.wikipedia.org/">Arabic
Wikipedia</a> as of July 2019, extracted using
<a href="https://github.com/attardi/wikiextractor/tree/3162bb6c3c9ebd2d15be507aa11d6fa818a454ac">wikiextractor</a>
containing 857,386 articles</li>
+ <li>547,110 articles from
+ <a href="https://www.aljazeera.net/">aljazeera.net</a>, an
+ Arabic-language news site</li>
+ <li>149,901 articles from <a href="http://www.bbc.com/arabic">BBC
+ Arabic</a>, another Arabic-language news site</li>
+ <li>116,754 documents from the
+ <a href="https://conferences.unite.un.org/UNCorpus/en/DownloadOverview">United Nations Parallel Corpus v1.0</a></li>
<li>1,709 ebooks from <a
href="https://www.hindawi.org/books">hindawi.org</a></li>
<li>and a plain-text copy of the Quran from <a
@@ -137,12 +139,14 @@
options Simple Enhanced and Text (for inclusion of diacritics)</li>
</ul>
<p>
- summing up to roughly two billion characters.
+ summing up to roughly
+ 825 million words or
+ 5.5 billion characters. <!-- == combined button presses -->
<!-- -->
The plot below shows <bdo dir="ltr" lang="ar">ا ل ي م و ن</bdo> can be
considered the most frequently used letters in the Arabic language.
<!-- -->
- Together they account for more than 50% of all letters in the corpus.
+ Together they account for more than 55% of all letters in the corpus.
</p>
</div>
</div>