From e31d8731531b41a909bfe33ddc134de07f0a7bab Mon Sep 17 00:00:00 2001
From: Lars-Dominik Braun <lars@6xq.net>
Date: Wed, 6 Nov 2019 19:18:08 +0100
Subject: Add United Nations Parallel Corpus v1.0

See issue #5.
---
 doc/index.html | 18 +++++++++++-------
 1 file changed, 11 insertions(+), 7 deletions(-)

(limited to 'doc')
diff --git a/doc/index.html b/doc/index.html
index 6749647..19151b0 100644
--- a/doc/index.html
+++ b/doc/index.html
@@ -120,16 +120,18 @@
 		The corpus used for the following analysis consists of
 		</p>
 		<ul>
-			<li>547,110 articles from
-			<a href="https://www.aljazeera.net/">aljazeera.net</a>, an
-			Arabic-language news site</li>
-			<li>149,901 articles from <a href="http://www.bbc.com/arabic">BBC
-			Arabic</a>, another Arabic-language news site</li>
 			<li><a href="https://dumps.wikimedia.org/arwiki/20190701/">a
 			dump</a> of the <a href="https://ar.wikipedia.org/">Arabic
 			Wikipedia</a> as of July 2019, extracted using
 			<a href="https://github.com/attardi/wikiextractor/tree/3162bb6c3c9ebd2d15be507aa11d6fa818a454ac">wikiextractor</a>
 			containing 857,386 articles</li>
+			<li>547,110 articles from
+			<a href="https://www.aljazeera.net/">aljazeera.net</a>, an
+			Arabic-language news site</li>
+			<li>149,901 articles from <a href="http://www.bbc.com/arabic">BBC
+			Arabic</a>, another Arabic-language news site</li>
+			<li>116,754 documents from the
+			<a href="https://conferences.unite.un.org/UNCorpus/en/DownloadOverview">United Nations Parallel Corpus v1.0</a></li>
 			<li>1,709 ebooks from <a
 			href="https://www.hindawi.org/books">hindawi.org</a></li>
 			<li>and a plain-text copy of the Quran from <a
@@ -137,12 +139,14 @@
 			options Simple Enhanced and Text (for inclusion of diacritics)</li>
 		</ul>
 		<p>
-		summing up to roughly two billion characters.
+		summing up to roughly
+		825 million words or
+		5.5 billion characters. <!-- == combined button presses -->
 		<!-- -->
 		The plot below shows <bdo dir="ltr" lang="ar">ا ل ي م و ن</bdo> can be
 		considered the most frequently used letters in the Arabic language.
 		<!-- -->
-		Together they account for more than 50% of all letters in the corpus.
+		Together they account for more than 55% of all letters in the corpus.
 		</p>
 	</div>
 	</div>
-- 
cgit v1.2.3