From e31d8731531b41a909bfe33ddc134de07f0a7bab Mon Sep 17 00:00:00 2001
From: Lars-Dominik Braun
Date: Wed, 6 Nov 2019 19:18:08 +0100
Subject: Add United Nations Parallel Corpus v1.0
See issue #5.
---
doc/index.html | 18 +++++++++++-------
1 file changed, 11 insertions(+), 7 deletions(-)
(limited to 'doc')
diff --git a/doc/index.html b/doc/index.html
index 6749647..19151b0 100644
--- a/doc/index.html
+++ b/doc/index.html
@@ -120,16 +120,18 @@
The corpus used for the following analysis consists of
- - 547,110 articles from
- aljazeera.net, an
- Arabic-language news site
- - 149,901 articles from BBC
- Arabic, another Arabic-language news site
- a
dump of the Arabic
Wikipedia as of July 2019, extracted using
wikiextractor
containing 857,386 articles
+ - 547,110 articles from
+ aljazeera.net, an
+ Arabic-language news site
+ - 149,901 articles from BBC
+ Arabic, another Arabic-language news site
+ - 116,754 documents from the
+ United Nations Parallel Corpus v1.0
- 1,709 ebooks from hindawi.org
- and a plain-text copy of the Quran from
- summing up to roughly two billion characters.
+ summing up to roughly
+ 825 million words or
+ 5.5 billion characters.
The plot below shows ا ل ي م و ن can be
considered the most frequently used letters in the Arabic language.
- Together they account for more than 50% of all letters in the corpus.
+ Together they account for more than 55% of all letters in the corpus.
--
cgit v1.2.3