From e31d8731531b41a909bfe33ddc134de07f0a7bab Mon Sep 17 00:00:00 2001 From: Lars-Dominik Braun Date: Wed, 6 Nov 2019 19:18:08 +0100 Subject: Add United Nations Parallel Corpus v1.0 See issue #5. --- doc/index.html | 18 +++++++++++------- 1 file changed, 11 insertions(+), 7 deletions(-) (limited to 'doc/index.html') diff --git a/doc/index.html b/doc/index.html index 6749647..19151b0 100644 --- a/doc/index.html +++ b/doc/index.html @@ -120,16 +120,18 @@ The corpus used for the following analysis consists of

- summing up to roughly two billion characters. + summing up to roughly + 825 million words or + 5.5 billion characters. The plot below shows ا ل ي م و ن can be considered the most frequently used letters in the Arabic language. - Together they account for more than 50% of all letters in the corpus. + Together they account for more than 55% of all letters in the corpus.

-- cgit v1.2.3