From 2d45ef655f8791037373ab83174fc6c3596227b0 Mon Sep 17 00:00:00 2001 From: Lars-Dominik Braun Date: Thu, 3 Oct 2019 17:23:53 +0200 Subject: text: Add epub reader and hindawi corpus See issue #5. --- doc/index.html | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) (limited to 'doc/index.html') diff --git a/doc/index.html b/doc/index.html index f9daf88..6749647 100644 --- a/doc/index.html +++ b/doc/index.html @@ -129,13 +129,15 @@ dump of the Arabic Wikipedia as of July 2019, extracted using wikiextractor - containing 857386 articles + containing 857,386 articles +
  • 1,709 ebooks from hindawi.org
  • and a plain-text copy of the Quran from tanzil.net using the options Simple Enhanced and Text (for inclusion of diacritics)
  • - summing up to roughly 1.5 billion characters. + summing up to roughly two billion characters. The plot below shows ا ل ي م و ن can be considered the most frequently used letters in the Arabic language. -- cgit v1.2.3