summaryrefslogtreecommitdiff
path: root/doc/index.html
diff options
context:
space:
mode:
authorLars-Dominik Braun <lars@6xq.net>2019-10-03 17:23:53 +0200
committerLars-Dominik Braun <lars@6xq.net>2019-10-03 17:23:53 +0200
commit2d45ef655f8791037373ab83174fc6c3596227b0 (patch)
treea05d506928fcc16f8dfdddb860c6ce4c5193bfc4 /doc/index.html
parent8048f6351fb4611134c2f6e2d9129ec025376914 (diff)
downloadlulua-2d45ef655f8791037373ab83174fc6c3596227b0.tar.gz
lulua-2d45ef655f8791037373ab83174fc6c3596227b0.tar.bz2
lulua-2d45ef655f8791037373ab83174fc6c3596227b0.zip
text: Add epub reader and hindawi corpus
See issue #5.
Diffstat (limited to 'doc/index.html')
-rw-r--r--doc/index.html6
1 files changed, 4 insertions, 2 deletions
diff --git a/doc/index.html b/doc/index.html
index f9daf88..6749647 100644
--- a/doc/index.html
+++ b/doc/index.html
@@ -129,13 +129,15 @@
dump</a> of the <a href="https://ar.wikipedia.org/">Arabic
Wikipedia</a> as of July 2019, extracted using
<a href="https://github.com/attardi/wikiextractor/tree/3162bb6c3c9ebd2d15be507aa11d6fa818a454ac">wikiextractor</a>
- containing 857386 articles</li>
+ containing 857,386 articles</li>
+ <li>1,709 ebooks from <a
+ href="https://www.hindawi.org/books">hindawi.org</a></li>
<li>and a plain-text copy of the Quran from <a
href="http://tanzil.net/docs/download">tanzil.net</a> using the
options Simple Enhanced and Text (for inclusion of diacritics)</li>
</ul>
<p>
- summing up to roughly 1.5 billion characters.
+ summing up to roughly two billion characters.
<!-- -->
The plot below shows <bdo dir="ltr" lang="ar">ا ل ي م و ن</bdo> can be
considered the most frequently used letters in the Arabic language.