summaryrefslogtreecommitdiff
path: root/doc
diff options
context:
space:
mode:
Diffstat (limited to 'doc')
-rw-r--r--doc/index.html6
1 files changed, 4 insertions, 2 deletions
diff --git a/doc/index.html b/doc/index.html
index 19151b0..a390ddf 100644
--- a/doc/index.html
+++ b/doc/index.html
@@ -132,6 +132,8 @@
Arabic</a>, another Arabic-language news site</li>
<li>116,754 documents from the
<a href="https://conferences.unite.un.org/UNCorpus/en/DownloadOverview">United Nations Parallel Corpus v1.0</a></li>
+ <li>subtitles from 94,093 movies based on a
+ <a href="http://opus.nlpl.eu/OpenSubtitles-v2018.php">2018 OpenSubtitles dump</a></li>
<li>1,709 ebooks from <a
href="https://www.hindawi.org/books">hindawi.org</a></li>
<li>and a plain-text copy of the Quran from <a
@@ -140,8 +142,8 @@
</ul>
<p>
summing up to roughly
- 825 million words or
- 5.5 billion characters. <!-- == combined button presses -->
+ 1.2 billion words or
+ 7.6 billion characters. <!-- == combined button presses -->
<!-- -->
The plot below shows <bdo dir="ltr" lang="ar">ا ل ي م و ن</bdo> can be
considered the most frequently used letters in the Arabic language.