summaryrefslogtreecommitdiff
path: root/doc/index.html
diff options
context:
space:
mode:
authorLars-Dominik Braun <lars@6xq.net>2019-11-08 16:06:37 +0100
committerLars-Dominik Braun <lars@6xq.net>2019-11-08 21:34:15 +0100
commit38c9ed5b042ae488ee12287bf8c19457189889aa (patch)
treed4f49039eec711aa7c9ee21c691f46bc89316e48 /doc/index.html
parente31d8731531b41a909bfe33ddc134de07f0a7bab (diff)
downloadlulua-38c9ed5b042ae488ee12287bf8c19457189889aa.tar.gz
lulua-38c9ed5b042ae488ee12287bf8c19457189889aa.tar.bz2
lulua-38c9ed5b042ae488ee12287bf8c19457189889aa.zip
Add OpenSubtitles corpus
See issue #5.
Diffstat (limited to 'doc/index.html')
-rw-r--r--doc/index.html6
1 files changed, 4 insertions, 2 deletions
diff --git a/doc/index.html b/doc/index.html
index 19151b0..a390ddf 100644
--- a/doc/index.html
+++ b/doc/index.html
@@ -132,6 +132,8 @@
Arabic</a>, another Arabic-language news site</li>
<li>116,754 documents from the
<a href="https://conferences.unite.un.org/UNCorpus/en/DownloadOverview">United Nations Parallel Corpus v1.0</a></li>
+ <li>subtitles from 94,093 movies based on a
+ <a href="http://opus.nlpl.eu/OpenSubtitles-v2018.php">2018 OpenSubtitles dump</a></li>
<li>1,709 ebooks from <a
href="https://www.hindawi.org/books">hindawi.org</a></li>
<li>and a plain-text copy of the Quran from <a
@@ -140,8 +142,8 @@
</ul>
<p>
summing up to roughly
- 825 million words or
- 5.5 billion characters. <!-- == combined button presses -->
+ 1.2 billion words or
+ 7.6 billion characters. <!-- == combined button presses -->
<!-- -->
The plot below shows <bdo dir="ltr" lang="ar">ا ل ي م و ن</bdo> can be
considered the most frequently used letters in the Arabic language.