summaryrefslogtreecommitdiff
path: root/doc/index.html
diff options
context:
space:
mode:
authorLars-Dominik Braun <lars@6xq.net>2019-11-10 09:44:35 +0100
committerLars-Dominik Braun <lars@6xq.net>2019-11-16 10:17:26 +0100
commit14daa5644598836fd6321038c6b0a496c7874374 (patch)
tree2309443ffec0acea7b8eda095dba17436c0deb0a /doc/index.html
parent38c9ed5b042ae488ee12287bf8c19457189889aa (diff)
downloadlulua-14daa5644598836fd6321038c6b0a496c7874374.tar.gz
lulua-14daa5644598836fd6321038c6b0a496c7874374.tar.bz2
lulua-14daa5644598836fd6321038c6b0a496c7874374.zip
doc: Auto-generate corpus table
Diffstat (limited to 'doc/index.html')
-rw-r--r--doc/index.html33
1 files changed, 5 insertions, 28 deletions
diff --git a/doc/index.html b/doc/index.html
index a390ddf..e930892 100644
--- a/doc/index.html
+++ b/doc/index.html
@@ -7,9 +7,8 @@
<meta name="viewport" content="width=device-width, initial-scale=1">
<link href="https://fonts.googleapis.com/css?family=IBM+Plex+Mono|IBM+Plex+Sans:100,400&display=swap" rel="stylesheet">
- <link rel="stylesheet" href="https://unpkg.com/purecss@1.0.1/build/base-min.css" crossorigin="anonymous">
- <link rel="stylesheet" href="https://unpkg.com/purecss@1.0.1/build/grids-min.css" crossorigin="anonymous">
- <link rel="stylesheet" href="https://unpkg.com/purecss@1.0.1/build/grids-responsive-min.css" crossorigin="anonymous">
+ <link rel="stylesheet" href="https://unpkg.com/purecss@1.0.1/build/pure-min.css" integrity="sha384-oAOxQR6DkCoMliIh8yFnu25d7Eq/PHS21PClpwjOTeU2jRSq11vu66rf90/cZr47" crossorigin="anonymous">
+ <link rel="stylesheet" href="https://unpkg.com/purecss@1.0.1/build/grids-responsive-min.css">
<script src="https://cdn.pydata.org/bokeh/release/bokeh-1.3.4.min.js"></script>
<link rel="stylesheet" href="style.css">
</head>
@@ -119,32 +118,10 @@
<!-- -->
The corpus used for the following analysis consists of
</p>
- <ul>
- <li><a href="https://dumps.wikimedia.org/arwiki/20190701/">a
- dump</a> of the <a href="https://ar.wikipedia.org/">Arabic
- Wikipedia</a> as of July 2019, extracted using
- <a href="https://github.com/attardi/wikiextractor/tree/3162bb6c3c9ebd2d15be507aa11d6fa818a454ac">wikiextractor</a>
- containing 857,386 articles</li>
- <li>547,110 articles from
- <a href="https://www.aljazeera.net/">aljazeera.net</a>, an
- Arabic-language news site</li>
- <li>149,901 articles from <a href="http://www.bbc.com/arabic">BBC
- Arabic</a>, another Arabic-language news site</li>
- <li>116,754 documents from the
- <a href="https://conferences.unite.un.org/UNCorpus/en/DownloadOverview">United Nations Parallel Corpus v1.0</a></li>
- <li>subtitles from 94,093 movies based on a
- <a href="http://opus.nlpl.eu/OpenSubtitles-v2018.php">2018 OpenSubtitles dump</a></li>
- <li>1,709 ebooks from <a
- href="https://www.hindawi.org/books">hindawi.org</a></li>
- <li>and a plain-text copy of the Quran from <a
- href="http://tanzil.net/docs/download">tanzil.net</a> using the
- options Simple Enhanced and Text (for inclusion of diacritics)</li>
- </ul>
+
+ #include "corpus.html"
+
<p>
- summing up to roughly
- 1.2 billion words or
- 7.6 billion characters. <!-- == combined button presses -->
- <!-- -->
The plot below shows <bdo dir="ltr" lang="ar">ا ل ي م و ن</bdo> can be
considered the most frequently used letters in the Arabic language.
<!-- -->