Merge in from ETS Updates. (#129)

Add in new ETS Updates.
ArgLab · Oct 25, 2024 · 1b6de21 · 1b6de21
2 parents 6f12321 + e42b451
commit 1b6de21
Show file tree

Hide file tree

Showing 140 changed files with 4,297 additions and 1,545 deletions.
diff --git a/.readthedocs.yaml b/.readthedocs.yaml
@@ -7,9 +7,9 @@ version: 2
 
 # Set the version of Python and other tools you might need
 build:
-  os: ubuntu-22.04
+  os: ubuntu-24.04
   tools:
-    python: "3.10"
+    python: "3.11"
     # You can also specify other tool versions:
     # nodejs: "19"
     # rust: "1.64"
@@ -26,4 +26,4 @@ sphinx:
 # Optionally declare the Python requirements required to build your docs
 python:
    install:
-   - requirements: requirements.txt
+   - requirements: autodocs/requirements.txt
diff --git a/CONTRIBUTORS.TXT b/CONTRIBUTORS.TXT
@@ -1,3 +1,4 @@
 Piotr Mitros
 Oren Livne
 Paul Deane
+Bradley Erickson
diff --git a/Makefile b/Makefile
@@ -3,7 +3,7 @@ PACKAGES ?= wo,awe
 run:
 	# If you haven't done so yet, run: make install
 	# we need to make sure we are on the virtual env when we do this
-	cd learning_observer && python learning_observer --watchdog=restart
+	cd learning_observer && python learning_observer
 
 venv:
 	# This is unnecessary since LO installs requirements on install.
@@ -34,6 +34,7 @@ install-packages: venv
 	pip install -e learning_observer/[${PACKAGES}]
 
 	# Just a little bit of dependency hell...
+
 	# The AWE Components are built using a specific version of
 	# `spacy`. This requires an out-of-date `typing-extensions`
 	# package. There are few other dependecies that require a
@@ -42,7 +43,16 @@ install-packages: venv
 	# components.
 	# TODO remove this extra step after AWE Component's `spacy`
 	# is no longer version locked.
-	pip install -U typing-extensions
+	# This is no longer an issue, but we will leave until all
+	# dependecies can be resolved in the appropriate locations.
+	# pip install -U typing-extensions
+
+	# On Python3.11 with tensorflow, we get some odd errors
+	# regarding compatibility with `protobuf`. Some installation
+	# files are missing from the protobuf binary on pip.
+	# Using the `--no-binary` option includes all files.
+	pip uninstall -y protobuf
+	pip install --no-binary=protobuf protobuf==4.25
 
 # testing commands
 test:

diff --git a/autodocs/.gitignore b/autodocs/.gitignore
@@ -1,2 +1,3 @@
 _build/
-generated/
+generated/
+apidocs/
diff --git a/autodocs/api.rst b/autodocs/api.rst
@@ -1,11 +1,6 @@
 API
 ===
 
-.. autosummary::
-   :recursive:
-   :toctree: generated/
-
-   learning_observer
-   writing_observer
-   lo_dash_react_components
+.. toctree::
 
+   apidocs/index
diff --git a/autodocs/conf.py b/autodocs/conf.py
@@ -17,12 +17,15 @@
 sys.path.insert(0, os.path.abspath('../'))
 
 extensions = [
-    'sphinx.ext.autodoc',
-    'sphinx.ext.autosummary',
-    'sphinx.ext.viewcode',
+    'autodoc2',
     'myst_parser',
 ]
 
+autodoc2_packages = [
+    '../learning_observer/learning_observer',
+    '../modules/writing_observer/writing_observer'
+]
+
 source_suffix = {
     '.rst': 'restructuredtext',
     '.md': 'markdown',

diff --git a/autodocs/requirements.txt b/autodocs/requirements.txt
@@ -0,0 +1,3 @@
+myst_parser
+sphinx
+sphinx-autodoc2
diff --git a/awe_requirements.txt b/awe_requirements.txt
@@ -1,3 +1,6 @@
+spacy==3.4.4
+pydantic==1.10
+spacytextblob==3.0.1
 AWE_SpellCorrect @ git+https://github.com/ETS-Next-Gen/AWE_SpellCorrect.git
 AWE_Components @ git+https://github.com/ETS-Next-Gen/AWE_Components.git
 AWE_Lexica @ git+https://github.com/ETS-Next-Gen/AWE_Lexica.git

diff --git a/devops/tasks/config/postuploads b/devops/tasks/config/postuploads
@@ -1,7 +1,8 @@
 sudo hostnamectl set-hostname {hostname}
 sudo rm -f /etc/nginx/sites-available/default
 sudo rm -f /etc/nginx/sites-enabled/default
-sudo ln -f /etc/nginx/sites-available/{hostname} /etc/nginx/sites-enabled/{hostname}
+if [ -f /etc/nginx/sites-available/{hostname} ]; then sudo ln -f /etc/nginx/sites-available/{hostname} /etc/nginx/sites-enabled/{hostname}; else echo "WARNING: Failed to make symlink in /etc/nginx/sites-available (config/postupload)"; fi
+
 sudo chown -R ubuntu:ubuntu /home/ubuntu/writing_observer
 sudo systemctl daemon-reload
 sudo service learning_observer stop

diff --git a/docs/scaling.md b/docs/scaling.md
@@ -0,0 +1,116 @@
+# Scaling Architecture
+
+The goal is for the Learning Observer to be:
+
+* Fully horizontally-scaleable in large-scale settings
+* Simple to run in small-scale settings
+
+It is worth noting that some uses of Learning Observer require
+long-running processes (e.g. NLP), but the vast majority are small,
+simple reducers of the type which would work fine on an 80386
+(e.g. event count, time-on-task, or logging scores / submission).
+
+## Basic use case
+
+In the basic use case, there is a single Learning Observer process
+running. It is either using redis or, if unavailable, disk/memory as a
+storage back-end.
+
+## Horizontally-scalable use-case
+
+LO needs to handle a high volume of incoming data. Fortunately,
+reducers are sharded on a key. In the present system, the key is
+always a student. However, in the future, we may have per-resource,
+per-class, etc. reducers.
+
+A network roundtrip is typically around 30ms, which we would like to
+avoid. Therefore, we would like reducers to be able to run keeping
+state in-memory (and simply writing the state out to our KVS either
+with each event, or periodically e.g. every second). Therefore, we
+would like to have a fixed process per key so that reducers can run
+without reads.
+
+Our eventual architecture here is:
+
+```
+incoming event --> load balancer routing based on key --> process pool
+```
+
+Events for the same key (typically, the same student) should always
+land on the same process.
+
+Eventually, we will likely want a custom load balancer / router, but
+this can likely be accomplished off-the-shelf, for example by
+including the key in an HTTP header or in the URL.
+
+**HACK**: At present, if several web sockets hit a server even with a
+  common process, they may not share the same in-memory storage. We
+  should fix this.
+
+## Task-scalable use-case
+
+A second issue is that we would like to be able to split work by
+reducer, module, or similar (e.g. incoming data versus dashboards).
+
+Our eventual architecture here is:
+
+```
+incoming event --> load balancer routing based on module / reducer --> process pool
+```
+
+The key reason for this is robustness. We expect to have many modules,
+at different levels of performance and maturity. If one module is
+unstable, uses excessive resources, etc. we'd like it to not be able
+to take down the rest of the system.
+
+This is also true for different views. For example, we might want to
+have servers dedicated to:
+
+* Archiving events into the Merkle tree (must be 100% reliable)
+* Other reducers
+* Dashboards
+
+## Rerouting
+
+In the future, we expect modules to be able to send messages to each
+other.
+
+## Implementation path
+
+At some point, we expect we will likely need to implement our own
+router. However, for now, we hope to be able to use sticky routing and
+content-based routing in existing load balancers. This may involve
+communcation protocol changes, such as:
+
+- Moving auth information from the websocket stream to the header
+- Moving information into the URL (e.g. `http://server/in#uid=1234`)
+
+Note that these are short-term solutions, as in the long-term, only
+the server will know which modules handle a particular event. Once we
+route on modules, an event might need to go to serveral servers. At
+that point, we will likely need our own custom router / load balancer.
+
+In the short-term:
+
+* [Amazon](https://aws.amazon.com/elasticloadbalancing/application-load-balancer/?nc=sn&loc=2&dn=2)
+supports sticky sessions and content-based routing. This can work on data in the headers.
+* nginx can be configured to route to different servers based on headers and URLs. This is slightly manual, but would work as well. 
+
+## Homogenous servers
+
+Our goal is to continue to maintain homogenous servers as much as
+possible. The same process can handle incoming sockets of data, render
+dashboards, etc. The division is handled in devops and in the load
+balancer, e.g. by:
+
+- Installing LO modules only on specific servers
+- Routing events to specific servers
+
+The goal is to continue to support the single server use-case.
+
+## To do
+
+We need to further think through:
+
+- Long-running processes (e.g. NLP)
+- Batched tasks (e.g. nightly processes)
diff --git a/docs/workshop.md b/docs/workshop.md
@@ -59,7 +59,6 @@ git clone [email protected]:ETS-Next-Gen/writing_observer.git lo_workshop
 
 ```bash
 cd lo_workshop/
-git checkout berickson/workshop # This is a branch we set up with some extra things for this workshop!
 ```
 
 NOTE: All future commands should be ran starting from the repository's root directory. The command will specify if changing directories is needed.

diff --git a/extension/writing-process/src/background.js b/extension/writing-process/src/background.js
@@ -9,7 +9,7 @@ var RAW_DEBUG = false;
 /* This variable must be manually updated to specify the server that
  * the data will be sent to.  
 */
-var WEBSOCKET_SERVER_URL = "wss://learning-observer.org/wsapi/in/" 
+var WEBSOCKET_SERVER_URL = "wss://learning-observer.org/wsapi/in/";
 
 import { googledocs_id_from_url } from './writing_common';
 
@@ -35,7 +35,7 @@ const loggers = [
 
 loEvent.init('org.mitros.writing_analytics', '0.01', loggers, loEventDebug.LEVEL.SIMPLE);
 loEvent.setFieldSet([loEventUtils.getBrowserInfo(), loEventUtils.fetchDebuggingIdentifier()]);
-loEvent.go()
+loEvent.go();
 
 // Function to serve as replacement for 
 // chrome.extension.getBackgroundPage().console.log(event); because it is not allowed in V3
@@ -157,7 +157,7 @@ chrome.webRequest.onBeforeRequest.addListener(
                     'bundles': JSON.parse(formdata.bundles),
                     'rev': formdata.rev,
                     'timestamp': parseInt(request.timeStamp, 10)
-                }
+                };
                 logFromServiceWorker(event);
                 loEvent.logEvent('google_docs_save', event);
             } catch(err) {
@@ -170,7 +170,7 @@ chrome.webRequest.onBeforeRequest.addListener(
                     'formdata': formdata,
                     'rev': formdata.rev,
                     'timestamp': parseInt(request.timeStamp, 10)
-                }
+                };
                 loEvent.logEvent('google_docs_save_extra', event);
             }
         } else if(this_a_google_docs_bind(request)) {
@@ -181,7 +181,7 @@ chrome.webRequest.onBeforeRequest.addListener(
     },
     { urls: ["*://docs.google.com/*"] },
     ['requestBody']
-)
+);
 
 // re-injected scripts when chrome extension is reloaded, upgraded or re-installed
 // https://stackoverflow.com/questions/10994324/chrome-extension-content-script-re-injection-after-upgrade-or-install

diff --git a/extension/writing-process/src/writing.js b/extension/writing-process/src/writing.js
@@ -192,7 +192,7 @@ function google_docs_version_history(token) {
       }
     */
 
-    const metainfo_url = "https://docs.google.com/document/d/"+doc_id()+"/revisions/tiles?id="+doc_id()+"&start=1&showDetailedRevisions=false&filterNamed=false&token="+token+"&includes_info_params=true"
+    const metainfo_url = "https://docs.google.com/document/d/"+doc_id()+"/revisions/tiles?id="+doc_id()+"&start=1&showDetailedRevisions=false&filterNamed=false&token="+token+"&includes_info_params=true";
 
     fetch(metainfo_url).then(function(response) {
         response.text().then(function(text) {
@@ -354,7 +354,7 @@ function generic_eventlistener(event_type, frameindex) {
         if (event_type=='attention') {
             refresh_stream_view_listeners();
         }
-    }
+    };
 }
 
 function refresh_stream_view_listeners() {
@@ -393,13 +393,13 @@ var editor = document.querySelector('.kix-appview-editor');
 var frames = Array.from(document.getElementsByTagName("iframe"));
 
 // TODO: We should really make a list of documents instead of a fake iframe....
-frames.push({'contentDocument': document})
+frames.push({'contentDocument': document});
 
 // Add an event listener to each iframe in the iframes under frames.
 for(var event_type in EVENT_LIST) {
     for(var event_idx in EVENT_LIST[event_type]['events']) {
         const js_event = EVENT_LIST[event_type]['events'][event_idx];
-        const target = EVENT_LIST[event_type]['target']
+        const target = EVENT_LIST[event_type]['target'];
         if(target === 'document') {
             for(var iframe in frames) {
                 if(frames[iframe].contentDocument) {
@@ -608,7 +608,7 @@ function prepare_mutation_observer() {
     */
     var observer = new MutationObserver(function (mutations) {
         mutations.forEach(function (mutation) {
-            const event = {}
+            const event = {};
 
             // This list guarantees that we'll have the information we need
             // to understand what happened in a change event.
@@ -718,8 +718,8 @@ function writing_onload() {
     if(this_is_a_google_doc()) {
         log_event("document_loaded", {
             "partial_text": google_docs_partial_text()
-        })
-        execute_on_page_space("_docs_flag_initialData.info_params.token")
+        });
+        execute_on_page_space("_docs_flag_initialData.info_params.token");
         const handleFromWeb = async (event) => {
             if (event.data.from && event.data.from === "inject.js") {
                 const data = event.data.data;

diff --git a/learning_observer/learning_observer/adapters/adapter.py b/learning_observer/learning_observer/adapters/adapter.py
@@ -51,11 +51,20 @@ def dash_to_underscore(event):
 
     return event
 
-
 common_transformers = [
     dash_to_underscore
 ]
 
+def add_common_migrator(migrator, file):
+    '''Add a migrator to the common transformers list.
+    TODO
+    We ought check each module on startup for migrators
+    and import them instead of using this function to
+    add them to the transformations.
+    '''
+    print('Adding migrator', migrator, 'from', file),
+    common_transformers.append(migrator)
+
 
 class EventAdapter:
     def __init__(self, metadata=None):

diff --git a/learning_observer/learning_observer/auth/events.py b/learning_observer/learning_observer/auth/events.py
@@ -289,14 +289,6 @@ async def test_case_identify(request, headers, first_event, source):
     }
 
 
-@register_event_auth("http_auth")
-async def http_auth_identify(request, headers, first_event, source):
-    '''
-    TODO: Allow events to be authorized by HTTP basic authentication
-    '''
-    raise NotImplementedError("Not yet built; sorry")
-
-
 async def authenticate(request, headers, first_event, source):
     '''
     Authenticate an event stream.
@@ -311,12 +303,12 @@ async def authenticate(request, headers, first_event, source):
     type (e.g. require auth for writing, but not for dynamic assessment)
 
     Our thoughts are that the auth metadata ought to contain:
+
     1. Whether the user was authenticated (`sec` field):
-       * `authenticated` -- we trust who they are
-       * `unauthenticated` -- we think we know who they are, without security
-       * `guest` -- we don't know who they are
-    2. Providence: How they were authenticated (if at all), or how we believe
-       they are who they are.
+        * `authenticated` -- we trust who they are
+        * `unauthenticated` -- we think we know who they are, without security
+        * `guest` -- we don't know who they are
+    2. Providence: How they were authenticated (if at all), or how we believe they are who they are.
     3. `user_id` -- a unique user identifier
     '''
     for auth_method in learning_observer.settings.settings['event_auth']:
-Original file line number
+Diff line change
@@ Expand Up @@
     ```bash
     cd lo_workshop/
-    git checkout berickson/workshop # This is a branch we set up with some extra things for this workshop!
     ```
     NOTE: All future commands should be ran starting from the repository's root directory. The command will specify if changing directories is needed.
@@ Expand Down @@