From 5177d6fed82fc3448e4fd57c43539a1585dcbf8a Mon Sep 17 00:00:00 2001 From: MatMatt Date: Sun, 22 Dec 2024 07:23:02 +0000 Subject: [PATCH] deploy: 8a47403e4e809bb269aac7a91e784e1fda32f168 --- ANNEX IV IT Principles.html | 1275 +++++++++++++++++++++++++++++++++++ CheatSheet.html | 392 +++-------- sitemap.xml | 20 +- 3 files changed, 1364 insertions(+), 323 deletions(-) create mode 100644 ANNEX IV IT Principles.html diff --git a/ANNEX IV IT Principles.html b/ANNEX IV IT Principles.html new file mode 100644 index 0000000..3fe2c9c --- /dev/null +++ b/ANNEX IV IT Principles.html @@ -0,0 +1,1275 @@ + + + + + + + + + + + + +Copernicus Land Monitoring Service IT Architecture Principles and Implementation Guidelines + + + + + + + + + + + + + + + + + + + +
+ +
+ +
+
+

Copernicus Land Monitoring Service IT Architecture Principles and Implementation Guidelines

+

ANNEX IV

+
+ + + +
+ +
+
Author
+
+

European Environment Agency (EEA)

+
+
+ +
+
Published
+
+

November 8, 2024

+
+
+ + +
+ + +
+
+
Keywords
+

Copernicus Land Monitoring Service, CLMS IT Architecture, European Environment Agency, IT Principles and Guidelines, IT Ecosystem, IT Security, EUPL Licensing, Reproducibility, Reusability, Transparency, Scalability, Maintainability, Resilient IT Solutions, Modular IT Architecture, Continuous Integration

+
+
+ +
+ + +
+

1 Preface

+

The EEA (European Environment Agency) CLMS IT architecture principles are indicative and must be evaluated in all IT deliverables.

+
+
+

2 Introduction

+

The IT architecture principles set the overall framework for the EEA CLMS IT landscape. The principles are designed to ensure a consistency in deliverables and at the same time support the CLMS program’s IT vision and -strategy. The principles are designed to ensure that IT solutions are coherent, can be further developed and operated efficiently, that they support business needs and security requirements, etc.

+

A uniform approach is required to ensure the coherency goal. The EEA CLMS programs IT applications may depend on and interact with each other. It is therefore important that IT solutions focus on connectivity and potential synergy effects to ensure continued coherence in the IT landscape.

+

Any application provided may be developed, operated, maintained, and further developed by a supplier different from the supplier who delivered the initial application. Therefore, efforts must always be made to be supplier independent. Other suppliers must be able to continue working from where the previous supplier left off.

+
+
+

3 Scope and key terms

+

The scope of the EEA CLMS IT architecture principles is IT solutions to be delivered to the CLMS. The solutions delivered will include functionalities required to support the program (for Example, maintain and operate CLC+ Core multi-use grid-based Land Cover/Land Use hybrid data repository). This includes also the dependencies of these IT solutions to other internal or external systems.

+

Definition of the key terms used within this document:

+
    +
  • Application Programming Interface (API) - is a set of protocols, tools, and definitions that allow different software applications to communicate with each other. It defines the methods and data structures that developers can use to interact with a service or application, facilitating the exchange of data and functionality.

  • +
  • Automation scripts – refers to sets of instructions, written in scripting languages designed to automate repetitive tasks and processes. These scripts streamline workflows, reduce the need for manual intervention and ensure consistency and efficiency in performing tasks.

  • +
  • Client specific software/IT solution – is a custom-designed software/solution that is tailored to meet the unique needs, requirements, and preferences of a particular client or organization.

  • +
  • Commercial software – software products that are developed, marketed, and sold for profit by software companies or developers. Commercial software is typically licensed to end-users, who must purchase it or pay a subscription fee.

  • +
  • Continuous Integration and Continuous Deployment (CI/CD) – are closely related methodologies designed to streamline and automate software development. Together, they ensure that code changes are continuously tested, integrated, and deployed to production environments, enabling teams to deliver updates more rapidly, reliably, and with minimal manual intervention.

  • +
  • Deliverables – specific outputs, products, or results provided as part of the contract or a contractual agreement.

  • +
  • End of life (EOL) – refers to the date after which a product will no longer be sold or renewed (though still might receive some form of support, such as security patches).

  • +
  • End of support (EOS) – refers to the date of complete cessation of all support services for the product, including new patches, updates or fixes.

  • +
  • Infrastructure-as-a-code – is an IT practice that involves managing and provisioning computing infrastructure through machine-readable definition files, rather than physical hardware configuration or interactive configuration tools.

  • +
  • IT ecosystem – is a network of interconnected technologies, software, hardware, and services that work together to support an organization’s digital operations. It includes applications, infrastructure, cloud services, and networks, all integrated to ensure seamless communication, scalability, security, and efficiency.

  • +
  • IT solution continuity - involves a collection of strategies and plans focused on maintaining an organization’s essential operations during and after disruptions. The goal is to minimize downtime, limit financial losses, and safeguard the organization’s reputation when faced with interruptions.

  • +
  • IT solutions – services, products and processes that use information technology to solve business problems, improve operational efficiency, and enhance overall performance. IT solutions are typically composed by various parts. For Example, the codebase, CI/CD routines, documentation etc.

  • +
  • Open-ended – refers to components or features that are flexible, adaptable, and capable of evolving over time to meet a wide range of needs and requirements.

  • +
  • Pre-processing - refers to the series of operations performed on raw data to prepare it for further analysis and processing.

  • +
  • REST API service - is a type of web service that allows systems to communicate over HTTP by accessing and manipulating resources using standard methods such as GET, POST, PUT, and DELETE. RESTful APIs are stateless, meaning each request from a client to the server must contain all the necessary information, and they typically return data in formats like JSON or XML. This approach is widely used for building scalable, lightweight web services and enabling seamless integration between different systems.

  • +
  • Software development tools – applications, frameworks, and utilities that software developers use to create, debug, maintain, or support software.

  • +
  • Software product – a collection of computer programs, procedures, and documentation that performs a specific task or function or provides a comprehensive solution to a particular problem.

  • +
  • Source code – a set of instructions and statements written by a programmer using human-readable programming language. It is the original code written and saved in files before being compiled or interpreted into executable programs. Source code serves as the blueprint for creating software applications.

  • +
  • Workflow – systematic sequence of processes and activities.

  • +
+
+
+

4 Principles

+

The principles are grouped into 6 overarching IT architectural themes:

+
    +
  1. Reproducibility - the overarching goal is to ensure that any deliverable in the form of an IT solution may be reproduced given sufficient time.

  2. +
  3. Reusability - the services and products provided through the EEA CLMS are to be reused by the end users as the foundation of their further work.

  4. +
  5. Transparency - the EEA CLMS program is funded by the EU and supports its community with data and products. These products are to be part of the foundation for further work in the field and hence transparency is key to ensuring that extended work can be carried out.

  6. +
  7. Scalability - the delivered IT solutions must be scalable to ensure future enhancements can fulfil coming demands and requirements.

  8. +
  9. Maintainability - the EEA aims in the CLMS program to be able to provide updated products when new data becomes available e.g. on yearly basis. To reduce the time to market the principle of maintainability is to be followed.

  10. +
  11. IT security – IT solutions of the EEA CLMS are utilized by multiple stakeholders and thus IT security is paramount. This is especially critical for the open-ended aspects such as the use of APIs and outward facing web solutions.

  12. +
  13. Resilience – IT solutions of the EEA CLMS are designed to withstand and recover from disruptions by remaining operational during unforeseen events.

  14. +
+

Together, these principles guide design, development, and evolution of the IT solutions in the EEA CLMS program. The principles should be periodically reviewed and updated to ensure alignment with the latest technological advancements and emerging best practices. This ongoing evaluation will help maintaining the relevance, effectiveness, and security of the IT solutions.

+
+

4.1 Reproducibility

+

The overarching principle of reproducibility is further unfolded below in the following sub-principles:

+ ++++ + + + + + + + + + + + + + + + + + + + + + + + + +
Reproducibility 1:Description of workflows must be provided
What:Deliverables which are a result of pre-processing of data must be provided with a description of the workflow for the pre-processing
Why:To ensure that the deliverable can be re-produced, details must be provided on how this can be achieved
Consequence:Descriptions of pre-processing workflow steps are to be provided with deliverables. Ideally the workflows delivered as scripts or similar. At a minimum documentation of how the workflows are to be set up is to be provided
Example:A delivery that includes a web application, shall include description of the build process, such as the compilation of source code, packaging of the application, and deployment steps. This for instance could include details on the specific versions of tools used (e.g. Node.js, Docker etc.)
+ ++++ + + + + + + + + + + + + + + + + + + + + + + + + +
Reproducibility 2:Data sources to be supplied with deliverables
What:IT solutions which utilize data sources must supply the data sources
Why:To ensure that the deliverable can be re-produced details are required on data sources used along with any enrichment which have been applied to the data source
Consequence:Data source location must be provided if data are publicly available. If data are not accessible to the CLMS, the data are to be provided as part of the deliverable
Example:If the software relies on a proprietary weather data API that is not publicly accessible, the data, or at least a sample dataset, should be provided with the delivery. If the API is publicly available, detailed instructions on how to access it (e.g., API keys, endpoint URLs) must be included
+ ++++ + + + + + + + + + + + + + + + + + + + + + + + + +
Reproducibility 3:List of software used in development of IT solution to be provided
What:The software products which have been used in the development of the software are to be listed as part of the deliverable
Why:To ensure that the IT solution can be further developed details are required of the software components/products that were used in the development
Consequence:List of software development tools used in the production to be provided. Further for client specific developments the source code must also be provided
Example:A system consisting of several building blocks, such as User Interface, backend, importer, and exporter modules, shall be provided with a list of software development tools, used for production of these building blocks and modules
+ ++++ + + + + + + + + + + + + + + + + + + + + + + + + +
Reproducibility 4:Automation tool/scripts used in the production of the IT solution must be provided
What:IT solutions which include automation scripts/workflows in the development must supply these scripts as part of the deliverable
Why:Automation scripts used in development are viewed as part of the deliverable and are required for reproduction of the solution
Consequence:Automation scripts whether as stand-alone scripts or as a configuration of standard/commercial software must be provided as part of the deliverable
Example:If the IT deliverable includes an automatic backup that generates full backups in certain increments, then the automation scripts behind the backup generation must be provided as part of the deliverable, so that they could be recreated
+ ++++ + + + + + + + + + + + + + + + + + + + + + + + + +
Reproducibility 5:If a solution includes outcomes of pre-executed algorithms the prerequisites for running the algorithms must be provided
What:To ensure reproducibility, the algorithms must be provided either as pseudo code or as source code
Why:The foundation of the IT solution must be re-producible to ensure future enhancements are possible say if new insights/data become available also after the end of the contractual agreement
Consequence:Supplier must as part of the deliverable also detail any algorithms which form the basis of the solution
Example:A spatial product, providing a detailed pan-European wall to wall 10-meter spatial resolution raster product, that is based on a supervised classification of satellite image time-series. The supplier must provide a detailed description of the algorithm that was used for classifying satellite-imagery time-series
+
+
+

4.2 Reusability

+

The principle of reusability is detailed in the following sub-principles:

+ ++++ + + + + + + + + + + + + + + + + + + + + + + + + +
Reusability 1:IT solutions should be open-ended equipped with APIs
What:IT solutions should be open-ended equipped with APIs through which functionality or data key for the end user may be accessed
Why:To ensure that further work benefit from existing solutions it is paramount that systems delivered are open ended. Future work can hereby utilize and benefit from previous work. Delivered IT solutions should form part of the overall IT ecosystem of the EEA CLMS program so that the “whole is greater than the sum”
Consequence:IT solutions should be provided with APIs which access key functionality of the IT solutions
Example:A webservice provided, which publishes geospatial data, has an API Rest service, which grants users direct access to the data
+ ++++ + + + + + + + + + + + + + + + + + + + + + + + + +
Reusability 2:Scripts used in production must be delivered as part of IT solutions
What:Scripts should be delivered with code so that they may be used as templates for the end user for further development
Why:Data, conditions, or requirements may change for an IT solution. To ensure that such changes can be accommodated the underlying script must be possible to modify to reflect and support such changes
Consequence:Scripts used in the productions form part of the final deliverable
Example:IT delivery, consisting of several building blocks, shall be provided with scripts, included with the final delivery of the code, so that the end users of the system could modify, expand, or adopt the building blocks/modules to suit specific needs or add new features
+ ++++ + + + + + + + + + + + + + + + + + + + + + + + + +
Reusability 3:Documentation of IT solutions are to be provided
What:Documentation of the developed IT solutions must be provided. The requested documentation shall also be provided in quarto markdown format on the dedicated EEA GitHub repository
Why:For the further use and improvements of the IT solution, technical documentation is paramount.
Consequence:Documentation including but not limited to System Description Document (SDD), System Deployment Document and Examples must be provided with IT solution deliverables. The requested documents shall also be provided in the quarto markdown format.
Example:An IT delivery, consisting of several building blocks, shall be provided with SDD, user guidelines, and detailed documentation of system deployment, including, but not limited to system and storage architecture, infrastructure setup, provisioning, monitoring, disaster recovery, accessibility, scalability options and performance. If requested, this documentation shall be provided in quarto markdown format on the dedicated EEA GitHub repository
+
+
+

4.3 Transparency

+

The CLMS is funded by the EU and supports its community with data and services. As such, these products and services are to be part of the foundation for further work in the field and accessible to the community. To support this, the principle of transparency is detailed in the following subprinciples:

+ + + + + + + + + + + + + + + + + + + + + + + + + +
Transparency 1:Source code of client specific software to be supplied with IT solution
What:Source code including deployment and integration scripts of client specific IT solution is supplied as part of the deliverable and made publicly available under the EUPL-1.21 license
Why:To ensure transparency, it is essential to have clear insights into the client-specific software. This enables efficient future developments and modifications
Consequence:Source code of client specific software must be delivered with IT solution. The source code shall include CI/CD (Continuous Integration/Continuous Development), Docker recipes and be published under the EUPL-1.2. license
Example:Source code of all the components of the specific IT solution must be delivered. Any updates or developments of the source code shall be reflected in the EEA GitHub repository, which is the main repository of the system. Moreover, the specific client IT solutions shall be published under the EUPL-1.2 license, so the openness and transparency are ensured
+ ++++ + + + + + + + + + + + + + + + + + + + + + + + + +
Transparency 2:Inline documentation of the source code
What:Source code of client specific IT solution must be documented in-line
Why:To effectuate the handover from one developer to the next inline documentation are to be included to guide the developer on the job
Consequence:Source code must have inline documentation. Inline code should be formatted so that it may be easily extracted to generate online documentation
Example:Source code of all the components of the IT solution must have inline documentation. The documentation shall be structured, following common conventions, and kept at a minimal, but comprehensive level
+ ++++ + + + + + + + + + + + + + + + + + + + + + + + + +
Transparency 3:Commercial software used in the production/development must be attainable by the EEA or a third-party provider
What:Commercial software which are prerequisites must be attainable on comparable terms. Such software is justified only if no open alternative exists
Why:To ensure that further work may be carried out any prerequisites in the form of software must be attainable by the EEA or a supplier
Consequence:Generally attainable commercial software used in production must be listed when delivering an IT solution. Name of software, version, EOL and EOS to be supplied
Example:An IT solution is deploying various components and the set-up of the virtual machines that houses the components is done by means of an infrastructure as-a-code-tool. All the capabilities of the infrastructure as-a-code-tool, that require purchasing must be listed when delivering the IT solution
+
+
+

4.4 Scalability

+

The delivered IT Solutions must be scalable to ensure future enhancements and demands. Scalability in this sense covers both scalability in functionality as well as scalability in terms of being able to handle an increased load/traffic. To this end the following scalability principles apply

+ ++++ + + + + + + + + + + + + + + + + + + + + + + + + +
Scalability 1:Client specific IT solutions should have a modular structure
What:Modular structure of client specific IT solutions is a requirement. This may be achieved using e.g. microservices
Why:A modular structure is sought to ensure further development, and updates are possible. The possibility of substituting or adding modules in an IT solution will increase the lifespan of a solution
Consequence:Modular architecture of IT solutions is a requisite
Example:If the client specific IT solution has, for Example, grown its user base since the launch of the solution, then scaling up shall be possible at any time – both vertically (more CPUs, RAM) and horizontally (more VMs)
+ ++++ + + + + + + + + + + + + + + + + + + + + + + + + +
Scalability 2:Client specific IT solutions must be able to interface with other IT solutions
What:The IT deliverable must be able to be used in conjunction with other deliverables to form a composite solution
Why:To make the most of the funds available the developed solutions should form part of an IT ecosystem making up a whole
Consequence:IT deliverables must be equipped with documented APIs for interfacing with other IT applications
Example:A client specific product, which can be used for extracting and manipulating data, should be accessible programmatically through e.g. well documented REST services
+
+
+

4.5 Maintainability

+

The EEA aims in the CLMS program to be able to provide updated products when new data becomes available. To reduce the time to market the principle of maintainability is to be followed.

+ ++++ + + + + + + + + + + + + + + + + + + + + + + + + +
Maintainability 1:IT solutions are to be delivered on a principle of CI/CD
What:The launch of new releases of IT deliverables are to be configured and managed so that new functionality is available as soon as possible. The principle of CI/CD are to be adhered to
Why:Time to market is to be reduced through an approach of maintainability of deployments as soon as possible
Consequence:IT deliverables are to be supplied with a dev-ops set-up which supports CI/CD
Example:A delivered IT solution is organized with a test server environment potentially a pre-production environment, used for quality assurance and continuous development, so that deployment to production can be initiated smoothly
+ ++++ + + + + + + + + + + + + + + + + + + + + + + + + +
Maintainability 2:IT solutions are to be Dockerized or similar
What:The use of container technology is encouraged
Why:Container technology eases the work of moving IT solutions around the IT infrastructure making deployment easier to automate
Consequence:IT solutions are to be deployed using Dockerization or similar
Example:Software components of the client specific IT solution shall be provided as docker images so that deployment is flexible with respect to hardware
+ ++++ + + + + + + + + + + + + + + + + + + + + + + + + +
Maintainability 3:Tests are to be organised so that they may be automated
What:Tests are to be structured so that they may be easily automated
Why:To ensure that the CI/CD process does not introduce bugs or deployment failures, tests are to be automated so that they can continuously be run to ensure the quality of the solution and its possible enhancements
Consequence:Tests are to be delivered so that they can be automated
Example:The delivered solution has in the test phase run through a number of tests e.g. unit tests and result verification tests. These will be the basis for automated regression tests
+ ++++ + + + + + + + + + + + + + + + + + + + + + + + + +
Maintainability 4:IT solutions are to be regularly assessed
What:IT solutions are to be automatically monitored with a notification service, and their performance routinely evaluated to ensure optimal functioning
Why:Regular assessments ensure that IT solutions can be maintained so as to meet emerging needs, threats and technological advancements
Consequence:IT solution’s scalability, security, and overall performance are continuously monitored and evaluated to address performance and or security issues
Example:The delivered IT solution and its associated dependencies are regularly assessed and evaluated. The evaluation process should also account for advancements in technology and track developments to ensure the solution remains relevant and effective
+
+
+

4.6 IT security

+

The IT solutions of the CLMS program shall ensure system integrity against various security threats, protection of the data, and maintenance of privacy. The following sub-principals are to be followed:

+ ++++ + + + + + + + + + + + + + + + + + + + + + + + + +
IT security 1:Incorporate security considerations from the beginning of the system development
What:Ensure security is integrated into all stages of the system development lifecycle, from planning to deployment
Why:Early integration of security measures reduces vulnerabilities, lowers costs associated with late-stage fixes, and ensures robust protection against threats
Consequence:Threat modelling and security assessments need to be conducted from the start, as well as allocation of resources for ongoing security reviews and testing
Example:Standard aspects such as two factor authentication, protection against SQL injection, encryption of sensitive data, etc.
+ ++++ + + + + + + + + + + + + + + + + + + + + + + + + +
IT security 2:Compliance with relevant laws, regulations and industry standards
What:IT-solutions must adhere to legal requirements, industry standards, and regulations e.g. EUDPR, ISO
Why:Compliance ensures legal and regulatory adherence, builds trust, protects sensitive data, and mitigates risk of legal penalties and breaches
Consequence:IT deliverables need to incorporate robust security measures, include documentation of compliance efforts, and ensure features and processes aligned with legal and industry measures
Example:Data handling agreements must be in place, consideration of server location in EU, etc.
+ ++++ + + + + + + + + + + + + + + + + + + + + + + + + +
IT security 3:Ensuring that users and systems have appropriate permissions based on their roles and responsibilities
What:Implement role-based access control (RBAC) to manage user and system permissions according to their roles
Why:It prevents unauthorized access, minimizes the risk of data breaches, and ensures that users only have access to the information necessary for their roles
Consequence:The provider will need to define clear roles and responsibilities, implement RBAC policies, regularly review and update access controls
Example:A delivered IT solution has role-based accesses, which ensures that only Admin-Users are allowed to manage (add, edit, activate, inactivate) users and organisations. Also, only administrator can view and edit any ingestion and extraction within the system to support users if they need any help
+
+
+

4.7 Resilience

+ ++++ + + + + + + + + + + + + + + + + + + + + + + + + +
Resilience 1:IT solution should have a disaster recovery plan
What:IT solution should have a well-defined process of restoring IT systems, data, and operations following a disruption
Why:To ensure that the IT solution and data are recoverable after an unforeseen event
Consequence:IT deliverables will be provided with well-prepared disaster recovery plan that will ensure a rapid restoration of services and data integrity, and minimize damage
Example:A delivered IT solution has a disaster recovery plan that includes backup protocols, data replication, and recovery timelines
+ ++++ + + + + + + + + + + + + + + + + + + + + + + + + +
Resilience 2:Ensuring IT solution continuity
What:IT solution is designed and implemented in a way that ensures continuous operation during a disruption
Why:To maintain critical operations with a minimal downtime, even when confronted with unforeseen events
Consequence:IT deliverables are designed for high availability, incorporating redundancy so that in case of a disruption/failure, restore service can immediately take over, minimizing downtime and ensuring continuous operation
Example:In the event of a system failure or disruption of the delivered IT solution, restore service automatically take over to maintain service continuity. For instance, if a primary system goes down, a secondary system activates, ensuring that users experience no downtime.
+

+
+
+ + +
+ + +
+ + + + + \ No newline at end of file diff --git a/CheatSheet.html b/CheatSheet.html index 3efd9eb..6a04fd3 100644 --- a/CheatSheet.html +++ b/CheatSheet.html @@ -2,12 +2,12 @@ - + - + A Cheatsheet for Developing Standards for Generative AI Training and Web Crawlers @@ -25,7 +25,7 @@ } /* CSS for syntax highlighting */ pre > code.sourceCode { white-space: pre; position: relative; } -pre > code.sourceCode > span { line-height: 1.25; } +pre > code.sourceCode > span { display: inline-block; line-height: 1.25; } pre > code.sourceCode > span:empty { height: 1.2em; } .sourceCode { overflow: visible; } code.sourceCode > span { color: inherit; text-decoration: inherit; } @@ -36,7 +36,7 @@ } @media print { pre > code.sourceCode { white-space: pre-wrap; } -pre > code.sourceCode > span { display: inline-block; text-indent: -5em; padding-left: 5em; } +pre > code.sourceCode > span { text-indent: -5em; padding-left: 5em; } } pre.numberSource code { counter-reset: source-line 0; } @@ -101,17 +101,12 @@

Contents

  • 2.3. PDF Structuring for AI Integration
  • 2.4. HTML Structuring for AI Integration
  • -
  • 3. Importance of Sitemap Indexing in HTML Documents
  • +
  • 3. Importance of Sitemap Indexing in HTML Documents for Easy Web Crawling and Generative AI Training
  • 4. Best Practices for Information Formatting
  • -
  • 5. Quarto Markdown Editors -
  • -
  • 6. Automation with GitHub Deployment
  • +
  • 5. Automation with GitHub Deployment
  • 6. Conclusion
  • - +

    Other Formats

    @@ -135,7 +130,7 @@

    A Cheatsheet for Developing Standards for Generative AI Traini
    Published
    -

    October 29, 2024

    +

    September 30, 2024

    @@ -143,17 +138,10 @@

    A Cheatsheet for Developing Standards for Generative AI Traini -
    -
    -
    Keywords
    -

    AI standards, web crawlers, AI training, content formatting

    -
    -
    - - -

    This document serve as a quick reference guide to ensure content follows structured formats essential for web crawlers and AI systems. Utilizing Quarto Markdown in HTMLs and generating sitemaps are critical for efficient crawling, helping search engines and AI models quickly index and retrieve well-structured content.

    +

    ::: {style=“font-family: ‘Times New Roman’, serif; text-align: justify;”}

    +

    This document serve as a quick reference guide to ensure content follows structured formats essential for web crawlers and AI systems. Utilizing Quarto Markdown in HTMLs and generating sitemaps are critical for efficient crawling, helping search engines and AI models quickly index and retrieve well-structured content.

    1. Introduction

    @@ -201,28 +189,24 @@

    YAML Example for

    2.2. HTML Structuring for Web Crawlers

    Semantic HTML5 elements, such as <article>, <section>, and <header>, help web crawlers index and understand the content more efficiently.

    -
    ---
    -<article>
    -  <header>
    -    <h1>Understanding Web Crawlers</h1>
    -    <meta name="description" content="Overview of web crawlers and their role in AI training." />
    -  </header>
    -  <section>
    -    <h2>How Web Crawlers Index Content</h2>
    -    <p>Web crawlers use links and metadata to index the web.</p>
    -  </section>
    -</article>
    ----
    +
    <article>
    +  <header>
    +    <h1>Understanding Web Crawlers</h1>
    +    <meta name="description" content="Overview of web crawlers and their role in AI training." />
    +  </header>
    +  <section>
    +    <h2>How Web Crawlers Index Content</h2>
    +    <p>Web crawlers use links and metadata to index the web.</p>
    +  </section>
    +</article>

    2.2.1. Microdata for Structured Content

    -
    ---
    -<article itemscope itemtype="https://schema.org/Article">
    -  <header>
    -    <h1 itemprop="headline">AI and Web Crawling</h1>
    -    <meta itemprop="description" content="Overview of AI training using web crawlers." />
    -  </header>
    -</article>
    ----
    +
    <article itemscope itemtype="https://schema.org/Article">
    +  <header>
    +    <h1 itemprop="headline">AI and Web Crawling</h1>
    +    <meta itemprop="description" content="Overview of AI training using web crawlers." />
    +  </header>
    +</article>

    @@ -239,35 +223,33 @@

    2.3. PD

    2.4. HTML Structuring for AI Integration

    To optimize content for AI integration, HTML documents should include semantic elements, structured data formats like JSON-LD, and relevant metadata. This helps AI systems process and train on the content efficiently.

    -
    ---
    -<article itemscope itemtype="https://schema.org/Article">
    -  <header>
    -    <h1 itemprop="headline">AI Training Data and Web Crawlers</h1>
    -    <meta name="description" content="How to structure content for AI training and web crawling." />
    -  </header>
    -  <section>
    -    <h2>AI Model Training</h2>
    -    <p>Semantic structure is essential for AI to understand content.</p>
    -    <script type="application/ld+json">
    -    {
    -      "@context": "https://schema.org",
    -      "@type": "Dataset",
    -      "name": "AI Training Data",
    -      "description": "Dataset structured for AI and web crawlers.",
    -      "creator": {
    -        "@type": "Organization",
    -        "name": "Your Organization"
    -      }
    -    }
    -    </script>
    -  </section>
    -</article>
    ----
    +
    <article itemscope itemtype="https://schema.org/Article">
    +  <header>
    +    <h1 itemprop="headline">AI Training Data and Web Crawlers</h1>
    +    <meta name="description" content="How to structure content for AI training and web crawling." />
    +  </header>
    +  <section>
    +    <h2>AI Model Training</h2>
    +    <p>Semantic structure is essential for AI to understand content.</p>
    +    <script type="application/ld+json">
    +    {
    +      "@context": "https://schema.org",
    +      "@type": "Dataset",
    +      "name": "AI Training Data",
    +      "description": "Dataset structured for AI and web crawlers.",
    +      "creator": {
    +        "@type": "Organization",
    +        "name": "Your Organization"
    +      }
    +    }
    +    </script>
    +  </section>
    +</article>

    -
    -

    3. Importance of Sitemap Indexing in HTML Documents

    +
    +

    3. Importance of Sitemap Indexing in HTML Documents for Easy Web Crawling and Generative AI Training

    Sitemaps are essential for enhancing the discoverability and accessibility of web content for both web crawlers and AI systems. As an XML file, a sitemap provides a structured roadmap of a website, listing URLs, metadata, and details like last modified dates and update frequency. This helps crawlers efficiently index content and enables generative AI models to train on well-structured data, improving processing and retrieval accuracy. Key Benefits of Sitemap Indexing for Web Crawling and AI Training are:

    • Improved Discoverability: Sitemaps enable web crawlers to find all relevant resources on a site, especially for deep or hard-to-reach pages.

    • @@ -275,16 +257,13 @@

      3. Importance of Sitemap Indexing in HTML Documents

    • Structured Data for AI Training: Well-indexed documents help generative AI models understand relationships between content, improving relevance and accuracy in AI-generated responses.

    • Faster Content Retrieval: Sitemaps speed up indexing and ensure better search rankings, enabling faster content access for AI models.

    -
    ---
    -<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
    -<url>
    -    <loc>https://<your-username>.github.io/<your-repo-name>/index.html</loc>
    -    <lastmod>2024-10-08T12:24:05Z</lastmod>
    -    <changefreq>monthly</changefreq>
    -    <priority>0.8</priority>
    -</url>
    -</urlset>
    ----
    +
    <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
    +   <url>
    +      <loc>http://example.com/ai-training</loc>
    +      <lastmod>2024-09-30</lastmod>
    +      <changefreq>monthly</changefreq>
    +   </url>
    +</urlset>

    Submit your sitemap to search engines via tools like Google Search Console to ensure your content is indexed properly. This improves the discoverability of AI training datasets and documents by web crawlers and AI models.


    @@ -298,44 +277,13 @@

    4. Best Practices for Information Formatting


    -
    -

    5. Quarto Markdown Editors

    -

    To work with Quarto Markdown (.qmd) files and have them generated automatically, we can use several editors that integrate well with Quarto. VS Code (Visual Studio Code), RStudio, JupyterLab with Quarto Integration, and Atom with Quarto Plugin are some popular editors that support Quarto and can automatically generate .qmd files.

    -

    R-Studio is lightweight, easy-to-use and integrates with Quarto and provides tools for rendering, previewing, and managing .qmd documents in an effective way.

    -
    -

    Steps to Set It Up

    -
      -
    1. Install RStudio: Download from RStudio.
    2. -
    3. Install Quarto: Follow Quarto installation instructions to install Quarto.
    4. -
    5. Create a New Quarto Document: -
        -
      • In RStudio, go to File > New File > Quarto Document.
      • -
      • Choose the type of document you want (e.g., HTML, PDF, Word).
      • -
      • A .qmd file will be created automatically.
      • -
    6. -
    7. Automatically Render .qmd: -
        -
      • After editing your document, you can preview it using Render or export it to various formats.
      • -
    8. -
    -
    -
    -

    Benefits

    -
      -
    • Full support for Quarto with an integrated environment.
    • -
    • Provides tools for live preview and exporting.
    • -
    • Ideal for users familiar with R or data science workflows.
    • -
    -
    -
    -
    -

    6. Automation with GitHub Deployment

    +

    5. Automation with GitHub Deployment

    Automation is crucial for ensuring efficiency and consistency in the deployment of content structured for AI integration and web crawlers. By automating the rendering of Quarto Markdown, Markdown, and Jupyter Notebook files into HTML, generating a sitemap, and deploying the output to GitHub Pages, the process becomes seamless and repeatable with minimal human intervention. This ensures that any changes to content are instantly reflected on the website, keeping the content discoverable and up-to-date for web crawlers and AI systems. Steps in the Automation Pipeline are:

    1. Trigger on Push or Pull Requests:
        -
      • The workflow is triggered whenever .qmd files are modified or included in a pull request, ensuring content is updated automatically.
      • +
      • The workflow is triggered whenever .qmd, .md, or .ipynb files are modified or included in a pull request, ensuring content is updated automatically.
    2. Checkout Repository:
        @@ -355,7 +303,7 @@

        6. Automation with GitHub Deployment

    3. Generate Sitemap:
        -
      • Automatically creates a sitemap.xml following the google structure and it helps search engines and web crawlers discover all available content on the website.
      • +
      • Automatically creates a sitemap.xml that helps search engines and web crawlers discover all available content on the website.
    4. Deploy to GitHub Pages:
        @@ -407,7 +355,18 @@

        6. Conclusion

        } return false; } - const onCopySuccess = function(e) { + const clipboard = new window.ClipboardJS('.code-copy-button', { + text: function(trigger) { + const codeEl = trigger.previousElementSibling.cloneNode(true); + for (const childEl of codeEl.children) { + if (isCodeAnnotation(childEl)) { + childEl.remove(); + } + } + return codeEl.innerText; + } + }); + clipboard.on('success', function(e) { // button target const button = e.trigger; // don't keep focus @@ -439,50 +398,11 @@

        6. Conclusion

        }, 1000); // clear code selection e.clearSelection(); - } - const getTextToCopy = function(trigger) { - const codeEl = trigger.previousElementSibling.cloneNode(true); - for (const childEl of codeEl.children) { - if (isCodeAnnotation(childEl)) { - childEl.remove(); - } - } - return codeEl.innerText; - } - const clipboard = new window.ClipboardJS('.code-copy-button:not([data-in-quarto-modal])', { - text: getTextToCopy }); - clipboard.on('success', onCopySuccess); - if (window.document.getElementById('quarto-embedded-source-code-modal')) { - // For code content inside modals, clipBoardJS needs to be initialized with a container option - // TODO: Check when it could be a function (https://github.com/zenorocha/clipboard.js/issues/860) - const clipboardModal = new window.ClipboardJS('.code-copy-button[data-in-quarto-modal]', { - text: getTextToCopy, - container: window.document.getElementById('quarto-embedded-source-code-modal') - }); - clipboardModal.on('success', onCopySuccess); - } - var localhostRegex = new RegExp(/^(?:http|https):\/\/localhost\:?[0-9]*\//); - var mailtoRegex = new RegExp(/^mailto:/); - var filterRegex = new RegExp('/' + window.location.host + '/'); - var isInternal = (href) => { - return filterRegex.test(href) || localhostRegex.test(href) || mailtoRegex.test(href); - } - // Inspect non-navigation links and adorn them if external - var links = window.document.querySelectorAll('a[href]:not(.nav-link):not(.navbar-brand):not(.toc-action):not(.sidebar-link):not(.sidebar-item-toggle):not(.pagination-link):not(.no-external):not([aria-hidden]):not(.dropdown-item):not(.quarto-navigation-tool):not(.about-link)'); - for (var i=0; i6. Conclusion

    interactive: true, interactiveBorder: 10, theme: 'quarto', - placement: 'bottom-start', + placement: 'bottom-start' }; - if (contentFn) { - config.content = contentFn; - } - if (onTriggerFn) { - config.onTrigger = onTriggerFn; - } - if (onUntriggerFn) { - config.onUntrigger = onUntriggerFn; - } window.tippy(el, config); } const noterefs = window.document.querySelectorAll('a[role="doc-noteref"]'); @@ -514,130 +425,7 @@

    6. Conclusion

    try { href = new URL(href).hash; } catch {} const id = href.replace(/^#\/?/, ""); const note = window.document.getElementById(id); - if (note) { - return note.innerHTML; - } else { - return ""; - } - }); - } - const xrefs = window.document.querySelectorAll('a.quarto-xref'); - const processXRef = (id, note) => { - // Strip column container classes - const stripColumnClz = (el) => { - el.classList.remove("page-full", "page-columns"); - if (el.children) { - for (const child of el.children) { - stripColumnClz(child); - } - } - } - stripColumnClz(note) - if (id === null || id.startsWith('sec-')) { - // Special case sections, only their first couple elements - const container = document.createElement("div"); - if (note.children && note.children.length > 2) { - container.appendChild(note.children[0].cloneNode(true)); - for (let i = 1; i < note.children.length; i++) { - const child = note.children[i]; - if (child.tagName === "P" && child.innerText === "") { - continue; - } else { - container.appendChild(child.cloneNode(true)); - break; - } - } - if (window.Quarto?.typesetMath) { - window.Quarto.typesetMath(container); - } - return container.innerHTML - } else { - if (window.Quarto?.typesetMath) { - window.Quarto.typesetMath(note); - } - return note.innerHTML; - } - } else { - // Remove any anchor links if they are present - const anchorLink = note.querySelector('a.anchorjs-link'); - if (anchorLink) { - anchorLink.remove(); - } - if (window.Quarto?.typesetMath) { - window.Quarto.typesetMath(note); - } - // TODO in 1.5, we should make sure this works without a callout special case - if (note.classList.contains("callout")) { - return note.outerHTML; - } else { - return note.innerHTML; - } - } - } - for (var i=0; i res.text()) - .then(html => { - const parser = new DOMParser(); - const htmlDoc = parser.parseFromString(html, "text/html"); - const note = htmlDoc.getElementById(id); - if (note !== null) { - const html = processXRef(id, note); - instance.setContent(html); - } - }).finally(() => { - instance.enable(); - instance.show(); - }); - } - } else { - // See if we can fetch a full url (with no hash to target) - // This is a special case and we should probably do some content thinning / targeting - fetch(url) - .then(res => res.text()) - .then(html => { - const parser = new DOMParser(); - const htmlDoc = parser.parseFromString(html, "text/html"); - const note = htmlDoc.querySelector('main.content'); - if (note !== null) { - // This should only happen for chapter cross references - // (since there is no id in the URL) - // remove the first header - if (note.children.length > 0 && note.children[0].tagName === "HEADER") { - note.children[0].remove(); - } - const html = processXRef(null, note); - instance.setContent(html); - } - }).finally(() => { - instance.enable(); - instance.show(); - }); - } - }, function(instance) { + return note.innerHTML; }); } let selectedAnnoteEl; @@ -681,7 +469,6 @@

    6. Conclusion

    } div.style.top = top - 2 + "px"; div.style.height = height + 4 + "px"; - div.style.left = 0; let gutterDiv = window.document.getElementById("code-annotation-line-highlight-gutter"); if (gutterDiv === null) { gutterDiv = window.document.createElement("div"); @@ -707,32 +494,6 @@

    6. Conclusion

    }); selectedAnnoteEl = undefined; }; - // Handle positioning of the toggle - window.addEventListener( - "resize", - throttle(() => { - elRect = undefined; - if (selectedAnnoteEl) { - selectCodeLines(selectedAnnoteEl); - } - }, 10) - ); - function throttle(fn, ms) { - let throttle = false; - let timer; - return (...args) => { - if(!throttle) { // first call gets through - fn.apply(this, args); - throttle = true; - } else { // all the others get throttled - if(timer) clearTimeout(timer); // cancel #2 - timer = setTimeout(() => { - fn.apply(this, args); - timer = throttle = false; - }, ms); - } - }; - } // Attach click handler to the DT const annoteDls = window.document.querySelectorAll('dt[data-target-cell]'); for (const annoteDlNode of annoteDls) { @@ -796,5 +557,4 @@

    6. Conclusion

    - \ No newline at end of file diff --git a/sitemap.xml b/sitemap.xml index 1d35aad..2f0592c 100644 --- a/sitemap.xml +++ b/sitemap.xml @@ -1,44 +1,50 @@ + + https://.github.io//ANNEX IV IT Principles.html + 2024-12-22T07:23:02Z + monthly + 0.8 + https://.github.io//CLMS_doc_example.html - 2024-11-12T15:41:23Z + 2024-12-22T07:23:02Z monthly 0.8 https://.github.io//CLMS_filenamingconvention.html - 2024-11-12T15:41:23Z + 2024-12-22T07:23:02Z monthly 0.8 https://.github.io//CheatSheet.html - 2024-11-12T15:41:23Z + 2024-12-22T07:23:02Z monthly 0.8 https://.github.io//README.html - 2024-11-12T15:41:23Z + 2024-12-22T07:23:02Z monthly 0.8 https://.github.io//clms.html - 2024-11-12T15:41:23Z + 2024-12-22T07:23:02Z monthly 0.8 https://.github.io//guidelines.html - 2024-11-12T15:41:23Z + 2024-12-22T07:23:02Z monthly 0.8 https://.github.io//test.html - 2024-11-12T15:41:23Z + 2024-12-22T07:23:02Z monthly 0.8