Improving FOLIO's SDLC

Improving FOLIO's SDLC

This is a working document on improving FOLIO’s software development lifecycle(SDLC). There are some blockers that forces the current state of FOLIO’s SDLC. Removal of these blockers will help FOLIO guarantee quality releases and improve development speed.

NB: In the table below, Integration tests is synonymous with Karate tests. The solutions listed are not the only solutions possible.


Category

Problem

Solution(s)

Effort

Category

Problem

Solution(s)

Effort

Build

  • Development environment is unwieldy. A single vagrant box with all modules is the only viable option. The vagrant box requires at least 24 GB of RAM and quite a bit of CPU.

    • ANECDOTE: “Takes 10 minutes to log into FOLIO in the morning”

    • Developer experience is poor hence developer productivity is limited. This means feature development speed is limited.

    • Creates a high bar for contributions to the FOLIO project

    • Devs can’t execute Integration tests locally.

      • Due to tight coupling of FOLIO Modules, integration tests are more effective than unit tests isolated to one module.

      • This means features being developed locally cannot truly be verified to pass smoke testing until the feature in a FOLIO hosted environment.

Develop resource efficient dev environment. Busybee environment uses 10GB of RAM instead of >24GB RAM. Currently under trial with Folijet and Thunderjet.

S

Test

  • Everyone owns integration tests, therefore it is owned by no one.

    • Karate tests have not evolved much since its introduction.

    • Karate tests cannot handle feature toggles appropriately.

    • Basic functionality like debugging in an IDE is locked behind a paywall by the creators of the Karate test framework. Source.

    • Test standards vary team by team.

      • Incorrect scenario definitions

      • Tests dependent on other tests.

      • Magic variables and functions with hard-to-decipher origin.

      • Very long feature files.

      • Hard to read tests

        • Large JSON objects are not externalized.

    • Test Rail integration’s value seems questionable.

    • No facilities exist to mark tests as flaky.

    • Tests should have the ability to accept credentials to an existing tenant, create required objects and perform assertions.

    • ANECDOTE: After completion of file splitting feature in Data Import, integration tests for Data Import was down for at least a month.

create testing SIG or delegate to similar group
create standards for Karate test
investigate karate test framework viability
monitor overall health of integration tests
investigate cross-cutting concerns for integration tests

L

Test

  • Integration tests do not attain 100% pass rate before a flower release.

    • A pass rate threshold may currently be used to move forward with a release, but that doesn’t not consider the criticality of the failed tests. This makes integration tests as a quality gate to be porous.

      • Since many test scenarios are not appropriately defined, using a threshold seems absurd.

require 100% pass rate at many points during development and release

M

Test

  • Global environment for executing integration tests can be flaky

    • This reduces trust of integration tests as a quality gate since developers cannot execute integration tests anywhere else.

engineering collaborates with kitfox to resolve stability issues

M

Test

  • Because FOLIO is a distributed monolith, multiple modules are required to tell the story of a user action. That is why integration/E2E tests are more effective at finding issues than unit tests. SDLC should push towards being able to verify modules independently.

    • Kafka heavy modules do not have an appropriate testing harness or methodology.

    • Current contracts between modules are not enough to mock/specify behaviour. Often real module is required to get the full experience.

    • The best class of tests (currently integration tests) can’t be included in pre-checks before a PR is merged.

Propose methodology for testing modules with heavy kafka usage

M

Test

  • How to test if a data migration has been successful?

 

L

Test

  • Performance tests are not concluded before a flower release of FOLIO.

    • This implicitly means any issues found will be fixed in a patch. This distracts from feature development and the context switch reduces developer productivity.

require completion of performance before a flower release

S

Test

  • Performance tests are only executed against released versions of FOLIO modules

    • This reduces the time available to execute performance tests.

    • This is done because non-released versions of modules have not guarantee of functional viability. It would be a waste of time to test otherwise.

provide build of FOLIO with 100% pass on integration and e2e tests.

S

Test

  • After many weeks of testing, it takes less than 48 hours for issues to found during upgrades of FOLIO.

    • This means the testing scope needs to expand.

We need to bring production scenarios into the SDLC. This could be different on a per module basis.

for Data Import, have a repository of job profiles from production systems. We only need to store the “shape”, doesn’t require actual production reference data. More details incoming.

XL

Test

  • There is no contract testing or API compatibility testing between modules

 

L

Deploy

  • Environments such as Snapshot, Rancher, and Bugfest are integral to FOLIO's Software Development Life Cycle (SDLC) at present. However, they serve as temporary supports that conceal fundamental problems. Over time, it has become evident that their capacity to obscure these underlying issues is restricted. Examples:

    • Needing the spin up different flavors of bugfest to test ECS and non-ECS. More operation al overhead.

    • Inability of snapshot to test different states of feature toggles.

    • If three libraries were to contribute 3 dev teams, do each get a rancher environment in FOLIO’s limited infrastructure?

    • Somebody will change the language of the admin user in bugfest to Mandarin

      • Slack messages are sent to remedy the issue. Until the next time it happens.

    • FOLIO dev teams have to share a perf rancher environment.

      • 3 new dev teams into FOLIO will have to be added to the queue, causing more delay.

      • Makes it difficult to plan a sprint if performance tests is part of definition of done and access to perf rancher is not guaranteed.

    • To develop integration tests, developers have to connect to an environment like snapshot.

      • Requires an internet connection to develop.

      • Superuser credentials is in plaintext in a public source code repository for integration tests.

      • A feature has to be in the master branch to have integration tests developed.

        • Since new code is already in master, it might be too late.

        • This limits how Definition of Done can be implemented for a feature.

If we take away these environments, what left of the SDLC? How can we function without them? Turn them to nice to have rather than mandatory?

L

Deploy

  • FOLIO is a distributed system that is hard to troubleshoot. It should be a requirement for a correlation identifier to be present in all requests. This is not enforced. It is clear that it is not a priority since OKAPI does not return a correlation identifier with every response for troubleshooting.

    • Troubleshooting issues on FOLIO can take a while eating into developer productivity.

introduce standards on modules for inclusion of correlation identifier
have OKAPI return a correlation identifier for every request

M

Release

  • Tight coupling of FOLIO modules forces monolithic releases

    • It is becoming difficult to release FOLIO 3 times a year

start having monthly releases. We could call them sanity/prerelease/alpha. It will not result in official artifacts. It could result in a deployment to bugfest. This allows us to optimize the steps needed for release, provide interim builds that can be tested. Eventually, these mini releases could be converted to full blown releases.

L

Release

  • Activities required for a flower release take a lot of effort

    • This is a contributing factor to why FOLIO cannot release often.

create release SIG or designate existing group
Introduce enough release tooling and repo standards such that one person can perform a FOLIO release.

L

Release

  • Comprehensive dashboards containing code quality status, integration test status, E2E test status and other metrics required to define a good release is not available to the community at large.

    • We should be able to see karate tests status and E2E status in one view.

provide dashboard for release build health

S