[FOLIO-2633] Jenkins builds broken when okapi-3 until ModuleDescriptors have permissionsRequired, then verify Created: 03/Jun/20  Updated: 02/Jul/20  Resolved: 01/Jul/20

Status: Closed
Project: FOLIO
Components: None
Affects versions: None
Fix versions: None

Type: Task Priority: P2
Reporter: David Crossley Assignee: David Crossley
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original estimate: Not Specified

Attachments: File okapi-snapshot-test-161-20200611.log.gz     File okapi-snapshot-test-177-20200622.log.gz    
Issue links:
Blocks
is blocked by MODGQL-124 Securing APIs by default Open
is blocked by FOLIO-2660 duplicate roles in folio-ansible and ... Closed
is blocked by MODCAT-200 Securing APIs by default Closed
is blocked by MODLOGSAML-60 Securing APIs by default Closed
is blocked by OKAPI-859 Fail to enable module if tenant API h... Closed
is blocked by ERM-851 Securing APIs by default Closed
is blocked by MODCXINV-45 Securing APIs by default Closed
is blocked by MODPERMS-85 Invalid CQL when encoding permission ... Closed
is blocked by MODUSERBL-88 Securing APIs by default Closed
Relates
relates to CIRC-783 Remove the fake permissions sets used... Open
relates to FOLIO-2567 Create following up module tickets af... Closed
relates to MODPERMS-86 totalRecords count incorrect Closed
relates to FOLIO-2665 folio-testing-backend Jenkins build b... Closed
relates to OKAPI-767 permissionsRequired required (securin... Closed
Sprint: DevOps: sprint 92, DevOps: sprint 90
Development Team: FOLIO DevOps

 Description   

When okapi-3.0.0 was recently released, the reference environment builds broke with errors of the following form:

Module 'mod-authtoken-2.5.0-SNAPSHOT.67' handler /token: Missing field permissionsRequired

That one was soon fixed, but then the breakage moved on to the next module.

Until modules have at least the default empty permissionsRequired array in their ModuleDescriptor, then Okapi has been pinned in folio-ansible to okapi-2.40.0 (pull/352).

Note: updating the ref envs to Okapi 3.0 is in scope of this ticket.

Update: 20200619: Those modules with missing permissionsRequired are fixed.
Now unpin okapi to current v3 reveals other troubles with refenv builds. Perhaps ansible playbook related.



 Comments   
Comment by David Crossley [ 04/Jun/20 ]

There is a branch of folio-ansible to return to okapi-3

The Jenkins job "folio-testing-test" can be used to verify when ready to unpin Okapi to be the current version.

Do Configure that job to set "Branches to build" to be "refs/heads/folio-2633-monitor-permsrequired".

Run the build and search the output for "Missing field permissionsRequired".
For example folio-testing-test/73

Return configuration to "*/master"

Comment by David Crossley [ 05/Jun/20 ]

As explained above, that Jenkins job shows modules that are completely missing permissionsRequired for some handlers.

There are four such listed:

mod-login-saml has MODLOGSAML-60 Closed (and has an open pull-request)

mod-codex-inventory has MODCXINV-45 Closed (and no pull-request yet)

mod-graphql

mod-marccat

The latter two have no ticket linked via FOLIO-2567 Closed .
Hongwei Ji or Adam Dickmeiss I do not have sufficient knowledge to advise, so would you please add a ticket for those (MODGQL and MODCAT)

Comment by Hongwei Ji [ 05/Jun/20 ]

We only scanned q1 modules. Those two were not part of q1. Do we know who maintains those two modules?

Comment by Hongwei Ji [ 05/Jun/20 ]

Added MODGQL-124 Open and MODCAT-200 Closed

Comment by David Crossley [ 11/Jun/20 ]

The MODLOGSAML-60 Closed and MODCXINV-45 Closed were done recently.

Today i added temporary empty permissionsRequired for MODGQL-124 Open and MODCAT-200 Closed to enable this FOLIO-2633 Closed to proceed.

Did another test run. See folio-snapshot-test 161

That run failed. Perhaps a different problem. I attached the okapi log. Would someone please investigate.

Comment by Marc Johnson [ 16/Jun/20 ]

Jakub Skoczen Adam Dickmeiss Given that this issue is still outstanding and the hosted environments are not running Okapi 3.x, does that mean that the official version of Okapi for 2020 Q2 will be Okapi 2.x?

Comment by Hongwei Ji [ 16/Jun/20 ]

I looked into the attached Okapi log and have an idea why it broke, so I opened OKAPI-859 Closed .

Comment by David Crossley [ 18/Jun/20 ]

Folowing today's Okapi v3.1.1 release, i did a new run of Jenkins build folio-snapshot-test/164

However it fails with this:

...
TASK [folio-ansible/roles/tenant-admin-permissions :
Get all permissionSets not included in other permissionSets excluding okapi] ***
ok: [10.36.1.116]

TASK [folio-ansible/roles/tenant-admin-permissions :
Fail if all permissions not retrieved] ***
fatal: [10.36.1.116]: FAILED! => {"changed": false, "msg":
"Retrieved permissions don't match total permissions count"}
...

Attached the okapi logs. – later deleted because not needed.

Comment by Hongwei Ji [ 18/Jun/20 ]

In Okapi3, we added a feature to automatically generate permission set for module permissions. Seems the ansible script in this line https://github.com/folio-org/folio-ansible/blob/85ad9988f3c4fd91cfec35ea515140e0a942f5d5/roles/tenant-admin-permissions/tasks/main.yml#L19 should be updated to exclude those permissions. The naming convention for those permissions is prefixing with "SYS#" by the way. Wayne Schneider and Ian Hardy, can you take a look? Thanks.

Comment by Marc Johnson [ 18/Jun/20 ]

Hongwei Ji

we added a feature to automatically generate permission set for module permissions

Does that mean that the manual permission sets maintained by modules like mod-circulation should be changed back to direct module permissions in the future?

Comment by Hongwei Ji [ 18/Jun/20 ]

Marc Johnson, those manual ones can be changed back to use direct ones but do not have to. Both ways should work because perm sets are expanded recursively.

Comment by Hongwei Ji [ 18/Jun/20 ]

David Crossley, Wayne Schneider and Ian Hardy I created a PR to address the perm count error: https://github.com/folio-org/folio-ansible/pull/360 but I cannot request reviewers due to permission setup for that repo.

Comment by David Crossley [ 19/Jun/20 ]

Thanks Hongwei. I merged that and followed with test Jenkins builds:

folio-snapshot-test/170

folio-testing-test/76

They failed at different ansible tasks.

However i am out of time today to investigate further.

Comment by Marc Johnson [ 19/Jun/20 ]

Hongwei Ji

those manual ones can be changed back to use direct ones but do not have to. Both ways should work because perm sets are expanded recursively.

Thanks

Comment by David Crossley [ 22/Jun/20 ]

As explained in this ticket Description, the default permissionsRequired was added to those other modules so that we could proceed with this, and attempt to unpin Okapi version.

There are now still permissions issues with the build. Hongwei and i have tried various changes over the weeekend and today, in this folio-ansible branch.

Built the folio-snapshot-test again today. See folio-snapshot-test/177
but it fails.

It is using the current okapi-3.1.1 release.

See the attached portion of log at okapi-snapshot-test-177-20200622.log (i have more if needed).

There are many errors of the following form (but not sure if that is the actual problem):

2020-06-22T04:58:44,741 INFO  DockerModuleHandle   mod-permissions-5.12.0-SNAPSHOT.80 22 Jun 2020 04:58:44:739
ERROR PermsAPI [499977eqId] Error attempting to update permissions metadata:
org.folio.cql2pgjson.exception.QueryValidationException:
org.z3950.zing.cql.CQLParseException: expected boolean, got '/'

Our most recent change to folio-ansible was to extend the "length" of the permission query to "Get all permissionSets" from 500 to 2000. That was for the linked Jenkins run 177. Extended that again to 3000 and ran again for Jenkins build 178, but still failed.

Comment by Adam Dickmeiss [ 22/Jun/20 ]

This latest issue is going to be fixed with MODPERMS-85 Closed

Comment by Jakub Skoczen [ 22/Jun/20 ]

Adam Dickmeiss can you let Wayne Schneider know when the new mod-permissions release is ready? He will try to give it a go.

Comment by David Crossley [ 22/Jun/20 ]

Adam Dickmeiss we just need the fix in master to enable the test run again.

Comment by David Crossley [ 22/Jun/20 ]

The Jenkins builds folio-snapshot-test and folio-testing-test are currently configured to use "refs/heads/folio-2633-monitor-permsrequired-3" of folio-infrastructure.

So just press the button.

Comment by Adam Dickmeiss [ 23/Jun/20 ]

started run https://jenkins-aws.indexdata.com/job/Automation/job/folio-snapshot-test/ , now that MODPERMS-85 Closed is done

Comment by David Crossley [ 24/Jun/20 ]

Following the fix of MODPERMS-85 Closed yesterday, the run of folio-testing-test 83 is successful with okapi-3.1.1

Today's folio-snapshot is broken by something unrelated, so cannot test there.

Not yet organised the final PR for folio-ansible to unpin okapi.

Comment by David Crossley [ 25/Jun/20 ]

Hmm, so for the last two days i have been trying to get a clean run of folio-snapshot-test with okapi v3, before trying to set okapi to v3 for all reference environment builds.

No.

It still feels like a folio-ansible/Jenkins problem.

See today folio-snapshot-test/184 which fails twice in the one run.

and yesterday folio-snapshot-test/183 which fails in a slightly different way.

For good measure i ran folio-testing-test again today (#84) with okapi v3, and again it is happy.

So calling on Wayne Schneider or Ian Hardy for assistance.

Comment by Jakub Skoczen [ 25/Jun/20 ]

David Crossley Wayne Schneider Guys, can we try to investigate what is going on?

Comment by Wayne Schneider [ 25/Jun/20 ]

It appears that the CQL query (childOf==[] not permissionName=okapi.*) to /perms/permissions is either returning unreliable results or an unreliable totalRecords key. I haven't been able to reproduce the issue outside of the CI environment, which makes it hard to track down.

We have added some addition debugging information in the Ansible error message, and we updated the query to exclude the SYS#* permissions. Making a test run now...

Comment by Ian Hardy [ 25/Jun/20 ]

I'm getting the right number in totalRecords with the CQL query, but without the behaviour is weird. There are 1659 permissions on snapshot right now, but totalRecords is reported as 1000 as long as the length is specified under 1000 (or the default length param). When you set length between 1001 and 1659, you get whatever you set for length back as totalRecords, and when you exceed 1659 you get that (the real total) for totalRecords:

$ http "https://folio-snapshot-okapi.aws.indexdata.com/perms/permissions" "x-okapi-token:$stoken" "x-okapi-tenant:diku" | jq '.totalRecords'
1000
$ http "https://folio-snapshot-okapi.aws.indexdata.com/perms/permissions?length=2" "x-okapi-token:$stoken" "x-okapi-tenant:diku" | jq '.totalRecords'
1000
$ http "https://folio-snapshot-okapi.aws.indexdata.com/perms/permissions?length=1200" "x-okapi-token:$stoken" "x-okapi-tenant:diku" | jq '.totalRecords'
1200
$ http "https://folio-snapshot-okapi.aws.indexdata.com/perms/permissions?length=2000" "x-okapi-token:$stoken" "x-okapi-tenant:diku" | jq '.totalRecords'
1659
Comment by Hongwei Ji [ 25/Jun/20 ]

We observed the count difference as well. It was discussed in https://folio-org.atlassian.net/browse/MODPERMS-86

Comment by Julian Ladisch [ 25/Jun/20 ]

totalRecords is an estimation based on PostgreSQL's query planner statistics.
For details see https://github.com/folio-org/raml-module-builder#estimated-totalrecords
Either get chunks, for example with chunk size 500, as described on https://github.com/folio-org/raml-module-builder#implement-chunked-bulk-download , until there are no more records.
Or use a high length to get all in one go, for example length=10000.

Comment by Wayne Schneider [ 25/Jun/20 ]

Thanks, Hongwei Ji and Julian Ladisch. I think the issue with the apparently mismatched counts may be a red-herring, as we are seeing what appears to be an unrelated failure in the last few builds.

Comment by Wayne Schneider [ 25/Jun/20 ]

At this point I'm stumped. The build is failing, not always in the same place, with the error:

FATAL: command execution failed
hudson.AbortException: Ansible playbook execution failed
	at org.jenkinsci.plugins.ansible.AnsiblePlaybookBuilder.perform(AnsiblePlaybookBuilder.java:262)
	at org.jenkinsci.plugins.ansible.workflow.AnsiblePlaybookStep$AnsiblePlaybookExecution.run(AnsiblePlaybookStep.java:400)
	at org.jenkinsci.plugins.ansible.workflow.AnsiblePlaybookStep$AnsiblePlaybookExecution.run(AnsiblePlaybookStep.java:321)
	at org.jenkinsci.plugins.workflow.steps.AbstractSynchronousNonBlockingStepExecution$1$1.call(AbstractSynchronousNonBlockingStepExecution.java:47)
	at hudson.security.ACL.impersonate(ACL.java:367)
	at org.jenkinsci.plugins.workflow.steps.AbstractSynchronousNonBlockingStepExecution$1.run(AbstractSynchronousNonBlockingStepExecution.java:44)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

I would be tempted to try to run the build on the master node, to rule out something strange with the slave container. Unfortunately, the master node seems a little strapped for RAM. I suspect that we can't run from the packet.io slave because it won't have ssh access to the snapshot system.

Comment by Hongwei Ji [ 26/Jun/20 ]

Not sure if you have seen https://issues.jenkins-ci.org/browse/JENKINS-54557. If the issue is similar, some comments (dirty fix) in that ticket might be helpful.

Comment by David Crossley [ 26/Jun/20 ]

Thanks Hongwei. I did add one of their suggestions. To sleep during Jenkins cleanup to enable ansible to report its errors. However it seems to not be any more informative than previous builds, e.g. comparing today's #191 with the #184.

Comment by David Crossley [ 26/Jun/20 ]

I do notice some other weirdness. For this branch of folio-ansible, Wayne improved the name of task "tenant-admin-permissions : Get all permissionSets not included in other permissionSets excluding okapi" and the failure message for the task "tenant-admin-permissions : Fail if all permissions not retrieved" to be more informative about this counts thing.

However those improved messages are not shown – still has the old messages. See folio-snapshot-test/191

This branch of folio-ansible does unpin the version of okapi, which i verified again today by inspecting the okapi.log file. So Jenkins must be using that branch, but it seems to be an old version. (Head hurts.)

Update: However the folio-testing-test build (which is successful) does show those updated messages, e.g. folio-testing-test/86

Comment by Julian Ladisch [ 26/Jun/20 ]

The CQL query (childOf==[] not permissionName=okapi.* not permissionName=SYS#*) should not use the = operator that is word matching ignoring punctuation. It should use the == operator that matches the complete field including punctuation:

(childOf==[] not permissionName==okapi.* not permissionName==SYS#*)

permissionName=SYS#* is the same as permissionName="SYS *" is the same as permissionName=SYS and this matches my-module.foo.sys.bar.read.

Details about CQL string matching: https://dev.folio.org/faqs/explain-cql/

The length parameter should be increased from length=500 to length=5000.

Comment by Wayne Schneider [ 26/Jun/20 ]

Thanks, Julian Ladisch, I've made those updates to the query. And Hongwei Ji, I think adding that little bit of sleep in case of failure does help ensure that we get the full log, thank you!

David and I think that we may have a clue as to what is going on – Jenkins seems to have hold of a different commit of folio-ansible in the Ansible roles_path. There may be some mitigation possible.

Comment by Wayne Schneider [ 26/Jun/20 ]

OK, we figured it out, I think.

  • There are two versions of the tenant-admin-permissions role in folio-infrastructure. One is local to that repository, one comes from the folio-ansible submodule.
  • The local version was not updated with the improved CQL query (increasing the result set length and excluding the SYS# permissions), so it was not getting all the permissions and the Ansible task was failing as designed.

The reason there are two versions of the role (and several other roles) is because at some point we could not figure out how to set the roles path for Ansible on Jenkins, so we just worked around it. We now know how to do that: create an Ansible configuration file with a roles_path default and use the ANSIBLE_CONFIG environment variable to point to the config file. Then we can remove all the local copies of the roles.

The environment variable update needs to be made in all the Jenkins jobs that use the Jenkins Ansible plugin.

Comment by David Crossley [ 29/Jun/20 ]

Excellent, Wayne. And thanks to everyone involved.

We will discuss this at today's (Monday) DevOps meeting.

Comment by David Crossley [ 29/Jun/20 ]

Blocked this ticket on FOLIO-2660 Closed to address the duplicate roles in folio-ansible and folio-infrastructure.

Update: That was handled in the context of this FOLIO-2633 Closed .

Comment by David Crossley [ 30/Jun/20 ]

The branches were merged today: folio-ansible/pull/362 and folio-infrastructure/pull/206

The subsequent reference environment builds related to folio-snapshot are successful.

Comment by David Crossley [ 30/Jun/20 ]

However, something is amiss with only folio-testing-backend FOLIO-2665 Closed .

Comment by David Crossley [ 02/Jul/20 ]

The final piece of the new system permissions puzzle was solved with FOLIO-2665 Closed .

Generated at Thu Feb 08 23:22:06 UTC 2024 using Jira 1001.0.0-SNAPSHOT#100246-sha1:7a5c50119eb0633d306e14180817ddef5e80c75d.