[FOLIO-3103] Restore mod-search/ui-inventory-es on ref envs Created: 05/Apr/21 Updated: 23/Apr/21 Resolved: 23/Apr/21 |
|
| Status: | Closed |
| Project: | FOLIO |
| Components: | None |
| Affects versions: | None |
| Fix versions: | None |
| Type: | Task | Priority: | P2 |
| Reporter: | Bohdan Suprun (Inactive) | Assignee: | John Malconian |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||
| Issue links: |
|
||||||||||||
| Sprint: | DevOps Sprint 111, DevOps Sprint 112 | ||||||||||||
| Development Team: | FOLIO DevOps | ||||||||||||
| Description |
|
The two modules was removed from snapshot deploy because of
Once it is fixed we're able to restore the modules. |
| Comments |
| Comment by Bohdan Suprun (Inactive) [ 07/Apr/21 ] |
|
Hi Ian Hardy,
|
| Comment by John Malconian [ 08/Apr/21 ] |
|
We continue to see the same failure after trying the latest snapshot version of mod-search committed to master (Build 66). fatal: [10.36.1.146]: FAILED! => {"changed": false, "connection": "close", "content": "POST request for mod-search-1.3.0-SNAPSHOT.66 /_/tenant failed with 500: {\"errors\":[{\"message\":\"Failed to perform elasticsearch request [index=instance_diku, type=indexExists, message: elasticsearch: System error]\",\"type\":\"SearchOperationException\",\"code\":\"elasticsearch_error\"}],\"total_records\":1}", "content_length": "298", "content_type": "text/plain", "elapsed": 160, "msg": "Status code was 400 and not [200]: HTTP Error 400: Bad Request", "redirected": false, "status": 400, "url": "http://10.36.1.146:9130/_/proxy/tenants/diku/install?deploy=true&tenantParameters=loadSample%3Dtrue%2CloadReference%3Dtrue", "vary": "origin"}
Attaching complete Okapi logs to this issue. I'm afraid I will need to remove inventory-es and mod-search again from the snapshot builds. |
| Comment by Bohdan Suprun (Inactive) [ 08/Apr/21 ] |
|
Am I correct understand that this is deployment descriptor is being used? So the descriptor.env is the list of ENV variables being passed to the container? changed: [10.36.1.146] => (item={'redirected': False, 'url': 'http://10.36.1.146:9130/_/discovery/modules/mod-search-1.3.0-SNAPSHOT.66', 'status': 404, 'vary': 'origin', 'content_type': 'text/plain', 'connection': 'close', 'content_length': '28', 'msg': 'HTTP Error 404: Not Found', 'elapsed': 0, 'changed': False, 'invocation': {'module_args': {'url': 'http://10.36.1.146:9130/_/discovery/modules/mod-search-1.3.0-SNAPSHOT.66', 'headers': {'X-Okapi-Tenant': 'supertenant', 'X-Okapi-Token': '', 'Accept': 'application/json'}, 'status_code': ['200', ' 404'], 'force': False, 'http_agent': 'ansible-httpget', 'use_proxy': True, 'validate_certs': True, 'force_basic_auth': False, 'body_format': 'raw', 'method': 'GET', 'return_content': False, 'follow_redirects': 'safe', 'timeout': 30, 'follow': False, 'url_username': None, 'url_password': None, 'client_cert': None, 'client_key': None, 'dest': None, 'body': None, 'src': None, 'creates': None, 'removes': None, 'unix_socket': None, 'mode': None, 'owner': None, 'group': None, 'seuser': None, 'serole': None, 'selevel': None, 'setype': None, 'attributes': None, 'content': None, 'backup': None, 'remote_src': None, 'regexp': None, 'delimiter': None, 'directory_mode': None, 'unsafe_writes': None}}, 'failed': False, 'item': {'key': 'mod-search', 'value': {'srvcId': 'mod-search-1.3.0-SNAPSHOT.66', 'nodeId': '10.36.1.146', 'descriptor': {'dockerImage': 'folioci/mod-search:1.3.0-SNAPSHOT.66', 'dockerPull': True, 'env': [{'name': 'JAVA_OPTIONS', 'value': '-XX:MaxRAMPercentage=66.0'}, {'name': 'KAFKA_HOST', 'value': 'kafka'}, {'name': 'KAFKA_PORT', 'value': '9092'}, {'name': 'ELASTICSEARCH_HOST', 'value': 'elasticsearch'}, {'name': 'ELASTICSEARCH_PORT', 'value': '9200'}, {'name': 'ELASTICSEARCH_URL', 'value': 'http://elasticsearch:9200'}, {'name': 'ELASTICSEARCH_USERNAME', 'value': 'elastic'}, {'name': 'ELASTICSEARCH_PASSWORD', 'value': 's3cret'}, {'name': 'DB_HOST', 'value': 'postgres'}, {'name': 'DB_PORT', 'value': '5432'}, {'name': 'DB_USERNAME', 'value': 'folio_admin'}, {'name': 'DB_PASSWORD', 'value': 'folio_admin'}, {'name': 'DB_DATABASE', 'value': 'okapi_modules'}, {'name': 'DB_QUERYTIMEOUT', 'value': '60000'}, {'name': 'DB_CHARSET', 'value': 'UTF-8'}, {'name': 'DB_MAXPOOLSIZE', 'value': '5'}, {'name': 'INITIAL_LANGUAGES', 'value': 'eng'}, {'name': 'SYSTEM_USER_PASSWORD', 'value': 'Mod-search-1-0-0'}], 'dockerArgs': {'HostConfig': {'Memory': 536870912, 'PortBindings': {'8081/tcp': [{'HostPort': '%p'}]}}}}}}, 'ansible_loop_var': 'item'})
If yes than could you make sure following:
The tenant init error suggests that mod-search event unable to connect to elasticsearch to check if index exists. |
| Comment by Bohdan Suprun (Inactive) [ 09/Apr/21 ] |
|
Ok, if my assumptions are correct and deployment tool copies env variables from module descriptor as they are, than the quickest fix is to leave only those variables that will be overridden by OKAPI: DB connection properties and ES host and port and remove all other properties. That's really surprising that we use sample values for the variables during the real deployment |
| Comment by John Malconian [ 09/Apr/21 ] |
|
I didn't set this up originally so let me investigate this. thanks. |
| Comment by Bohdan Suprun (Inactive) [ 13/Apr/21 ] |
|
Hi John Malconian, Have you had a chance to investigate it? We have https://github.com/folio-org/mod-search/pull/70 that should fix it, please review and approve if it looks fine. |
| Comment by John Malconian [ 16/Apr/21 ] |
|
Bohdan Suprun - I'm still running into problems initializing the module for the tenant. tenant initialization seems to result in an endless loop of log messages. I've attached a snippet of the log file (mod-search.log.gz) The following environment variables are passed to the module when the module is deployed (some are global): "KAFKA_HOST=10.36.1.77", ELASTICSEARCH_USERNAME and ELASTICSEARCH_PASSWORD are not defined. SYSTEM_USER_PASSWORD is not defined as well. I also verified that http://10.36.1.77:9301 (elasticsearch) is reachable.
ubuntu@ip-10-36-1-77:~$ curl http://10.36.1.77:9301
{
"name" : "d4d862e05e63",
"cluster_name" : "docker-cluster",
"cluster_uuid" : "RrE6TcMkT3i4y9hV_HXFhg",
"version" : {
"number" : "7.10.1",
"build_flavor" : "default",
"build_type" : "docker",
"build_hash" : "1c34507e66d7db1211f66f3513706fdf548736aa",
"build_date" : "2020-12-05T01:00:33.671820Z",
"build_snapshot" : false,
"lucene_version" : "8.7.0",
"minimum_wire_compatibility_version" : "6.8.0",
"minimum_index_compatibility_version" : "6.0.0-beta1"
},
"tagline" : "You Know, for Search"
}
Any ideas? |
| Comment by Bohdan Suprun (Inactive) [ 16/Apr/21 ] |
|
Hi John Malconian, That's something that expected if post tenant not invoked yet: org.folio.search.exception.TenantNotInitializedException: Following tenants might not be initialized yet: [diku] After installing the module these errors should disappear. |
| Comment by Marc Johnson [ 16/Apr/21 ] |
If it is expected and is not an issue, why is an exception thrown? |
| Comment by Bohdan Suprun (Inactive) [ 16/Apr/21 ] |
|
What should happen in following case then:
I don't think this is an issue and logging the exception is fine. |
| Comment by Marc Johnson [ 16/Apr/21 ] |
|
Maybe I misunderstood the situation.
I would consider your example to be somewhat exceptional as no HTTP requests should be made to a module directly only via Okapi and it knows which module versions are enabled for a given tenant, in which case this is either a bug in Okapi or a user is doing something that they shouldn't. Arguably, it shouldn't be an exception or a 500, rather a 400, as it could be considered a mistake on the part of the client to make a request to the module for a tenant that it is not responsible for or aware of. My understanding of the scenario that John Malconian was describing is somewhat different. Does this error occur when a message is read from the log for a tenant that mod-search is not enabled for or that it is consuming from a tenant specific topic for a tenant it is not enabled for?
I was only trying to understand the reasoning. Apologies if that caused offence :-/ I get the sense that I shouldn't have asked the question. By all means, ignore it and I won't follow up on it further. Apologies :-/ |
| Comment by Bohdan Suprun (Inactive) [ 16/Apr/21 ] |
|
No problem at all. We don't have separate topics for tenants for now. Issue occurs when mod-search tries to consume messages for tenant for which it is not enabled yet. The error is not handled, just to let Kafka send message again so that we're guaranteeing that it is handled when the tenant is enabled (or another temporary issue is resolved - e.g. if inventory unavailable for a moment - we use /inventory-view/instances API to get latest representation of instance/item/holding). Of course we can handle the exception but in this case some messages will be lost when a temporary failure occurred. We also had a spike to make it a little bit clearer, and now if we recognize that tenant is not initialized yet we wrap the cause with a TenantNotInitializedException. |
| Comment by John Malconian [ 16/Apr/21 ] |
|
For record keeping: mod-search deployed with the following options. ok: [10.36.1.41] => (item={'key': 'mod-search', 'value': {'srvcId': 'mod-search-1.3.0-SNAPSHOT.72', 'nodeId': '10.36.1.41', 'descriptor': {'dockerImage': 'folioci/mod-search:1.3.0-SNAPSHOT.72', 'dockerPull': True, 'env': [{'name': 'ELASTICSEARCH_URL', 'value': 'http://10.36.1.41:9301'}, {'name': 'OKAPI_URL', 'value': 'http://10.36.1.41:9130'}, {'name': 'JAVA_OPTIONS', 'value': '-XX:MaxRAMPercentage=66.0'}], 'dockerArgs': {'HostConfig': {'Memory': 536870912, 'PortBindings': {'8081/tcp': [{'HostPort': '%p'}]}}}}}})
|
| Comment by John Malconian [ 16/Apr/21 ] |
|
There is quite bit of resource consumption going on with mod-search while it waits for the tenant initialization and the "system_user". CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS |
| Comment by John Malconian [ 16/Apr/21 ] |
|
So far in my testing of FOLIO reference builds, I'm getting a mix of build failures and successes when mod-search is included. When successful, an additional 10 minutes is added to the total tenant initialization time. Right now, my sample size is small, so I will conduct more tests. |
| Comment by Bohdan Suprun (Inactive) [ 16/Apr/21 ] |
|
Thanks John Malconian. Is there any CPU limitation for mod-search? Could you please provide whole logs for mod-search when it failed next time? I want to check how long it takes to init the app, to call tenant init etc. |
| Comment by Bohdan Suprun (Inactive) [ 16/Apr/21 ] |
|
I guess there is something with this OKAPI orchestration mode. What I mentioned in logs that there is significant delay between when module initialized fully and when tenant init is called. For example MSEARCH-83: # Application started 00:10:39.884 [main] INFO org.folio.search.SearchApplication - Started SearchApplication in 9.25 seconds (JVM running for 10.249) # Tenant init called 00:29:19.092 [http-nio-8081-exec-2] INFO org.folio.spring.controller.TenantController - Initializing tenant... As you can see it is 19 mins delay. I'll have to investigate OKAPI code, OKAPI should should have a way when it is assumes that module is initialized and then sends the post tenant. There should be an issue here, I think. |
| Comment by John Malconian [ 16/Apr/21 ] |
|
thanks Bohdan Suprun - I had to pull inventory-es and mod-search out of snapshot again since I was getting more build failures than successful ones. There is a delay between module deployment and tenant initialization. First, Okapi deploys all backend modules which takes a certain amount of time and then we enable all backend modules at once for the tenant using the Okapi install endpoint. It is this tenant initialization process that is giving us troubles. Normally this step takes less than 5 minutes, but with mod-search included, it can take an additional 10 minutes (if successful). Otherwise, the it seems to abort. One thing that we also need to check is the connection timeout between the Ansible client and Okapi as well as the ALB and Okapi. When failures occur, there is no error from Okapi. It appears that Ansible is just aborting the connection. Regardless of a possible connection timeout, we need to determine what is causing the extra time to initialize the module with the tenant. |
| Comment by Bohdan Suprun (Inactive) [ 16/Apr/21 ] |
|
John Malconian, Wayne Schneider, Is it possible to run playbook inside a vagrant box so I can debug it? |
| Comment by John Malconian [ 20/Apr/21 ] |
|
@bohdan - I think on the vagrant box, the best option would be to create a second tenant and initialize the same set of modules for the second tenant that the default diku tenant is running + mod-search to get an idea of what's going on. But another issue has come up that is preventing me from doing further troubleshooting. ui-inventory-es needs a dependency update. Incompatible version for module folio_inventory-es-6.2.100047 interface circulation. Need 9.0. Have 10.0/mod-circulation-21.0.0-SNAPSHOT.836 I'm also wondering if it's possible to back off the polling loop in mod-search to every 10-15 seconds. I'm not sure what it is now, but it's generating a ton of noise in the logs. |
| Comment by Bohdan Suprun (Inactive) [ 21/Apr/21 ] |
|
Sure, there is already a PR - https://github.com/folio-org/mod-search/pull/74 to increase back-off to 20 seconds. And I also created a PR to update versions https://github.com/folio-org/ui-inventory-es/pull/44 |
| Comment by John Malconian [ 21/Apr/21 ] |
|
Excellent, thanks Bohdan Suprun. Ping me when https://github.com/folio-org/mod-search/pull/74 has been merged. |
| Comment by Bohdan Suprun (Inactive) [ 22/Apr/21 ] |
|
Hi John Malconian, We have merged the https://github.com/folio-org/mod-search/pull/74 and one more MSEARCH-112: Check that ES index exists before any indexing logic. improvement that prevents module to call OKAPI before ES indexes are created (i.e. before module is initialized). |
| Comment by John Malconian [ 22/Apr/21 ] |
|
Bohdan Suprun The recent changes made to logging were excellent. Before the changes, the module was generating ~10gb of I/O. After the change, ~2MB. Tenant initialization time after the changes is now relatively the same as before. I will re-add inventory-es/mod-search to snapshot and then close this issue. |