[FOLIO-2909] SPIKE: checkout-by-barcode receives 500 but loan is created Created: 10/Dec/20 Updated: 25/Jan/22 Resolved: 25/Jan/22 |
|
| Status: | Closed |
| Project: | FOLIO |
| Components: | None |
| Affects versions: | None |
| Fix versions: | None |
| Type: | Story | Priority: | P2 |
| Reporter: | Zak Burke | Assignee: | Unassigned |
| Resolution: | Won't Do | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||||||||||||||||||||||||||
| Issue links: |
|
||||||||||||||||||||||||||||||||||||
| Sprint: | |||||||||||||||||||||||||||||||||||||
| Development Team: | Vega | ||||||||||||||||||||||||||||||||||||
| Description |
|
SPIKE for technical investigation of how we can address partial success checkouts (so we can fix
Description from CIRC-1046: Overview: Although check out is a multi-step process that includes changing an item's status, creating a loan object, and possibly other actions such as emailing a receipt to the patron and/or publishing an Event Message in PubSub, from the point of view of an application making a single request to the checkout-by-barcode endpoint it is an atomic operation performed in a single request. At present, checkout-by-barcode returns the status of the final step in this pipeline, which may be a failure, even if early steps in the pipeline (such as changing the item status and creating the loan) succeed. This is misleading, if not downright wrong, and it causes the UI to indicate that the operation failed when in fact a loan was created for the item and patron. Expected Results: If an item's status changes from 'Available' to 'Checked out' and a loan item is created, the response-code should be a 200, even if other portions of the check out process fail and return 500. Actual Results: If an item's status changes from 'Available' to 'Checked out' and a loan item is created, but later portions of the check out process fail, the response-code is that of the failed portion of the check out. Additional Information: Related details/discussion at
Interested parties: Marc Johnson, Holly Mistlebauer, Cate Boerema |
| Comments |
| Comment by Cate Boerema (Inactive) [ 10/Dec/20 ] |
|
Marc Johnson, here's the spike for
|
| Comment by Kyle Felker [ 29/Mar/21 ] |
|
Zak Burke, Marc Johnson, I spent a few hours today crawling through the checkout by barcode service and documenting the process it goes through. I put up a document showing the steps in order. This whole thing is formed basically as a giant promise chain, with each result linked to the next operation in the chain. By the time the loan object is both created and saved to storage (step 29), 3/4ths of the checkout process is completed. Marc, am I right in thinking that if any of the previous promise calls fails, they abort the process? If that's so, then the only operations that would result in an error but still create a loan would be from step 30 onward. |
| Comment by Kyle Felker [ 30/Mar/21 ] |
|
Did a bit more poking around today-I coded one of the steps in the checkout process to fail automatically (retrieving the overdue fine policy). Then I loaded the module to the scratch environment and attempted to check out an item. The checkout failed with the message that the overdue policy could not be found. I then checked the item record for the item as well as the user record for the patron I attempted to check it out to-no loan, and item does not show checked out. So I think that any failure in the chain aborts the whole process at that point. I'm going to try and fail one of the steps after the loan is saved and see what happens. (Later) So, based on this, if you're getting an error message at the check-out screen, but the loan is showing in the UI, it means that one of the steps after the loan object has been saved has failed (steps 30 onward). Failed steps seem to forward whatever error message they get from the http endpoint they were dealing with-so, for example, if the attempt to publish the checkout event to the patron record fails, the failure message from that endpoint gets forwarded to the checkout screen UI. There are two basic approaches I can think of we could do here: 1. We could modify the code after the loan is saved so that failures at that point are not sent to the UI. That would probably be relatively easy-my problem with that is that then we have no notification that something went wrong that may need fixing. 2. We could try to figure out a way to make error messages indicate wether or not the overall checkout process was successful-something like "there was an error publishing an event to the patron record, but the checkout has completed." Then you'd know something went wrong, but the loan was created successfully. The validation code for phase one is doing something similar to this-that could be used as a model. |
| Comment by Marc Johnson [ 31/Mar/21 ] |
Yes, broadly. (Assuming the chaining is done correctly) any outcome of a step means that none of the following steps are actioned (this was my answer to dealing with the form of callback hell that FOLIO found itself in during the early days) |
| Comment by Marc Johnson [ 31/Mar/21 ] |
Agreed. This is broadly what I described when this issue first came up. The challenges being:
We don't really have a great way of accumulating non-fatal errors. I believe the Vega team did something in this area recently. I think the first step in this is listing all of the steps (after the loan is created) so that POs / stakeholders can decide what should happen when they fail. |
| Comment by Kyle Felker [ 31/Mar/21 ] |
|
>I think the first step in this is listing all of the steps (after the loan is created) so that POs / stakeholders can decide >what should happen when they fail. If I'm reading it right, there are really only three events after the loan object is saved successfully that could meaningfully fail:
There are things it does after these steps, but that looks like it's all connected to writing the loan data into the successful http response it sends to the UI. As long as the loan has been successfully created (and if it hasn't, we never get to this point), that stuff should not fail. |
| Comment by Marc Johnson [ 01/Apr/21 ] |
I think the next step is to surface these to the POs / stakeholders with an indication of the impact (e.g. the circulation log will not be consistent), so they can decide whether the impact of these failing without any notification is acceptable. |
| Comment by Zak Burke [ 05/Apr/21 ] |
Truly, both situations are problematic: creating a loan but reporting failure is bad (this is what we currently have), and completing only some of the steps in a chain of promises and reporting the chain as successful is also bad (this is what we are proposing). Somewhat facetious but actually not really facetious solution: return 279 Pyrrhic Victory, i.e. "success with errors". |
| Comment by Kyle Felker [ 06/Apr/21 ] |
|
>I think the next step is to surface these to the POs / stakeholders with an indication of the impact (e.g. the >circulation log will not be consistent), so they can decide whether the impact of these failing without any >notification is acceptable. 1. What is the best way to surface things and get decisions? Should I post something here? 2. The only one of those three steps I think I understand well enough to explain the impact of is the notice-I'm assuming that's for the patron, to remind them that the item is or will be due soon. I've looked at the technical documentation for the other steps, but that tells me how they work-not how they are used. I can see that the publishing step publishes a notice of the event to subscribers-which seem to be other modules? I'm not clear on what they do with that data or why they need it. Likewise, I can see the log logs details about the checkout event-but I'm not sure when this data is consulted or why. Is there documentation on how these are used? |
| Comment by Marc Johnson [ 27/Apr/21 ] |
Yes, I suggest starting with that. My suggestion for a format is a table describing each step (that is observable from outside, e.g. loan created, message published, notice scheduled) that could fail and the impact it has. As I think Zak Burke said in yesterday's planning meeting, we've tried to have these kinds of conversations in the past with limited success.
Yes, that would be a patron notice.
Indeed, this is where this work gets more challenging and time consuming (and why I hadn't managed to do it before you picked it up). We would need to trace which modules consume what messages and then figure out (maybe by asking teams or POs) what the impact is. For example, I believe that some messages are consumed by the circulation log, meaning that it won't reflect that event if they aren't published. I think we need to concentrate on first order side effects for the moment, e.g. active reactions to these steps, and defer second order side effects e.g. an item record is updated in storage, which then triggers a message to be published to Kafka, which is then use to updated the Elastic Search based inventory search.
I don't think there is any high level documentation for how messages published during this process are used across modules. I'm not sure there is any general process documentation for everything that should happen from a user perspective during a check out. The various circulation POs might know about that. |
| Comment by Kyle Felker [ 04/May/21 ] |
|
I've attached a document explaining my findings-please let me know if this is what you had in mind. |
| Comment by Kyle Felker [ 07/Jun/21 ] |
|
Blocking this until Marc Johnson can take a look at it. |
| Comment by Marc Johnson [ 07/Jun/21 ] |
Yes, that is kinda what I was thinking about. The idea I had in my head was that each step in these processes would be a row in a table, with columns (beyond the descriptions you have in the docs) like:
|
| Comment by Kyle Felker [ 15/Jun/21 ] |
|
Uploaded a new version with additional columns. |
| Comment by Holly Mistlebauer [ 27/Aug/21 ] |
|
Need to review the findings, see what the SIG wants us to do and figure out how to do it...for now we are focusing on checkout performance...we don't have the capacity to do this at the same time... |
| Comment by Marc Johnson [ 06/Jan/22 ] |
|
Holly Mistlebauer Stephanie Buck Given that
|
| Comment by Holly Mistlebauer [ 07/Jan/22 ] |
|
Marc Johnson: I have moved this to Vega as you suggested. Thanks... |
| Comment by Stephanie Buck [ 10/Jan/22 ] |
|
Holly Mistlebauer and Marc Johnson, does this mean that Kyle should be removed as Assignee? Kyle Felker, have you begun work on this ticket yet?
|
| Comment by Kyle Felker [ 10/Jan/22 ] |
|
I did some work on it a long, long time ago-you can see some of the documentation I produced attached to the issue itself. I have not touched it in months, though. If it's moving to another team, then I think I ought to be removed as the assignee. |
| Comment by Marc Johnson [ 25/Jan/22 ] |
|
Closing this as
|