[FOLIO-2909] SPIKE: checkout-by-barcode receives 500 but loan is created Created: 10/Dec/20  Updated: 25/Jan/22  Resolved: 25/Jan/22

Status: Closed
Project: FOLIO
Components: None
Affects versions: None
Fix versions: None

Type: Story Priority: P2
Reporter: Zak Burke Assignee: Unassigned
Resolution: Won't Do Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original estimate: Not Specified

Attachments: Microsoft Word checkout process.docx     Microsoft Word document checkout.docx    
Issue links:
Blocks
blocks CIRC-1046 Checkout-by-barcode receives 500 but ... Closed
Duplicate
is duplicated by CIRC-1045 Bugfest: PubSub error checking out Closed
Relates
relates to CIRC-1045 Bugfest: PubSub error checking out Closed
relates to CIRC-1136 BACKEND - Recieve "HTTP 500 Internal ... Closed
relates to UICHKOUT-712 Recieve "HTTP 500 Internal Server Err... Closed
relates to MODINVSTOR-620 Domain events sending in inventory-st... Closed
Sprint:
Development Team: Vega

 Description   

SPIKE for technical investigation of how we can address partial success checkouts (so we can fix CIRC-1046 Closed )

Description from CIRC-1046:

Overview: Although check out is a multi-step process that includes changing an item's status, creating a loan object, and possibly other actions such as emailing a receipt to the patron and/or publishing an Event Message in PubSub, from the point of view of an application making a single request to the checkout-by-barcode endpoint it is an atomic operation performed in a single request. At present, checkout-by-barcode returns the status of the final step in this pipeline, which may be a failure, even if early steps in the pipeline (such as changing the item status and creating the loan) succeed. This is misleading, if not downright wrong, and it causes the UI to indicate that the operation failed when in fact a loan was created for the item and patron.

Expected Results: If an item's status changes from 'Available' to 'Checked out' and a loan item is created, the response-code should be a 200, even if other portions of the check out process fail and return 500.

Actual Results: If an item's status changes from 'Available' to 'Checked out' and a loan item is created, but later portions of the check out process fail, the response-code is that of the failed portion of the check out.

Additional Information: Related details/discussion at CIRC-1045 Closed

Interested parties: Marc Johnson, Holly Mistlebauer, Cate Boerema



 Comments   
Comment by Cate Boerema (Inactive) [ 10/Dec/20 ]

Marc Johnson, here's the spike for CIRC-1046 Closed

Comment by Kyle Felker [ 29/Mar/21 ]

Zak Burke, Marc Johnson, I spent a few hours today crawling through the checkout by barcode service and documenting the process it goes through.  I put up a document showing the steps in order.  This whole thing is formed basically as a giant promise chain, with each result linked to the next operation in the chain.

By the time the loan object is both created and saved to storage (step 29), 3/4ths of the checkout process is completed.  Marc, am I right in thinking that if any of the previous promise calls fails, they abort the process?  If that's so, then the only operations that would result in an error but still create a loan would be from step 30 onward.

Comment by Kyle Felker [ 30/Mar/21 ]

Did a bit more poking around today-I coded one of the steps in the checkout process to fail automatically (retrieving the overdue fine policy).  Then I loaded the module to the scratch environment and attempted to check out an item.  The checkout failed with the message that the overdue policy could not be found.  I then checked the item record for the item as well as the user record for the patron I attempted to check it out to-no loan, and item does not show checked out.  So I think that any failure in the chain aborts the whole process at that point.  I'm going to try and fail one of the steps after the loan is saved and see what happens.

(Later)
As expected, failing a step after the loan object is saved also results in an error, but the loan object is still created, the item has been updated, etc.

So, based on this, if you're getting an error message at the check-out screen, but the loan is showing in the UI, it means that one of the steps after the loan object has been saved has failed (steps 30 onward).  Failed steps seem to forward whatever error message they get from the http endpoint they were dealing with-so, for example, if the attempt to publish the checkout event to the patron record fails, the failure message from that endpoint gets forwarded to the checkout screen UI. 

There are two basic approaches I can think of we could do here:

1.  We could modify the code after the loan is saved so that failures at that point are not sent to the UI.  That would probably be relatively easy-my problem with that is that then we have no notification that something went wrong that may need fixing.

2.  We could try to figure out a way to make error messages indicate wether or not the overall checkout process was successful-something like "there was an error publishing an event to the patron record, but the checkout has completed."  Then you'd know something went wrong, but the loan was created successfully.  The validation code for phase one is doing something similar to this-that could be used as a model. 

Comment by Marc Johnson [ 31/Mar/21 ]

Kyle Felker

am I right in thinking that if any of the previous promise calls fails, they abort the process? If that's so, then the only operations that would result in an error but still create a loan would be from step 30 onward.

Yes, broadly. (Assuming the chaining is done correctly) any outcome of a step means that none of the following steps are actioned (this was my answer to dealing with the form of callback hell that FOLIO found itself in during the early days)

Comment by Marc Johnson [ 31/Mar/21 ]

Kyle Felker

There are two basic approaches I can think of we could do here:

1. We could modify the code after the loan is saved so that failures at that point are not sent to the UI. That would probably be relatively easy-my problem with that is that then we have no notification that something went wrong that may need fixing.

2. We could try to figure out a way to make error messages indicate wether or not the overall checkout process was successful-something like "there was an error publishing an event to the patron record, but the checkout has completed." Then you'd know something went wrong, but the loan was created successfully. The validation code for phase one is doing something similar to this-that could be used as a model.

Agreed. This is broadly what I described when this issue first came up. The challenges being:

  • classifying which steps are considered fatal or non-fatal
  • how to make folks aware of (and potentially take compensating action for) non-fatal errors

We don't really have a great way of accumulating non-fatal errors. I believe the Vega team did something in this area recently.

I think the first step in this is listing all of the steps (after the loan is created) so that POs / stakeholders can decide what should happen when they fail.

Comment by Kyle Felker [ 31/Mar/21 ]

Marc Johnson

>I think the first step in this is listing all of the steps (after the loan is created) so that POs / stakeholders can decide >what should happen when they fail.

If I'm reading it right, there are really only three events after the loan object is saved successfully that could meaningfully fail:

  1. Save patron action session details about the checkout event.
  2. Publish the item checkout event information
  3. Schedule a notice for the loan due date

There are things it does after these steps, but that looks like it's all connected to writing the loan data into the successful http response it sends to the UI.  As long as the loan has been successfully created (and if it hasn't, we never get to this point), that stuff should not fail.

Comment by Marc Johnson [ 01/Apr/21 ]

Kyle Felker

there are really only three events after the loan object is saved successfully that could meaningfully fail

I think the next step is to surface these to the POs / stakeholders with an indication of the impact (e.g. the circulation log will not be consistent), so they can decide whether the impact of these failing without any notification is acceptable.

Comment by Zak Burke [ 05/Apr/21 ]

so they can decide whether the impact of these failing without any notification is acceptable.

Truly, both situations are problematic: creating a loan but reporting failure is bad (this is what we currently have), and completing only some of the steps in a chain of promises and reporting the chain as successful is also bad (this is what we are proposing). 

Somewhat facetious but actually not really facetious solution: return 279 Pyrrhic Victory, i.e. "success with errors". 

Comment by Kyle Felker [ 06/Apr/21 ]

Marc Johnson

>I think the next step is to surface these to the POs / stakeholders with an indication of the impact (e.g. the >circulation log will not be consistent), so they can decide whether the impact of these failing without any >notification is acceptable.

1.  What is the best way to surface things and get decisions?  Should I post something here?

2.  The only one of those three steps I think I understand well enough to explain the impact of is the notice-I'm assuming that's for the patron, to remind them that the item is or will be due soon.  I've looked at the technical documentation for the other steps, but that tells me how they work-not how they are used.  I can see that the publishing step publishes a notice of the event to subscribers-which seem to be other modules?  I'm not clear on what they do with that data or why they need it.  Likewise, I can see the log logs details about the checkout event-but I'm not sure when this data is consulted or why.  Is there documentation on how these are used?

Comment by Marc Johnson [ 27/Apr/21 ]

Kyle Felker

1. What is the best way to surface things and get decisions? Should I post something here?

Yes, I suggest starting with that. My suggestion for a format is a table describing each step (that is observable from outside, e.g. loan created, message published, notice scheduled) that could fail and the impact it has.

As I think Zak Burke said in yesterday's planning meeting, we've tried to have these kinds of conversations in the past with limited success.

The only one of those three steps I think I understand well enough to explain the impact of is the notice-I'm assuming that's for the patron, to remind them that the item is or will be due soon.

Yes, that would be a patron notice.

I've looked at the technical documentation for the other steps, but that tells me how they work-not how they are used. I can see that the publishing step publishes a notice of the event to subscribers-which seem to be other modules? I'm not clear on what they do with that data or why they need it.

Indeed, this is where this work gets more challenging and time consuming (and why I hadn't managed to do it before you picked it up). We would need to trace which modules consume what messages and then figure out (maybe by asking teams or POs) what the impact is. For example, I believe that some messages are consumed by the circulation log, meaning that it won't reflect that event if they aren't published.

I think we need to concentrate on first order side effects for the moment, e.g. active reactions to these steps, and defer second order side effects e.g. an item record is updated in storage, which then triggers a message to be published to Kafka, which is then use to updated the Elastic Search based inventory search.

Likewise, I can see the log logs details about the checkout event-but I'm not sure when this data is consulted or why. Is there documentation on how these are used?

I don't think there is any high level documentation for how messages published during this process are used across modules.

I'm not sure there is any general process documentation for everything that should happen from a user perspective during a check out. The various circulation POs might know about that.

Comment by Kyle Felker [ 04/May/21 ]

Marc Johnson

I've attached a document explaining my findings-please let me know if this is what you had in mind. 

Comment by Kyle Felker [ 07/Jun/21 ]

Blocking this until Marc Johnson can take a look at it.

Comment by Marc Johnson [ 07/Jun/21 ]

Kyle Felker

I've attached a document explaining my findings-please let me know if this is what you had in mind.

Yes, that is kinda what I was thinking about.

The idea I had in my head was that each step in these processes would be a row in a table, with columns (beyond the descriptions you have in the docs) like:

  • impact of this step failing e.g. no notices will be sent, change won't appear in the circulation log
  • should this step fail the process or not
  • should this step failing be communicated to the client
Comment by Kyle Felker [ 15/Jun/21 ]

Marc Johnson

Uploaded a new version with additional columns.

Comment by Holly Mistlebauer [ 27/Aug/21 ]

Need to review the findings, see what the SIG wants us to do and figure out how to do it...for now we are focusing on checkout performance...we don't have the capacity to do this at the same time...

Comment by Marc Johnson [ 06/Jan/22 ]

Holly Mistlebauer Stephanie Buck

Given that CIRC-1046 Closed , which is basically the same idea, has been moved to Vega, should this also be moved (and removed from the current sprint)?

Comment by Holly Mistlebauer [ 07/Jan/22 ]

Marc Johnson: I have moved this to Vega as you suggested. Thanks...

Comment by Stephanie Buck [ 10/Jan/22 ]

Holly Mistlebauer and Marc Johnson, does this mean that Kyle should be removed as Assignee? Kyle Felker, have you begun work on this ticket yet?

 

Comment by Kyle Felker [ 10/Jan/22 ]

Stephanie Buck

I did some work on it a long, long time ago-you can see some of the documentation I produced attached to the issue itself. I have not touched it in months, though. If it's moving to another team, then I think I ought to be removed as the assignee.

Comment by Marc Johnson [ 25/Jan/22 ]

Closing this as CIRC-1046 Closed which this was intended as a precursor to (according to the issue links) is stated as being in progress, making this redundant.

Generated at Thu Feb 08 23:24:09 UTC 2024 using Jira 1001.0.0-SNAPSHOT#100246-sha1:7a5c50119eb0633d306e14180817ddef5e80c75d.