[FOLIO-2050] SPIKE Design batch create / update API endpoint standard Created: 30/May/19 Updated: 25/Jan/23 Resolved: 25/Jan/23 |
|
| Status: | Closed |
| Project: | FOLIO |
| Components: | None |
| Affects versions: | None |
| Fix versions: | None |
| Type: | Task | Priority: | P2 |
| Reporter: | Marc Johnson | Assignee: | Ian Walls |
| Resolution: | Won't Do | Votes: | 0 |
| Labels: | platform-backlog, potential-decision, tech-debt | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Issue links: |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Sprint: | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Development Team: | Core: Platform | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Description |
ContextThere have been frequent requests for batch APIs for various types of records in FOLIO Some modules have started to implement batching, either in response to this, or performance concerns. These implementations all differ, I think it could be valuable to try to decide on a pattern for these. Framing QuestionsSize expectations or restrictionsJon Miller has suggested that we may want to use this to load typical batch sizes of 100 .. 2000 records. Synchronous or asynchronous responseShould the server not respond until the batch processing has finished, or should it respond promptly (maybe after some validation) with the ability to monitor the status of the operation? How does this affect the client? Is this decision affected by the size of batch we allow, as that is likely a primary component of latency? Should a response include complete representation of all of the records created, or references or even no information at all (except failures, depending upon the question below)? Complete or partial success / failureShould a batch only succeed if all records and valid, or should it be acceptable for some records to be invalid? What should happen if all of the records are valid however persistence of some of them fails (this is likely related to the transactions topic below)? Database transactions (specific to storage modules)Should the records that are created be done so in a single transaction? How could this decision affect the handling partial success or failure, if we decide we also want this? How does this affect resource usage, e.g. a connection has to used exclusively for each batch operation, which could lead to connection contention within the module? Streamed processing of recordsTo constrain memory usage during batch operations, should the set of records be processed as a stream of single (or multiple records)? How does this affect validation, any restrictions on batch size or database transaction semantics? For example, if we wanted to validate all records prior to any persistence, we might need to be able to process the stream more than once. Processing Semantics
What requirements are we missing? |
| Comments |
| Comment by Anatolii Starkov [ 18/Jun/19 ] |
|
Hi Marc Johnson I see 2 options MinimumPOST /source-record/batch/records
[
{record 1 ... },
{record 2 ... },
...
{record n ... }
]
Response
HTTP/1.1 207 Multi-Status
Content-Type: application/json
Content-Length: xxxx
{
"results": [
{
"href" : "/source-record/records/1",
"status" : 200,
"headers" : [{"contentType" : "application/json"}],
"body" : {record 1 …}
},
{
"href": "/source-record/records/2",
"status": 200,
"headers" : [{"contentType" : "application/json"}],
"body" : {record 2 …}
},
{
"href": "/source-record/records/3",
"status": 400,
"headers" : [{"contentType" : "application/json"}],
"body": "The name 'some name' is already existed"
},
{
"href": "/source-record/records/4",
"status": 403,
"headers" : [{"contentType" : "application/json"}],
"body": "You don't have access"
}
}
Notes: MaximumPOST /source-record/batch
[
{
"id": "1",
"uri": "/source-storage/records/1",
"method": "DELETE"
},
{
"id": "2",
"uri": "/source-storage/snapshots/2",
"method": "PUT",
"body": {…}
}
,
{
"id": "3",
"uri": "/source-storage/records",
"method": "POST",
"body": {…"}
}
]
Response
[
HTTP/1.1 207 Multi-Status
Content-Type: application/json
Content-Length: xxxx
{
"results": [
{
"href" : "/source-storage/records/1",
"status" : 204,
"headers" : [{"contentType" : "application/json"}],
"body" : ""
},
{
"href": "/source-storage/snapshots/2",
"status": 200,
"headers" : [{"contentType" : "application/json"}],
"body" : { …}
},
{
"href": "/source-storage/records",
"status": 400,
"headers" : [{"contentType" : "application/json"}],
"body": "The name 'some name' is already existed"
}
}
Notes: https://evertpot.com/http/207-multi-status |
| Comment by Julian Ladisch [ 27/Jun/19 ] |
|
The primary key id should be optional - the server creates a new one unless the client provides one. This is how the single record endpoints work. Regarding transaction: It is faster if the database puts all INSERTs into a single transaction, see "Disable Autocommit" on https://www.postgresql.org/docs/current/populate.html#DISABLE-AUTOCOMMIT If a record has a failure then
|
| Comment by Theodor Tolstoy (One-Group.se) [ 28/Jun/19 ] |
|
Julian Ladisch I agree. For migration scripts I think it is key (no pun intended) that you can generate the Id:s outside of FOLIO in order to create linked objects in batch as well and to maintain a fast lookup of legacy and new ID:s on various objects. |
| Comment by Jon Miller [ 29/Jun/19 ] |
|
The HTTP 207 multi-status status code is interesting. I didn't know that existed. The following is how I was thinking it could be implemented. Inserts, updates, and deletes would be handled separately using HTTP POST, PUT, and DELETE like they are for the single object versions of the APIs, rather than combined in one. I can see how it could be useful to do it that way though. The JSON request format would be the same as it is for the single object version of the API, only they would be contained in arrays. It will definitely need to combine all the SQL updates into a single transaction. Having one transaction per object will kill performance. This would be true even if you weren't using a web API. I think it's the main reason the single object APIs are as slow as they are currently. The first thing the API would do is a JSON validation. I was thinking that it would validate each of the objects in the array separately and return the id of the object in the error. However, if you need to support the case where the id values aren't supplied. Maybe it would be better to validate the entire array at once. Presumably, the JSON validator will return the index values of the array elements that failed validation. Either that, or you could validate each element separately and return the index value instead. Doing he JSON validation first means that you can return an entire list of errors. However, an error could also happen at the SQL level. In this case, the returned array would only have one element. The main case where I could see this happening is if there was a foreign key constraint violation. In the success case, I think I would only return a list of id values. Or maybe if there are other generated values of interest included those. Otherwise, I think it would unnecessarily slow things down. If you are doing a migration of for example, instances, you could have 10 million of them. I don't know exactly how big that would be, but, it could be say 20 GB. It would be a waste to both upload and download that much data. I can see how you might want to return it so that it is consistent with what the single object version of POST does, but, in general, I think it is overkill. Most likely, you would just throw the data away and not even use it. I think the response format should be kept as simple as possible. The result format listed above seems overly complicated. I would eliminate the top-level "results" property or maybe give it a more meaningful name like "ids" and only return the list of ids. Then, if the user wanted the full object, they could just use a GET to get it. In the default case, I don't think people are going to expect that data back. Most likely, I would have two different response formats. One for success and one for error. For success, it would just be the list of ids. For error, it would be a list of errors. The error objects would include the id value of the object or index of the array element if available. This way, if it was a list of validation errors, the client code could filter the elements that had errors and resubmit the batch if they want the other ones to go through. I'm inclined to think that more complicated retry logic like this should be in the client. I didn't know about HTTP 207 multi-status before. I think it's cool, but, I think I am inclined to think that the API should return normal success or fail error codes since I think things need to be treated as batches (either the whole batch succeeds or the whole batch fails). If you wanted to support allowing some objects to succeed and some to fail, you would first need to attempt to do the update all in one transaction for performance. Then, if that failed, you would need to fall back to do them one at a time. To me, implementing complicated logic like seems like too much to ask if you are expecting developers to implement these APIs across the board in all the modules. Maybe if there is a core API that somehow handles this and makes it easy for each module to do it might be doable. Also, regarding the response format above, like I said, I would keep it as simple as possible. Including things like HTTP headers seems unnecessary. Regarding DELETEs, it might be better to make it more generalized where you can provide a CQL filter. I created a separate JIRA for this https://folio-org.atlassian.net/browse/FOLIO-2052. I created that one mainly for truncating tables. Maybe you could have two different query parameters. One with a comma separated list of ids, and another that's a CQL filter. |
| Comment by Marc Johnson [ 01/Jul/19 ] |
|
Thanks Jon Miller Theodor Tolstoy (One-Group.se) Julian Ladisch Anatolii Starkov Thanks for your thoughts Given how involved these topics are, I think this issue can be reframed to only create / update. And we create separate issues for the other kinds of activities. Jakub Skoczen What do you think? |
| Comment by Anatolii Starkov [ 02/Jul/19 ] |
|
Hi guys, |
| Comment by Jakub Skoczen [ 02/Jul/19 ] |
|
Anatolii Starkov Kateryna Senchenko What problem are you trying to solve with the 207? A batch update operation with a large result set may take a long time – are you proposing the response is blocked until it fully completes? What if the client would like to learn about the status before the entire operation succeeds (or fails)? Did you consider a non-blocking API that returns 202 and a URI where the client can lookup (poll) for status? What about the error handling? How would the client retrieve errors for failed individual operations? Also, response and error handling are only some aspect of the batch API designs – likely the more critical elements are how the streaming of datga is handled and (via the HTTP socket and down to the data base) and if the client has any control over it (e.g batchSizes). Where those elements taken into consideration? |
| Comment by Marc Johnson [ 02/Jul/19 ] |
|
I've added some framing questions to this issue, to try to help us express the needs / requirements of this API better. There is no need to answer each specifically, they are only intended to inspire ideas or aspects of the design. I think we need decide on the high level design / behaviour of these operations before we get too deep into the protocol design itself. |
| Comment by Jon Miller [ 02/Jul/19 ] |
|
Regarding batch operations taking a long time, the client shoujldn't be sending huge batches. Arrays of 1,000 would be typical. It could even be 100. You don't want the batch size too big or too small. If it is too small, like 1, performance will be too slow. If it is too large, normally it would cause locking contention. However, I think the way PostgreSQL is designed, it doesn't have that problem. Nonetheless, clients IMHO shouldn't be sending huge arrays. The client should still be breaking things up. The reason the client should not make it too big, is like I said, not to cause locking contention. In the apps that I write, I normally use a batch (transaction) size of 1,000, or I might lower it I try to keep the database update speeds to less than a second. Otherwise, it could block activities from other users of the system. I'm not sure PostgreSQL has this problem, but, I think it is good to stick to that on general principle. The client should not have to be able to query for status updates separately, It shouldn't be using huge batches IMHO. |
| Comment by Marc Johnson [ 02/Jul/19 ] |
|
Jon Miller Thank you for that extra information.
Your previous mention of millions of records had led me to wonder if the expectation could be that clients could submit millions of records in a single operation. Setting expectations around a relatively small number helps with a number of my framing questions I think.
I don't know about the transaction overhead or locking contention (I imagine it is row-level). What little I've read suggested that a PostgreSQL connection operates on a single transaction and command at a time (no concurent like Oracle or MS SQL Server I think), meaning a connection needs to be used exclusively for a batch operation. |
| Comment by Jon Miller [ 02/Jul/19 ] |
|
Marc Johnson Yes this would be used to load millions of objects, in for example the loading of instances. However, the client would only send say 1,000 at a time. I have done tests in the past using SQL and have found that the transaction sizes don't have to be huge to speed things up a lot. For example, I'm not sure how much of a speed improvement you get by using a transaction size of 10,000, or 1,000, versus 100. It just can't be as small as 1. Yes, the connection would be tied up for the duration of the request. So, you are right, it could cause the connection pool to get depleted if there were a lot of batch updates going on. I think that's another reason no to to use huge transaction sizes. The lock contention I was thinking of was with regard to table or row locks. I don't remember what it's called, but there is a certain design that PostgreSQL uses, at least with regard to queries, where every client has it's own snapshot of the data and it doesn't have to use locking. I'm not an expert on it, so, maybe there is some locking, but, from what I remember, the article that I read about it said that it has less locking problems that other databases do, due to the design that it uses. Also, I think it has slow count performance due to that design. I think if you have a transaction open for a long time, it can even block queries from other clients from happening. I think it depends on the isolation-level. I'm not really an expert on it. I just know that my own experience with other databases is too not make the transaction size too big. I did tests of say 10,000 versus 1,000 and didn't see much of a performance gain. It may have even been slower. So, that's why I typically use something like 1,000. You may be able to get by with sometime smaller like 100. Really, I think it is less about the size and more about not having the transaction open for too long. My experience is mostly working with the database directly and not going over a web API, so, you may need to adjust for that. Personally, I wouldn't have an objection, if there was a limit on how much you can send at once. At the moment, I don't think I would need streaming support (unlike for queries where I think this is very needed). If it just loaded it into an array on the back end, I don't think it would necessarily cause a problem. |
| Comment by Marc Johnson [ 02/Jul/19 ] |
That is good to know. Both good reasons to decide on a range of batch size that provides this trade off, e.g. typical batches sizes should be between 100 and 2000.
That sounds like some sort of read pass-through. I imagine it is sensitive to the transaction isolation level and especially read committed. It's is useful to know the trade offs in that design, I knew the count performance was slow, I didn't know why
I think these decisions about batch size and internal implementation are heavily connected. If we can find a way to limit the batch size, then streaming might not be important. A challenge with this, that I don't know if there is an easy way to address, is to determine the size before reading into an array. Reading into an array could present a fragile implementation, where simply sending a few big batches cripples the module. |
| Comment by Jon Miller [ 02/Jul/19 ] |
|
I'm wondering if maybe the word "bulk" should be used instead of "batch" in the title of this issue? I'm thinking a batch size could be bigger than the number of objects transferred in one API call. I guess it depends on the definition of batch. For example, say you are loading a file with 500,000 objects in it. Would that be considered a batch? |
| Comment by Marc Johnson [ 02/Jul/19 ] |
I've been struggling with either term, I likely don't understand the difference between them enough. |
| Comment by Jon Miller [ 02/Jul/19 ] |
|
Marc Johnson The design that I was thinking of that PostgreSQL uses is called Multi-version Concurrency Control (MVCC). It mentions it in Wikipedia at https://en.wikipedia.org/wiki/PostgreSQL#Multiversion_concurrency_control_(MVCC). Yeah, I'm not sure what the best way to set a limit on upload size is. I don't know Vert.x or whatever it is you're using on the back end. I know with ASP.NET applications, it has a default maximum request size and there is a way to override it for specific pages in a web app. Maybe you could do something like that, where it's basically the web server that's counting raw bytes that limits it. Otherwise, I'm assuming you would need to read the data as a stream (decode each JSON object/row as it comes in). I know how to do this in .NET at a low-level, but, don't know how it would be done in Java or with Vert.x. I'm assuming Jakub Skoczen or Julian Ladisch would know? |
| Comment by Julian Ladisch [ 02/Jul/19 ] |
|
"onFailure=skip" can be implemented using SAVEPOINT - https://www.postgresql.org/docs/current/sql-savepoint.html :
postgres=# create table t (i integer unique);
CREATE TABLE
postgres=# begin;
BEGIN
postgres=# savepoint s;
SAVEPOINT
postgres=# insert into t values (1);
INSERT 0 1
postgres=# release savepoint s;
RELEASE
postgres=# savepoint s;
SAVEPOINT
postgres=# insert into t values (1);
ERROR: duplicate key value violates unique constraint "t_i_key"
DETAIL: Key (i)=(1) already exists.
postgres=# rollback to savepoint s;
ROLLBACK
postgres=# savepoint s;
SAVEPOINT
postgres=# insert into t values (3);
INSERT 0 1
postgres=# release savepoint s;
RELEASE
postgres=# end;
COMMIT
postgres=# select * from t;
i
---
1
3
(2 rows)
|
| Comment by Julian Ladisch [ 02/Jul/19 ] |
|
Vert.x uses Jackson, and Jackson has a streaming API for reading JSON: https://github.com/FasterXML/jackson-docs/wiki/JacksonStreamingApi |
| Comment by Jon Miller [ 02/Jul/19 ] |
|
Julian Ladisch Thanks for the info. Never knew about SAVEPOINT. I wonder what affect it would have on performance. |
| Comment by Jon Miller [ 02/Jul/19 ] |
|
Julian Ladisch The downside to using SAVEPOINT though, is that you wouldn't be able to do INSERTs with multiple VALUE sets. As you showed me with your code, that speeds the inserts up a lot. I don't think I would want to sacrifice that. |
| Comment by Jon Miller [ 02/Jul/19 ] |
|
I would recommend setting the limit a bit higher from the start so that we can test it and see what works best. Maybe allow up to 100,000. Normally, I don't set it that high myself, but, I'm curious if there would be speed boost with it higher. I'm not sure how much overhead the API adds for something like this. I wouldn't think it would be too much seeing as the JSON sent over HTTP is basically exactly what's put into the database. |
| Comment by Dale Arntson [ 02/Jul/19 ] |
|
That's an interesting idea, Julian. It probably needs some real world testing to see what kind of impact it would have on performance. On the one hand, it would slow things down, both because of the additional overhead of the savepoint catch and release, and because you couldn't make use of insert-manys to avoid the overhead associated with individual inserts. On the other hand, you wouldn't have to process the whole batch, try a commit, potentially roll back, and then commit each record individually as you march through the batch a second time, segregating and reporting on individual record errors. Perhaps the place to invoke savepoints would be on batches containing errors, for use on the second pass. That would allow you to fully optimize the initial batch load when no errror occur, and minimize the overhead of committing each record individually in the event that the batch save initially fails. The logic might be a little complex for module developers to implement, but if they have a well documented model to follow, they could maybe just drop it in. |
| Comment by Dale Arntson [ 02/Jul/19 ] |
|
By the way, because of a bug in my inventory loader code, I inadvertently benchmarked both batch commits of a 1000 records at a time vs individual record commits, and the latter took about 15 times as long, that's with no indexing or trigger overhead. |
| Comment by Julian Ladisch [ 03/Jul/19 ] |
|
This bash script runs the different types of database inserts against the instance table of mod-inventory-storage v15.5.1 created for tenant diku including all triggers and indexes. N=10000 VALUE="repeat('bar', 9999)" p() { PGPASSWORD=postgres time -p psql -h localhost -p 5433 -U postgres | tail -4 echo } delete() { echo "delete from diku_mod_inventory_storage.instance;" } one_insert() { echo "insert into diku_mod_inventory_storage.instance (_id, jsonb) values (md5('' || generate_series(1, 10000))::uuid, jsonb_build_object('foo', $VALUE));" } multi_inserts() { echo "begin;" # start transaction for i in `seq $N` do echo "insert into diku_mod_inventory_storage.instance (_id, jsonb) values (md5('' || $i)::uuid, jsonb_build_object('foo', $VALUE));" done echo "end;" # commit transaction } savepoints() { # transaction with savepoints echo "begin;" for i in `seq $N` do echo "savepoint s;" echo "insert into diku_mod_inventory_storage.instance (_id, jsonb) values (md5('' || $i)::uuid, jsonb_build_object('foo', $VALUE));" echo "release savepoint s;" done echo "end;" } commits() { # no transaction, uses auto-commit for i in `seq $N` do echo "insert into diku_mod_inventory_storage.instance (_id, jsonb) values (md5('' || $i)::uuid, jsonb_build_object('foo', $VALUE));" done } delete | p one_insert | p delete | p multi_inserts | p delete | p savepoints | p delete | p commits | p Results: 3.04 s one_insert 10.09 s multi_inserts 12.59 s savepoints 19.88 s commits The differences are small. |
| Comment by Jon Miller [ 03/Jul/19 ] |
|
Thanks for testing Julian Ladisch The difference may seem small when you are only doing 10,000 inserts. However, it adds up when you are inserting 10 million. It could be a 2 hour load versus 6 or more. According to your own numbers single insert is 3 times faster than the next option. Also, it doesn't look like the data you are loading is real data. I'm not sure exactly what PostgreSQL does behind the scenes with respect to storing the jsonb. It may be storing it relationally behind the scenes. You might need to store data using the proper JSON schema to make it more realistic. |
| Comment by Anatolii Starkov [ 18/Jul/19 ] |
|
Hi there |
| Comment by Ian Walls [ 19/Jul/19 ] |
|
I think we should also look at the COPY ... FROM function in Postgres for this purpose. Others have found significant performance gains even over transaction-wrapped inserts. Stackoverflow has some analysis as to why this is the case. I've not seen any particularities around using this for JSONB columns... For certain orders of magnitude, I agree that we may not get the most time savings out of this technique; reducing our number of HTTP requests and optimizing the business logic are also going to be important factors. But as Jon Miller says, when you get up into the multi-million record range, a 3x or 10x speed-up really makes a difference. In terms of business logic, are we expecting this API to validate each individual record? That would add a significant amount of time per record to the process, and then we'd need to deal with the "bad" records in some reasonable way. Perhaps the bulk API could include a "validate" switch (default to 'true')? This way, for migration, tooling could handle the validation as part of the transformation process, but for on-going data loads (such as patron imports), which should be of a lower order of magnitude, validation could be performed. Similar to JSON schema validation, an option could also be provided for record matching, and on what set of fields (default: UUID if extant). The initial migration would not need this, and significant time could be saved by not touching each record, but on going imports could perform this matching, and react accordingly. |
| Comment by Ian Walls [ 19/Jul/19 ] |
|
Also, I think it's important to point out that multi-million record loads are probably only going to happen in the bibliographic context; few (perhaps no?) libraries have millions of patrons, vendors, open loans/holds or historical circ transactions they want to import. We're really talking about being able to load our MARC records here (or MODS, METS or other schema....), and there are plenty of existing tools for validating those records. |
| Comment by Jenn Colt [ 19/Jul/19 ] |
|
There will be millions of item records, not in marc. |
| Comment by Jon Miller [ 20/Jul/19 ] |
|
If anyone wants to test COPY performance, the app I have at https://github.com/jemiller0/Folio uses COPY FORMAT BINARY when executing in SQL mode (the default). It can also be used for testing using the web API. Currently, when in web API mode, it is just using the single object POST methods. As multi-object POST methods are implemented on the back end, I will update my app to use them. Currently, there is one entity that supports multi-object POSTs, records in the Source Record Storage module. That is the closest thing currently available that I could use. The application is generated by a scaffolding generator app that I wrote. This is why I am hoping that FOLIO implements the batch APIs in a consistent standard way. That will allow me to easily generate client-side code for whatever entities support multi-object POST. Regarding using COPY in the web API, while it's the fastest, it's also the most dangerous. I think it may ignore foreign key constraints. Also, it ignores triggers. FOLIO uses triggers to set some columns with data pulled from the JSONB columns. The app that I wrote, sets those columns explicitly. Also, I think transaction logging is disabled when using COPY. I think it may be possible to corrupt the table structure if you do things like set data types incorrectly. It may be better/safer to stick to using multi-object INSERTs since this is a general purpose API that could be used at any time, not just during migration. Also, regarding validation, the app that I created allows you to test with and without JSON validation to see the difference in speed. That is mainly useful when running it in SQL mode. I should note that the app that I created also allows you to save all table data to files also (so, if you wanted to do a performance test of COPY BINARY FORMAT, you can load your data by whatever method you want, save it to files using my app, then, load it back in). Currently, it does this using a single API call when in web API mode. It streams the data on the client. However, I think most if not all of the FOLIO web APIs build up the result set in RAM first on the server-side. I am hoping that the FOLIO APIs switch to using streaming for everything. I don't want to make it page the results because I want everything to come back in a single database query in a single transaction for consistency. I.e. the data will be a snapshot of what it was when the query was executed. The logic to return the results is much simpler that way, a simple foreach. There have been a number of bugs in the FOLIO UI that have occurred because developers didn't page the data. IMHO, it would be better for paging to be opt-in instead of by default. It might be worth including query streaming in this issue as well. Streaming support should be added at the same time as bulk insert/update/delete. Regarding how many entities could be in the millions, the ones that I know of are instances, holdings, and items. I want to say loans can be quite large also. Orders and order items can have a lot also. I don't know how RAML works, but, if the bulk methods can be generated, I think all the entities should support the bulk APIs. Maybe for some of the small lookup tables, it might seem kind of dumb, but, in general, I like the idea of having a consistent interface across the board. Of course, if you have to implement them manually, then it is probably not worth the effort for bulk insert/update/delete. IMHO, for queries, they should all just be switched to stream unless the logic is a lot more complicated. |
| Comment by Jon Miller [ 20/Jul/19 ] |
|
In the document that Anatolii Starkov referenced, it mentions using POST for both INSERT and UPDATE and something about "idempotent principle." I don't know why update would be different for multiple objects versus single. Personally, I think it should use PUT for update, the same way the single version one does. I think the URLs should be kept as simple as possible, including the name of the operation in the URL seems ugly. |
| Comment by Jon Miller [ 20/Jul/19 ] |
|
If partial success support is included, I think there should be a query parameter to control whether that functionality is enabled or not. That goes above and beyond how transactions work if you are using SQL directly. A developer may not want partial success. |
| Comment by Anatolii Starkov [ 23/Jul/19 ] |
|
Hi Jon Miller, Thanks for your comment. |
| Comment by Jon Miller [ 23/Jul/19 ] |
|
Anatolii Starkov I am not an expert on REST principles or idempotence, so, maybe I just don't understand what you are talking about, but, why would updating an array of objects be any different than updating one? Using POST seems inconsistent, counter-intuitive, and confusing to me. When you say "we can not guarantee that UPDATE of a batch items is idempotent", what do you mean by that? I'm assuming you mean it should give the same results if you call it more than once? I don't know why that wouldn't be true, the same way it is for single object updates Technically, most of the objects have timestamp values in them that will change every time (I.e. dateUpdated is set to the current time). Does that mean we have to use POST for all updates across the board even for single object updates? |
| Comment by Jon Miller [ 23/Jul/19 ] |
|
Also, regarding the URLs, it would be nice if there was a single endpoint that handled both single objects as well as arrays of objects. Then, you could just have something like mod-orders-storage/orders and use the same URL for either single or multiple inserts. The back end could just check if the input is an array and process it accordingly. It wouldn't be that unlike how GET works where if you supply an id in the URL it returns a single object and returns an array without it. Basically, I think the URLs should be kept as simple, clean, and consistent as possible. |
| Comment by Anatolii Starkov [ 25/Jul/19 ] |
|
I am with you Jon Miller but I can't help thinking that some module in a chain of a request could make some calculation before update a batch of entities. For example: Anyway I just explained Taras Spashchenko and my thoughts in the How to design batch API (General recommendations) |
| Comment by Dale Arntson [ 29/Jul/19 ] |
|
Thanks for the write-up Anatolii Starkov. If I could comment on your previous post. I think the use of batch processing with business logic APIs will only work if there is no state shared between the individual calls. Lacking that, it seems to me they can be idempotent. But for data migration, I think that this is mostly not an issue. Data migration works best, in my opinion, when the data can be mapped from one data model directly to another data model, or in this case from the data model of the migrating institution to the Folio data model. The storage APIs are ideal for that kind of mapping, and they are much simpler to think about because there is no intervening business logic that you have to understand (often internally) before you can know how the data will be mapped. Bulk processing using the storage APIs do not share state because you are updating only one row of one table at a time. The only exception that needs to be considered here, and only because of the complexity of the processing involved, is the the use of the BL APIs to process the legacy marc data. But those tables could also be populated using the storage APIs, as long as the equivalent processing was done externally in the mapping code. Or maybe I am missing some portion of your argument. |
| Comment by Ian Walls [ 30/Jul/19 ] |
|
It seems to me that the simplest and most straight-forward thing to do is update RMB to handle arrays of JSON objects in a naive loop. I think (and I'm certainly willing to be corrected if I'm misunderstanding RMB) that this is close enough to the boilerplate/templating kind of thing that RMB does that it could be built in. Apps that need a savvier handling of this array of objects could re-implement the logic after the module is initially build. The question still remains as to whether the API endpoint should respond promptly or after completion. For small enough arrays, this waiting wouldn't be a problem, but once you start getting into the higher orders of magnitude (10^N where N > 3 or 4), the lack of feedback becomes problematic. Perhaps BL modules and storage modules need to be handled differently.... BL modules could implement the naive loop (which would include the logic) as a separate process that can be polled for status updates by the client (I'm assuming that web sockets are outside the FOLIO paradigm at this point). This would allow for error handling, and keep batch requests from being blocking to the client. Storage modules, on the other hand, would benefit from clumping DB writes into transactions of a certain size, or using COPY .... FROM. They could be a little more trusting of input (provided the request is properly permissioned), perhaps skipping steps that require processing individual records in the array (i.e. validation), since it would be assumed that either the BL module or a migration toolkit was handling the 'deep thinking'. It would also not be in a position to extract sub-records (i.e. holdings/inventory information from MARC), again leaving that to the BL module and it's processing of individual records. |
| Comment by Jon Miller [ 30/Jul/19 ] |
|
Ian, the web API would operate in a synchronous manner, not asynchronous. I'm assuming it will be asynchronous internally (for example if it is using VertX), but, the web API call itself will be synchronous and will not return until the operation has been completed. The API isn't intended to have more than say 1,000 objects sent to it at a time. Maybe 10,000 max. The operation will be wrapped in a database transaction and you don't want transactions to stay open for a long time. Normally, this would cause locking contention. However, this may be less of an issue with PostgreSQL due to how it's implemented. In general though, I think it is a good idea to keep the transaction size to a reasonable size. The transaction size doesn't have to be huge to speed things up a lot. Also, it isn't just the HTTP requests that slow things down. Create a program that loads a lot of data directly using SQL and do a COMMIT for each row. You will see what I'm talking about. It is not just the HTTP request that slows things down. I know this for a fact as I have written many applications over the years to load and work with bulk data. It's one of the main things I do with our current system. Also, with regard to large transactions, the DBMS probably uses temp space for them. That is probably another reason to not have a huge transaction size. |
| Comment by Ian Walls [ 30/Jul/19 ] |
|
Jon, I think we're still early enough in the design phase that we could decide if the batch processing is done synchronously or asynchronously. If we're going with synchronous, then there would need to be a mechanism for enforcing array size limits, lest we start running afoul of timeouts. In the context of my last comment, where we treat BL and storage module differently, I think having the asynchronous job model for the BL module makes sense, since we need to process each object individually and that can take an unknown amount of time (depending on the size of the array, the complexity of the logic, and the hardware resources provided). The BL module would then implement an aggregation/collection step to gather up a reasonable number of processed objects, and submit them to the storage API in batch. This could be handled synchronously, because the BL would take care of defining the reasonable number (1000 - 10,000, depending on config), and would be okay waiting for completion. If a developer wanted to bypass the BL module, and take responsibility for getting the data separated, cleaned, validated and batched appropriately, then they could do so and address the storage API directly. |
| Comment by Jon Miller [ 30/Jul/19 ] |
|
I think there needs to be a synchronous version of the API for things like doing data migrations and bulk loading from things like command-line tools. I've mainly been thinking about it from that perspective. Though, I can see how one might want it to be asynchronous if the client was a JavaScript app running in a browser where you don't want to block the UI thread. If asynchronous is added in addition to synchronous, I don't have a problem with that. Otherwise, I don't really think asynchronous is really appropriate for a command-line tool. Off hand, I am not sure how one would go about implementing it. Do you get back a token and then poll until the request is complete? I haven't used it personally, but, I think there are ways to use an observer pattern so that you are notified when an operation is completed. I'm not a JavaScript programmer. So, I can't comment on what the best way to do that is. I just know that is not what I'm looking for for use with the application that I'm developing. |
| Comment by Ian Walls [ 30/Jul/19 ] |
|
Can't find it in this comment string, but I remember some discussion at one point of 'upsert' and what HTTP verb to use. Apparently, it's PUT: https://stackoverflow.com/questions/18470588/in-rest-is-post-or-put-best-suited-for-upsert-operation. So I'm just going to leave this here.... |
| Comment by Julian Ladisch [ 19/Aug/19 ] |
|
If we implement all the client options
|
| Comment by Julian Ladisch [ 19/Aug/19 ] |
|
I don't see that there is a use case where the server should generate the id. |
| Comment by Ian Walls [ 20/Sep/19 ] |
|
I'd like to move this forward, so I'm revisiting the framing questions with the answers we've discovered in our discussion: Size limits: I'd advocate that we leave this to the client to manage. The POST size will be limited by the server configuration (1MB to 2GB, typically), and the tool using the API will better understand what batching is appropriate. But, if we wanted a hard max on how many records the system will accept at a time in order to protect resources, I'd be willing to accept such a limit Sync/async: The API should be synchronous, at least when working with the storage modules. For BL modules, there may be a case for asynchronous processing, but that can be differed to the BL module to implement, outside of this standard practice DB transaction method (for storage modules): I believe for efficiency, we need to use either single transaction or COPY.. FROM. Both of these are a solid order of magnitude faster than other techniques. Complete/partial load: due to the above transaction methods being recommend, complete load is the only viable option. So that means if we have a problematic record in the middle of the array, the whole create/update will fail. While annoying, this makes idempotency easier, since the whole API request has a single result, instead of record-by-record statuses. This leaves validation/integrity checks to the BL or migration toolkit. Streaming: I believe this is now part of RMB, so let's use what we have now! Semantics/business logic: I think this should be kept outside the bulk API, since cracking open each record in a batch to inspect it/act on it will significantly slow down the processing, and after all, speed is what we're going for here. If there are no further objections or clarifying questions, shall we consider this wrapped up and move it forward towards implementation? |
| Comment by Ian Walls [ 04/Oct/19 ] |
|
There have been no further comments or objections in the last 2 weeks, so I'm going to advance this. |
| Comment by Jon Miller [ 04/Oct/19 ] |
|
Sounds good to me. Though could someone post a template of what the proposed implementation is? I would like to have an opportunity to provide feedback before it becomes immutable. A lot of this stuff seems to instantly become production code that can't be changed. |
| Comment by Jakub Skoczen [ 21/Oct/19 ] |
|
Ian Walls Jon Miller I have distilled this spike into an initial implementation ticket for the Platform team:
|
| Comment by Marc Johnson [ 27/Dec/19 ] |
|
Given that it appears decisions have been made about how folks want the batch APIs to work and a reference implementation completed, should this issue be closed? Could it still be useful to share thoughts, questions, concerns etc on this issue? |
| Comment by Ann-Marie Breaux (Inactive) [ 27/Dec/19 ] |
|
Hi Marc Johnson I don't have an opinion on whether this should be closed, but either in the context of this ticket or the reference implementation, it would be great if there were some documentation regarding any decisions or POC that could be linked/surfaced for developers and library sys admins. |
| Comment by Marc Johnson [ 27/Dec/19 ] |
|
Ann-Marie Breaux Thanks
I agree, I think it would be useful for this decision to be documented as an API standard. I imagine this would be on the developer website or wiki. There have been some mentions of these decisions only being partially applicable (e.g. storage only). I think it would be useful if that was also reflected in the documentation. My understanding is that the decisions are what Ian Walls stated in this comment and that the reference implementation was
I don't know how well that reference implementation concurs with Ian Walls comment, or if some things change as it was being implemented. |
| Comment by Jon Miller [ 27/Dec/19 ] |
|
I was going to ask where the API documentation is for the initial APIs, but, I see that it is the ones that are listed as "sync" in inventory (instances, holidings, and items). It looks like it is simply a JSON array, which was basically all I was wanting. I will give them a try. One other thing I'm wondering about though, does anyone know what the status of streaming of query results is (queries, not add/update)? I don't recall if this was a separate JIRA or which one it is. |
| Comment by Jon Miller [ 27/Dec/19 ] |
|
I just noticed that there is a top-level property in the request JSON. I'm wondering if that serves a purpose? Why not just have an array with no top-level property?
{
"$schema": "http://json-schema.org/draft-04/schema#",
"description": "A collection of instance records",
"type": "object",
"properties": {
"instances": {
"description": "List of instance records",
"id": "instances",
"type": "array",
"items": {
"type": "object",
"$ref": "instance.json"
}
}
},
"additionalProperties": false,
"required": [
"instances"
]
}
|
| Comment by Jon Miller [ 05/Jan/20 ] |
|
Can you please remove the top-level property and just make it an array? The current implementation makes extra work with regard to generating client code for these methods. For example, in the case of holdings, the URL is "/holdings-storage/batch/synchronous", while the property name is "holdingsRecords" rather than just "holdings". |
| Comment by Julian Ladisch [ 07/Jan/20 ] |
|
The property name "holdingsRecords" is consistent with the property name of the GET /holdings-storage/holdings API: https://s3.amazonaws.com/foliodocs/api/mod-inventory-storage/p/holdings-storage.html#holdings_storage_holdings_get_response |
| Comment by Julian Ladisch [ 07/Jan/20 ] |
|
The dedicated array property
{
"instances": [
{ "id": ... },
{ "id": ... },
...
]
}
makes it easy to add additional properties and most clients will continue to work:
{
"priority": "low",
"instances": [
{ "id": ... },
{ "id": ... },
...
]
}
|
| Comment by Julian Ladisch [ 08/Jan/20 ] |
|
Jon Miller: For the status of GET batch/streaming see the "Issue Links" at
|