#1 Browsing by Contributors using dedicated Elasticsearch index
This approach is the same as browsing by subjects.
Pros | Cons |
---|---|
Easy to implement by reusing existing code base | Requires additional space to store the dedicated index |
Requires additional code to manage update and delete events (each batch with that event refreshes Elasticsearch index) |
#2 Browsing by Contributors using PostgreSQL table
This options provides ability to browse by using PostgreSQL table with index on the contributors field.
create table instance_subjects ( subject text not null, instance_id text not null, constraint instance_subject_pk primary key (subject, instance_id) ); create index instance_subjects_subject on diku_mod_search.instance_subjects (lower(subject));
Insertions can be done in batch, which can be done configuring Spring Data Jpa:
insert into instance_subjects(instance_id, subject) values (?,?) on conflict do nothing;
spring: jpa: properties: hibernate: order_inserts: true order_updates: true jdbc.batch_size: 500
@Data @Entity @NoArgsConstructor @Table(name = "instance_subjects") @AllArgsConstructor(staticName = "of") @SQLInsert(sql = "insert into instance_subjects(instance_id, subject) values (?, ?) on conflict do nothing") public class InstanceSubjectEntity implements Persistable<InstanceSubjectEntityId> { @EmbeddedId private InstanceSubjectEntityId id; @Override public boolean isNew() { return true; } } @Data @Embeddable @NoArgsConstructor @AllArgsConstructor(staticName = "of") public class InstanceSubjectEntityId implements Serializable { private String subject; private String instanceId; }
select subject, count(*) from instance_subjects where subject in ( select distinct on (lower(subject)) subject from instance_subjects where lower(subject) < :anchor order by lower(subject) desc limit :limit ) group by subject order by lower(subject);
select subject, count(*) from instance_subjects where subject in ( select distinct on (lower(subject)) subject from instance_subjects where lower(subject) >= :anchor order by lower(subject) limit :limit ) group by subject order by lower(subject);
Pros | Cons |
---|---|
Fast to query (faster than other options) | Requires additional space to store the dedicated index (~1Gb per million resources) |
Easy to manage update and delete events |
#3 Browsing by Contributors using PostgreSQL index and Elasticsearch terms aggregation
This approach can be implemented in two steps:
- Create new index for lowercase contributor values from instance jsons and browse by it
- Retrieve counts using terms aggregation per contributor entity
Pros | Cons |
---|---|
It can be slightly better than option #1, because there is no need to store and manage dedicated index or table | Additional load to the existing data storage and mod-inventory-storage |
No need to manage update and delete events | Additional index can slow down document indexing for mod-inventory-storage |
Slower than option #2 |
#4 Browse by range query and numeric representation of incoming value
This approach can be based on the Call-Number browsing. The main idea is to create a long value for the string and use it for the range query to limit the number of documents to aggregate.
Items to check/investigate:
- Retrieve number of instances per each contributor
- Manage how to deal with redundant contributors in the result of terms aggregation (script query?)
Pros | Cons |
---|---|
Approximately, the same performance as Call-Number Browsing. | Additional value must be stored within each document - numeric value for each contributor |
No need to store dedicated Elasticsearch index, PostgreSQL table or index |