Provenance Meets Big Data: Will they have a future together?

A paper read by Jim Hinck before the Symposium “Who Owned This? Libraries And The Rare Book Trade Consider Issues Surrounding Provenance, Theft And Forgery.” at the Grolier Club on Tuesday 5 March, 2019.

When I first read the announcement for this symposium it reminded me that the ABAA had also been behind what was, I believe, the first effort by a bookselling association to establish a database specifically for recording missing or stolen old and rare books.

I thought of this because, as it happened, I was involved on the technical side of that project.

As a consequence I have, ever since, had a strong personal interest in how new database applications can be useful in the prevention and recovery of stolen rare books. That is what I want to talk about today.

The early effort I referred to was not particularly ambitious and ultimately did not get very far. It started in the early 1990s in the wake of several serious library thefts. They were discovered, losses were identified, but recovery was slow.

Publicity regarding the losses was sometimes suppressed for a variety of reasons we are all too familiar with. Distributing lists of lost treasures was, understandably, not always something that library directors were eager to do. But that does not mean that recovery was not a priority. To the contrary, it was always a high priority. But all too often the items were not recovered until they were publicly offered for sale and then claimed by the library as stolen property.

At this point the original thief was usually no longer involved, and the seller who offered them openly to the public had no reason to suspect that the items had been stolen. In many cases it was a legitimate bookseller who ended up holding the bag and suffering the financial loss.

This has always been one of the great risks in being an antiquarian bookseller. If you were offered an item whose provenance was at all uncertain there was no place to go to to check whether anyone had reported it to be lost or stolen.

It was a problem the ABAA thought it needed do something about. The office at Rockefeller Center was already receiving lots of stolen book reports – from booksellers, libraries and even collectors – but it did not have an effective system to put order into this information once it was received and, most critically, no practical way to enquire whether a given book of uncertain provenance had already been reported missing.

What was needed was a database to record this information and make it searchable.

It was also about this time that the International Foundation For Art Research (IFAR) was working to establish the Art Loss Register, which was to be a database of stolen art, antiques and collectibles . The ABAA was very aware of this new project and recognised the relevance to its own situation. The need for a Stolen Book Register along similar lines was clear to all. The ABAA decided it needed to build one.

The database software it chose to use for this project was an obscure desktop program called DataPerfect. It was already being used in the ABAA headquarters by the association secretary, so it made perfect sense to use it for building the new Stolen Book Register as well.

Which is how I got involved. By chance, I was, at the time, working on a cataloguing program for booksellers. It was also being built with DataPerfect, a program with which I had a lot of experience. The ABAA secretary was in charge of the project, but I also pitched in and was happy to help.

Shortly after this project began we recognised that, to be effective, we needed to have something that would accommodate the needs of librarians as well as booksellers. They were, after all, the ones we expected be reporting the largest and most serious thefts.

The original plan was quite simple. When books were discovered to be stolen or missing they would still be reported to the ABAA headquarters, individually or in lists. The ABAA secretary would then manually enter them into the database one at a time. Then, if someone needed to check on a book of uncertain provenance they would have to phone the ABAA office and ask the secretary to check it for them.

It didn’t occur to us that there would be any other way to go about it. But once consultation with librarians began another database came into view: OCLC. Why, the librarians wondered, should we build a separate database system for stolen books when something already existed that could be adapted to this purpose instead. The librarians were understandably concerned about the additional workload that would be required for manually compiling lists and sending them to the ABAA. And, of course, OCLC was already networked and accessible to nearly all the institutions that might have thefts to report. The data, for the most part, had already been entered. All that was needed was to add a new field, or adapt an existing one, and then use it to quickly flag stolen items as soon as they were discovered. This is what the librarians wanted, and the ABAA was happy to oblige

Exit DataPerfect. And, of course, exit Jim Hinck.

At this point the project became much more ambitious. I was no longer involved and only learned about its progress from hearsay. The intentions were good, but so far as I know, in the end nothing happened, and the project died. I can only speculate as to why.

There was, of course, a committee, and we know what happens with them. Perhaps more significantly, there was also corporate OCLC to work with. They had a product to sell. I imagine that requests for significant changes to that product – changes without added income – would not have been likely to receive much support. And I’m sure there were other obstacles as well. But I would still be curious to hear the post-mortem from anyone who may have been involved at the end.

My guess is that money was the major obstacle, but knowing what I do about databases I think the real problem would have come when they ultimately recognised that strapping a stolen book database onto something designed for other purposes wouldn’t really work. You see, the critical data for stolen books is copy-specific. OCLC, on the other hand, is designed to treat books generically. It was the wrong tool for the job at hand. So we are probably all better off that this approach was abandoned when it was.

At this point I would be remiss not to mention that we do now have a stolen-book database. It is easily accessible through the ILAB web site. You should all be familiar with it by now. There are also sites operated by the ABA and ABAA. Interestingly, these are, at least in concept, not very different from what the ABAA first set out to build in the early nineties, before being distracted by the lure of OCLC. The big difference, of course, is the addition of internet access to the ILAB data, something that wasn’t possible in the 1990s. But there is, to my knowledge, no direct succession from one project to the other.

Sadly, there was already plenty of stolen book data available when the ABAA project began, but without a database to store it in the accumulated data was inevitably lost. That is a great shame. Records of this sort have no expiration date. They would still be useful if they were available today. ILAB’s reports, however, only go back to 2003. What was recorded before then we presume to be lost forever .

For the recovery of missing books a stolen books register, such as the one provided by ILAB, is a valuable tool. The ILAB database is, however, limited to only one category of data: missing books.

But there is also a second category of book data that I would like to consider – one that may be even more useful to us – because it is not just limited to what is already known to be lost, but also to what has been found.

To explain what I mean here I can use as an example the recently discovered thefts at the Carnegie Library in Pittsburgh, which I assume we are all familiar with. In the course of his investigations, one of the detectives working on the Carnegie Library theft discovered that one of their missing books had appeared in the 563 years section of the viaLibri website. This is the place where we display a chronological list of selected interesting books that have been found using the viaLibri search engine. The detective wrote to me and asked if we had similar information about any other books.

Well, as a matter of fact, we did.

We have actually been archiving information of this nature for almost a dozen years. At first, we started doing it fairly selectively, but we have, over time, become more and more comprehensive in the data we keep.

As a result, we have now recorded in our archives the details of more than 50 million items. After some recent updates, we are now adding more than half a million new items every day.

I was, of course, happy to make this data available to the detective in Pittsburgh.

Once we were given the name of a suspect – in this case a bookseller – it became possible to identify books from that dealer that had, over a period of several years, appeared on our site. By comparing those results with the list of missing books the investigators were able identify 37 stolen items that the suspect had, at one time or another, offered for sale. These would be a good example of what I referred to previously as “found books”.

In this case the information came from what I consider to be an especially useful source of digital data: They came from an historical database of books that have been offered for sale in the online marketplace for old and rare books.

Just to be clear on this: What I am referring to here as an online marketplace is, in reality, nothing more than the combined listings of all the old and rare books that are searchable and for sale online at a given point in time. Because its content is constantly changing we may think of this marketplace as being ephemeral, but the books themselves are not at all ephemeral and there is no need for the data that describes them to be treated as such. Indeed, there is enormous value in this data, and good reason to preserve it, even after its original commercial purpose has been served.

As an example of this, we have already seen how historical data helped investigators discover what happened to some of the books missing in Pittsburgh. In that case, searching was fairly simple because the name of a suspect had already been connected with the missing books.

But what if there is, as yet, NO suspect to investigate? In that case, historical marketplace data may be able to help us find one.

For example, an investigator could search this retrospective marketplace data looking for the titles and authors of items already known to be missing. If any potential matching books were found they would necessarily be linked to an individual who had once tried to sell them. This could be useful to know, regardless of whether or not the item was still available for sale. Multiple matches from the same seller could be a good place to start looking. Of course, this might not be the actual thief. In fact, I would be surprised if it was. But it should, nevertheless, be someone who knew where their book had come from and, if they no longer had it, to whom it had been sold.

The value of marketplace data is not, however, limited to its usefulness for the detection and recovery of stolen books. Just knowing that traceable historical records will exist in the future may be enough to prevent not just thefts, but illegal exports and other potential threats to culturally important books and manuscripts. For that I would like to recount a personal experience that demonstrates what I mean.

At some point early in the history of viaLibri I received an email from someone in South America. He urgently requested that I remove an early manuscript that was being displayed by the 563 years feature that I have already mentioned.

The item had been purchased from a dealer in Spain. I explained to him that we could only remove something when the request was received from the bookseller who had originally offered it for sale.

The Spanish dealer then contacted me and the removal was done. Before I removed it, however, I took a close look at its description. It was a document written in Spanish that related to an important historical event written by one of its participants.

The price, date and content made it clear that it was something that would probably have required an export license to leave Spain. It also seemed very likely that it did not have one.

In the context of our program today, I think this request was very interesting. Here we have a case where, in contrast to our normal expectations, the interest of an owner was not to establish provenance, but rather, to conceal it.

This was an example of a phenomena that interests me greatly. A desire to suppress the provenance of online purchases, for whatever reason, is not, I regret to say, exceptional. But in some important respects it is completely new.

In the pre-digital era, booksellers mostly sold their books by means of printed catalogues. This meant that book descriptions, often including valuable information about ownership and provenance, were generally committed to the relatively durable medium of paper and ink. There was no possibility of suppressing that information once the catalogue had been received from the printer and posted in the mail.

Online bookselling changed that dramatically. As things stand now, after a book or manuscript is sold online the information about it is quickly withdrawn from public view. The digital record of its existence disappears in a way that is impossible for its printed counterpart.

As a result , its availability as a reference for future booksellers and archivists is lost. The record of its passage through the marketplace is erased. Its history, its provenance, is gone.

Are we content to let this happen? I sincerely hope not.

So how can we prevent it?

We have already seen the role that digital historic marketplace records can have in the recovery of stolen books. We have also seen how retrospective data may help to inhibit the illegal export of important manuscripts and books.

But for these things to happen the data that enables them must first be saved. There needs to be an archive. In fact, there could be many archives. The more the merrier.

But to build archives we need archivists. Who will they be?

I can’t ask that question without also first acknowledging that, merely by saving the data we generate, our website becomes a de facto archive of the marketplace. It becomes an archive whether we intend it to or not. A very large number of books have passed through our marketplace. Over 50 million unique items have already been recorded. Everyday we see and add over half a million more. We do this because we can, but we also do it because we think we should.

The moment for doing this has only recently arrived. It cannot be ignored. An archive of this size was not something we would have attempted a decade ago. The amount of data would have overwhelmed us and the costs of saving and processing it would have been prohibitive.

But as data storage capacities have increased, and costs have dropped, the opportunities created by access to massive quantities of data have now opened before us. That trend will inevitably continue. Terabytes need no longer intimidate us, and an important milestone has now been reached.

You see: We have now arrived at the point where we can save forever the details of every old and rare book that appears for sale online.

For the world of old and rare books The Age of Big Data is about to arrive.

The implications of this should not be ignored. Not only can this data be used to create new tools for the prevention and recovery of stolen books, it will also constitute, over time, a huge storehouse of information about the history of the book itself. It will provide the raw material for a kind of comprehensive bibliographic research that could not have been attempted previously.

And, in the context of today’s program, I think this data will be particularly valuable because it is, necessarily, copy-specific. It will contain the kind of unique details that are usually missing from the generic data typically found in bibliographies and library catalogues.

Information on things like bindings, text variants, inscriptions, bookplates, annotations, usage and, of course, provenance are found in the catalogue entries of booksellers and major auction houses, but only occasionally elsewhere.

How do we make sure this data is saved? It is not really that difficult. The marketplace generates millions of search results daily. The stream of new data is continuous. All we have to do is capture it as it passes by, identify what has been previously seen, and then save the rest.

That much is now already being done.

As for the future, I can only speculate on what else will be possible. When I do, it is easy to imagine, for example, that library catalogues will one day intersect with our cumulative marketplace data as it is generated. When that happens I would expect to see algorithms that know how to mine that intersected data and recognise suspicious activities that would otherwise be ignored.

Perhaps even more interesting…

With the quantity of data that will be available to us in the future it should be possible to develop increasingly objective methods for measuring the relative rarity and demand for specific books.

Combining this with historic price information will eventually make it possible to perform automated audits that identify items whose value has increased beyond set limits. These can then be pulled from open shelves and moved to more secure and protected storage locations.

But these are only examples.

Looking yet further ahead, I don’t think it is preposterous to expect, some day, to see artificial intelligence applied to the historical bibliographic data that we will have accumulated over time. I can’t say I actually know how that might work, but I do, nevertheless, anticipate a day when cognitive computing and machine learning will be applied to the data we are now collecting from the marketplace.

These may eventually learn to recognise patterns of market activity that are able to predict the occurence of thefts even before they have been discovered by the usual means. When that happens it will be like having a virtual smoke alarm monitoring the antiquarian marketplace and ready to alert us when something suspicious has occurred.

I could go on. But it is important to realise that it will not be the algorithms, it will not be the artificial intelligence of the future that will make these things happen.

It will be the data.

Without data all the rest is like a loom without yarn.

And that is my final and greatest concern in concluding what I have to say here today.

It is essential to recognise that the miraculous technologies that make it possible for us to use all the types of data I have mentioned, also have the ability to make that data disappear and be lost forever. That loss is already taking place and its importance cannot be overestimated.

Is there any reason to believe that the future will care less than the present about where its books have come from? Will it be indifferent to what was known about those books when they passed from one owner to the next?

I think not.

The question then is: Should we throw this information away for no better reason than that it is no longer useful to ourselves?

I, for one, do not think so.

I hope that others will feel the same.

2 thoughts on “Provenance Meets Big Data: Will they have a future together?”

Thomas Joyce says:

April 9, 2019 at 2:07 am

Thank you, Jim.

It strikes me that this is somewhat analogous to the arms race, insofar as that if a powerful new weapon might be built, then somebody will attempt to build it. Just so, as you have stated, as terabytes of storage become incredibly affordable, then we, booksellers, collectors and librarians must have such a historical record, for so many reasons.

I recall an incident some years ago involving a list of not more than a hundred quality books that were stolen. They were meant to be given to a regional library. The hundred never arrived, but neither did the librarian distribute the lists of the known missing items – including some nearly unique books. The list still has not been promulgated. A globally accessible list would be a great boon, and save a lot of heartache from the victims and the unwitting buyers.

1. Jim Hinck says:
  
  April 9, 2019 at 7:46 pm
  
  Thanks Thomas.
  
  I would very much like to see the list of 100 books. I would publish it if I had it. The books themselves might now be beyond recovery, but their exposure would be cautionary to thieves of the future.
  
  Perhaps in this case it is tight lips that sink ships.
  
  -Jim

2 thoughts on “Provenance Meets Big Data: Will they have a future together?”

Leave a Reply to Jim Hinck Cancel reply