A key part of the [project to remodel the Getty Provenance Index](http://www.getty.edu/research/tools/provenance/provenance_remodel/index.html) is the transformation of our datasets into Linked Open Data (LOD). Transforming to LOD will allow our data to be more easily discovered on the web and more easily linked to data sets at other institutions. But this transformation to LOD required us to standardize many of our data elements that had previously been left to be understood in the context of the surrounding data.\n\nAs a database editor for the Provenance Index, I focus on data from the sales projects, which is the largest group of records in the Index. In this post I’ll describe some of the thorny data-standardization issues involved in the move to LOD through the lens of a single group of fields—ones that presented interesting challenges. ### A Brief History of the Sales Projects\n\n\nIf you’re not familiar with the [Getty Provenance Index](http://www.getty.edu/research/tools/provenance/search.html), here’s a bit of history. Founded in the early 1980s by Burton Fredericksen, the first curator of paintings for the Getty Museum, the Provenance Index has grown over the past 30-plus years into a collection of online databases with 1.7 million records. Database records represent art objects drawn from archival source documents such as auction catalogs, inventories, and dealer stock books, ranging from the 17th to the 20th centuries. This data can be used to trace the ownership of works of art and to examine patterns in collecting and art markets in Europe and the US.\n\nThe largest group of records is what we call the sales projects, which currently include 1.25 million object records from about 16,000 sales catalogs. The sales projects began in the 1980s as an attempt to record all the paintings appearing in British sales catalogs in the 19th century. The work was done decade by decade, starting from 1801, and, so far, we’ve gotten up to 1840. This work was followed by projects for French, Belgian, and Dutch sales. Then, through collaborations with other institutions, we produced projects for earlier sales, from the 17th and 18th centuries, and most recently, a 20th-century German sales project. We also started to include additional object types beyond paintings, such as drawings and sculptures. You can [see an overview of what’s covered in the sales projects here](http://www.getty.edu/research/tools/provenance/charts.html#catalogs).\n\nThese sales can all be searched together online, but they were originally developed as individual projects over four decades, with the first projects produced as print publications. One of our challenges has been to make sure that all of these individual datasets work together and adapt to changing formats and technology. This challenge has continued into the current remodel project. Below you can see examples of how our data looked in our early print volumes compared to our current web-based platform. Our country-based sales projects always had the same basic structure, and we attempted to capture the same basic information in a standard way. There are small differences between sales catalogs in different countries and periods, but they don’t cause a problem in our current web-based platform. For example, you can search by sale date, artist, buyer, seller, object type, title words, etc., across all the sales databases with no problem, other than the fact that the title words will be in the original language of the document. Things changed, however, when we began making the shift to linked open data, and some of the differences in the data from country to country became an issue. In the LOD model, each data element needs a specific definition. And some of these elements rely on other elements in order to provide meaning. So changes in the way one element was handled often had a ripple effect.\n\nI’ve been working on content in the Provenance Index databases for many years and am not an IT expert. Therefore, I will try to avoid highly technical descriptions, such as “The software development team took the thingy and did some stuff to it.” You’re welcome. Instead, I’ll focus on the content side of the standardization process and explore how one issue led to the next as we sought to create more granular definitions for concepts that were more ambiguous in our old data model.\n\nI could have written about any of the fields you see in the search screen above—artists, owners, object types, subject matter, sale locations, etc. But instead, I’m going to write about something that doesn’t even appear on that search screen. Ooh, do I have your attention now? What could this mysterious data element be? Well, it’s—wait for it—sale price. Oh, yes. We’re going there.\n### The Many Types of Art Sales Prices\n\n\nYou read that right. There are different types of prices. I hope you are sitting down and holding on to something, because we are in for a bumpy ride. Many sales catalogs contain prices, but the prices don’t all mean the same thing. Moreover, prices can be written in by hand, they can come from published sale results, or they can be printed in a catalog. Prices printed in an auction catalog obviously cannot be the actual auction results, because they were published before the auction took place. So the printed prices in auction catalogs are estimates, starting prices, or reserves. To add complexity, it is fairly unusual to have any prices printed in auction catalogs in the 17th, 18th, and early 19th centuries. Because of this rarity, we did not have a separate way to record this information. We simply recorded any price in our main price field and added a note if we needed to explain that it was a different “type” of price, like an estimate. In 20th-century German catalogs, printed estimates and starting prices are common. So, when our Nazi-era German sales project started in 2011, individual data fields were added to record the different types of prices that could exist: price, estimated price, or starting price. In the remodeled Provenance Index, these will be linked to specific concepts and be standard across databases. So we had to go back to our older sales projects and identify which prices needed to be moved to new fields. ### Prices vs. Transactions\n\n\nAfter this initial pass, in which we separated price information into three fields, we realized that our main sale price field could still mean two very different things. A price often represents an actual purchase price, but it can also be what’s known as the “bought-in” price. When bidding does not reach the reserve price, the highest bid is often recorded; but without a sale, this “price” does not represent a transfer of ownership. That distinction—whether or not the price also represents an actual sale—is not made in the price field. Instead, we note this in our databases through the “transaction” field.\n\nThis is an example of how the information in one field is dependent on the information in an entirely separate field for meaning. If the transaction is noted as “sold,” then the price represents the purchase price. If the transaction is noted as “bought-in,” it is *not* a purchase price because a transfer of ownership did not occur.\n\nFor the remodel project, we decided that we would consider *all* the prices that appear in the sale price field to be “bid” prices. In other words, they represent a bid that was made at auction. It might be the winning bid; it might be a high bid that didn’t reach the reserve; or we might simply not know what the bid represents. The LOD modeling of the price will then be dependent on the transaction, so that only the “sold” transactions will have the price linked to a sale event.\n### Oh No, Another Price Type!\n\n\nOnce we had sorted out the purchase price, bought-in price, starting price, and estimated price, we thought we had taken care of all the price-type issues for the sales catalogs. But, as we were separating the prices into their distinct fields, *another* price type surfaced. Something we didn’t consider in the initial analysis was the fact that 267 sales in our sales databases (less than 2%) are not auctions at all. Most of these are sales by private contract.\n\nUnlike auctions, in which objects are sold to the highest bidder at a specified time, in a private contract sale the works are exhibited in a gallery for an extended period, usually from a few weeks to a few months. Customers can view and purchase the works at any time during the exhibition for set or negotiated prices. In most cases we don’t have any prices at all for such sales, so the price type is often not an issue. But in a few cases there *are* prices printed in this type of catalog, and we realized that our existing price types didn’t adequately define them. Because there is no bidding, the prices printed in a private contract sale catalog are not starting or reserve prices. They also don’t necessarily represent the eventual sale price, because customers could negotiate them down. There are about 4,500 object records with this price type, out of 1.25 million sales records. By comparison, there are almost 100,000 records with estimated prices. Despite this relatively small number, we decided that we needed a new price type, which we would call the “asking” price. We requested that this term be added to the Getty’s [Art & Architecture Thesaurus®](http://www.getty.edu/research/tools/vocabularies/aat) so that it can be linked to a standard concept, which will be used in our new LOD model. ### Types of Sales\n\n\nIn order to identify asking prices, we first had to identify all of the private contract sales. Many were clearly labeled as such in the notes for the sale. But others only included language such as “for ready money” or “sold out of hand” or similar terms in French or German. Still others didn’t include any of these specific terms, but we could identify them through other factors. For example, The European Museum, an exhibition space in London that held long-term sale exhibitions during the late 18th and early 19th centuries, would often include only the month, year, or season for the beginning of the exhibition. So these vague dates were a clue about the type of sale. Once all these sales were identified, the catalogs had to be checked to see which ones had prices that should be considered asking prices.\n### Lotteries? Where Did These Come From?\n\n\nMost of the non-auction sales in our databases are private contract sales. However, there are a few exceptions even to this. While I was searching through our 16,000 sales, I came across five catalogs for 18th-century lotteries, four German and one French. Lotteries were events in which participants would purchase tickets that gave them the chance to win one of the objects being raffled. They were a common way of dispersing art until the end of the 17th century, at which point auctions gained in popularity. But they continued to occur in the 18th and into the 19th centuries. In general, we have not included them in our sales projects, but apparently five crept in. Not only that, but one of the German lotteries includes prices that are valuations. Yes, valuations. Yet another price type!\n### Shared Prices\n\n\nAs I mentioned above, some of this price description information, such as the fact that a price was an estimate, had previously been included as a note attached to the main price field in our database. But other price description information also appeared in that price note field. The most common note explained that the object being viewed had been sold together with another object for a single price. Without the note it would appear that the price represented the purchase price for the individual object, when in fact, only a portion of the price would have been for that object. In the following example, you can see in the “Transaction” field that the price was actually for two separate lots. There are over 82,000 sales records with shared prices in our databases. This shared price information usually comes from hand annotations, often indicated by brackets that link two or more sale lots to a single price, as in the following example, in which the above lot 5 is shown to have been sold with lot 6: This concept was not a problem for the new model, as long as the specified lots were identified and could be linked together by matching the notes in the corresponding records. In order to create the link, though, the notes had to match perfectly, so this took some cleanup effort. The real problem was that there were also hundreds of lots that had notations for a shared price, but lacked any associated lot to share the price with. It’s easy to assume this was a mistake, but it wasn’t. These missing records occurred when an object that was in scope for the project was sold with another object that was out of scope, and therefore not included. As I mentioned at the beginning of this post, our sales projects record specific types of art objects. We have not included prints, books, most decorative arts, etc. So, for example, when a drawing lot was sold with a print lot, we would only have the drawing lot in our database and not the print lot. But the drawing lot would still include a shared price note indicating that it was sold together with another lot. The new model didn’t know what to do with this shared price note because it had nothing to link to.\n\n\n\nWe decided that the solution would be to create shell records that would stand in for the lots that were out of scope.\n\n ### Conclusion\n\n\nThank you for going on this action-packed adventure with me. As you can see, one seemingly simple data element can turn out to be remarkably complicated once you start digging in. And this post only covers a small part of the issues related to prices—don’t even get me started on currencies—which is itself only one small part of the overall data standardization process. But I hope this gives you a little bit of an idea of the work we’ve been doing as part of this remodel project, all with the aim of making Provenance Index records richer tools for research.