ThingFish
ThingFish is a network-accessable, searchable, extensible datastore. It can be used to store chunks of data on the network in an application-independent way, associate the chunks with other chunks through metadata, and then search for the chunk you need later and fetch it again, all through a REST API over HTTP.
Releases
Until we figure out why our Trac Downloads plugin stopped working, here's the releases as attachments:
There is a mailing list for discussion and questions hosted at RubyForge. You can subscribe here.
Goals
Simplicity
The system should, in its most basic form, only do two things:
- Store files via a network interface.
- Store metadata about the files and provide a search facility for finding files via their associated metadata.
Modularity
The system should have as much of the backend details abstracted out into plugin functionality as possible. This will allow the basic system to remain simple and be expanded to fit an environment's needs. It also makes incremental functionality easier, as plugins can be created when the functionality they encapsulate is required rather than up front.
We wish to minimize the dependencies necessary to get a basic installation up and running. The base system should only require a recent installation of Ruby and Mongrel. Plugins which extend it or replace the default simple backends with better-tuned and functional ones may depend on whatever they wish.
Language Neutrality
The service API presented by ThingFish should be as portable as possible, requiring only network sockets and an standards-compliant implementation of HTTP 1.1.
To this end, we've chosen the REST architectural style.
Scalability
While scalability is an obvious goal for most every network-accessable service, we feel like it's important to consider it up front.
Because of its modularity, ThingFish should be able to scale both deep and wide without sacrificing simplicity in the default configuration. New strategies for scalability (caching, file storage, metadata semantics) can be introduced as they are needed without having to take their implementation into consideration for the initial system.
Using a REST API also helps with wide scalability, as it is a stateless protocol and therefore can be load-balanced with little to no changes to the server software.
REST API
Default Handler
- Fetch the toplevel index (exactly what this means is subject to content negotiation)
- GET /
- Return the data for a given file
- GET /«uuid»
- Upload a file
- POST /
- Replace a file's data
- PUT /«uuid»
- Delete a file from the datastore
- DELETE /«uuid»
Search Handler
Returns a list of URIs for files which match the given search criteria.
- Find files with a given filename
- GET /search?filename=ovenmitt.jpg
- Find files with given tags
- GET /search?tag=(pain|firing%20squad)+ovenmitt
- Find a list of files with a complex query
- GET /search?tag=nsfw;filename=logo*;created=before+1/12/2007;owner=mahlon
- Complex query interface
-
Find a list of still images created by the same person in the same namespace as a given resource via a metastore implementation-specific query (RDF+SPARQL in this example):
POST /search HTTP/1.1 Content-type: application/sparql-query PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX dc: <http://purl.org/dc/elements/1.1/> PREFIX dcmi: <http://dublincore.org/documents/dcmi-type-vocabulary/> PREFIX thingfish: <http://oss.laika.com/thingfish/rdf/2007/03/schema#> SELECT ?urn WHERE { urn:uuid:c10b7ee8-cdad-11db-a110-23336f446aba dc:creator ?person urn:uuid:c10b7ee8-cdad-11db-a110-23336f446aba thingfish:namespace ?ns ?urn dc:creator ?person ?urn dc:type dcmi:StillImage ?urn thingfish:namespace ?ns }
Metadata Handler
Returns a list of metadata tuples.
- Return a list of all metadata tuples for a given file
- GET /metadata/«uuid»
- Find all tags in the store
- GET /metadata/tag
- Find all tags for a given file
- GET /metadata/«uuid»/tag
- Return the first preview that matches the request's Accept header for a given file
- GET /metadata/«uuid»/preview
- Add a tag for the given file
- POST /metadata/«uuid»/tag
- Replace the namespace for the given file
- PUT /metadata/«uuid»/namespace
- Delete a namespace for the given file
- DELETE /metadata/«uuid»/namespace
- Note that you won't be able to do this
- DELETE|POST /metadata/«uuid» Since those actions would dirty the info for underlying data. These should return a BAD_REQUEST.
Admin Handler
- Show available diskspace
- GET /admin/diskspace
- Show and edit space/quotas per user
- GET /admin/quotas
- Cleanup and maintenance (candidates for deletion?)
- GET /admin/cleanup
- Show current usage (status, stats, and graphs -- rrd tool, IO, trending performance)
- GET /admin/status
Additional Features
Auto-Generation of Metadata
ThingFish will also support extraction and auto-generation of metadata from the stored file.
Examples:
- Detection of filetype based on magic for less-useful upload mimetypes
- application/octet-stream
- text/plain
- Previews for appropriate mimetypes
- Extraction of embedded metadata (e.g., camera info, codec, etc.)
- Pluggable extractions
- Each extractor knows what mimetypes it can extract its metadata from
- upload time
- uploading agent (e.g., User-Agent header)
- uploading ip
We're trying to name metadata according to the conventions of the Dublin Core where possible/appropriate. The default metadata we're currently extracting (from the HTTP request) is:
Description Dublin Core Type Metastore Attribute Content-type format format Content-length extent extent User-agent n/a useragent Uploading IP n/a uploadaddress Upload Date created created Modified Date modified modified Checksum n/a checksum
Content Negotiation
The daemon will also support pluggable Transparent HTTP content negotiation, which will allow customizable serialization of complex datatyes and on-the-fly transformation of fetched files.
For URIs that return RDF triples or other structural data, the client will be able to fetch it in YAML, JSON, XML, HTML, or perhaps other formats (Turtle?, N3?)
This will be implemented with a table of transformations from one mimetype to another. If the mimetype of the file is in the accepts list, return it as-is. If not, a response format is detemined by taking the list of formats the requester accepts, then iterating over the table of transformations. If a transformation exists for the requested type, it executes and returns the data in the requested format.
Pluggable output formats
- YAML
- JSON
- XML
- HTML
- RSS
- Image/audio re-encoding PNG->TIFF, etc.
- Human-readable text
Filetype conversion 'caching'
Depending on the filetypes, doing continued conversions that just get thrown away and recalculated each request could potentially tax the ThingFish server CPU. The filter interface (parent class) should have an API interface for storing the conversion results as new ThingFish resources (with a special 'variant' metadata key), and updating the original resource metadata with a reference to the new type and uuid. Each filter that performs a conversion could check the original UUID for a reference to a pre-calculated version first.
Later, the /admin interface could have a "variants" section (or whatever) that would display total diskspace in use by variants, and optionally purge them all wholesale. Neat.
Implementation
Language
- Ruby
- Daemon built around Mongrel (Mongrel: Home <http://mongrel.rubyforge.org/>)
- C extensions where necessary for speed/scalability
File Storage
The storage backend should be pluggable. The default implementation should use a simple filesystem directory structure, perhaps with hashed names based on a resource's UUID.
Duplicates should be avoided via checksumming, perhaps with the error response returning a referral to the original via a Location header or something.
We also need to handle the case of duplicates being uploaded in the case where there are ACLs which restrict access to the original copy. In the case where there's already a resource in the filestore with the same checksum that is not accessible to the uploading user, a duplicate should be transparently created. We want to avoid informing the second uploading user of the first object's existence if she doesn't have permissions to view it to avoid information leakage.
Associated Metadata
The default metadata structure for ThingFish files will be a basic key/value list implemented with an in-memory hash. You can customize the metadata layer via pluggable metastore strategies.
We're writing a metadata plugin for the LAIKA ThingFish installation that adds an ontological layer implemented using RDF via the Redland RDF library. We'll start out with the Dublin Core Metadata Terms at a minimum, and then add other RDF vocabularies. Some likely additions:
- FOAF (Friend of a Friend)
- designed to describe people, their interests and interconnections.
- DOAC (Description of a Career)
- supplements FOAF to allow the sharing of résumé information.
- DOAP (Description of a Project)
- designed to describe software projects; uses FOAF to identify the people involved
- Images Ontology
- Ontology for Images, image regions (SVG), videos, frames, segments, and what they depict.
- Photography Vocabulary
- Definitions of various terms related to photographs and photography equipment.
Search
Metadata searching will support two interfaces: a basic query mapper that will generate simple queries via a naive interface on the current metastore strategy, and a more-robust query engine that will provide a raw, implementation-specific query interface via a POST request, with the body of the request containing the query text.
Results from a search will be returned via one of the results-serialization strategies.
Sub-Topics
- ThingFish/DeveloperNotes
- We're keeping notes as we develop about various things so we don't forget them.
- ThingFish/SubProjects
- A list of ideas for smallish projects with ThingFish backends that will help drive the implementation.
Release Plan
- 06/01/08 - ThingFish 0.3 - The "N-Robot" Release
- (later) - ThingFish 0.4 - The "Fish Is Murder" Release
- (later) - ThingFish 0.5 - The "JunkFlusha" Release
References
- Simple Map/Reduce in Ruby - an interesting implementation using DRB/Rinda.
- MapReduce for Ruby: Ridiculously Easy Distributed Programming - another DRB/Rinda implementation which doesn't seem to actually implement map/reduce, but has some interesting ideas nonetheless.
- Hadoop - A Java distributed filesystem + mapreduce implementation that came out of the Lucene project
- Spotlight Common Metadata Attribute Keys - A list of commonly-used keys in the Spotlight indexing system in MacOS X.
- Bloom Filters - A space-efficient probabilistic data structure that is used to test whether an element is a member of a set. Could be used for very fast string-array indexes
- RFC 2616 - HTTP 1.1
- RFC 2518 - HTTP Extensions for Distributed Authoring -- WEBDAV
- RFC 4709 - Mounting Web Distributed Authoring and Versioning (WebDAV) Servers
- http://xmlarmyknife.org/docs/rdf/sparql/ - SPARQL Query Service description
- SPARQL/Update - An update language for RDF graphs
- sparql.js - SPARQL Javascript Library
- XML Plists - Apple's documentation of the XML plist format.
Attachments
- thingfish-0.0.1.tar.gz (42.2 kB) -
ThingFish 0.0.1 Archive
, added by mgranger on 01/10/08 12:10:59.
