Tim Hubbard
on Panel "Information Diversity"

I'm going talk about a few stories about the human genome project because it has been
reported in the media and not particular accurately. Then I'll talk about an open source
project for analysing the genome and I'll talk about something going beyond that. Open
source deals with software but there is a question of the information that you generate from a
software, the actual annotation, you wear a gene, things like that. How can you integrate that?
How you can do something open sourceish for that sort of information. And then some future
directions from that. So, finally I might, if there is time, talk about gene patents. 

The human genome project. This is a huge project. It has sequenced three billion bases. To
give you an idea of how large this is: It took eight years to sequence yeast and that was
finished in early 1999. That only had 30 million base pairs. So, the human genome is
enormously larger. It's been a huge logistic challenge to scale up for this size and to actually
store the information correctly. It was started as a concept. The politics is always complicate
in this things. The concept actually came from the Department of Energy in the States. It was
pushed by them partly because they were looking for something to spend the money that they
won't go on spending on military research. And they were looking for something big to spend
a lot of money. This looked like an obviously target because they said, well three billion, one
dollar per base, this is going to cost a lot of money. Anyway, that is where the concept
originally came from. Now it is important to point out that obtaining the sequence is really
just one little snapshot in the whole process. There is a lot of science that went on before
obtaining the sequence which is important, involving a lot of researchers around the world.
The concentration on sequencing was done in America and the UK with contributions from
Germany, from France, Japan, China, but there were a lot of other researchers involved
before that. Similar it is only just completing the sequence and it is not completed yet, either.
It is also just a start of a huge investigation to follow to understand what it all means. This is
just giving you the scale again. You can see what the numbers were and you can see how
recent genomics is. 

The first complete genome, two million bases, was only completed in 1995. This is just 
history. Here is this first genome again. Large scale sequencing set up, and the word Bermuda
is important here because the whole idea of making this data freely available is to some extent
novel. Normally it happens in science that people publish articles in scientific journals, and
there has been a growing trend to release the data at that point. In the past there wasn't an
issue of data. Scientific publications were the information. Increasingly as datasets became
larger and larger there is a need  for a depository to store the data that underly the conclusions
in the paper. And so people started to release that. They started to set up databases to store
that information. But it got to a point with sequencing that sequences come out so fast
because the machines are so efficient, that you never even get around to writing a publication
and the sequence is so valuable to every research you have to wait until you write the paper
and release it at that point. It sounds good to release it early but then the question is: do you
know how you release it? Do you release all of it? And so there is large amounts of money
beeing donated by governments to do this. The sort of deal was struck that since these large
institutes that were doing sequencing would end up beeing so powerful because they have
access to this information. But in exchange for the money that they were getting to do the
sequencing they should also released it. And this was sort of pushed forward as a bandwagon
eventually to the point that, in fact, it is the correct thing to do to release data immediately.
And the Bermuda meetings co-defined this such that data was released within 24 hours of
sequencing. None of this holding it back to look at it to say if you find something interesting
you might patent. You just release it immediately, and that makes it simpler for everybody. 

Of course certain interests didn't completely liked that. Particularly one of the people who
was involved at that point. That is Craig Venter who went on to form Celera. And Celera's
mission as set out was to sequence the human genome in a much faster way then the public
domain. In actual fact it didn't quite work out that way because the public domain took up the
challenge, and there was an announcement in 2000 that rough versions of the human genome
had been generated by both sides. So that was what you read in the press. 

Here is the stuff about Bermuda again. Every twenty-four houres they patent something. It
was connected to this funding of large scale issues and the underwriting by institutions like
The Welcome Trust, the world richest charity and the NAH is really that releasing this data
gives the greatest public benefit. You have huge numbers of scientist. This data was
complicated and I go on to talk about this later. It is hard to understand what it means. A lot
of eyes are needed. After all, it is like having a source code written by someone else. You
know you need potentially a lot of people if it is complicated source code to work out what it
really means. So, one thing in sequencing is the huge scale. 

This is a production facility not in a private but in the public domain. It just looks like a
factory of highly automised robotic production. These are the sort of computer set-ups that
are involved. [Sengenar] has 40 terabits of data. And this gives you an idea of the production
speed up from May 1999 to May 2000. It is actually a tenfold speed-up. So the one thing that
this sort of scale being done in the public domain actually really squashes is that notion that
the researchers in the public domain do little research in there laboratories, and you need a
big powerful, well organised private company to come and do the serious stuff to make drugs.
This blows this away completely. You can have this sort of well organised stuff entirely
publicly funded with academic salaries and associates. You might have to have more
managers. It might have to have more production meetings but it can be done. And this is
really a demonstration of that. And so there is no reason why this sort of publicly funded
approach can't tackle problems which are at that moment to some extent considered to be
only things that you could get done with private money. This just gives a break-down of the
people involved. It really was an international effort although the large scale was done in
really five institutions. Japan made a very significant impact and China which joint very late
on still managed to do one percent of the production in time before the announcement. 

So the controversy: Controversy is all about the technique used. The public domain strategy
was that if you just chop the gene up in little pieces and try to put them together you would
end up in a mess, because you wouldn't be able to work out this puzzle. And there were good
reasons for that because it is known that there were a lot of very similar pieces of sequence in
the genome. Putting them together looked like it could be a very big problem. So this is the
strategy used by the public domain: You take the twenty four chromosomes in a cell. You
chop them up into fragments which are around 100,000 long and then you go and sequence
each of those fragments using a random strategy. Private domain strategy was to just bypass
that intermediate step and just do it directly. And the claim was that they had clean enough
sequencing facilities and clever enough computers so that they could do it all in one step.
That is the claim. 

So what really happened? Here you got the Celera machine putting things together, and they
generated a certain amount of data. And in fact, that was at no point in the publication,
because there was the press release in June 2000 but then there was the real scientific
publications that came out February this year. We didn't really discover the real story until
February this year. Here you have this magical program and nothing was ever announced that
came as a result of just their data. They admitted they took the public domain data in what
they did talk about which is a certain size, certain length, certain number of pieces, because it
is a draft genome, certain numbers of coverage. And they generated something, and the
amazing thing is that it looks almost identical as the public domain version. And so a lot of
people said, where is the need and what are they generating out of all this? And the answer is,
if you actually go and look in detail: not actually that much. Because what turned out was this
assembler did not solve the problem. It didn't managed to put the whole thing together. It
ended up in a mess that was predicted at the time that Celera was announced when in fact
politically what was going on in the Congress of the United States were representations from
this company in various committees that the public domain should give up. They should let
the company do everything. The public domain should go off and sequence the mouse
genome or something like that. Because the private domain could do it all. Of course the
private domain would like to get access and control of this data but their claim was that they
could shut down the public domain. 

It wasn't necessarily to do it twice. In actual fact it turned out that this approach used by the
public domain with maps was necessary and you wouldn't have got a genome otherwise. But
this message has been hidden. It has been strategically hidden by a number of clever media
events including preempting the major announcement in February this year by leaking
another story the day before to make sure that was the thing that caught the attention that
media instead of the real story. So, there is a very general story here. These are scientists on
both sides. Scientist in a private domain and in a public domain. But there is a lot of money to
be made here. And you don't want to trust a scientist anymore then anybody else. You need to
go and look behind and check what people say in press releases. Check it up with real data
and in this case where you have scientists in a private company where the data is hidden no
one can go and check. So, even though the public domain was heavily sceptical, you know
we believed that the private domain had done it for eight months. We believed that they have
done it at least as well as us. Because we couldn't see their data. It was only when we saw the
scientific publication, were we able to make any sort of check. And that was what we
discovered. 

Now this opposite side of collaboration has been mentioned already the [snip] consortium. It
is a very interesting partnership for a number of reasons because it actually also underlines
the reasons behind why this is a good thing. So, here you have twelve companies giving away
three million dollars each, generating a part of data which is all freely released. The
remarkable thing is the companies got no benefit specifically at all. This data was made
available in public the same day they saw it. They got no private pre-access at all. No special
rights. This data is in effect cleverly protected using an application of patent law so that no
one else can patent it. It is available for everybody but it is still protected. So, the companies
didn't want to spend this money and then have somebody else patent it, which is quite
reasonable. This has been extended to the mouse because the mouse is another important
genome to understand the human. And similarly a large amount of money has been put in
because these companies would prefer this to be publicly available rather then having to go
and buy it from a private company. 

Now, there is a reason for why they are actually interested in that. I should just mention, this
is also under discussion for a third project involving public structures. Why are they
interested in doing this sort of thing? You could think it was just altruism. It is actually
slightly different. Biology is too complicate for any organisation to have a monopoly on, and
that includes pharmaceutical companies. They are maybe big companies with large research
engines but whenever they start a project, is has been quoted to me, they know that there is
more research going on on that project elsewhere in the world than anything they are going to
do inside their company. In particular in the case of genomes, this pile of data is so valuable
because it allows all kinds of research to be connected together. It is much more important in
this case but it is very generalised. If a core block of biological data is kept hidden from all
those other researchers around the world, both in the public and in the private domain who
publish in the scientific literature, then as a company you shoot yourself in the foot. It is not
just a question of  do I spend three million in the public domain or do I spend three million
accessing private data?' If you spend three million accessing private data then that private
data isn't going to be available to all these tens of thousands of researchers working on similar
problems around the world who are going to publish things and give you leads that will allow
you to develop new drugs and make profits. So it is a completely non-altruistic view. It is a
view that is saying  we get a lot of our information from everybody else out there.' It actually
underlines how much private research depends on public research to get new ideas, to get
new products. 

Of course going beyond all this, there is patents. So we have protected the DNA. It is actually
been successful no matter how it's been presented because the genome is public. It is
available to everybody. You can go and buy one, sure, but it is the same as the one that is
available, and the public domain project is filling in these gaps. The private domain is not
doing anything in this, and so the public domain is going to get completely finished. It is 50
percent finished now. When it is completely finished then there is no point that anybody is
selling anything because there is only one human genome, and it is virtually identical for
everybody in the world. So, you only need one. 

But then, the raw genomic data is one thing. The analysis of that and the location of the genes
and there functions is something completely different. And so there is this issue of patenting
genes. These business people have thought, well, the genome is protected, maybe it is ok
then. It is actually very seriously not ok, because it is very unlike the situation of a normal
patent regardless if you agree with patents or not. Even if you agree with patents this is a
special case which has more problems, and the extra problems are related to the fact that you
can not get around these patents. So, in a normal situation, suppose you are designing a
mouse trap and patent it, and you don't sell very many of them, because you want to sell them
for a lot of money. Somebody else comes along, and wants to license your patent on the
mousetrap, and you say:   No, I'm not interested. I'm making a lot of money anyway.' So,
there is the option for the second company to go and do some research and come up with
another mouse trap, another design, a different type of mousetrap, a better mousetrap. And
then there could be competition, and that will affect prizes and that will affect availability.
That has been the standard model for arguing that patents are a sort of useful thing, that they
encourage research and development, but you can get competition none the less. 

In this specific case of healthcare related to genes that is not the case because humans are all
the same. We only have a fixed number of genes. There is no better gene for a particular type
of thing. There is no alternativ gene for breaths cancer. There is only one. There is only one
which has this specific effect. And so, if you have a patent on this gene you've locked up
research in this whole area and that is the case very specifically with these two genes for
cancer of the breasts. We have already seen this company that holds patents affecting every
application of these genes in the future. They have a lock on the system. The only application
we have seen so fare has been tests. But it is already there, and they are shutting down others.
They have shut down all alternativ testing in the US and they've been arousing Europe about
this. The UK has done some sort of deal because some of the research was done in the UK,
and so the UK has a bargaining position. The French government is now challenging this in
the courts. But this is just a sign of what's to come. Among the many gene patents that have
been granted, there is an awful lot of submarine stuff in the States where it is not clear. People
have got patents pending. Sort of patents that are hidden in the US patent system which only
appear if the person decides to activate them if it turns out that the gene is actually really
important. I'm sure James will talk about that a bit more. 

So, analysing, another aspect: This business of interpretation is kind of how you feel
sometimes. You got this three million piece jigsaw puzzle, and some guy comes along and
says they found a corner piece. The awful truth of course is that we've got this genome. It is
huge by any standards of anything we have dealt with before. We're just beginning to tough
out the realm which is 30 times smaller. And it is in these pieces. It is not one nice continuous
thing. It is going to keep changing continuously for three years which makes a nightmare for
data tracking, but everybody in the world wants to use this thing now. So, you have all these
people sending you mails saying  how do I get to it?' 

This is part of the project which I'm jointly in charge of called "Ensemble." It is a joint
project. It's got around 30 people working on it on a large grant supporting it from the World
Trust. It is basically a website which shows all the information and the complete analysis of
the human genome. And what is that analysis? Well, it is this basically: Here you got a tiny
bit of the sequence. Up at the top here, this is a little bit of one of these. There is a whole
chromosome down here. That is twenty-four of those. You can see this over here. So, we are
zooming in two levels. Now, we got down to a single chromosome. And now we got down to
a tiny region. This chromosome x is about 117 million bases long. At the top here we are
looking at one mega base. Down here we have scrolled into a hundred thousand letters. If you
print this out on A4 paper, it is three quarters of a million pieces of paper. So, it is kind of
big. I haven't zoomed down here to get into the individual letters. These are four letters: A, C,
T, and G. It would be pretty pointless, but here you can see some genes. You can see a wide
view of some genes up here. You could see, there is an individual gene down here. I got a
gene structure over here just to show you were a gene is. Here is the genetic sequence. The
genes are sort of patched in the middle and it has things at the start and things at the end. It is
like a piece of code. It got a start and a stop. And it's got things controlling itself, being
turned on at the beginning and end maybe, maybe some way away. That is a gene, and it gets
copied. You copy from the start to the stop, and then you make that turn into a protein. 

So the interesting thing is, how do we go out predicting these damn things. Well, what we do
is we scan the sequence and look for things that look like a protein sequence and that's about
it, actually. This works reasonably well in small things like bacteria because this is what the
gene is really like, but it breaks down in high organisms because these genes are fragmented.
They are chopped up into little pieces. So, if we go back to this slide, then this is down here.
This is why this thing is shown as sizeable lumps with these pats. Those are the links. These
are the bits of the gene here and because it is all fragmented like this, it makes it very difficult
to predict. So, at the moment we can just about work out where two thirds of the gene are
with an awful lot of effort. In terms of these things controlling it, no hope at the moment,
completely streets away. 

So, we have the sequence but we don't really understand what it means, and we certainly can't
work out how it will work. It is a critical resource for doing all this research. People work out
what individual genes do, and you hear about those in the papers. But in terms of the ultimate
objective, which is a complete understanding of a cell, then a whole human body, you have a
hundred million cells in your body, it is a long way away, a lot of more research to come.

And so, I said this now several times. It is too complicated for one organisation. And we want
a lot of organisations working on this because it is so complicated. Whenever you have a lot
of people working on it you have this problem of integrating the data. How do you combine
something over here with something over here? Of course you can use the web and links but
you end up with a situation you click here and go to somebody else's website and it looks
different. You got a lot of different interfaces. Data is not presented quite the same way. Once
you have ten or twenty of these things, it's a nightmare to work out who is predicting what.
So how can we address that? 

Well, there is various ways, and one of them is pure open source. Ensemble is an open source
project and the reason that's important is, that we make our entire software system available
and our entire analysis available and that, at least, helps people approach the problem while
we are using the same base system. It encourages some sort of standardisation. And there is
open discussion of how we do things, so it is not as if we're cutting everybody out of this and
imposing a standard. But we really want to go beyond that. 

And so I want to talk about distributed annotation because this is something I think got a lot
of applications outside this area. Something called "Linkenstein" in the States is behind the
standard for this, but we have been heavily involved at Ensemble in actually implementing
this. So, here is the idea: Just imagine this is a piece of raw sequence here and these little
blubbs here are features on the sequence, so it might be the position of a gene. Maybe this is a
prediction of some repeatedness between the genes, I don't know. Anyway, here is a server
providing this information and here is you viewing this on the web page. And basically, you
can get what they want to give you and that's it. And if you are somebody outside, outside
this rich group that's got a big server set up, and you need to be fairly well off in order to set
up a server for a human genome, then it is quite difficult to get your data in. So, if you can get
extra sequences incorporated, we are very happy to accept that. You might be able to
persuade one of these big centres to run your programs if you developed some fancy new
algorithm. But extra annotation where you believe this gene is a little bit different because of
some reason, they are not gone accept that. So if you want to make that available to the
world, maybe you can publish it in a scientific paper but that obviously can just sit in the
books somewhere. You can set up your own server but then you have to have a reasonable
amount of resources to duplicate what's here. Because nobody is going to go to your server
unless it's got most of what is regarded as a standard. And so that shuts a lot of people out.
So, here is a way of avoiding that: The idea is that you don't have to bother to serve
everything. You just serve a little bit of extra information that you have calculated, and then
you make sure everything is synchronised, so we are talking in the same coordinate system.
And then you make the viewer cleverer so the viewer now grabs information from two
servers and does the synchronisation on the fly. Once you've done this with two systems you
can do it with n systems. So here we have another one here, and this could be somebody from
bioinformatics analysing the whole genome with some stunningly effort, or this could be a
tiny biological group that is just working on one gene, as quite a lot of specialist groups are
doing. 

And as far as the users are concerned you might think they got hundreds of thousands of these
different things from different servers. They can control what they see, so they can turn things
off that they don't like, if they think this guy is serving rubbish they can turn him off. So this
is democratisation of annotation. Everyone has an equal chance to speak, and you can choose
what to listen to. 

I think there is a wealth of implications of this model in different fields. This is being strongly
engineered in bioinformatics to handle this problem of annotation of the genome. So, I think
one of the things that this does is actually clearly splits between databases which curate data,
store data and different front-ends that separate those out because you can set up a data server
and then you can write it down. Which means that you don't then have to do both. You can be
good at one or the other or you can do both if you want to, but it allows the competition in the
front-end view. It allows you to merge different sources of data if you think it is useful. 

It can be applied to all kinds of different things, and I'm only talking about a linear sequence
here. But you can annotate on stable identifiers if you got a system of stable identifiers. And,
of course, non-biological systems: the thing which is most obvious to me is maps. You have
seen various people trying to be portals for maps. Or what about somebody serving a
reference map of Berlin, and then anybody else across the city being able to serve not only
their little website, but putting up a little server saying on this coordinates, I'm here and this is
a bit about me. And then anybody looking at this particular region of Berlin would be able to
go off and talk to all these servers and pull that information and see that for themselves. It
wouldn't rely on a central person having to agree to accept everything. It would be
decentralised. And it also allows the possibility of servers providing summaries of other
servers, digests. So, here we have three different annotations and we don't want to look at all
three of them. We actually like to see a consensus. So, you're going to have a server which
talks to other servers and provides a consensus view of that. This is for bioinformatics but it
has potential for a lot of things. 

Open source, open standards because to get a winning software project we'd actually like a lot
of people's opinion to be involved. Open annotation because it is not just the software but in
the case of bio-informatics it is what data is generated with it. Open data, because the data has
got to be available, and, of course, the key application in this area is healthcare, and I'm sure
Jamie will talk about everything that surrounds the viability of access to drugs. So as I said
earlier on, the fact that the genome was done in a public domain and in fact even the private
domain ended up needing it to be done in the public domain, I think indicates the power of
being able to organise things if you got reasonable resources. It is clear that the genome
project particular at the Senger Center was well resourced by a charity but if you got that
reasonable resourcing you can achieve things without a profit motive. 

[transcript: Katja Pratschke
editor check: ok
author check: ]