Tim Hubbard on Panel "Information Diversity" I'm going talk about a few stories about the human genome project because it has been reported in the media and not particular accurately. Then I'll talk about an open source project for analysing the genome and I'll talk about something going beyond that. Open source deals with software but there is a question of the information that you generate from a software, the actual annotation, you wear a gene, things like that. How can you integrate that? How you can do something open sourceish for that sort of information. And then some future directions from that. So, finally I might, if there is time, talk about gene patents. The human genome project. This is a huge project. It has sequenced three billion bases. To give you an idea of how large this is: It took eight years to sequence yeast and that was finished in early 1999. That only had 30 million base pairs. So, the human genome is enormously larger. It's been a huge logistic challenge to scale up for this size and to actually store the information correctly. It was started as a concept. The politics is always complicate in this things. The concept actually came from the Department of Energy in the States. It was pushed by them partly because they were looking for something to spend the money that they won't go on spending on military research. And they were looking for something big to spend a lot of money. This looked like an obviously target because they said, well three billion, one dollar per base, this is going to cost a lot of money. Anyway, that is where the concept originally came from. Now it is important to point out that obtaining the sequence is really just one little snapshot in the whole process. There is a lot of science that went on before obtaining the sequence which is important, involving a lot of researchers around the world. The concentration on sequencing was done in America and the UK with contributions from Germany, from France, Japan, China, but there were a lot of other researchers involved before that. Similar it is only just completing the sequence and it is not completed yet, either. It is also just a start of a huge investigation to follow to understand what it all means. This is just giving you the scale again. You can see what the numbers were and you can see how recent genomics is. The first complete genome, two million bases, was only completed in 1995. This is just history. Here is this first genome again. Large scale sequencing set up, and the word Bermuda is important here because the whole idea of making this data freely available is to some extent novel. Normally it happens in science that people publish articles in scientific journals, and there has been a growing trend to release the data at that point. In the past there wasn't an issue of data. Scientific publications were the information. Increasingly as datasets became larger and larger there is a need for a depository to store the data that underly the conclusions in the paper. And so people started to release that. They started to set up databases to store that information. But it got to a point with sequencing that sequences come out so fast because the machines are so efficient, that you never even get around to writing a publication and the sequence is so valuable to every research you have to wait until you write the paper and release it at that point. It sounds good to release it early but then the question is: do you know how you release it? Do you release all of it? And so there is large amounts of money beeing donated by governments to do this. The sort of deal was struck that since these large institutes that were doing sequencing would end up beeing so powerful because they have access to this information. But in exchange for the money that they were getting to do the sequencing they should also released it. And this was sort of pushed forward as a bandwagon eventually to the point that, in fact, it is the correct thing to do to release data immediately. And the Bermuda meetings co-defined this such that data was released within 24 hours of sequencing. None of this holding it back to look at it to say if you find something interesting you might patent. You just release it immediately, and that makes it simpler for everybody. Of course certain interests didn't completely liked that. Particularly one of the people who was involved at that point. That is Craig Venter who went on to form Celera. And Celera's mission as set out was to sequence the human genome in a much faster way then the public domain. In actual fact it didn't quite work out that way because the public domain took up the challenge, and there was an announcement in 2000 that rough versions of the human genome had been generated by both sides. So that was what you read in the press. Here is the stuff about Bermuda again. Every twenty-four houres they patent something. It was connected to this funding of large scale issues and the underwriting by institutions like The Welcome Trust, the world richest charity and the NAH is really that releasing this data gives the greatest public benefit. You have huge numbers of scientist. This data was complicated and I go on to talk about this later. It is hard to understand what it means. A lot of eyes are needed. After all, it is like having a source code written by someone else. You know you need potentially a lot of people if it is complicated source code to work out what it really means. So, one thing in sequencing is the huge scale. This is a production facility not in a private but in the public domain. It just looks like a factory of highly automised robotic production. These are the sort of computer set-ups that are involved. [Sengenar] has 40 terabits of data. And this gives you an idea of the production speed up from May 1999 to May 2000. It is actually a tenfold speed-up. So the one thing that this sort of scale being done in the public domain actually really squashes is that notion that the researchers in the public domain do little research in there laboratories, and you need a big powerful, well organised private company to come and do the serious stuff to make drugs. This blows this away completely. You can have this sort of well organised stuff entirely publicly funded with academic salaries and associates. You might have to have more managers. It might have to have more production meetings but it can be done. And this is really a demonstration of that. And so there is no reason why this sort of publicly funded approach can't tackle problems which are at that moment to some extent considered to be only things that you could get done with private money. This just gives a break-down of the people involved. It really was an international effort although the large scale was done in really five institutions. Japan made a very significant impact and China which joint very late on still managed to do one percent of the production in time before the announcement. So the controversy: Controversy is all about the technique used. The public domain strategy was that if you just chop the gene up in little pieces and try to put them together you would end up in a mess, because you wouldn't be able to work out this puzzle. And there were good reasons for that because it is known that there were a lot of very similar pieces of sequence in the genome. Putting them together looked like it could be a very big problem. So this is the strategy used by the public domain: You take the twenty four chromosomes in a cell. You chop them up into fragments which are around 100,000 long and then you go and sequence each of those fragments using a random strategy. Private domain strategy was to just bypass that intermediate step and just do it directly. And the claim was that they had clean enough sequencing facilities and clever enough computers so that they could do it all in one step. That is the claim. So what really happened? Here you got the Celera machine putting things together, and they generated a certain amount of data. And in fact, that was at no point in the publication, because there was the press release in June 2000 but then there was the real scientific publications that came out February this year. We didn't really discover the real story until February this year. Here you have this magical program and nothing was ever announced that came as a result of just their data. They admitted they took the public domain data in what they did talk about which is a certain size, certain length, certain number of pieces, because it is a draft genome, certain numbers of coverage. And they generated something, and the amazing thing is that it looks almost identical as the public domain version. And so a lot of people said, where is the need and what are they generating out of all this? And the answer is, if you actually go and look in detail: not actually that much. Because what turned out was this assembler did not solve the problem. It didn't managed to put the whole thing together. It ended up in a mess that was predicted at the time that Celera was announced when in fact politically what was going on in the Congress of the United States were representations from this company in various committees that the public domain should give up. They should let the company do everything. The public domain should go off and sequence the mouse genome or something like that. Because the private domain could do it all. Of course the private domain would like to get access and control of this data but their claim was that they could shut down the public domain. It wasn't necessarily to do it twice. In actual fact it turned out that this approach used by the public domain with maps was necessary and you wouldn't have got a genome otherwise. But this message has been hidden. It has been strategically hidden by a number of clever media events including preempting the major announcement in February this year by leaking another story the day before to make sure that was the thing that caught the attention that media instead of the real story. So, there is a very general story here. These are scientists on both sides. Scientist in a private domain and in a public domain. But there is a lot of money to be made here. And you don't want to trust a scientist anymore then anybody else. You need to go and look behind and check what people say in press releases. Check it up with real data and in this case where you have scientists in a private company where the data is hidden no one can go and check. So, even though the public domain was heavily sceptical, you know we believed that the private domain had done it for eight months. We believed that they have done it at least as well as us. Because we couldn't see their data. It was only when we saw the scientific publication, were we able to make any sort of check. And that was what we discovered. Now this opposite side of collaboration has been mentioned already the [snip] consortium. It is a very interesting partnership for a number of reasons because it actually also underlines the reasons behind why this is a good thing. So, here you have twelve companies giving away three million dollars each, generating a part of data which is all freely released. The remarkable thing is the companies got no benefit specifically at all. This data was made available in public the same day they saw it. They got no private pre-access at all. No special rights. This data is in effect cleverly protected using an application of patent law so that no one else can patent it. It is available for everybody but it is still protected. So, the companies didn't want to spend this money and then have somebody else patent it, which is quite reasonable. This has been extended to the mouse because the mouse is another important genome to understand the human. And similarly a large amount of money has been put in because these companies would prefer this to be publicly available rather then having to go and buy it from a private company. Now, there is a reason for why they are actually interested in that. I should just mention, this is also under discussion for a third project involving public structures. Why are they interested in doing this sort of thing? You could think it was just altruism. It is actually slightly different. Biology is too complicate for any organisation to have a monopoly on, and that includes pharmaceutical companies. They are maybe big companies with large research engines but whenever they start a project, is has been quoted to me, they know that there is more research going on on that project elsewhere in the world than anything they are going to do inside their company. In particular in the case of genomes, this pile of data is so valuable because it allows all kinds of research to be connected together. It is much more important in this case but it is very generalised. If a core block of biological data is kept hidden from all those other researchers around the world, both in the public and in the private domain who publish in the scientific literature, then as a company you shoot yourself in the foot. It is not just a question of do I spend three million in the public domain or do I spend three million accessing private data?' If you spend three million accessing private data then that private data isn't going to be available to all these tens of thousands of researchers working on similar problems around the world who are going to publish things and give you leads that will allow you to develop new drugs and make profits. So it is a completely non-altruistic view. It is a view that is saying we get a lot of our information from everybody else out there.' It actually underlines how much private research depends on public research to get new ideas, to get new products. Of course going beyond all this, there is patents. So we have protected the DNA. It is actually been successful no matter how it's been presented because the genome is public. It is available to everybody. You can go and buy one, sure, but it is the same as the one that is available, and the public domain project is filling in these gaps. The private domain is not doing anything in this, and so the public domain is going to get completely finished. It is 50 percent finished now. When it is completely finished then there is no point that anybody is selling anything because there is only one human genome, and it is virtually identical for everybody in the world. So, you only need one. But then, the raw genomic data is one thing. The analysis of that and the location of the genes and there functions is something completely different. And so there is this issue of patenting genes. These business people have thought, well, the genome is protected, maybe it is ok then. It is actually very seriously not ok, because it is very unlike the situation of a normal patent regardless if you agree with patents or not. Even if you agree with patents this is a special case which has more problems, and the extra problems are related to the fact that you can not get around these patents. So, in a normal situation, suppose you are designing a mouse trap and patent it, and you don't sell very many of them, because you want to sell them for a lot of money. Somebody else comes along, and wants to license your patent on the mousetrap, and you say: No, I'm not interested. I'm making a lot of money anyway.' So, there is the option for the second company to go and do some research and come up with another mouse trap, another design, a different type of mousetrap, a better mousetrap. And then there could be competition, and that will affect prizes and that will affect availability. That has been the standard model for arguing that patents are a sort of useful thing, that they encourage research and development, but you can get competition none the less. In this specific case of healthcare related to genes that is not the case because humans are all the same. We only have a fixed number of genes. There is no better gene for a particular type of thing. There is no alternativ gene for breaths cancer. There is only one. There is only one which has this specific effect. And so, if you have a patent on this gene you've locked up research in this whole area and that is the case very specifically with these two genes for cancer of the breasts. We have already seen this company that holds patents affecting every application of these genes in the future. They have a lock on the system. The only application we have seen so fare has been tests. But it is already there, and they are shutting down others. They have shut down all alternativ testing in the US and they've been arousing Europe about this. The UK has done some sort of deal because some of the research was done in the UK, and so the UK has a bargaining position. The French government is now challenging this in the courts. But this is just a sign of what's to come. Among the many gene patents that have been granted, there is an awful lot of submarine stuff in the States where it is not clear. People have got patents pending. Sort of patents that are hidden in the US patent system which only appear if the person decides to activate them if it turns out that the gene is actually really important. I'm sure James will talk about that a bit more. So, analysing, another aspect: This business of interpretation is kind of how you feel sometimes. You got this three million piece jigsaw puzzle, and some guy comes along and says they found a corner piece. The awful truth of course is that we've got this genome. It is huge by any standards of anything we have dealt with before. We're just beginning to tough out the realm which is 30 times smaller. And it is in these pieces. It is not one nice continuous thing. It is going to keep changing continuously for three years which makes a nightmare for data tracking, but everybody in the world wants to use this thing now. So, you have all these people sending you mails saying how do I get to it?' This is part of the project which I'm jointly in charge of called "Ensemble." It is a joint project. It's got around 30 people working on it on a large grant supporting it from the World Trust. It is basically a website which shows all the information and the complete analysis of the human genome. And what is that analysis? Well, it is this basically: Here you got a tiny bit of the sequence. Up at the top here, this is a little bit of one of these. There is a whole chromosome down here. That is twenty-four of those. You can see this over here. So, we are zooming in two levels. Now, we got down to a single chromosome. And now we got down to a tiny region. This chromosome x is about 117 million bases long. At the top here we are looking at one mega base. Down here we have scrolled into a hundred thousand letters. If you print this out on A4 paper, it is three quarters of a million pieces of paper. So, it is kind of big. I haven't zoomed down here to get into the individual letters. These are four letters: A, C, T, and G. It would be pretty pointless, but here you can see some genes. You can see a wide view of some genes up here. You could see, there is an individual gene down here. I got a gene structure over here just to show you were a gene is. Here is the genetic sequence. The genes are sort of patched in the middle and it has things at the start and things at the end. It is like a piece of code. It got a start and a stop. And it's got things controlling itself, being turned on at the beginning and end maybe, maybe some way away. That is a gene, and it gets copied. You copy from the start to the stop, and then you make that turn into a protein. So the interesting thing is, how do we go out predicting these damn things. Well, what we do is we scan the sequence and look for things that look like a protein sequence and that's about it, actually. This works reasonably well in small things like bacteria because this is what the gene is really like, but it breaks down in high organisms because these genes are fragmented. They are chopped up into little pieces. So, if we go back to this slide, then this is down here. This is why this thing is shown as sizeable lumps with these pats. Those are the links. These are the bits of the gene here and because it is all fragmented like this, it makes it very difficult to predict. So, at the moment we can just about work out where two thirds of the gene are with an awful lot of effort. In terms of these things controlling it, no hope at the moment, completely streets away. So, we have the sequence but we don't really understand what it means, and we certainly can't work out how it will work. It is a critical resource for doing all this research. People work out what individual genes do, and you hear about those in the papers. But in terms of the ultimate objective, which is a complete understanding of a cell, then a whole human body, you have a hundred million cells in your body, it is a long way away, a lot of more research to come. And so, I said this now several times. It is too complicated for one organisation. And we want a lot of organisations working on this because it is so complicated. Whenever you have a lot of people working on it you have this problem of integrating the data. How do you combine something over here with something over here? Of course you can use the web and links but you end up with a situation you click here and go to somebody else's website and it looks different. You got a lot of different interfaces. Data is not presented quite the same way. Once you have ten or twenty of these things, it's a nightmare to work out who is predicting what. So how can we address that? Well, there is various ways, and one of them is pure open source. Ensemble is an open source project and the reason that's important is, that we make our entire software system available and our entire analysis available and that, at least, helps people approach the problem while we are using the same base system. It encourages some sort of standardisation. And there is open discussion of how we do things, so it is not as if we're cutting everybody out of this and imposing a standard. But we really want to go beyond that. And so I want to talk about distributed annotation because this is something I think got a lot of applications outside this area. Something called "Linkenstein" in the States is behind the standard for this, but we have been heavily involved at Ensemble in actually implementing this. So, here is the idea: Just imagine this is a piece of raw sequence here and these little blubbs here are features on the sequence, so it might be the position of a gene. Maybe this is a prediction of some repeatedness between the genes, I don't know. Anyway, here is a server providing this information and here is you viewing this on the web page. And basically, you can get what they want to give you and that's it. And if you are somebody outside, outside this rich group that's got a big server set up, and you need to be fairly well off in order to set up a server for a human genome, then it is quite difficult to get your data in. So, if you can get extra sequences incorporated, we are very happy to accept that. You might be able to persuade one of these big centres to run your programs if you developed some fancy new algorithm. But extra annotation where you believe this gene is a little bit different because of some reason, they are not gone accept that. So if you want to make that available to the world, maybe you can publish it in a scientific paper but that obviously can just sit in the books somewhere. You can set up your own server but then you have to have a reasonable amount of resources to duplicate what's here. Because nobody is going to go to your server unless it's got most of what is regarded as a standard. And so that shuts a lot of people out. So, here is a way of avoiding that: The idea is that you don't have to bother to serve everything. You just serve a little bit of extra information that you have calculated, and then you make sure everything is synchronised, so we are talking in the same coordinate system. And then you make the viewer cleverer so the viewer now grabs information from two servers and does the synchronisation on the fly. Once you've done this with two systems you can do it with n systems. So here we have another one here, and this could be somebody from bioinformatics analysing the whole genome with some stunningly effort, or this could be a tiny biological group that is just working on one gene, as quite a lot of specialist groups are doing. And as far as the users are concerned you might think they got hundreds of thousands of these different things from different servers. They can control what they see, so they can turn things off that they don't like, if they think this guy is serving rubbish they can turn him off. So this is democratisation of annotation. Everyone has an equal chance to speak, and you can choose what to listen to. I think there is a wealth of implications of this model in different fields. This is being strongly engineered in bioinformatics to handle this problem of annotation of the genome. So, I think one of the things that this does is actually clearly splits between databases which curate data, store data and different front-ends that separate those out because you can set up a data server and then you can write it down. Which means that you don't then have to do both. You can be good at one or the other or you can do both if you want to, but it allows the competition in the front-end view. It allows you to merge different sources of data if you think it is useful. It can be applied to all kinds of different things, and I'm only talking about a linear sequence here. But you can annotate on stable identifiers if you got a system of stable identifiers. And, of course, non-biological systems: the thing which is most obvious to me is maps. You have seen various people trying to be portals for maps. Or what about somebody serving a reference map of Berlin, and then anybody else across the city being able to serve not only their little website, but putting up a little server saying on this coordinates, I'm here and this is a bit about me. And then anybody looking at this particular region of Berlin would be able to go off and talk to all these servers and pull that information and see that for themselves. It wouldn't rely on a central person having to agree to accept everything. It would be decentralised. And it also allows the possibility of servers providing summaries of other servers, digests. So, here we have three different annotations and we don't want to look at all three of them. We actually like to see a consensus. So, you're going to have a server which talks to other servers and provides a consensus view of that. This is for bioinformatics but it has potential for a lot of things. Open source, open standards because to get a winning software project we'd actually like a lot of people's opinion to be involved. Open annotation because it is not just the software but in the case of bio-informatics it is what data is generated with it. Open data, because the data has got to be available, and, of course, the key application in this area is healthcare, and I'm sure Jamie will talk about everything that surrounds the viability of access to drugs. So as I said earlier on, the fact that the genome was done in a public domain and in fact even the private domain ended up needing it to be done in the public domain, I think indicates the power of being able to organise things if you got reasonable resources. It is clear that the genome project particular at the Senger Center was well resourced by a charity but if you got that reasonable resourcing you can achieve things without a profit motive. [transcript: Katja Pratschke editor check: ok author check: ]