Assistente AI
Trascrizione
00:00:12Hotel Las Las Vegas New things next time that is not next week,
00:00:21but the week after so di.
00:00:25The third of a un altro a view on topic, of course.
00:00:34So that you you have you can ask during
00:00:38the exam and on the word will be
00:00:41a seminar from a league whose expert on.
00:00:49So another thing that I
00:00:53want that they got completely during the course is that you
00:00:57have to form groups so so far West Bank of people
00:01:03who assign a group am identifier to their team,
00:01:10so please leave a comment on this column to specific
00:01:14who are your your team member for the final project.
00:01:18So this is the study groups
00:01:21that you have to form for the final project so
00:01:23just groups of two two is
00:01:28except able I would have groups of your for people, otherwise.
00:01:34To many projects so,
00:01:38please leave a comment on your images
00:01:40and the side of your you want to go to work with for
00:01:45the final project expected to present
00:01:50the project on the same call is not mandatory so you can summit,
00:01:55the project and see On one day
00:01:59and others can take that able evaluate the project
00:02:05in the panel and then oral performance individuali or you so
00:02:12please take moment to the word
00:02:16di your your eyes wide shut the people.
00:02:22Download su You Tube Sono am.
00:02:30I don't have to say on this front am today I would like to you
00:02:40the web server go deeper
00:02:43on the on the project and you an example of implementation and
00:02:47show the file and everything and
00:02:51a new topic that I want to cover this protein embedding
00:02:57and how they are can be used in some aspect or some features of
00:03:02proteins the end of the year things.
00:03:11So you have questions from
00:03:14the organizational point of view or and other type of questions
00:03:18so member that next week the isn't class
00:03:21because this conference am.
00:03:28So yesterday we had problem with the thing
00:03:35because di definition they were update di server today.
00:03:41IT should be available.
00:03:48Yes, okay, so as
00:03:53As the last time ring is fable the command line tools
00:03:58its plus plus implementation is available as a plugin
00:04:05for human and is it as a web server.
00:04:14Online di The link is not bio computing UPI UPI stage
00:04:22University of Padova dot it and this is also
00:04:26the the world of my lab and lab.
00:04:32Professor Tosatto so works this way so the input
00:04:37is file and PDF file can have multiple models as we so so fare,
00:04:44but let's start with the simple case for instance.
00:04:48We can click on one of this example you see what
00:04:52you have at the parameters of the types of
00:04:57the granulari of the nodes of the graph that we
00:05:00want to generate so whether we want to know
00:05:03what you want to consider every
00:05:07used to to be evaluate to to to find
00:05:12the network di cardinalità di edges
00:05:16so we can all
00:05:19combination of possible interaction between two aminoacido.
00:05:23We can provide just the most energetic interaction or the
00:05:29best for every type of interaction
00:05:32or we can can or we can the turn multiple and see.
00:05:39One interaction interaction type so they are different things,
00:05:46then we have a threshold strict deluxe can
00:05:51define your own and other options like if you want to get
00:05:56when missing or if you want to
00:05:59only consider amminoacido or if you want to
00:06:02consider also water or if you want to use label id.
00:06:08This is very technical and is the first to
00:06:11the fact that sometimes you have what if
00:06:14you have the same id plus
00:06:17an identifier identifier unique channel based on their identity.
00:06:27Parameters can you see if deluxe
00:06:32changed lets you what if di output.
00:06:48And so on the left you have a table in
00:06:54the number of interaction that they are
00:06:57divided by intra e inter actions.
00:07:01We have the frequency of course This does not apply.
00:07:05If you have a single confirmation structure
00:07:09and you have structure and
00:07:12graph they are interactive
00:07:15so if you click on another you should see the.
00:07:21Interaction in the structure.
00:07:25And should work also the other way around effect.
00:07:32IT shows what is in the in
00:07:36the world wide web interaction or so e super state forward.
00:07:45And actually in this case didn't mistake in this case.
00:07:52We have a few models.
00:07:53We have it think tank models so we can some
00:07:59how filter for instance by the frequency of the various edges.
00:08:06We want to the edge.
00:08:11Of the models and you see
00:08:14the rest of those so it's also interesting because you
00:08:20see the colors of the of the nodes corrisponde to
00:08:24the color of the changes in this case that you can also see.
00:08:31The various models to see this is where
00:08:35there is that moves a lot and so.
00:08:41IT makes that not all the contacts
00:08:44are that you can see also the other way around the end,
00:08:48of course gives you more different information for instance.
00:08:51They are contacts or
00:08:53very variable contacts you see where they are here.
00:08:58Again. This is something that should
00:09:01the different aspects of the type of.
00:09:06Of example that you are you are.
00:09:10So what are you have below so you have
00:09:16am an overview of the use
00:09:20the notes and the models so you have a Matrix,
00:09:24where you have models as
00:09:27a columns and that is involved in contacts,
00:09:31he as roses and you see where the these contacts
00:09:36so you see this for different type
00:09:38of interaction for instance at bonds,
00:09:41You see the paper think they are not there.
00:09:48You just have bond.
00:09:50Van der Waals for specific example.
00:09:55Is what is more started is not the same of the same table.
00:10:05We also have very contact like at bond and they are
00:10:13reported a reported years have you can look
00:10:16for example with more nodes.
00:10:22So. So I can also show you how to search examples and
00:10:28the PDB so in the PDB you can click on di advanced search.
00:10:40And on the advanced search you can actually search the type
00:10:45of field that you want search and.
00:10:51In this case may be model.
00:11:00How say deposit model count and we
00:11:05can be that are more than a hundred.
00:11:16Year some examples.
00:11:21Habe. We can can.
00:11:27Think similar to the example,
00:11:31See you that is white structure and
00:11:38Mobile and go to di home page we can just copy di identifier.
00:11:50And it should be it is valid and automatic flash structure so.
00:12:00As you see di the program is very fast and then di
00:12:04over that you see here is that is to create job and scheduler.
00:12:13Execution is and other quick.
00:12:39In this case you see di the network is closed because
00:12:46this is that are multiply by the number of States that you have so.
00:12:53Maybe you are not really interested in
00:12:55those interaction that are available in all models they are.
00:13:05You see you and tables the table showing
00:13:08contacts over models and principal you could
00:13:12us this table or use this matrix to cluster
00:13:15as cluster as structures and that is
00:13:20different clustering comparing two clustering information
00:13:24so you can cluster on contacts date ben
00:13:28coordinate National space you see this is
00:13:32for You also have pipe stack.
00:13:40Ionic of course in this case we're
00:13:44looking and more structure so structure is
00:13:49just collection of possible solutions types
00:13:52from and constraints so is not in Anyway the present.
00:13:58Di actual distribution of
00:14:02information and solutions population of information and solutions
00:14:07other just solutions ends completely different from what you
00:14:12see a dynamic simulation dynamic simulation where
00:14:17you have time this world flat di evolution
00:14:20of the structures of time I don't have
00:14:23a simulation actually I have a file but I
00:14:28didn't so I'm not sure year Il format.
00:14:34The other thing that it shoes are the other presentation
00:14:40the probabilistiche contact map contact map
00:14:45as the one that we we have seen during the course.
00:14:48The difference is that you see it is an it map
00:14:52the science that is not present in distance
00:14:56is that the frequency of those contacts across the models.
00:15:07And the table which
00:15:12contacts and the frequency so if you want to download the list of
00:15:16contacts and frequency is just possible to do it by click click
00:15:22here so and as a
00:15:26said this two representation makes sense when you have.
00:15:30Simulation Or is this makes sense to
00:15:33spot for instance what are the important use for this.
00:15:38Is an interesting you to see that is quite stable.
00:15:49This is interactive.
00:15:58Field se.
00:16:03Items to be interactive so this is that it is actually
00:16:09an interaction is showing
00:16:13all the interaction that are along
00:16:15the colors of the use that are somewhere contact.
00:16:19What is is in this case?
00:16:21Okay am questions.
00:16:31AM you can may be used to understand
00:16:37how the networks are been
00:16:40generated so you can you for the final project.
00:16:44You can be what you have found
00:16:47in maybe you can take that is an example,
00:16:51you can think and see.
00:16:54If you are prediction makes science for if you have
00:16:57a few more interaction types that are not provided by
00:17:01the may be in the structure some how you can say ok look.
00:17:07Maybe year old and so is not that is
00:17:11just that wasn't able to capture the state of interaction.
00:17:15So this is an easy way to also may be
00:17:19provide some years in your mind your reports and.
00:17:23That Otherwise you can always use by using.
00:17:31If you want to report an example Ok,
00:17:36so will stop here
00:17:39talking about being our get back to the project later,
00:17:43but first I would like to you something about embedding.
00:17:58Ho.
00:18:14Letto?
00:18:18I siti trans embedding trans the name of the most.
00:18:26AIM models developed in collaboration against the.
00:18:36World most important this is so.
00:18:45Their everywhere protein language models.
00:18:49Maybe you have about language models,
00:18:51but this is the language models or something different.
00:18:57Function production you can designed
00:19:01proteins you can protect classical features that where
00:19:05protected in the past using the message or
00:19:10just sequence and that been developed by many famous companies.
00:19:17Google Microsoft and especially with the.
00:19:21Del Trans was developed by the.
00:19:26Laboratory in collaboration with
00:19:29the Google and DeepMind infrastructure.
00:19:33And you understand what is
00:19:36that the KEE because of the source is that you need to
00:19:42this is us for the use for
00:19:46predictive function they being used for predictive disorders.
00:19:51We so in mani new methods use embedding as an input how.
00:20:01We were used to protein features so the input usually its sequence.
00:20:10We have a model neural network
00:20:14for instance and b model s complicate.
00:20:19Con tanto che in
00:20:24And then you something for the structure of the function effect of
00:20:30mutation your eyes are examples that we
00:20:33will plus so the question is can we
00:20:38improve this presentation with
00:20:42a different presentation at an impact on the on the final output.
00:20:49The sequence itself is of course in informative now,
00:20:55but sometimes mean the combination type of patterns of
00:21:01natural resources is very large and always to find the trade off,
00:21:08everything and making model that able to do general for news
00:21:14is so there are
00:21:17two type of encoding
00:21:20classical encoding that have been use over the years.
00:21:23We have seen or mention the one of
00:21:27encoding so one encoding means that for every aminoacidi
00:21:33ai generate vector of the elements where I have all sites
00:21:39except for one specific position corrispondendo
00:21:43to one so gli aminoacidi.
00:21:47We have one position three
00:21:51amminoacidi we have one in position for all the
00:21:55desktop di that you can start up
00:21:59this vector inline and so you can the present a window of
00:22:05the sequence stack of time factors and this is
00:22:10the end of your neural network
00:22:12that you you can also add some other.
00:22:15Ehm ehm input input no things or basis, but that is.
00:22:24As you can do something similar things except
00:22:29having a binary vector you can have vector with
00:22:33numbers with numbers and the presenti
00:22:37in the frequency of an aminoacidi in the.
00:22:41Molti corsi online sono molti corsi online o di omologo sequence
00:22:48optimal online your input sequence
00:22:52and that can simply do statistics on
00:22:54Typekit aminoacidi that they have given colon ed è presente
00:22:59This is the new network still
00:23:04able to learn that I can also simply and for instance.
00:23:11AM Bin do if you want
00:23:16to do to make the input less complicate is the same that
00:23:21instead of having just one in
00:23:24the vector corrispondente di aminoacidi provide
00:23:27some alternatives indicate that this is not
00:23:31be the most important thing to have in the position.
00:23:41What?
00:23:46Is? Okay, so am.
00:23:57Di di here is to what is the difference of
00:24:05scoperto language models known language model is to
00:24:10present text Input this is
00:24:15and you have the world's end model is able to learn what
00:24:22are the important words able to the place or generate.
00:24:28What is the missing world.
00:24:30If you provide a sentence without the world.
00:24:36Is the proton sequence and amminoacidi are
00:24:40single words the other things like.
00:24:46A fragment in di di di
00:24:50the sequence domains or other types of grooming
00:24:55grooming different ways that this
00:24:59is to be the most effective way to do list of models.
00:25:05Di is that the parallelismo is
00:25:09that is the same natural language of course,
00:25:14and so the idea is that the grammar that
00:25:18keeps and aminoacidi close to another aminoacidi
00:25:22or in specific position inside the sequence is
00:25:26important if move aminoacidi orig scramble and
00:25:32sequence using the information
00:25:36and we know that is the case because we know that
00:25:39the positions important for the structure and this is capture
00:25:44by language models and is this is very well so again.
00:25:52The parallelismo alphabet is better
00:25:58have venti amminoacidi just a few more letters language.
00:26:07Words for profit consequences
00:26:12would have to see different things for instance domains,
00:26:18which is a classical intercom identifier.
00:26:22We can have information about us we can information
00:26:26about second structure and so on but in the end will see us.
00:26:32Single letters and sentence is the sequence.
00:26:37So I'm sure you have seen and so the idea is to train
00:26:47parameters in this two models in
00:26:51this two blocks where you have an encoder which generata
00:26:57a representation or your input sequence and a decoder
00:27:02that can be this presentation back
00:27:06il punto di sequence interesting things that
00:27:10decoder training process you can employment
00:27:15model by masking works and so the model is us
00:27:20to read what is the missing the world so
00:27:24I love pizza very much output is
00:27:27the particular the world is so this way,
00:27:33we can eat the model to learn
00:27:37what are the importance of work in the context.
00:27:44In the case of sequence you see you need to the method
00:27:52of the last of the best letter that
00:27:57should be place in the specific position is that.
00:28:03You can considerate salt based
00:28:06learning so there is no need to Spotify,
00:28:10any class is not answer because we actually know what is
00:28:19so this something different is indeed search supervisor something
00:28:25new and the other thing that is very interesting is that principle.
00:28:31We have to data to train this model
00:28:34because we have billions of sequence is that
00:28:38can be used to improved models to the question
00:28:42is on many parameters do we want to have
00:28:46inside models and how should train this time to
00:28:50be accurate and that they.
00:28:55Tu vedi so am the thing that
00:29:00we can do Is to take what is di output of the encoder.
00:29:05So this multidimensionale or presentation that is generated by
00:29:10the encoder block and it is not for another a model for downstream.
00:29:20Downstream tasks in this in this part.
00:29:27We are the train and find
00:29:29the parameters that optimizer generation of
00:29:33this lettering specific positions learn what is this is what if
00:29:38di what is the meaning of the letters and di.
00:29:45Here is that when we encode
00:29:49this information we have the presentation that is
00:29:52very informative about the grammar
00:29:55and the biology corpus sequencing.
00:29:58So Di trans e something
00:30:05is for every letter we have a number of features.
00:30:11Dimensions of this matrix that speaks from
00:30:17the case is one thousand twenty for every letter so for every.
00:30:25Thousand for elements of numbers that are not only one plus one,
00:30:34but they are very very black and the present something.
00:30:44So. Getting back to.
00:30:50Work before the state of the second you can
00:30:55simply or simply you can give di embedding, eh?
00:31:00As input di neural network that you want you want to
00:31:04model to something in the State of Testaccio.
00:31:08So the generation of embedding started from
00:31:13the sequence is to you need to hold parameters in memory.
00:31:20In generale Più.
00:31:23Bytes there are a lot of parameters, of course,
00:31:27but once they are loading memory di conversion of
00:31:30sequence to the embedding space is very fast
00:31:33so in the end of
00:31:35this input for this is an input for the new connector in.
00:31:41Multiplayer generated multiple sequence and you
00:31:46should keep your sequence the sequence search
00:31:50your sequence against database like it takes
00:31:53minutes five minutes at least End user.
00:31:59Generated content a thousand of sexual and you feed.
00:32:09In this time di generation of the
00:32:12embedding say that is
00:32:15constant because this is a little big on the last,
00:32:18but the problem of the length is more about the memory of your CPU
00:32:22later computational time and in general.
00:32:27If you have a very long sequence.
00:32:29Usually you you fragment sequence chains of two thousand asset,
00:32:35you and a fragment of time.
00:32:41So this is called transfer learning so
00:32:45we can use model that has been
00:32:48training in supervisor manner so
00:32:52without knowing anything about Biology as we can
00:32:56transfer this information to learn something
00:32:59different to learn some classification task.
00:33:08So that is moment,
00:33:11we learn or training with the presentation and that is
00:33:18moment where we can use this presentation
00:33:21to the model end, of course training.
00:33:24This is much faster back to the end, of course it.
00:33:31Is not easy, but for instance function prediction or
00:33:35prediction prediction effect nations is something
00:33:39that you can easily on your own your laptop.
00:33:48So and other interesting aspects is that
00:33:52Or Learning training di Adams using brilliant sequence
00:33:58for training model at
00:34:02the other annotation other features
00:34:06sometimes add on the mount data.
00:34:11Much less data and I'm not limited by the example that I have am.
00:34:30Okay, so this is a figure of the paper the paper
00:34:35that they explain so
00:34:41you can have the input input input sequence
00:34:45di input sequence e stop coded
00:34:50And you get the search for positions you have
00:34:54the transformer model and the way is the output of the last layer.
00:35:02So you can take that the use of the last of
00:35:06us and this is what is it to.
00:35:13You see for every you have one thousand twenty four elements,
00:35:19you can for instance am
00:35:21train accomodation networks and for instance am learn second.
00:35:29Is for ships that is nothing you
00:35:34can do which is called pooling so instead of
00:35:39aim training model for
00:35:43learning features for every single view you can for instance
00:35:47a prediction interest prediction features
00:35:51potential sequence so what you want to do is to average
00:35:56or combine these factors and you can combine
00:36:02and average or you can can
00:36:05maximum or you can can you want training on the.
00:36:11Most of the cases they di average so you create
00:36:16single vector of one touch for elements present
00:36:22di protein average addition you
00:36:25average all you have to have single and that
00:36:29vector possible capture properties
00:36:33of the year of
00:36:34your protein in this case for instance su time frame.
00:36:38Forward need for this protein should belong
00:36:45to the nucleus of the Organization or.
00:36:52Your eyes inside the soul.
00:36:58As a said there is that.
00:37:02You can also concatenate my works better.
00:37:10Better than the other other task
00:37:15and their just want to do you what are
00:37:20the the science of the last of being
00:37:25generated in the paper so there is also
00:37:30another aspect that usually di amount of example sentence
00:37:37is that we have for proteins
00:37:40is larger than the usual inputs that is used to train.
00:37:45models and am years di an example of
00:37:52the Google millions world is
00:37:56handed and million tokens for proteins,
00:38:00you see he number of millions number of aminoacidi di.
00:38:09Storage so you have gli uni
00:38:13gli uni class fifty percent so five millions proteins,
00:38:20four billion aminoacidi and you see you need
00:38:25a hundred so unit class and so on the same times.
00:38:32You see you have eight billion a number of tokens is
00:38:39the database that we we didn't see is di
00:38:44B. D. Big fantastic database that includes also data from methods.
00:38:50Methods means am when you
00:38:56sample sequence is without focusing and specific
00:39:00organized but for instance some playing
00:39:03gut microbiota or the side of the options
00:39:07or soil general from the environment
00:39:12and it means that you are
00:39:13collecting a number of species or together.
00:39:16You like this is the cells,
00:39:19you get genetic material and you see
00:39:22you don't actually know how are you are collecting.
00:39:26New jeans so new new balance of jeans that you know how.
00:39:29They are dei exist for sure so you collect
00:39:33new net natural sequence is in nature without actually,
00:39:40knowing where you Cam from so you're
00:39:42not you you will be interested that is
00:39:45impossible to do identifying where who is
00:39:49the because you have millions of together by you collect.
00:39:56Variation your you that you know how to exist for sure and
00:40:03so you get this numbers or you have
00:40:05found billion tokens available for.
00:40:11Your models which is better
00:40:15the paper shows that you have to choose and there are
00:40:21the decisions that is to be taken
00:40:24am just give you and idea how faster
00:40:30is di generation of this embedding
00:40:34to using evolution information for instance in molti posti.
00:40:40Io examples of multiple sequence using funded finiti in
00:40:49the second is not just to give your alternative things
00:41:00things in the differences are on the input dataset
00:41:07used and on the type of model that is used.
00:41:15To you see calculator evolution information takes at least for six
00:41:20times longer to process the same process same input
00:41:25and this is the important part so how we know
00:41:31that this models capture functional properties.
00:41:38Home test so they to be embedding
00:41:43generated by various models and dei try to
00:41:47a cluster those factors that we
00:41:50generated generated and its compared when you have
00:41:57am a classic fire that is
00:42:01train on the models or strain without using
00:42:05the model sequence along and for this you
00:42:10see year di taken di embedding.
00:42:16For every single type acid and that like
00:42:20to do a new presentation so is like.
00:42:25More complicate. It's the thousand twenty four dimension
00:42:33to to dimensions.
00:42:35What you see for instance if you look at the different types of
00:42:38assets that you see that when you use playing models so.
00:42:45You you see that a minor asset class
00:42:50based on their physical chemical properties so you see for
00:42:54instance years ago am
00:42:59audio fobico in asset alifatici
00:43:01and automatic differences aromatica.
00:43:04Infatti positive negative or you see also the size year.
00:43:13IT capture very what are the properties assets and other task
00:43:19is di classification second structures on this.
00:43:26In this case use.
00:43:31Structural cross from scope
00:43:35to the classification of every single protein in the.
00:43:40Member scope classification similar
00:43:44where you have alpha o alfa beta
00:43:49and other classes like alpha plus
00:43:55beta so the thing you see here is
00:43:57that the pending on the second classification.
00:44:00You see that some clusters are here to.
00:44:10A more omogenee distribution of
00:44:13the classes and that they have to classified content based on di
00:44:20am a. Codice of this time you see
00:44:29eucarioti Proteins separate from
00:44:33bacteria from videos and from archea in
00:44:37a way at least for clusters so all this is that this.
00:44:47Is capture multiple and ortogonale aspects information
00:44:57that is in the sequence and effective so.
00:45:03So for instance years example of prediction and the space.
00:45:12We have that set and this is present
00:45:16difficult of the dataset so name is
00:45:21the number of effective sequence is the number
00:45:25of multiple sequence after six.
00:45:31So at the beginning you have more difficult date
00:45:37set di and you have set up in blue you
00:45:42have a embedding protein
00:45:45embedding method and in other classica state of the art
00:45:51Typekit structure and you see that especially for
00:45:55difficult cases when the multiple sequence only
00:46:00have you number of
00:46:04rappresentative sequence embedding are clear di ledbetter.
00:46:12Del classico evolutionary based methods because dei modelli what is
00:46:20di important di they need
00:46:24to evaluate that di input level Il sole di san.
00:46:30So the other thing
00:46:33am so it will also for small families so for where.
00:46:45Are they also made a Saint consideration about the size of
00:46:51the model so you can
00:46:54not only play on the size and the number of aim parameters.
00:46:59You can also play on the amount of cycles.
00:47:03How long we are you would be training your your model so in.
00:47:10Found that for this specific prediction is better to the time for
00:47:18long term exchange rating mount of
00:47:23parameter so conclusion is that probably laser model for faster.
00:47:30They are not for enough to general is better
00:47:36and di the fix parameter bias di computing power for premium.
00:47:45So other questions this is just introduction to.
00:47:52The fact that they are now the state of the art or
00:47:58feeding or employment neural network
00:48:02or model to generate or features.
00:48:10Other interesting topic and other methods could falsi that is
00:48:20very interesting So essential not that we
00:48:25have alpha for database so we
00:48:28have structures for million structures,
00:48:32he is to employment something like
00:48:36sequence method like this software for searching faces,
00:48:41but something similar for structures and
00:48:45searching similar structure at
00:48:47the database level very quickly so di
00:48:51the simple idea to be that you have your structure
00:48:55and you in your mind against the database, but that's.
00:49:05All and the idea for
00:49:08performance capture comparison over sequence comparison is
00:49:12that it should be much more sensitive and this is to the second.
00:49:23So far so since we have. We could.
00:49:30Be lose sharing the same structure.
00:49:34It's that the storture is nothing to do to
00:49:39actually service jeans and proteins that are related,
00:49:44but that are transparent or their miss.
00:49:49So principal search and the structures should be
00:49:54way more sensitive than just using
00:49:59the spaces so sequence is
00:50:01definite precise so fine easily proteins
00:50:05with GT identity they are the.
00:50:12Finding where we have
00:50:14the same structures and the different is of course
00:50:18impossible using Similari al sole.
00:50:24Question is we can.
00:50:28Also have mansion that is just that is also
00:50:33the database which is
00:50:37been developed by Facebook and it contains six million protein.
00:50:43It's even more complicate to think about that.
00:50:49So the thing to do could be the news timeline.
00:50:55If you have one protein versus one hundred million protein.
00:51:01Comment on and single CPU if you want to do all vs all
00:51:09and Maybe cluster is
00:51:11the structures take that millennium on thousand core.
00:51:17vostro So sequence this is for five fast food is not times
00:51:27and or sequence for
00:51:32under the one week with
00:51:36this cluster software di software similar to this like faster,
00:51:42what you see its one week versus then so have to find
00:51:47the way to simulate sequencer, but using structured.
00:51:53So blast is past because its it employees
00:52:02world searching so it is your input and search
00:52:06works your dataset is in the index.
00:52:11In the way where this works is
00:52:14organizing the way where Wayward source of the world is.
00:52:20And for structure similari.
00:52:24This is a problem because this is not like sequence,
00:52:30where you can open gaps easily boards that
00:52:34are identico here if you change the planet one region.
00:52:38Maybe you can match what is that line,
00:52:44but completely different so not
00:52:47actually something you can employment and its way.
00:52:53AM so di d idea is to do something similar to
00:53:02embedding and so to the present di aminoacidi sequence in Alphabet.
00:53:12AM. aminoacido. And what is
00:53:20actually the present not the sequence
00:53:23the connection of one aminoacido
00:53:25other and what is the structure of the local level,
00:53:28but the other is what are the interaction
00:53:32of the interaction of international spaces.
00:53:39And how it does that does this way so is that is not coding?
00:53:47What is this connected in the sequence that is actually encoding.
00:53:53What is this acid as its neighborhood so
00:53:58is like contacts seen so far so for
00:54:03instance you see you have disseminate that is that is
00:54:07very close to this one so what it is what
00:54:12is the input to generate letter that
00:54:16the present situation are you see the neighbour attacks.
00:54:24Connected between this to
00:54:27this way you have aminoacidi tear it takes what is the
00:54:34closest used looking for change and then it encode information like
00:54:41the distance and the english that are
00:54:44for so special all these angle.
00:54:50Year is in or combination of
00:54:57the various powers that are involved you see
00:55:01they are the distance and they are just alpha.
00:55:05So is it capture the distance between the Closest aminoacidi
00:55:11and also the local information to
00:55:13aminoacidi that are in this interaction.
00:55:18Last. This. Is an encoder
00:55:28it to present the result of the encoder
00:55:31into asset of States and for commodity at the States,
00:55:37but they are we have the number of
00:55:40letters and then they back to this
00:55:44is this is so the model has been
00:55:49training to learn this angle and this local information.
00:55:55AM from nato da structure and in the end
00:56:01what you that is that this is translated into
00:56:05a letter that corrisponde the environment
00:56:10of the specific acid so you not
00:56:12capture just sequence information so
00:56:14this is close to this esaminasse in the sequence actually during
00:56:21the structure property of di aminoacidi
00:56:25its neighborhood so you are encoding di structure properties.
00:56:31The neighborhood. The actual confirmation
00:56:37of the neighborhood for every aminoacidi so you can
00:56:42translate letter connected in the chemical way in
00:56:48sequence of letters to the properties like this one.
00:56:56In in the end of the protect so once you encode the sequence is in
00:57:05this way you can apply
00:57:08sequence aliens so you can write your entity database
00:57:14encode with this new typeof acid and you can search
00:57:19your input sequence against it is like when it
00:57:24similar to what does not for
00:57:28short fragments in the PDB for search and
00:57:32similar things in this case.
00:57:36This is very clever because once you
00:57:38have this presentation is not only capture in
00:57:43local confirmation of a specific fragment that is actually
00:57:49capture the environment even more specific than the rosetta.
00:57:57So dei managed to encode venti aminoacidi Twenty walls.
00:58:09And. You don't have the between consecutive letters because
00:58:16because actually we are things that are closing national space.
00:58:25AM. Dates a connected with di the state of information
00:58:36so is not like for
00:58:38aim structures super positions
00:58:43where the time is more informative because we have more sounding.
00:58:49Your position is the same information,
00:58:53including things that are not aim not structures
00:58:59they have a signal that is
00:59:00less informative comparto elements proteins.
00:59:07And how we can sources this is
00:59:12typical sequence and algorithm as the same thing as you
00:59:18have a query compare your query with every
00:59:23single Typekit one and you identifying similar words,
00:59:29similar local elements and how to combined elements together
00:59:35and extend the element and you can create
00:59:39something this where you have this part of the sequence,
00:59:42the world cup and the second.
00:59:49For this is in the last of the algorithm.
00:59:55IT also apply a team score or.
01:00:01News online.
01:00:08Top of the top target Typekit out of two millions you can up with
01:00:16ten am best alliance for
01:00:20dating online is actually super super fast.
01:00:26So much faster.
01:00:30IT is is thousand twenty thousand times faster than.
01:00:38One using when computer scope scopi small
01:00:44proteins twenty thousand thousand time faster one.
01:00:50To one billion. Year you have
01:00:55some statistics about di Di chiudersi, Non so.
01:01:01They are very.
01:01:05Ok, aim and that was for each and is very by now so is standard.
01:01:15If you want to search singularity across
01:01:18proteins is recommended also compare with the
01:01:23other software aim there is a web server and can you want to you
01:01:32another software developed by
01:01:35the same group for which is the idea it.
01:01:42Is not the same compression structure that is the way.
01:01:49So that is not problem for if you want to download db Più.
01:01:57Twenty five terabyte of space.
01:02:00So it's not very manage it is
01:02:06because this is it is larger.
01:02:14PDB consumers eight bytes for
01:02:20each authors and of course we
01:02:22have many alternatives to the structures.
01:02:26We can use this there is what is not format for developed for it
01:02:33files and that is also another one mtf which di di,
01:02:40di, di exploits, PNG compression
01:02:46so what this is just a pipeline and di
01:02:52the service that they provide So this is the logo.
01:02:58Del make the database a single file contains pdb and
01:03:03the index that is not the part of the interesting is the way,
01:03:08they encode and or compressor PDB so
01:03:14is not just to take di the text of PDB and compresse,
01:03:22but the new Orleans describe the coordinate
01:03:27so they coordinate the just the move coordinate by
01:03:32instead only english so the idea is that you
01:03:37need to provide three numbers for every atom by you
01:03:42can just provide the English decide
01:03:45di farsi am from one aminoacidi to
01:03:51the next asset and also for the chance so for the same you can
01:03:58you have triangle instead of
01:04:01nine elements of the compression is not deathmatch,
01:04:05but for change this becomes much larger because of nations and in
01:04:12general person angels are
01:04:14kai kai and one two three for five and so.
01:04:23The compression works this way to you
01:04:28the present eliminarsi just
01:04:32by angels and every venti five amminoacidi.
01:04:37You have the coordinate that they are used because when you
01:04:42decompresso and by using that can
01:04:46convert angels to coordinate you have hardcore.
01:04:50Porta di tanti film. Used to correct.
01:04:53If. You're your conversion to the world of which
01:05:04convert angels in coordinate and
01:05:07this is the one direction between two
01:05:10consecutive and corso and I think tank di coordinate.
01:05:19So just to give you and idea
01:05:23of to visual what is the compression so we have
01:05:26one bytes that tiles you want you need to know how meni.
01:05:37E so. See you have night
01:05:42seven bytes that is reduce
01:05:45to thirteen bytes per the New York di Star Wars.
01:05:51Su Internet of size you see comp.
01:05:56Comparto di original PDB or mc in terms of kilobite,
01:06:01but also compared with the gzip to
01:06:06seventy kilobyte versus seven
01:06:09you have the time compression compression time and you see.
01:06:15It's fast is fast.
01:06:19And you also have the others in the compression.
01:06:25This is not loose less compression
01:06:29is that see you need is very very important.
01:06:35Okay, so.
01:06:39On the other slides if committed
01:06:44at the details late to the training of.
01:06:51The language models questions thoughts.
01:06:58On the papers they're just three papers in
01:07:02the model and you to the least is
01:07:08very enlightened about what you can
01:07:12do how novel it was because he was published just a few years ago.
01:07:24What is?
01:07:36IT. All prediction methods
01:07:41o classical Methods developed in bioinformatica
01:07:46with the employment using
01:07:49protein language models and in al context al case there was
01:07:56at least in the worst case scenario a want to improvement to using
01:08:01the best state of the art evolutionary based methods so
01:08:06this is really interesting topic
01:08:09and the this method that is improving because also there is
01:08:13a little bit tuning of this
01:08:17this models and other aspects is to find tune
01:08:21di lens in principle you could start with the parameters that
01:08:28have been training with the manner and.
01:08:38Within the end Ehm a classe.
01:08:43So in super manner it works
01:08:48so also for functions prediction prediction
01:08:51whatever you want to do it works,
01:08:53but of course it is demanding in terms of the source sources.
01:08:57Even is less demanding to the training everything from scratch so
01:09:02starting from what parameters that have been training is actually.
01:09:09Much less cost of force training scratch so you
01:09:14can also take hold buildings parameters that are that have been
01:09:19put them in your transformer assign or decided
01:09:24to a specific class or a task
01:09:30and you wait and you di calibrate or tune
01:09:35all parameters and learn
01:09:37better how to know function sacco di spazio.
01:09:42Old things which is better it steel open question and steel.
01:09:51Delicious and all the labs are working are working on this things.
01:09:57In this very days older question is it clear.
01:10:02What is the idea of okay so able close
01:10:10this part year now and get back
01:10:16to the final project so.
01:10:48Okay, è just want to.
01:10:54Show off files that are available there is this classification.
01:11:04And features thing I
01:11:08think I download This is the training dataset and
01:11:12this is West bank of scripts I think of the year.
01:11:34Okay, I think you have if you don't
01:11:37love you have something like this.
01:11:43In. In the day in the data if put
01:11:49di feature that it for you so in principle your training that.
01:12:02You see for every pdb you have a table with the columns
01:12:07defined in the way that I that you yesterday so just a few numbers,
01:12:13you are strings and the interesting thing is the last column.
01:12:19The last column is the classification.
01:12:21So you have all the features for
01:12:23the source and the target aminoacidi
01:12:26plus the class be aware that some person.
01:12:33We are Arlacchi is the first one.
01:12:39Is no.
01:12:42You see in this case I have aminoacidi 23
01:12:47that is interacting with aminoacidi twenty seven.
01:12:54So we have to lose for the same pair of
01:12:58minoxidil you see in one case as in the other is a van der Waals.
01:13:04Actions so be careful because this is very
01:13:07common and in the way or the other you
01:13:11have to decide other which one
01:13:15goes to the negative is negative or if you want.
01:13:20To pending on the model that you using o di approach di using.
01:13:30A b careful with the Destiny of course correct answer would be to
01:13:35have you could also side to one or the other for instance.
01:13:41Maybe you want just do it bond instead.
01:13:43Van der Waals van der Waals is
01:13:45the least interesting thing definitely
01:13:49bond a ionic or the other type interaction are not.
01:13:57So for instance if you want to know.
01:14:01If you are training model one class versus the other be careful,
01:14:08not to remove this example or consider this example as a
01:14:12negative when you want to each and this is to be careful.
01:14:20Generated this table you have all this is and
01:14:25the feature what you have year you have this.
01:14:31script that given a PDB files
01:14:37a configuration files e non output directory
01:14:41generated features meaning.
01:14:46All that tables that you do before without the class
01:14:51the classes something that I have added
01:14:54afterward and essential what it.
01:14:57Calcolate calcolate the number of features
01:15:00by the ingles di Ashley Scales lookup.
01:15:09Second. DS Spirit software
01:15:17used DSP is actually an execute able its year.
01:15:26And by Python As. Is this?
01:15:33This is what you see you can be able to execute able to generate.
01:15:44Sphere exposure, which is not member of explain it is not.
01:15:53Native function of Python is
01:15:57essential capture weather and
01:16:00aminoacidi exposure or not exposed to the solver.
01:16:02Just looking at the contacts.
01:16:06Used year we have the values and the result.
01:16:14Here to secondary structure.
01:16:18Just looking at the values if you the member in the plot,
01:16:25you can capture what is the second just looking
01:16:28avatar di film angels and I know what the visions.
01:16:34I some visions to to capture the only difference to
01:16:39DSP is that DSP gives you the second structure in
01:16:44it classes this way is the second structures just in
01:16:49tre classes within alpha bit sheet and coil.
01:16:56And year. I di di di contacts.
01:17:02Use configuration configuration is in
01:17:07the configuration file just information about
01:17:09the chandra parameters which.
01:17:14Is it is dangerous so you see here.
01:17:19I have ranges to define where the.
01:17:25Beat shit up is for polipropilene is not class a is it.
01:17:36Ehm, Alex, L it is sloop or so it's for classes, sorry.
01:17:44Sequence separation and distance for contacts.
01:17:53Contacts using the neighbour search so something that you
01:17:57know how to use it and a history of
01:18:01the information in in a list that is then convertito a data frame.
01:18:09See also added a to do
01:18:12ADCs effect separation it you can do because probably sequence
01:18:16operations con noi how to aminoacido in the sequence is very
01:18:22informative I think that are not for try l'aminoacido distance.
01:18:28Maybe this is also an indicators that you
01:18:33are looking contact for an alpha helix.
01:18:39May be ADCs effect separation
01:18:43in your feature tablets something that could improve
01:18:46because probably your network can learn impliciti
01:18:51this just looking at the sequence position, but it's because.
01:18:57Maybe you have sequence position in
01:19:01two different changes in that case it doesn't make sense.
01:19:05It's the complete.
01:19:09Sequence positions at important where you
01:19:12are looking contacts within
01:19:14the same changes between
01:19:16two different changes positions are mandated.
01:19:22AM that added is
01:19:28the alphabet to the alphabet
01:19:33so di encoding of the sequence in d am.
01:19:39Forse alphabet and that is this that does
01:19:44that and in this case the code for
01:19:50it is it in from the deposit
01:19:54so there is this may be the little bit that is
01:19:58an encoder picture and you
01:20:02another couple of functions that are used what if.
01:20:08You see you have to import this file here provided.
01:20:19Will install touch and.
01:20:27This is.
01:20:38Not here
01:20:40This is what we know.
01:20:43If you need is one or if you can download it from
01:20:46di the repository have to get that this time.
01:21:04It's not important Anyway,
01:21:08or maybe the problem because this is wrong.
01:21:20Something when download it.
01:21:27Is this one.
01:21:30In this case check.
01:21:43Ok. Ok di arguments and PDB file and output di the same
01:21:51as before we need
01:21:53configuration file and so it simply takes di tipi di file.
01:21:59IT collect you and then the rest of the largest.
01:22:04Time is the execution and
01:22:08application of the model to transform the sequence into.
01:22:13Into 3D Alphabet and its stable using pandas so this is the way,
01:22:22this too big to.
01:22:26The features that are capture.
01:22:30In this files so year you have di
01:22:35three di Alphabet so the state and the letter
01:22:40they are the same so they map one to one so you can either
01:22:45use one of the other probably for you is better to use
01:22:47the state because this number is to use.
01:22:51And the other thing that is shared with you is this
01:22:58simple classified just to you to have you to you.
01:23:04An example you can use the features so in
01:23:08this case I've just apply
01:23:12have bias so because that you want to use it.
01:23:17And. You see I'm just reading
01:23:24the features from this folder so just to you how to create set.
01:23:33Ai. Ai? Ai? The place because they are some contact that
01:23:43are classified so they don't have created a
01:23:47fake class missing and this is the result.
01:23:53Sì ai drop al di not number
01:23:57or the empty cells or all the those with an
01:24:03in then I change di cast some of the columns for
01:24:11instance interaction as a category year you have bias
01:24:16because it was not be able to use categorie values.
01:24:22I consider just some of the features for as training features.
01:24:31Ai convert di editing into
01:24:36categories so calcolate percentile using this function.
01:24:43And then split the dataset create
01:24:49a training test and split.
01:25:01And I. Made model so fit di X e Y. Tre
01:25:12DataSet and the right to
01:25:17protect the features Typekit conclusion matrix that
01:25:22the result that you should
01:25:25be preliminary that you should generate to evaluate
01:25:29what you are doing so you have a comparison between
01:25:33the two label and the label si you see how many of
01:25:39di label where is instance o
01:25:44if you see that some bond of the bond where.
01:25:54Where is they were and tagged as van der Waals
01:25:57in the training data set and that's correct because they
01:26:01were many of them where overlap you see
01:26:05the number as Actually missing and the interesting cases.
01:26:11We are not actually sure that is
01:26:14a problem of being of our that's not
01:26:17accurate and you see the number
01:26:21of each bond where the addicted and.
01:26:27So on one and you should
01:26:32work on di model and another and you should work also
01:26:36on the data set and binding
01:26:40filtering and reasoning on the day and the city.
01:26:46What is the could be the best way
01:26:48to present data to to your training procedures.
01:26:57In this case trying.
01:27:02By have bias model multi but you see
01:27:07the results are not very different you also
01:27:12notice that this presentation you see for instance in
01:27:16this case the minor classes are
01:27:21never product so this is imitation for
01:27:23this model using the motion at least the classes are addicted.
01:27:30So essential what is the best idea is to have
01:27:33numbers only on di diagonal and every does
01:27:39what you see that in this case for instance pipe that
01:27:45are product in the number of pipe addicted.
01:27:51Last but you have also a lot of positive down again.
01:27:59Also in this case be careful
01:28:01because may be that your false positive and
01:28:05just imitation of death of course these
01:28:08are the difficult things to to dimostrate because I have
01:28:12you are used to have a data set and say go to
01:28:16believe that your training dataset is not as a side in
01:28:22this is not actually the case so you can comment on cases where
01:28:27you have false positive for instance for the false negative.
01:28:34It's. It's the.
01:28:40Year of more variant where you can see
01:28:46the different results so one of my action.
01:28:53Item is to this think about.
01:28:58This dependency where was Here if you want to use
01:29:05information is not a big deal at the moment because if you are
01:29:10on using the training data that given to you is not a problem,
01:29:15but for the final projects is in
01:29:19the question you to provided an influence a piece of
01:29:24software that also generate the features if you want to
01:29:27use the tree di features your software
01:29:31also be able to execute the piece of
01:29:34this piece of code and in genere
01:29:38di the features using as
01:29:41input is not included embedding generate by the Lancet.
01:29:47I think it's a little bit too much because would have
01:29:50to vector of one thousand twenty four elements for
01:29:54every paid line of your time Ovvio trend.
01:30:03Di 3d ai encoding should be not?
01:30:09Do you have questions?
01:30:14Not yet yes.
01:30:16Sono qui per.
01:30:23Structural and then we classified you have you extract
01:30:31the features including the contacts and you
01:30:35classified contacts that's questions.
01:30:41Training on the features you already gives.
01:30:45Yeah. Yeah. In principio you can working on
01:30:51the thing about the features build model and the
01:30:56second moment about the features,
01:31:00which is most of time is provided actually everything is provided
01:31:04except this thing so need to check it is what if it.
01:31:10Orifizio dell'idea di indies.
01:31:15In this archive.
01:31:24Eh no, here.
01:31:31Okay. All that provided somewhere.
01:31:40All that other questions.
01:31:46Okay. All make all the recordings available starting from today so.