Nascent

Jeff Jonas Web Seminar at Nature

On Friday the 4th of April Jeff Jonas came in

to give the current latest installment of our Tech Talks. Jeff is the

chief scientist for IBM’s Entity Analytics, but that is just one data

point out of what, during the course of Jeff’s talk, became apparent was

a very rich context.

He managed to jam in about 90 slides in 45 minutes, so I’m mostly going

to paraphrase what he was saying in his presentation, as it went by so

quickly.

As this is quite a long blog post I’ll save you the trouble or reading

it by giving away the ending right now, the main theme that Jeff talked

about was data. Lot’s of data, almost mind staggeringly huge volumes of

data, and how to deal with it all. The answer is to construct a system in

which each

of the nodes (or sensors) reporting information provides that information

in

a format

that can be stitched together in a contextually aware way.

By stepping

away from extracting a signal from one piece of datum, and instead

building a way to look at the context in which that datum lives you can

solve interesting problems. That’s kind of the big picture.

At the end of his talk he also entertained us with some of his thoughts

on diverse topics, from the total surveillance state to how safe is the

world really? The longer write up is below the break.


The longer version begins now. Jeff started out in the IT industry

through looking at questions relating to identifying people who were

trying to hide their identities. This was initially work for credit and

collection agencies who often deal with people falsifying

records for lots of different reasons. Though he didn’t say so

explicitly, it was probably through solving these kinds of problems that

led to his very interesting take on dealing with data. This work soon

found interested clients in the casino’s of Las Vegas.

Las Vegas is a place where there is quite an incentive to cheat, in that

a correctly worked scam can net a large profit for the perpetrators in a

very short space of time. He showed a video of a table where one of the

gamblers traded his deck with the dealer deck. These cards had been ordered

in such a

way that everyone at the table knew the ordering of the cards, enabling

them to play a deterministic game, and cheat the casino. It would

probably be prudent for these kinds of people to try to hide their

identities and also prudent for the casino to try to recognise such

people when they turn up at the casino’s doorsteps.

He detailed the lengths that some criminal organisations go to in order

to introduce people who will cheat to the casinos in such a way that the

casino will have no prior information about these people. Quite a way is

the answer.

On the other hand there is a lot of data out there, and if you could

solve the puzzle by stitching the data together then you might be able

to stay one step ahead and intercept these scams before or during,

instead of finding out a lot later.

He devised a system called NORA (non-obvious relationship awareness) for

these clients to tackle just this problem, then in 1998 he was asked to

speak about this work at a public NSA-hosted conference. It was after this

that his company was

acquired by IBM.

He described a conversation with a counterterrorism intelligence analyst

where he asked her

what she could wish for. She said that she wished she could get answers

faster, to which Jeff replied, ‘what are the chances that you can ask

every smart question every day?’. The point here is that sometimes a

question that is asked today needs to wait until some event happens in

the future before it can have a meaningful or relevant answer. You

probably can’t ask that question every day, but if there was a way to

put that question into storage and allow it to become active when the

data that is required to give it relevance shows up, then this would be

a useful way of dealing with the question. In fact what you are doing is

treating the question like data. One of Jeff’s key points is that you

have to allow the data to find the data and the relevance must find the

user. OK, so that might sound a bit cryptic, but the basic idea behind

it is pretty straightforward.

The current situation is that organisations have huge piles of data, but

that data tends to reside in separate relational silos, and these silos

don’t talk to one another. Moreover a query against one of these silos

tends to only match exact terms, so if you are searching for ‘Bill’ and

you have perhaps millions of names in your data set you are not going to

get results for anyone called ‘William’, or ‘Bil’ or ‘Billy’, even

though as humans we know that semantically these forms are all related.

Aside from the lack of semantic reconciliation that exists within one

data set, different data sets rarely connect to one another. For

instance a database containing fraud investigations is rarely connected to

one’s own employee database. Jeff described this by saying that this

data is isolated and as a result our perceptions of the data is

isolated.

In order to solve a puzzle, such as determining connections between fraud

rings and insiders, you

need to give a context to your data so that you can begin to infer

relationships within the data. He described this as creating persistent

context.

OK, so what is the pseudo-algorithm for dealing with all of this

messiness?

Lets take as an example a record for renting an apartment. This might

contain a name, address, date of birth, and perhaps a phone number. Let’s

say that

the name is ‘Bill Weather’. Now let’s take a record in the apartment’s

eviction database.

Again this may contain a name, date of birth, and address.

Perhaps the name in this case is ‘William Weather’, but the address is

the same as in the first case. Tying these together tells us that

anytime we encounter ‘William Weather’ or ‘Billy Weather’ at this address

in

the future,

we are probably dealing with the same person, through the glue of the

same address. In order to create a data engine that can do this kind of

matching you have to extract key features (names, addresses, phones, etc.)

from all of your sources into one

store where semantic reconciliation is attempted on each new record in real

time.

These key features generally represent ‘who’, ‘what’, ‘where’ and ‘when’ as

available on each individual observation (e.g., atomic level of data).

You accumulate and store. You can also do the same for questions. A

question might be structured as ‘did this person buy anything in our

store’. If you have no record of that person right now, instead of

throwing away the question add it to the data store, in it’s atomic

form, and if any record for that person comes along you already have

some indication that there is something interesting about this person.

In effect you are getting rid of the distinction between questions and

data, and replacing them by relationships, or contexts, about entities.

After all, questions also usually concern people, places, times or

events.

You won’t know whether a piece of data is important until someone asks,

but by melding everything together you build up persistent contexts.

Jeff compared this method to one of mining huge amounts of latent data,

which he described as trying to boil the ocean. The flip side of Jeff’s

approach is that not only do you treat queries as data, but data then

also become queries, as a new datum can tie together pieces in your

puzzle and trigger a reconciliation of information which leads to

insight about the data set. The trick is to take the approach of

contextualising each datum as it arrives. This is the most efficient way

to deal with data. By processing upon receipt you can begin to scale

with new information. You don’t have to go back periodically and try to

mine

through

the ocean. This reminded me of a comment about fixing bugs in computer

code. The most efficient place to fix a bug in computer code is just

after you have typed it. Any delay past this point adds to the cost of

fixing the bug.

He cited the example of a US federal agency who probably have

multi-zetobytes of data lying

around the place. There are not enough computers on Earth to sift through all of

this data via brute-force, and this is a problem that is getting worse as

our capacity as a species to leave digital trials increases.

Another emergent aspect of this system is that queries find queries, and

give a deeper picture about the information that you are dealing with.

What you are doing is constructing context in an ongoing way, and when

that context reaches a relevance threshold level you can publish insight.

Jeff said that bad data was good for solving problems like this because

it helps to spread out the interaction of pieces of the puzzle. He said

that if you polish all of your data you end up loosing essential

features of the data. So how do you treat ambiguities and false

positives in the data? The answer seems to be to throw more data at the

problem, and in the process of reconciling bits together you

get rid of the ambiguities. He also said that orthogonal data sets were

very important for gluing disparate data together.

He gave an example of where he was asked to find invented identities in

a population. The given number of total individuals was known. Data from

a variety of sources was ingested into the system, and each time a new

name, or set of information regarding a person, was encountered a

potential identity was created in the system. As data was poured in the

number of possible people in the population at first grew to a multiple of

the actual figure, before data reconciliation occurred and the potential

numbers of individuals rapidly dropped down towards the real figure,

with identifications of false identities popping out on the way down.

Now, of course, these tools work where there is a lot of information

about people, and indeed Jeff was talking about use cases in situations

where the population of an entire country was being queried, which

obviously raises questions about data, privacy and surveillance. As he

was talking I was wondering about the issue of false positives, and the

example of passenger no fly lists came to my mind, but then I realised

that the example of the

href="http://www.cbsnews.com/stories/2006/10/05/60minutes/main2066624.shtml

"

>TSA

no fly lists was a perfect example of not using the techniques that

Jeff was describing for doing reconciliation of knowledge about people

based on multiple silos of data.

In fact Jeff addressed questions of privacy directly in two ways. In the

first case he asked how do you go about reconciling information in two

data silos where you might not want to share all the information between

both of the stores of data. He said that this was a big problem for

government agencies where there are very strict regulations about data

sharing between departments, so that one group looking at terrorist

activity may not be able to access the database of a group looking at

drug smuggling. To overcome this Jeff suggested that it is

possible to share one-way hashed representations of portions of the data

between data stores. The hashed representation is a unique

non-reversible representation of a piece of information. If you give me

one of these hashes there is nothing that I can do to reverse the

transformation and extract the original information, but if you tell me

the algorithm you used to create the one way hash, I can apply that

algorithm to my data and see if any of my data produces the same hash

that you gave me. If it does then I know that you have some information

about something that I also have information about and it might be worth

looking at cooperating on this particular item.

He stressed that this technique does not do away with proper policy

controls on how you manage access to your data, but rather what it can

do is simplify the conversation you might want to have about looking

into sharing information between different agents.

That was pretty much most of what Jeff talked about during his main

talk, then he started telling us about stuff that he had been thinking

about. One of these things, unsurprisingly, is the emergence of the

total surveillance society. His opinion is that this is both

irresistible and inevitable and that it is being driven by consumer

forces. The fact that people love GPS on their phones is just the start

of what will be a series of technologies that will make total

surveillance a reality. His other example was the idea of a pair of

glasses that have an embedded RFID chip. The convenience of not being able

to loose such items owing to their constant traceability will make

people buy items like this simply due to our tendency to try to optimise

our lives, and a consequence of this pursuit will be an in effect a the

creation of a surveillance infrastructure.

Virtual worlds also came up, and Jeff said that he expects a significant

amount of time to be spent in virtual worlds, and that we will be drawn

into them through our need to conduct business there. He said that at

the moment he finds them kind of boring, but indicators that they will

be important are that they provide an immersive experience, the 100

dollar laptop is becoming a reality, and for billions of people the

experience of life that is represented in virtual worlds is

significantly more appealing than the circumstances that they are faced

with in their day to day lives. If you have a business model where the

very poor can gain access to a virtual world for a micro payment of a

few cents a month, and you add this to a potential market of a couple of

billion people, then you have a viable, indeed a compelling, business.

The last thing that Jeff talked about was how safe the world is. He

pointed out two opposite trends here. The first is that at the moment

the world is safer than it has ever been before. That the current

average life expectancy world wide is 67, which is higher than at any

point in the history of the world. He compared mortality rates in the

13th century from the black death to the kinds of threats that tend to

be broadcast across the media today. The black death killed 17% of the

population of the world. Jeff pointed out that even if you took the US

and Europe and dropped them into the sea you would only manage to get

rid of 5.5% of the population of the earth, and so the kinds of threats

that we worry over from terrorist activities today are incredibly minor

within a historic perspective, and that the reality is that there has

been no better time to be alive.

In contrast the cost of manufacturing tools for killing lots of people

have been dropping as our technology advances. The cost of the first

nuclear weapon was a significant % of US GDP, but now we can manufacture

potentially lethal virus strains that could be more damaging, and at a

fraction of the cost. He called this section more death, faster,

cheaper.

All in all, Jeff raised a lot of thinking points. At the end we had a

chance to ask him a few questions. He said that the kind of work he is

involved with is not just applicable to casinos and to government

agencies, but to all manner of businesses. When asked how we might apply

these ideas at Nature he advised that we ask ourselves what we do and

what we are good at and then try to map these things onto the kinds of

atomic questions that can be used for gluing lots of data together.

The issue of using these ideas in science seems to have some

applicability, for instance using the one way hash idea to see if

different labs are working on the same genes, or chemicals, without

saying directly what those specific entities are, however it’s not clear

whether science could deal with the cost issue involved in doing this.

Jeff writes prolifically on his blog about his ideas and he suggested the following posts as further reading regarding the topics that he touched on in his talk

Comments

Comments are closed.