Выбор редакции
30 января, 10:37

[Перевод] Разница между статистикой и наукой о данных

Здравствуйте, уважаемые читатели. Мы вновь попробуем посоветоваться с вами по поводу актуальности орейлевской новинки. На сей раз речь пойдет о статистике для Data Science. Объем оригинала — 250 стр., дата выхода — 25 февраля. В книге рассмотрены лаконичные кейсы с небольшим количеством графиков и примеров на языке R. Чтобы размышлять и голосовать было интереснее — под катом найдете статью, автор которой попытался уловить и описать разницу между статистикой и Data Science Читать дальше →

Выбор редакции
29 января, 15:04

Random Forest: прогулки по зимнему лесу

1. Вступление Это небольшое практическое руководство по применению алгоритмов машинного обучения. Разумеется, существует немалое число алгоритмов машинного обучения и способов математического (статистического) анализа информации, однако, эта заметка посвящена именно Random Forest. В заметке показаны примеры использования этого алгоритма для задач классификации и регрессии, а также даны некоторые теоретические пояснения. Читать дальше →

Выбор редакции
26 января, 14:52

Зачем нужно еще больше дата-центров: сегодня и завтра аналитики больших данных

Зачем хранить столько данных в строящихся все больше и больше дата центрах? Одна из сфер применения биг дата — прогнозная аналитика. Она отвечает на вопросы: что значат эти цифры о нас, где сейчас используется аналитика и что будет через три года? Прогнозирование — основа оптимизации Количество данных растет со скоростью, которую человеку невозможно вообразить. Данные ничто без анализа. Только невообразимое количество закодированной в единицы и нули информации. Зачем строят новые дата-центры? Что и почему хранится, а также обрабатывает в их глубинах? Мы все наслышаны о контекстной рекламе, показ которой основывается на наших предпочтениях, о которых поисковые машины узнают из наших действий онлайн. Но вот про остальные сферы мало кто говорит широкой публике. А ведь кроме того, что биг дата в сумме с прогнозной аналитикой позволяет рекламодателям и банкам зарабатывать невероятные деньги, они помогают спасать человеческие жизни. Читать дальше →

Выбор редакции
26 января, 13:21

[Из песочницы] Обзор Knime Analytics Platform — open source системы для анализа данных

О KNIME Вашему вниманию представляется обзор Knime Analytics Platform – open source фреймворка для анализа данных. Данный фреймворк позволяет реализовывать полный цикл анализа данных включающий чтение данных из различных источников, преобразование и фильтрацию, собственно анализ, визуализацию и экспорт. Скачать KNIME (eclipse-based десктоп приложение) можно отсюда: www.knime.org Кому может быть интересна эта платформа: Тем, кто хочет анализировать данные Тем, кто хочет анализировать данные и не владеет навыками программирования Тем, кто хочет покопаться в неплохой библиотеке реализованных алгоритмов и, возможно, узнать что-то новое Читать дальше →

Выбор редакции
26 января, 09:24

Разработка на R: тайны циклов

Меньше недели назад в журнале Хакер вышла авторская версия материала, посвященного фичам при использовании циклов при разработке на R. По согласованию с Хакером, мы делимся полной версией первой статьи. Вы узнаете о том, как правильно писать циклы при обработке больших объемов данных. Читать дальше →

Выбор редакции
26 января, 00:25

This Data Mining Startup Lets Consumers Own Their Digital Footprint

I spoke with Digi.me Founder and CEO Julian Ranger about the vision behind his business, disrupting the data industry, and his plans to help online users capitalize on their digital activity.

Выбор редакции
24 января, 17:50

Обзор рынка труда в области big data и data science

Хабр, привет! По релевантным поисковым запросам нашлось около 1000 вакансий, затем они были вручную отфильтрованы по заголовкам и описаниям, и для подготовки обзора мы использовали 288 активных вакансий в области big data и data science с HeadHunter. В действительности активных вакансий больше, так как во внимание не принимались другие ресурсы (например, SuperJob, Blastim, социальные сети, сайты компаний). Кроме того, нужно понимать, что это всего лишь снимок текущей ситуации, каждый день вакансии заполняются и появляются новые. Читать дальше →

19 января, 16:10

Oracle (ORCL) Faces Lawsuit on Labor Discrimination Charges

Oracle Corporation (ORCL) is facing allegations over its hiring and rewarding practices.

Выбор редакции
18 января, 16:10

jl-sql: работаем с JSON-логами в командной строке с помощью SQL

Вступление никому не интересно, поэтому начну сразу с примеров использования % cat log.json {"type": "hit", "client": {"ip": "127.1.2.3"}} {"type": "hit", "client": {"ip": "127.2.3.4"}} {"type": "hit", "client": {"ip": "127.3.4.5"}} {"type": "hit", "client": {"ip": "127.3.4.5"}} {"type": "hit", "client": {"ip": "127.1.2.3"}} {"type": "click", "client": {"ip": "127.1.2.3"}} {"type": "click", "client": {"ip": "127.2.3.4"}} Выполняем запрос: % cat log.json | jl-sql 'SELECT client.ip, COUNT(*) AS count WHERE type = "hit" GROUP BY client.ip' {"client":{"ip":"127.1.2.3"},"count":2} {"client":{"ip":"127.2.3.4"},"count":1} {"client":{"ip":"127.3.4.5"},"count":2} Читать дальше →

Выбор редакции
17 января, 16:59

[Из песочницы] Глубокое обучение с подкреплением виртуального менеджера в игре против неэффективности

Об успехах Google Deepmind сейчас знают и говорят. Алгоритмы DQN (Deep Q-Network) побеждают Человека с неплохим отрывом всё в большее количество игр. Достижения последних лет впечатляют: буквально за десятки минут обучения алгоритмы учатся и выигрывать человека в понг и другие игры Atari. Недавно вышли в третье измерение — побеждают человека в DOOM в реальном времени, а также учатся управлять машинами и вертолетами. DQN использовался для обучения AlphaGo проигрыванием тысяч партий в одиночку. Когда это ещё не было модным, в 2015 году, предчувствуя развитие данного тренда, руководство Phobos в лице Алексея Спасского, заказало отделу Research & Development провести исследование. Необходимо было рассмотреть существующие технологий машинного обучения на предмет возможности использования их для автоматизации победы в играх управленческих. Таким образом, в данной статье пойдёт речь о проектирование самообучающегося алгоритма в игре виртуального управленца против живого коллектива за повышение производительности. Читать дальше →

14 января, 21:02

Debunking Loretta Lynch's One-Sided 'Chicago Cops Are Racist Villains' Statistics

Is it fake news when on MLK weekend Loretta Lynch issues a scorching 164 page report blasting Chicago police for using force on Blacks 10 times more often then Whites... but nowhere mentioning that Blacks are murdered 15 times more often than Whites, or that Blacks are the murders 20 times more often than Whites, or that Police are 30 times more often to be killed by a Black than by a White? Seems like a shot in the face at Jeff Sessions and a gift to BLM and civil rights leaders, in the final hour... Statistical Ideas' blog's Salil Mehta exposes the one-sided statistics outgoing AG Loretta Lynch used to villify Chicago Cops... Outgoing Attorney General for the Department of Justice, Loretta Lynch, has distributed an environmentally-friendly, 164-page report that finds (after a year-long investigation) that my hometown Chicago Police Department "engages in a pattern or practice of using force, including deadly force, in violation of the Fourth Amendment of the Constitution."  The longwinded defense glosses over critical statistics to allow any reader to truly understand what is at the nucleus of their most scorching claim against the Chicago Police.  That in addition to economic hardships for minorities (and is that 100% essential?), police use excessive force 10 times more often against Blacks as they do against Whites.  And that's a deceptive headline shocker, which combined with selective data-mining, simply states what they feel is obvious with their constituents across the country.  The leaders of the Black Lives Matter movement joined other civil rights groups in responding lockstep with the findings, demanding "we are not going to take it any more".  What does that mean?  Are the big concerns in life that (Chicago) police are simply villains?  Life would be coziest if they used kids’ gloves?  And why were these groups righteously hushed in recent months, as Blacks were multiple times caught on video group-assaulting Whites who "may have voted for Trump"?  Look at the top of the chart -below- which shows this 10x rate for Blacks versus Whites, when it comes to use of excessive police force.  Now we'll discuss some other self-computed statistics that should be considered, and which were not provided as meaningful background.   “more often” This naughty expression incorrectly implies that each Black criminal is subject to 10 times as much excessive police force as each White criminal.  But that's a false trap.  The actual statistics, for the few who cared to look at them, show that simply the overall population of Blacks saw 10 times more excessive use of force versus the overall population of Whites. Blacks are a small minority It is true that in the United States there are nearly five times as many Whites as there are Blacks (put differently, there is 0.2 Blacks per White).  So it would be arousing to incorrectly deduct that while there is 0.2 Blacks per White, there is nearly 10 times as many Blacks experiencing excessive force versus Whites.  Recall we are discussing the U.S. here.  Instead we must drill down to just The Windy City: and there we see an equal number of Blacks versus Whites.  Blacks are not a small minority at all.  And let's hold on to this 1:1 statistic as we go through some other relevant settings information below. who is getting murdered Even though there is one Black per White in Chicago, there are nearly 15 murdered Blacks per murdered White.  It's not the Chicago Police Department killing these Blacks, and certainly not White civilians.  We should note that in 2016 Chicago Mayor Emmanuel, former Obama Chief of Staff, oversaw the largest number of homicides in Chicago, in the past two decades!  These murders are one of the primary reasons that Chicago isn't anymore a top 100 places to live, and it sees great emigration of its citizens to other parts of the U.S.  Including people who want to be police officers.  And with strained municipal and state budgets and fearing their own life, the Chicago Police Department have to bring justice to whoever is killing all of these Blacks (happening at a rate of nearly a dozen weekly).  It’s unfortunately a treasured, yet sometimes unappreciated job. who is doing the murdering Without surprise but part of the probability data, Blacks are committing these record murders across Chicago (even under a Democratic leader and with their unemployment rate falling to cyclical lows).  Butchering at 20 times the rate of Whites!  So while the population is equally split between Blacks and Whites, the Department of Justice (DOJ) report chose to fleece us from the fact that Blacks accounts for 20 times as many of the city's slayings (instead only highlighting the policy focus of Blacks experience 10 times as much use of excessive force).  As appalling as these levels of excessive uses of force against Blacks are, we need to also appreciate that the Chicago Police (as a demographic segment) are >30 times more likely to be killed by a Black, then any other demographic segment killing any other (in chart below, see these drivers of the current 5-year national record-slaughtering of police).  It should be part of this report and not a random result but from the most statistically significant predictive factors such as: the nature of the suspect in relation to others at the scene, the age difference between the suspect and police, and how this same suspect interacts with police when approached. Now given all of this context above, wouldn't you precisely conclude that there should be some balanced compassion for the Chicago Police?  Even if not, understand that police have their supporters who don’t feel Whites deserve to experience excessive force in equal number as Blacks, even if see less violent crime.  This absurd measure of executing violent criminal justice is unhinged, even though we have the same number of brash disparagers against the police, anyway.   Though we should equally note that nothing in here proves that the Chicago Police have executed their public duties in a racially fair way.  Minorities deserve to feel more at ease in their own homes and communities.  It's just that showcasing one-sided statistics, as the DOJ did here, is clearly more confrontational and less likely to be embraced. All of this mortality science has been brought up earlier (here, here) and in a couple high-profile, peer-review academic research articles on race within the police ranks (one of which I was the journal editor for).  Though it is worth noticing here every time we get untrue violent justice statistics put forward.  This final-hour parting gesture by Attorney General Lynch fall squarely into that category, and one the new successor (perhaps President-elect Trump's nominee Jeff Sessions) is entitled to handle differently. Mortality probability math is very tough, as shown in our calm debunk of an Oxford University research paper, which they then immediately and mortifyingly redacted with Erratas.  Though the teachings are often the same and are worth reminding ourselves of, every time the occasion arises.

14 января, 15:08

Book Bits |14 January 2017

● A Good Disruption: Redefining Growth in the Twenty-First Century By Martin Stuchtey, et al. Summary via publisher (Bloomsbury) Disruptive technology is one of the defining economic trends of our age, transforming one major industry after another. But what is the true impact of such disruption on the world’s economies, and does it really have […]

12 января, 15:21

Peter Thiel: I won't take a job with Trump's administration

Peter Thiel, the billionaire venture capitalist who donated more than a million dollars to President-elect Donald Trump’s campaign and serves on his transition team, would not accept a job in the incoming administration if offered one, he told the New York Times.“Confirm,” Thiel said in an interview published Wednesday when asked if it’s true that there is no administration job that could lure him to Washington. “I want to stay involved in Silicon Valley and help Mr. Trump as I can without a full-time position.”While the PayPal co-founder has said he would not join the White House in an official capacity, he has played an active role in Trump’s transition process, most notably arranging a Trump Tower meeting last month between the president-elect and some of the nation’s most prominent tech executives, including Apple’s Tim Cook and Facebook’s Sheryl Sandberg. Amazon CEO Jeff Bezos and Elon Musk, the head of SpaceX and Tesla, both attended as well despite their criticism of Trump during the campaign.Thiel also spoke at the Republican National Convention last July, discussing during his remarks that he is gay, a first for the GOP event. The billionaire told the Times that he drew more criticism from the liberal gay community than from Christians for speaking at the RNC and that overall, “Trump is very good on gay rights.” The president-elect is unlikely to reverse any of the progress the gay community has made in recent years, Thiel said, adding that he would “obviously be concerned if I thought otherwise.”Throughout his campaign, but especially during the Republican primary, Trump campaigned hard on his promise to ban temporarily Muslims from entering the U.S., a position he has since backed away from. Executives from multiple tech companies have preemptively said that they would not assist the Trump administration in the creation of any Muslim registry system. Asked if the data-mining company he helped found would be willing to help the Trump administration build such a registry, Thiel told the Times: “We would not do that.”

Выбор редакции
09 января, 03:44

С чего начать внедрение Hadoop в компании

Алексей Еремихин ( alexxz ) Я хочу навести порядок в головах, чтобы люди поняли, что такое Hadoop, и что такое продукты вокруг Hadoop, а также для чего не только Hadoop, но и продукты вокруг него можно использовать на примерах. Именно поэтому тема — «С чего начать внедрение Hadoop в компании?» Структура доклада следующая. Я расскажу: какие задачи я предлагаю решать с помощью Hadoop на начальных этапах, что такое Hadoop, как он устроен внутри, что есть вокруг него, как Hadoop применяется в Badoo в рамках решения задач с первого пункта. Читать дальше →

06 января, 13:30

Why the French Email Law Won’t Restore Work-Life Balance

A new law establishing workers’ “right to disconnect” went into effect in France on January 1 of this year. The law requires companies with more than 50 employees to establish hours when staff should not send or answer emails. In an interview with the BBC, French legislator Benoit Hamon described the law as an answer to the travails of employees who “leave the office, but they do not leave their work. They remain attached by a kind of electronic leash — like a dog.” We all know intuitively that we are more connected to the workplace than ever before. When Bain & Company examined e-communications and other forms of collaboration at two dozen large global companies, we found that the time devoted to email, instant messaging (IM), crowdsourcing, and other online communications is extensive and, unfortunately, on the rise. We used Microsoft Workforce Analytics (formerly VoloMetrix) and other data mining tools to comb through information captured in Microsoft Outlook, Gmail, and similar applications to understand precisely how much time is dedicated to processing e-communications — that is, sending, reading, and responding to email, IM, and other messages. What we found confirmed what many of us have long suspected, namely: Senior executives now receive 200 (or more) emails per day. The average frontline supervisor devotes about eight hours each week — a full business day — to sending, reading and answering e-communications. The level of e-communication has grown every year since 2008 (the year we started examining this data), and much of it now creeps into off-hours and weekends. Of the eight hours managers devote to e-communications each week, we estimate 25% of that time is consumed reading emails that should not have been sent to that particular manager and 25% is spent responding to emails that the manager should never have answered. Stated differently, the average frontline supervisor devotes almost half a day each week to processing unnecessary e-communications. There is nothing an individual employee can do to combat this onslaught. Neglect too many emails or IMs, and you risk irritating your peers or, worse, your boss. And if sending endless email chains is the way your organization gets things done, then you have no real choice but to adopt the ways of the tribe. In short, excessive e-communication is an organizational problem. It demands organizational solutions. While the intent of France’s new law is laudable, rules like this confuse effect with cause, and as a result probably will not slow the tide of e-communication. At best, these measures will merely shift the timing of workplace communications from off-hours to the workday and push other “work” to weekends and after hours. In short, government officials can tell employers that they should not expect employees to respond to e-communications during off-hours, but unless the need — or perceived need — for excessive email, IM, crowdsourcing, and the like is somehow addressed, no government mandate will have much of an impact on the total time devoted to e-communications by employees or supervisors. Indeed, the French may quickly discover that their most productive workers are routine “lawbreakers” who stay connected during off-hours to reduce the need to take time away from family and friends to complete other work-related tasks. The only way to decrease the total time dedicated to e-communications is to encourage leaders and employees to manage the load they put on the organization through email, IM, and so on. In our work with clients, we have come to believe that the best way to do this is to provide real-time information to leaders regarding organizational load, defined as the total hours devoted to reading and responding to emails originating from each executive. The leadership team at Seagate, for example, found that merely providing information on the total load each manager generated each week compared to peer executives helped to reduce unnecessary e-communications. Internal competition encouraged leaders to reduce the number of employees copied on each email as well as the responses they sent to emails that did not require one. Combined, these actions reduced the time devoted to processing e-communications, without the need for mandates. Information alone modified management’s behavior. Another simple but powerful action is to eliminate “Reply All” — figuratively or literally. Since it takes time to read any email, even those that are unnecessary or not intended for you, the Reply All feature can be a big time waster. In the organizational time audits we described earlier, we found that Reply All being so easy to use costs the average frontline supervisor more than 30 minutes a week in processing unnecessary e-communications. Eliminate the feature and you will liberate unproductive time across the organization. There is little doubt that unnecessary e-communications is costly, not just to the individual employee but to society at large. It contributes to employee burnout and lost productivity. But legal mandates focused on the symptoms, rather than the cause, of excessive emails are likely to have little effect. It’s time for leaders to take responsibility for the load they put on the organization and to take steps to change the way work gets done on the job. Only then will employees be able to successfully cut the leash and focus their precious time on delivering great results.

Выбор редакции
03 января, 07:58

[Перевод] Тренируем нейронную сеть написанную на TensorFlow в облаке, с помощью Google Cloud ML и Cloud Shell

В предыдущей статье мы обсудили как натренировать чат-бот на базе рекуррентной нейронной сети на AWS GPU инстансе. Сегодня мы увидим, как легко можно обучить такую же сеть с помощью Google Cloud ML и Google Cloud Shell. Благодаря Google Cloud Shell не нужно будет делать практически ничего на локальном компьютере! Кстати, сеть из прошлой статьи мы взяли лишь для примера, можно спокойно брать любую другую сеть, которая использует TensorFlow. Читать дальше →

02 января, 02:30

Clif High-2017 Predictions on Everything

 Internet data mining expert Clif High uses calls what he does “Predictive Linguistics,” to mine the Internet and collects billions of data points to produce forecasts of the future. High has predictions on Trump, gold, silver, housing, stocks, bonds, the dollar, interest rates and even new... [[ This is a content summary only. Visit http://FinanceArmageddon.blogspot.com or http://www.figanews.com/ or http://goldbasics.blogspot.com for full links, other content, and more! ]]